LayerNorm(SmallInit(Embedding)) in a Transformer to improve convergence
LayerNorm(SmallInit(Embedding)) in a Transformer I find that when training a transformer, the embedding matrix moves slowly, hence it’s difficult for the model to jump out of the initial noisy embedding. (initial embedding) [[-0.0073 0.0062 -0.0261 … 0.0086 0.0107 -0.008 ] … ] (after 1 step, the directions of the embedding vectors are not moved much because the numbers change by ~LR = ~4e-4) [[-0.0069 0.0066 -0.0265 … 0.009 0.0111 -0.0084] … ] So I propose initializing the embedding matrix to […]
Read more