Issue #95 – Constrained Parameter Initialisation for Deep Transformers in Neural MT
20 Aug20
Issue #95 – Constrained Parameter Initialisation for Deep Transformers in Neural MT
Author: Dr. Patrik Lambert, Senior Machine Translation Scientist @ Iconic
Introduction
As the Transformer model is the state of the art in Neural MT, researchers have tried to build wider (with higher dimension vectors) and deeper (with more layers) Transformer networks. Wider networks are more costly in terms of training and generation time, thus they are not the best option in production environments. However, adding encoder layers may improve the translation quality with a very small impact on the generation speed. In this post, we take a look at a paper by Hongfei Xu et al. (2020) which revisits the convergence issues with deep standard Transformer models and proposes a simple solution based on constraining the parameter initialisation.
Layer normalisation and residual connections
The advantage of deeper networks is that they can represent more complex functions. However, as already seen in Issue #41 of our blog series, a usual problem for the convergence of deep Transformer network training is vanishing gradients. Vanishing gradients may occur when the network weights are updated. The gradient of the loss function is calculated at each