Issue #95 – Constrained Parameter Initialisation for Deep Transformers in Neural MT

20 Aug20

Issue #95 – Constrained Parameter Initialisation for Deep Transformers in Neural MT

Author: Dr. Patrik Lambert, Senior Machine Translation Scientist @ Iconic

Introduction

As the Transformer model is the state of the art in Neural MT, researchers have tried to build wider (with higher dimension vectors) and deeper (with more layers) Transformer networks. Wider networks are more costly in terms of training and generation time, thus they are not the best option in production environments. However, adding encoder layers may improve the translation quality with a very small impact on the generation speed. In this post, we take a look at a paper by Hongfei Xu et al. (2020) which revisits the convergence issues with deep standard Transformer models and proposes a simple solution based on constraining the parameter initialisation.

Layer normalisation and residual connections

The advantage of deeper networks is that they can represent more complex functions. However, as already seen in Issue #41 of our blog series, a usual problem for the convergence of deep Transformer network training is vanishing gradients. Vanishing gradients may occur when the network weights are updated. The gradient of the loss function is calculated at each

To finish reading, please visit source site