Issue #41 – Deep Transformer Models for Neural MT
13 Jun19
Issue #41 – Deep Transformer Models for Neural MT
Author: Dr. Patrik Lambert, Machine Translation Scientist @ Iconic
The Transformer is a state-of-the-art Neural MT model, as we covered previously in Issue #32. So what happens when something works well with neural networks? We try to go wider and deeper! There are two research directions that look promising to enhance the Transformer model: building wider networks by increasing the size of word representation and attention vectors, or building deeper networks (i.e. with more encoder and decoder layers). In this post, we take a look at a paper by Wang et al. (2019) which proposes a deep Transformer architecture overcoming well known difficulties of this approach.
Problems of wide and deep Transformer networks
Wide Transformer networks (so-called Transformer-big) are a common choice when a large amount of training data is available. However, they contain more parameters, causing a slower training and generation time (3 times slower than the so-called Transformer-base, a reasonable trade-off between quality and efficiency).
Usual problems of deep Transformer networks are vanishing gradients and information forgetting. Vanishing gradient occurs because in order to update the network weights, the gradient of the
To finish reading, please visit source site