Issue #73 – Mixed Multi-Head Self-Attention for Neural MT
12 Mar20
Issue #73 – Mixed Multi-Head Self-Attention for Neural MT
Author: Dr. Patrik Lambert, Machine Translation Scientist @ Iconic
Self-attention is a key component of the Transformer, a state-of-the-art neural machine translation architecture. In the Transformer, self-attention is divided into multiple heads to allow the system to independently attend to information from different representation subspaces. Recently it has been shown that some redundancy occurs in the multiple heads. In this post, we take a look at approaches which ensure that different heads can capture distinct features.
Multi-head Self-attention
Attention mechanisms selectively focus on specific parts of the sentence during translation. In self-attention networks (like the Transformer), the hidden states of each word are calculated by attending to every other word in the sentence. Self-attention thus relies on global information. Furthermore, the Transformer was designed with multiple attention heads to give the model the ability to attend to different parts of the word representation vectors in parallel. These different subspaces of the representation vectors contain, in principle, information on different word characteristics. Thus, in theory, by attending to different types of information, multi-head attention can capture different features. The multiple heads significantly improve the Transformer performance (in terms of
To finish reading, please visit source site