Machine Translation Weekly 45: Deep Encoder, Shallow Decoder, and the Fall of Non-autoregressive models
Researchers concerned with machine translation speed invented several methods
that are supposed to significantly speed up the translation while maintaining
as much as possible from the translation quality of the state-of-the-art
models. The methods are usually based on generating as many words as possible
in parallel.
State-of-the-art models do not generate in parallel, they are autoregressive:
it means that they generate words one by one and condition the decisions about
the next words on the previously generated words. On the other hand, all
computations in the rest of the Transformer models can be heavily parallelized,
such that they can process sentences almost in constant time with respect to
the sentence length. Normally this applies only to the encoder because the
decoder needs to proceed word by word anyway. This parallelism is of course
very attractive for researchers that tried parallelize the decoding phase as
well and generate all (or at