Issue #16 – Revisiting synthetic training data for Neural MT
08 Nov18
Issue #16 – Revisiting synthetic training data for Neural MT
Author: Dr. Patrik Lambert, Machine Translation Scientist @ Iconic
In a previous guest post in this series, Prof. Andy Way explained how to create training data for Neural MT through back-translation. This technique involves translating monolingual data in the target language into the source language to obtain a parallel corpus of “synthetic” source and “authentic” target data – so called back-translation. Andy reported interesting findings whereby, with a few million sentences of synthetic training data, we can be nearly as effective as the same amount of authentic data. They also observed a tipping point, where adding more synthetic data is actually harmful. This means that for many languages and domains we cannot use all the monolingual data available. We can thus raise the question of whether we can select the data to be back-translated so as to optimise translation quality.
As is the nature of Neural MT, there have already been some new developments which give further insight into this question, as well as into some other aspects of back-translation. Let’s take a look at them here.