Issue #77 – Neural MT with Subword Units Using BPE-Dropout
09 Apr20
Issue #77 – Neural MT with Subword Units Using BPE-Dropout
Author: Dr. Chao-Hong Liu, Machine Translation Scientist @ Iconic
The ability to translate subword units enables machine translation (MT) systems to translate rare words that might not appear in the training data used to build MT models. Ideally we don’t want to find these subword units (and their corresponding translated “segments”) as a preprocessing procedure, it would be much easier if we could recognise them directly, and automatically, from the corpus of parallel sentences that is used to train an MT model. In this post, we review the work by Provilkov et al. (2019) on improving byte pair encoding (BPE) for NMT.
BPE-Dropout
The idea of using subword units for MT is highly desirable because it allows us to translate rare words if their subunits are in the training corpus, even when these words themselves are not. This is very useful especially with training of MT systems for low-resource languages. Another advantage to use subword units translation is that it could be applied to non-segmenting languages, e.g. Chinese and Thai. In this case, it can be applied in the training pipeline without the
To finish reading, please visit source site