Issue #3 – Improving vocabulary coverage
01 Aug18
Issue #3 – Improving vocabulary coverage
Author: Raj Nath Patel, Machine Translation Scientist @ Iconic
Machine Translation typically operates with a fixed vocabulary, i.e. it knows how to translate a finite number of words. This is obviously an issue, because translation is an open vocabulary problem: we might want to translate any possible word! This is a particular issue for Neural MT where the vocabulary needs to be limited at the beginning for technical reasons. The problem is exacerbated for morphologically rich languages where the vocabulary size increases in proportion to the complexity of the language. In this post we will look at a few of the available options to handle the open vocabulary problem in Neural MT, and their effectiveness in improving overall translation quality.
Sub-word Units
The most common way to handle open vocabulary and rich morphology in NMT is to split the word forms into smaller units, also known as “subwords”. This is mainly based on the fact that various word classes are better translatable via smaller units than words. For instance, it is more efficient and robust to translate names via character copying or transliteration, and compounds via compositional translation. Byte-pair encoding (BPE)
To finish reading, please visit source site