Issue #21 – Revisiting Data Filtering for Neural MT
17 Jan19
Issue #21 – Revisiting Data Filtering for Neural MT
Author: Dr. Patrik Lambert, Machine Translation Scientist @ Iconic
The Neural MT Weekly is back for 2019 after a short break over the holidays! 2018 was a very exciting year for machine translation, as documented over the first 20 articles in this series. What was striking was the pace of development, even in the 6 months since we starting publishing these articles. This was illustrated by the fact that certain topics – such as data creation, and terminology – were revisited in subsequent articles because the technology had already moved on! We’re kicking off 2019 in the same vein, by revisiting the topic of data cleaning because, no matter how good the algorithms are, clean data is better data. We’ll let Patrik take it from here…
As we described in Issue #2 of this series, Neural MT is particularly sensitive to noise in the training data (e.g. wrong language, bad alignments, poor translations, misspellings, etc.). As a result, the task of filtering out noisy sentence pairs in a parallel corpus has piqued interest even further recently. A shared task for parallel corpus filtering was organised
To finish reading, please visit source site