Issue #126 – Learning Feature Weights for Denoising Parallel Corpora
15 Apr21
Issue #126 – Learning Feature Weights for Denoising Parallel Corpora
Author: Dr. Patrik Lambert, Senior Machine Translation Scientist @ Iconic
Introduction
Large web-crawled parallel corpora constitute a very useful source of data to improve neural machine translation (NMT) engines. However, their effectiveness is reduced by the large amount of noise they usually contain. As early as in issue #2 of this series, we pointed out that NMT is particularly sensitive to noise in the training data. In issue #21, we presented the dual cross-entropy method to filter out noise in parallel corpora (Junczys-Dowmunt, 2018). In this post, we take a look at a paper by Kumar et al. (2021), which goes a step further and proposes a denoising