Issue #126 – Learning Feature Weights for Denoising Parallel Corpora

15 Apr21

Issue #126 – Learning Feature Weights for Denoising Parallel Corpora

Author: Dr. Patrik Lambert, Senior Machine Translation Scientist @ Iconic

Introduction

Large web-crawled parallel corpora constitute a very useful source of data to improve neural machine translation (NMT) engines. However, their effectiveness is reduced by the large amount of noise they usually contain. As early as in issue #2 of this series, we pointed out that NMT is particularly sensitive to noise in the training data. In issue #21, we presented the dual cross-entropy method to filter out noise in parallel corpora (Junczys-Dowmunt, 2018). In this post, we take a look at a paper by Kumar et al. (2021), which goes a step further and proposes a denoising

To finish reading, please visit source site

denoising parallel corpora