Issue #90 – Tangled up in BLEU: Reevaluating how we evaluate automatic metrics in Machine Translation

16 Jul20

Issue #90 – Tangled up in BLEU: Reevaluating how we evaluate automatic metrics in Machine Translation

Author: Dr. Karin Sim, Machine Translation Scientist @ Iconic

Introduction

Automatic metrics have a crucial role in Machine Translation (MT). They are used to tune the MT systems during the development phase, to determine which model is best, and to subsequently determine the accuracy of the final translations. Currently, the performance of these automatic metrics is judged by seeing how well they correlate with human judgments of translations emanating from various systems. WMT currently uses Pearson’s correlation which is highly sensitive to outliers; as a result the correlation can appear erroneously high.

As reported by Mathur et al (2020), despite the strong evidence proving the shortcomings of BLEU, it continues to be the industry standard. As this research indicates, there are serious flaws in the way that automatic metrics are evaluated, which we will briefly highlight in this post.

Findings

The most recent WMT (reported in Ma et al., 2019 ) found that with a large number of systems, there were discrepancies in correlation between best metrics and human scores, depending on the number of
To finish reading, please visit source site

Leave a Reply