Issue #92 – The Importance of References in Evaluating MT Output
30 Jul20
Issue #92 – The Importance of References in Evaluating MT Output
Author: Dr. Carla Parra Escartín, Global Program Manager @ Iconic
Introduction
Over the years, BLEU has become the “de facto standard” for Machine Translation automatic evaluation. However, and despite being the metric being referenced in all MT research papers, it is equally criticized for not providing a reliable evaluation of the MT output. In today’s blog post we look at the work done by Freitag et al. (2020), who investigate to which extent BLEU (Papineni et al., 2002) is to be blamed, as opposed to the references used to evaluate MT output.
In our very first blog post on evaluation (#8), Dr. Sheila Castilho was already questioning the quality of the data we use. She questioned whether MT evaluation results could be trustworthy if the quality of the data sets are very poor. In issues #80 and #81 we also reviewed a set of recommendations made recently by Läubli et al. (2020) for performing MT evaluations aimed at assessing Human Parity in MT. Finally, in issues #87 and #90