Issue #92 – The Importance of References in Evaluating MT Output

30 Jul20

Issue #92 – The Importance of References in Evaluating MT Output

Author: Dr. Carla Parra Escartín, Global Program Manager @ Iconic

Introduction

Over the years, BLEU has become the “de facto standard” for Machine Translation automatic evaluation. However, and despite being the metric being referenced in all MT research papers, it is equally criticized for not providing a reliable evaluation of the MT output. In today’s blog post we look at the work done by Freitag et al. (2020), who investigate to which extent BLEU (Papineni et al., 2002) is to be blamed, as opposed to the references used to evaluate MT output.

In our very first blog post on evaluation (#8), Dr. Sheila Castilho was already questioning the quality of the data we use. She questioned whether MT evaluation results could be trustworthy if the quality of the data sets are very poor. In issues #80 and #81 we also reviewed a set of recommendations made recently by Läubli et al. (2020) for performing MT evaluations aimed at assessing Human Parity in MT. Finally, in issues #87 and #90

To finish reading, please visit source site