Machine Translation Weekly 82: Multimodal Translation and the Visual Context
This week I am going to discuss (and criticize) a paper on multimodal machine
translation that attempts to once again evaluate if and how the visual
information could be useful in machine translation. The title of the paper is
Good for Misconceived Reasons: An Empirical Revisiting on the Need for Visual
Context in Multimodal Machine Translation,
it has authors from several institutions in China and Hong Kong and will appear
at this year’s ACL.
Multimodal machine translation (also, a topic of my dissertation) is defined as
translation of image captions when both the caption text and the image itself
are the input of the translation system. There is a specialized dataset for
that called Multi30k with 30k simple sentences accompanied with images. Over
time, the dataset became too easy