Highlights from Machine Translation and Multilinguality in December 2023 and January 2024
Many things happened in the field in December: EMNLP, Google released Gemini,
and Mixtral appeared. January was seemingly not that packed with new events,
but plenty of new interesting work popped up on arXiv.
Predicting Human Translation Difficulty with Neural Machine Translation
Folks from the University of Melbourne found out that features from NMT, most
notably the target sentence perplexity and something they call flow features,
are a good predictor of human translation time.
Turning English-centric LLMs Into Polyglots: How Much Multilinguality Is Needed?
A preprint from Zurich and Edinburgh experiments with instruction tuning of
English LLM in multiple languages and found it helps the cross-lingual
performance of LMs a lot. They use (authentic, i.e., not machine-translated)
data from the OpenAssistant
dataset and try
finetuning