Highlights from Machine Translation and Multilinguality in Summer 2024
Here are summaries of a few papers that I liked during the (long academic) summer.
BertaQA: How Much Do Language Models Know About Local Culture?
People from the University of the Basque Country prepared a QA dataset consisting of local knowledge about the Basque Country, hopefully including facts that might now exist on the English-speaking Internet and contrast that with global (but it probably means Western) facts. The questions are in the multiple-choice style. Then, they asked professional translators to translate the questions into English. They experimented with SoTA LLMs at that time (LLaMA2, Gemma, and a few commercial ones) and observed that LLMs are much worse in local knowledge than global knowledge. The most interesting finding is that finetuning the models in Basque improves the local QA