Highlights from Machine Translation and Multilinguality in October 2024

Here are summaries of a few pre-preprints that I noticed on arXiv during October. LangSAMP: Language-Script Aware Multilingual Pretraining Folks from LMU Munich try a relatively simple trick to improve multilingual encoder models, particularly non-Latin-script and low-resource languages. They use additional information about the language identity and the script, but only during training, so at the inference, we can still use the model without caring about what language we feed in. They add static language and script embeddings before the […]

Read more

Highlights from Machine Translation and Multilinguality in Summer 2024

Here are summaries of a few papers that I liked during the (long academic) summer. BertaQA: How Much Do Language Models Know About Local Culture? People from the University of the Basque Country prepared a QA dataset consisting of local knowledge about the Basque Country, hopefully including facts that might now exist on the English-speaking Internet and contrast that with global (but it probably means Western) facts. The questions are in the multiple-choice style. Then, they asked professional translators to […]

Read more

Lessons learned from analyzing values in multilingual encoders and what it means for LLMs

This post is a hindsight on two studies on multilingual sentence embeddings we published a year ago and comments on what I think people analyzing LLMs today should take away from them. In late 2022, we (which mainly was the work of Kathy Hämmerl from Munich and Björn Diesenroth and Patrick Schramowski from Darmstadt) finished a paper called Speaking Multiple Languages Affects the Moral Bias of Language Models (later published in Findings of ACL 2023) where we tried to compare […]

Read more

Highlights from Machine Translation and Multilinguality in May 2024

Here are short summaries of three pre-prints that I enjoyed reading in May. Zero-Shot Tokenizer Transfer Folks from the University of Cambridge and the Univerisity of Edinburgh propose a nice trick for changing the vocabulary of an already trained language model. They train a hyper-network (a neural network that predicts parameters of a different neural network) that predicts what embeddings a token would have if it were trained with the rest of the model. For each training batch, they build […]

Read more

Highlights from Machine Translation and Multilinguality in April 2024

Meta4XNLI: A Crosslingual Parallel Corpus for Metaphor Detection and Interpretation Folks from the University of the Basque Country prepared an English-Spanish dataset for natural langauge inference (i.e., deciding if sentences follow from each other, are in contradiction, or have nothing to do with each other) with metaphorical expressions. Unlike the standard version of this task (XNLI), which does not use figurative language, there is a large gap between in-language training and language transfer. (Transfer means that we finetune a multilingual […]

Read more

Highlights from Machine Translation and Multilinguality in March 2024

Did Translation Models Get More Robust Without Anyone Even Noticing? Folks from Lisbon study how robust the newest MT systems are against source-side noise. Machine translation using large models, including translation-specific NLLB or via LLMs (such as Tower or GPT-3.5), is much more robust both towards synthetic noise (the nice feature of synthetic noise is that you can check the translation quality for different noise levels) and also real-world noisy data from social networks. Tracing the Roots of Facts in […]

Read more

Highlights from Machine Translation and Multilinguality in February 2024

With a new month, here are a few papers that I noticed on arXiv in February. Linear-time Minimum Bayes Risk Decoding with Reference Aggregation A preprint from the University of Zurich proposes a linear time version of Minimum Bayes Risk (MBR) decoding in machine translation. This decoding algorithm does not aim to generate the most probable sequence given the model but the most typical one. This is typically done by sampling dozens of candidate output sentences, from which we select […]

Read more

Highlights from Machine Translation and Multilinguality in December 2023 and January 2024

Many things happened in the field in December: EMNLP, Google released Gemini, and Mixtral appeared. January was seemingly not that packed with new events, but plenty of new interesting work popped up on arXiv. Predicting Human Translation Difficulty with Neural Machine Translation Folks from the University of Melbourne found out that features from NMT, most notably the target sentence perplexity and something they call flow features, are a good predictor of human translation time. Turning English-centric LLMs Into Polyglots: How […]

Read more

Highlights from Machine Translation and Multilinguality in October 2023

Here is my monthly summary of what papers on multilinguality and machine translation I found the most noteworthy during October 2023. There were 2,881 preprints in the computation and language category on arXiv (a new record number), so there is a big chance that there were preprints I would like to read that I missed. Navigating Cultural Chasms: Exploring and Unlocking the Cultural POV of Text-To-Image Models A preprint from Israeli Technion, Google Research, and Cambridge University studies cultural awareness […]

Read more

Highlights from Machine Translation and Multilinguality in November 2023

Here are a couple of articles that caught my attention in November. Narrowing the Gap between Zero- and Few-shot Machine Translation by Matching Styles A team from Johns Hopkins University published a pre-print that belongs to the currently trendy genre: stuff we can do with LLMs. This time, it is about how to use it efficiently for domain-specific machine translation. It is known that few-shot prompting works much better than zero-shot prompting, but you need to select proper parallel examples. […]

Read more
1 2 3 10