Highlights from Machine Translation and Multilinguality in October 2024
Here are summaries of a few pre-preprints that I noticed on arXiv during October.
LangSAMP: Language-Script Aware Multilingual Pretraining
Folks from LMU Munich try a relatively simple trick to improve multilingual encoder models, particularly non-Latin-script and low-resource languages. They use additional information about the language identity and the script, but only during training, so at the inference, we can still use the model without caring about what language we feed in. They add static language and script embeddings before the langauge modeling head of MLM, which is typically not used at inference time. Presumably, this way, the model does not have to care about what langauge and script the output should be and can focus more on meaning (whatever that is). This, in turn, should make the cross-lingual