Highlights from Machine Translation and Multilinguality in October 2024

Here are summaries of a few pre-preprints that I noticed on arXiv during October.

LangSAMP: Language-Script Aware Multilingual Pretraining

Folks from LMU Munich try a relatively simple trick to improve multilingual encoder models, particularly non-Latin-script and low-resource languages. They use additional information about the language identity and the script, but only during training, so at the inference, we can still use the model without caring about what language we feed in. They add static language and script embeddings before the langauge modeling head of MLM, which is typically not used at inference time. Presumably, this way, the model does not have to care about what langauge and script the output should be and can focus more on meaning (whatever that is). This, in turn, should make the cross-lingual

 

 

To finish reading, please visit source site

Leave a Reply