Machine Translation Weekly 81: Unsupervsied MT and Parallel Sentence Mining
This week I am going to briefly comment on a paper that uses unsupervised
machine translation to improve unsupervised scoring for parallel data mining.
The title of the paper is Unsupervised Multilingual Sentence Embeddings for
Parallel Corpus Mining, it has authors
from Charles University and the University of the Basque Country and will
appear at this year’s ACL student research workshop.
The idea of the paper is quite simple. They took XLM, a BERT-like model that
was trained for 100 languages using masked language modeling objective
(randomly masking words in the input and predicting what the missing word is).
In such a setup nothing forces the model to represent parallel sentences
similarly, although it partially happens. This property dramatically improves
when the model is provided with parallel sentences and