Highlights from Machine Translation and Multilinguality in March 2022
Here is a monthly summary of what I found most interesting on arXiv this month
from machine translation and mutlilinguality. This month was the camera-ready
deadline for ACL 2022, so many of the interesting papers are accepted to ACL.
Overlapping BPE
When training, BPE merges actually do not have to follow the simple objective
of merging the most frequent token pair. In massively multilingual models,
there is an imbalance between languages, and some of them got segmented almost
down to characters. Therefore, we might want to have a higher vocabulary
overlap between languages. A paper from IIT Bombay and
Google that will appear at ACL suggests
mixing the interpolate the bigram frequency with a factor telling in how many
languages the particular merge would appear. This leads to