Issue #94 – Unsupervised Parallel Sentence Extraction with Parallel Segment Detection Helps Machine Translation
13 Aug20
Issue #94 – Unsupervised Parallel Sentence Extraction with Parallel Segment Detection Helps Machine Translation
Author: Dr. Chao-Hong Liu, Machine Translation Scientist @ Iconic
Introduction
Curating corpora of quality sentence pairs is a fundamental task to building Machine Translation (MT) systems. This resource can be availed from Translation Memory (TM) systems where the human translations are recorded. However, in most cases we don’t have TM databases but comparable corpora, e.g. news articles of the same story in different languages. In this post, we review an unsupervised parallel sentence extraction method based on bilingual word embeddings (BWEs) by Hangya and Fraser (2019).
Parallel Segment Detection Using Bilingual Word Embeddings
Word embeddings are used to represent the meaning of a word in a multidimensional space where words with similar meanings will appear nearby to each other. A great insight on this technology is that the space can be shared across languages, and so it could be useful in many tasks in MT. The recent developments in unsupervised bilingual word embeddings (BWEs) even enabled the building of MT systems using only monolingual corpora, Lample et al. (2018).
Fig. 1 shows an example of the approach