Issue #132 – Tokenization strategies for Korean MT tasks
27 May21
Issue #132 – Tokenization strategies for Korean MT tasks
Author: Dr. Jingyi Han, Machine Translation Scientist @ Iconic
Introduction
Asian languages have always been challenging for machine translation (MT) tasks due to their completely different grammar and writing system. As we know, there are specific segmenters for Chinese and Japanese as there is no space between words in these languages. With regards to Korean, even though the words are separated by space, is a normal tokenizer used for western languages good enough for use with Korean? In this post, we take a look at a paper by Park et al. (2020), in which they conducted a set of