Issue #132 – Tokenization strategies for Korean MT tasks

27 May21

Issue #132 – Tokenization strategies for Korean MT tasks

in Model improvement, The Neural MT Weekly

Author: Dr. Jingyi Han, Machine Translation Scientist @ Iconic

Introduction

Asian languages have always been challenging for machine translation (MT) tasks due to their completely different grammar and writing system. As we know, there are specific segmenters for Chinese and Japanese as there is no space between words in these languages. With regards to Korean, even though the words are separated by space, is a normal tokenizer used for western languages good enough for use with Korean? In this post, we take a look at a paper by Park et al. (2020), in which they conducted a set of

To finish reading, please visit source site

tokenisation
tokenization