Issue #136 – Neural Machine Translation without Embeddings
28 Jun21
Issue #136 – Neural Machine Translation without Embeddings
Author: Dr. Jingyi Han, Machine Translation Scientist @ Language Weaver
Introduction
Nowadays, Byte Pair Encoding (BPE) has become one of the most commonly used tokenization strategies due to its universality and effectiveness in handling rare words. Although many previous works show that subword models with embedding layers in general achieve more stable and competitive results in neural machine translation (NMT), character-based (see issue #60) and Byte-based subword (see issue #64) models have also proven to be better in some particular scenarios. In this post, we take a look at work by Shaham and Levy (2021) which investigates a simple but universal byte tokenization strategy without embedding layers for NMT.
Embeddingless model with byte tokenization
UTF-8 is an