Issue #136 – Neural Machine Translation without Embeddings

28 Jun21

Issue #136 – Neural Machine Translation without Embeddings

Author: Dr. Jingyi Han, Machine Translation Scientist @ Language Weaver

Introduction

Nowadays, Byte Pair Encoding (BPE) has become one of the most commonly used tokenization strategies due to its universality and effectiveness in handling rare words. Although many previous works show that subword models with embedding layers in general achieve more stable and competitive results in neural machine translation (NMT), character-based (see issue #60) and Byte-based subword (see issue #64) models have also proven to be better in some particular scenarios. In this post, we take a look at work by Shaham and Levy (2021) which investigates a simple but universal byte tokenization strategy without embedding layers for NMT.

Embeddingless model with byte tokenization

UTF-8 is an

To finish reading, please visit source site

Terminology
terminology/vocabulary coverage