Issue #121 – Finding the Optimal Vocabulary Size for Neural Machine Translation

11 Mar21

Issue #121 – Finding the Optimal Vocabulary Size for Neural Machine Translation

Author: Akshai Ramesh, Machine Translation Scientist @ Iconic

Introduction

Sennrich et al. (2016) introduced a variant of byte pair encoding (BPE) (Gage, 1994) for word segmentation, which is capable of encoding open vocabularies with a compact symbol vocabulary of variable-length subword units. With the use of BPE, the Neural Machine Translation (NMT) systems are capable of open-vocabulary translation by representing rare and unseen words as a sequence of subword units.

Today, subword tokenisation schemes inspired by BPE have become the norm across many Natural Language Processing tasks. The BPE algorithm has a single hyperparameter – “number of merge operations” – that governs the vocabulary size. According to

To finish reading, please visit source site

bpe
byte pair encoding