Issue #64 – Neural Machine Translation with Byte-Level Subwords
13 Dec19
Issue #64 – Neural Machine Translation with Byte-Level Subwords
Author: Dr. Patrik Lambert, Machine Translation Scientist @ Iconic
In order to limit vocabulary, most neural machine translation engines are based on subwords. In some settings, character-based systems are even better (see issue #60). However, rare characters in noisy data or character-based languages can unnecessarily take up vocabulary slots and limit its compactness. In this post we take a look at an alternative, proposed by Wang et al. (2019), in which subwords are trained directly from bytes instead of characters.
Byte-Level Text Representation
In UTF-8 encoding, each character is encoded into 1 to 4 bytes. This allows us to model a sentence as a sequence of bytes instead of characters. While there are 138,000 unicode characters, a sentence can be represented as a sequence of UTF-8 bytes (248 out of 256 possible bytes). A representation of text based on bytes is up to 4 times longer than a representation based on characters, thus it is computationally more expensive. As an alternative, Wang et al. segment a byte sequence into byte-level subwords, that is byte n-grams. They do so with the Byte Pair Encoding
To finish reading, please visit source site