Unsupervised text tokenizer focused on computational efficiency
YouTokenToMe is an unsupervised text tokenizer focused on computational efficiency. It currently implements fast Byte Pair Encoding (BPE) [Sennrich et al.]. Our implementation is much faster in training and tokenization than Hugging Face, fastBPE and SentencePiece. In some test cases, it is 90 times faster. Check out our benchmark results.
Key advantages:
- Multithreading for training and tokenization
- The algorithm has
O(N)
complexity, whereN
is the length of training data - Highly efficient implementation in C++
- Python wrapper and command-line interface
Extra features:
As well as in the