Hugging Face Releases New NLP ‘Tokenizers’ Library Version (v0.8.0)
data:image/s3,"s3://crabby-images/b6b17/b6b177ddac39c92b9be12447335b75ecc3ad4331" alt=""
Hugging Face is at the forefront of a lot of updates in the NLP space. They have released one groundbreaking NLP library after another in the last few years. Honestly, I have learned and improved my own NLP skills a lot thanks to the work open-sourced by Hugging Face.
And today, they’ve released another big update – a brand new version of their popular Tokenizer library.
A Quick Introduction to Tokenization
So, what is tokenization? Tokenization is a crucial cog in Natural Language Processing (NLP). It’s a fundamental step in both traditional NLP methods like Count Vectorizer and Advanced Deep Learning-based architectures like Transformers.
Tokens are the building blocks of Natural Language.
Tokenization is a way of separating a piece of text into smaller units called tokens. Here, tokens can be either words, characters, or subwords. Hence, tokenization can