What is Tokenization in NLP? Here’s All You Need To Know
Highlights
- Tokenization is a key (and mandatory) aspect of working with text data
- We’ll discuss the various nuances of tokenization, including how to handle Out-of-Vocabulary words (OOV)
Introduction
Language is a thing of beauty. But mastering a new language from scratch is quite a daunting prospect. If you’ve ever picked up a language that wasn’t your mother tongue, you’ll relate to this! There are so many layers to peel off and syntaxes to consider – it’s quite a challenge.
And that’s exactly the way with our machines. In order to get our computer to understand any text, we need to break that word down in a way that our machine can understand. That’s where the concept of tokenization in Natural Language Processing (NLP) comes in.
Simply put, we can’t work with text data if we don’t perform tokenization. Yes, it’s really that important!
And here’s the intriguing thing about tokenization