Part 3: Step by Step Guide to NLP – Text Cleaning and Preprocessing
Introduction
This article is part of an ongoing blog series on Natural Language Processing (NLP). In part-1and part-2 of this blog series, we complete the theoretical concepts related to NLP. Now, in continuation of that part, in this article, we will cover some of the new concepts.
In this article, we will understand the terminologies required and then we start our journey towards text cleaning and preprocessing, which is a very crucial component while we are working with NLP tasks.
This is part-3 of the blog series on the Step by Step Guide to Natural Language Processing.
Table of Contents
1. Familiar with Terminologies
- Corpus
- Tokens
- Tokenization
- Text object
- Morpheme
- Lexicon
2. What is Tokenization?
- White-space Tokenization
- Regular Expression Tokenization
- Sentence and Word Tokenization
3. Noise Entities Removal
- Removal of Punctuation marks
- Removal of stopwords, etc.
4. Data Visualization for Text Data
5. Parts of Speech (POS) Tagging
Familiar with Terminologies