Part 3: Step by Step Guide to NLP – Text Cleaning and Preprocessing

This article was published as a part of the Data Science Blogathon

Introduction

This article is part of an ongoing blog series on Natural Language Processing (NLP). In part-1and part-2 of this blog series, we complete the theoretical concepts related to NLP. Now, in continuation of that part, in this article, we will cover some of the new concepts.

In this article, we will understand the terminologies required and then we start our journey towards text cleaning and preprocessing, which is a very crucial component while we are working with NLP tasks.

This is part-3 of the blog series on the Step by Step Guide to Natural Language Processing.

1. Familiar with Terminologies

Corpus
Tokens
Tokenization
Text object
Morpheme
Lexicon

2. What is Tokenization?

White-space Tokenization
Regular Expression Tokenization
Sentence and Word Tokenization

3. Noise Entities Removal

Removal of Punctuation marks
Removal of stopwords, etc.

4. Data Visualization for Text Data

5. Parts of Speech (POS) Tagging

Familiar with Terminologies

To finish reading, please visit source site

Part 3: Step by Step Guide to NLP – Text Cleaning and Preprocessing

Introduction

Table of Contents

Familiar with Terminologies