A Quick Guide to Text Cleaning Using the nltk Library
This article was published as a part of the Data Science Blogathon.
Introduction
NLTK is a string processing library that takes strings as input. The output is in the form of either a string or lists of strings. This library provides a lot of algorithms that helps majorly in the learning purpose. One can compare among different variants of outputs. There are other libraries as well like spaCy, CoreNLP, PyNLPI, Polyglot. NLTK and spaCy are most widely used. Spacy works well with large information and for advanced NLP.
To get an understanding of the basic text cleaning processes I’m using the NLTK library which is great for learning.
The data scraped from the website is mostly in the raw text form. This data needs to be cleaned before analyzing it or fitting a model to it. Cleaning up the text data is necessary to highlight the attributes that you’re going to want your machine learning system to pick up on. Cleaning