How to Encode Text Data for Machine Learning with scikit-learn
Last Updated on June 28, 2020
Text data requires special preparation before you can start using it for predictive modeling.
The text must be parsed to remove words, called tokenization. Then the words need to be encoded as integers or floating point values for use as input to a machine learning algorithm, called feature extraction (or vectorization).
The scikit-learn library offers easy-to-use tools to perform both tokenization and feature extraction of your text data.
In this tutorial, you will discover exactly how you can prepare your text data for predictive modeling in Python with scikit-learn.
After completing this tutorial, you will know:
- How to convert text to word count vectors with CountVectorizer.
- How to convert text to word frequency vectors with TfidfVectorizer.
- How to convert text to unique integers with HashingVectorizer.
Kick-start your project with my new book Deep Learning for Natural Language Processing, including step-by-step tutorials and the Python source code files for all examples.
Let’s get started.