Python for NLP: Creating TF-IDF Model from Scratch
This is the 14th article in my series of articles on Python for NLP. In my previous article, I explained how to convert sentences into numeric vectors using the bag of words approach. To get a better understanding of the bag of words approach, we implemented the technique in Python.
In this article, we will build upon the concept that we learn in the last article and will implement the TF-IDF scheme from scratch in Python. The term TF stands for “term frequency” while the term IDF stands for the “inverse document frequency”.
Problem with Bag of Words Model
Before we actually see the TF-IDF model, let us first discuss a few problems associated with the bag of words model.
In the last article, we had the following three example sentences:
- “I like to play football”
- “Did you go outside to play tennis”
- “John and I play tennis”
The resulting bag of words model looked like this:
Play | Tennis | To | I | Football | Did | You | go | |
---|---|---|---|---|---|---|---|---|
Sentence 1 | 1 | 0 | 1 | 1 | 1 | 0 | 0 | 0 |
Sentence 2 | 1 | 1 | 1 | 0 | 0 | 1 | 1 | 1 |
Sentence |