Python for NLP: Creating Bag of Words Model from Scratch
This is the 13th article in my series of articles on Python for NLP. In the previous article, we saw how to create a simple rule-based chatbot that uses cosine similarity between the TF-IDF vectors of the words in the corpus and the user input, to generate a response. The TF-IDF model was basically used to convert word to numbers.
In this article, we will study another very useful model that converts text to numbers i.e. the Bag of Words (BOW).
Since most of the statistical algorithms, e.g machine learning and deep learning techniques, work with numeric data, therefore we have to convert text into numbers. Several approaches exist in this regard. However, the most famous ones are Bag of Words, TF-IDF, and word2vec. Though several libraries exist, such as Scikit-Learn and NLTK, which can implement these techniques in one line of code, it is important to understand the working principle behind these word embedding techniques. The best way to do so is to implement these techniques from scratch in Python and this is what we are going to do today.
In this article, we will see how to implement the