Implementing Word2Vec with Gensim Library in Python
Introduction
Humans have a natural ability to understand what other people are saying and what to say in response. This ability is developed by consistently interacting with other people and the society over many years. The language plays a very important role in how humans interact. Languages that humans use for interaction are called natural languages.
The rules of various natural languages are different. However, there is one thing in common in natural languages: flexibility and evolution.
Natural languages are highly very flexible. Suppose, you are driving a car and your friend says one of these three utterances: “Pull over”, “Stop the car”, “Halt”. You immediately understand that he is asking you to stop the car. This is because natural languages are extremely flexible. There are multiple ways to say one thing.
Another important aspect of natural languages is the fact that they are consistently evolving. For instance, a few years ago there was no term such as “Google it”, which refers to searching for something on the Google search engine. Natural languages are always undergoing evolution.
On the contrary, computer languages follow a strict syntax. If you want to tell a computer to print something on the screen, there is a special command for that. The task of Natural Language Processing is to make computers understand and generate human language in a way similar to humans.
This is a huge task and there are many hurdles involved. This video lecture from the University of Michigan contains a very good explanation of why NLP is so hard.
In this article we will implement the Word2Vec word embedding technique used for creating word vectors with Python’s Gensim library. However, before jumping straight to the coding section, we will first briefly review some of the most commonly used word embedding techniques, along with their pros and cons.
Word Embedding Approaches
One of the reasons that Natural Language Processing is a difficult problem to solve is the fact that, unlike human beings, computers can only understand numbers. We have to represent words in a numeric format that is understandable by the computers. Word embedding refers to the numeric representations of words.
Several word embedding approaches currently exist and all of them have their pros and cons. We will discuss three of them here:
- Bag of Words
- TF-IDF Scheme
- Word2Vec
Bag of Words
The bag of words approach is one of the simplest word embedding approaches. The following are steps to generate word embeddings using the bag of words approach.
We will see the word embeddings generated by the bag of words approach with the help of an example. Suppose you have a corpus with three sentences.
- S1 = I love rain
- S2 = rain rain go away
- S3 = I am away
To convert above sentences into their corresponding word embedding representations using the bag of words approach, we need to perform the following steps:
- Create a dictionary of unique words from the corpus. In the above corpus, we have following unique words: [I, love, rain, go, away, am]
- Parse the sentence. For each word in the sentence, add 1 in place of the word in the dictionary and add zero for all the other words that don’t exist in the dictionary. For instance, the bag of words representation for sentence S1 (I love rain), looks like this: [1, 1, 1, 0, 0, 0]. Similarly for S2 and S3, bag of word representations are [0, 0, 2, 1, 1, 0] and [1, 0, 0, 0, 1, 1], respectively.
Notice that for S2 we added 2 in place of “rain” in the dictionary; this is because S2 contains “rain” twice.
Pros and Cons of Bag of Words
Bag of words approach has both pros and cons. The main advantage of the bag of words approach is that you do not need a very huge corpus of words to get good results. You can see that we build a very basic bag of words model with three sentences. Computationally, a bag of words model is not very complex.
A major drawback of the bag of words approach is the fact that we need to create huge vectors with empty spaces in order to represent a number (sparse matrix) which consumes memory and space. In the example previous, we only had 3 sentences. Yet you can see three zeros in every vector.
Imagine a corpus with thousands of articles. In such a case, the number of unique words in a dictionary can be thousands. If one document contains 10% of the unique words, the corresponding embedding vector will still contain 90% zeros.
Another major issue with the bag of words approach is the fact that it doesn’t maintain any context information. It doesn’t care about the order in which the words appear in a sentence. For instance, it treats the sentences “Bottle is in the car” and “Car is in the bottle” equally, which are totally different sentences.
A type of bag of words approach, known as n-grams, can help maintain the relationship between words. N-gram refers to a contiguous sequence of n words. For instance, 2-grams for the sentence “You are not happy”, are “You are”, “are not” and “not happy”. Although the n-grams approach is capable of capturing relationships between words, the size of the feature set grows exponentially with too many n-grams.
TF-IDF Scheme
The TF-IDF scheme is a type of bag words approach where instead of adding zeros and ones in the embedding vector, you add floating numbers that contain more useful information compared to zeros and ones. The idea behind TF-IDF scheme is the fact that words having a high frequency of occurrence in one document, and less frequency of occurrence in all the other documents, are more crucial for classification.
TF-IDF is a product of two values: Term Frequency (TF) and Inverse Document Frequency (IDF).
Term frequency refers to the number of times a word appears in the document and can be calculated as:
Term frequence = (Number of Occurences of a word)/(Total words in the document)
For instance, if we look at sentence S1 from the previous section i.e. “I love rain”, every word in the sentence occurs once and therefore has a frequency of 1. On the contrary, for S2 i.e. “rain rain go away”, the frequency of “rain” is two while for the rest of the words, it is 1.
IDF refers to the log of the total number of documents divided by the number of documents in which the word exists, and can be calculated as:
IDF(word) = Log((Total number of documents)/(Number of documents containing the word))
For instance, the IDF value for the word “rain” is 0.1760, since the total number of documents is 3 and rain appears in 2 of them, therefore log(3/2)
is 0.1760. On the other hand, if you look at the word “love” in the first sentence, it appears in one of the three documents and therefore its IDF value is log(3)
, which is 0.4771.
Pros and Cons of TF-IDF
Though TF-IDF is an improvement over the simple bag of words approach and yields better results for common NLP tasks, the overall pros and cons remain the same. We still need to create a huge sparse matrix, which also takes a lot more computation than the simple bag of words approach.
Word2Vec
The Word2Vec embedding approach, developed by Tomas Mikolov, is considered the state of the art. Word2Vec approach uses deep learning and neural networks-based techniques to convert words into corresponding vectors in such a way that the semantically similar vectors are close to each other in N-dimensional space, where N refers to the dimensions of the vector.
Word2Vec returns some astonishing results. Word2Vec’s ability to maintain semantic relation is reflected by a classic example where if you have a vector for the word “King” and you remove the vector represented by the word “Man” from the “King” and add “Women” to it, you get a vector which is close to the “Queen” vector. This relation is commonly represented as:
King - Man + Women = Queen
Word2Vec model comes in two flavors: Skip Gram Model and Continuous Bag of Words Model (CBOW).
In the Skip Gram model, the context words are predicted using the base word. For instance, given a sentence “I love to dance in the rain”, the skip gram model will predict “love” and “dance” given the word “to” as input.
On the contrary, the CBOW model will predict “to”, if the context words “love” and “dance” are fed as input to the model. The model learns these relationships using deep neural networks.
Pros and Cons of Word2Vec
Word2Vec has several advantages over bag of words and IF-IDF scheme. Word2Vec retains the semantic meaning of different words in a document. The context information is not lost. Another great advantage of Word2Vec approach is that the size of the embedding vector is very small. Each dimension in the embedding vector contains information about one aspect of the word. We do not need huge sparse vectors, unlike the bag of words and TF-IDF approaches.
Note: The mathematical details of how Word2Vec works involve an explanation of neural networks and softmax probability, which is beyond the scope of this article. If you want to understand the mathematical grounds of Word2Vec, please read this paper: https://arxiv.org/abs/1301.3781
Word2Vec in Python with Gensim Library
In this section, we will implement Word2Vec model with the help of Python’s Gensim library. Follow these steps:
Creating Corpus
We discussed earlier that in order to create a Word2Vec model, we need a corpus. In real-life applications, Word2Vec models are created using billions of documents. For instance Google’s Word2Vec model is trained using 3 million words and phrases. However, for the sake of simplicity, we will create a Word2Vec model using a Single Wikipedia article. Our model will not be as good as Google’s. Although, it is good enough to explain how Word2Vec model can be implemented using the Gensim library.
Before we could summarize Wikipedia articles, we need to fetch them. To do so we will use a couple of libraries. The first library that we need to download is the Beautiful Soup library, which is a very useful Python utility for web scraping. Execute the following command at command prompt to download the Beautiful Soup utility.
$ pip install beautifulsoup4
Another important library that we need to parse XML and HTML is the lxml library. Execute the following command at command prompt to download lxml:
$ pip install lxml
The article we are going to scrape is the Wikipedia article on Artificial Intelligence. Let’s write a Python Script to scrape the article from Wikipedia:
import bs4 as bs
import urllib.request
import re
import nltk
scrapped_data = urllib.request.urlopen('https://en.wikipedia.org/wiki/Artificial_intelligence')
article = scrapped_data .read()
parsed_article = bs.BeautifulSoup(article,'lxml')
paragraphs = parsed_article.find_all('p')
article_text = ""
for p in paragraphs:
article_text += p.text
In the script above, we first download the Wikipedia article using the urlopen
method of the request
class of the urllib
library. We then read the article content and parse it using an object of the BeautifulSoup
class. Wikipedia stores the text content of the article inside p
tags. We use the find_all
function of the BeautifulSoup
object to fetch all the contents from the paragraph tags of the article.
Finally, we join all the paragraphs together and store the scraped article in article_text
variable for later use.
Preprocessing
At this point we have now imported the article. The next step is to preprocess the content for Word2Vec model. The following script preprocess the text:
# Cleaing the text
processed_article = article_text.lower()
processed_article = re.sub('[^a-zA-Z]', ' ', processed_article )
processed_article = re.sub(r's+', ' ', processed_article)
# Preparing the dataset
all_sentences = nltk.sent_tokenize(processed_article)
all_words = [nltk.word_tokenize(sent) for sent in all_sentences]
# Removing Stop Words
from nltk.corpus import stopwords
for i in range(len(all_words)):
all_words[i] = [w for w in all_words[i] if w not in stopwords.words('english')]
In the script above, we convert all the text to lowercase and then remove all the digits, special characters, and extra spaces from the text. After preprocessing, we are only left with the words.
The Word2Vec model is trained on a collection of words. First, we need to convert our article into sentences. We use nltk.sent_tokenize
utility to convert our article into sentences. To convert sentences into words, we use nltk.word_tokenize
utility. As a last preprocessing step, we remove all the stop words from the text.
After the script completes its execution, the all_words
object contains the list of all the words in the article. We will use this list to create our Word2Vec model with the Gensim library.
Creating Word2Vec Model
With Gensim, it is extremely straightforward to create Word2Vec model. The word list is passed to the Word2Vec
class of the gensim.models
package. We need to specify the value for the min_count
parameter. A value of 2 for min_count
specifies to include only those words in the Word2Vec model that appear at least twice in the corpus. The following script creates Word2Vec model using the Wikipedia article we scraped.
from gensim.models import Word2Vec
word2vec = Word2Vec(all_words, min_count=2)
To see the dictionary of unique words that exist at least twice in the corpus, execute the following script:
vocabulary = word2vec.wv.vocab
print(vocabulary)
When the above script is executed, you will see a list of all the unique words occurring at least twice.
Model Analysis
We successfully created our Word2Vec model in the last section. Now is the time to explore what we created.
Finding Vectors for a Word
We know that the Word2Vec model converts words to their corresponding vectors. Let’s see how we can view vector representation of any particular word.
v1 = word2vec.wv['artificial']
The vector v1
contains the vector representation for the word “artificial”. By default, a hundred dimensional vector is created by Gensim Word2Vec. This is a much, much smaller vector as compared to what would have been produced by bag of words. If we use the bag of words approach for embedding the article, the length of the vector for each will be 1206 since there are 1206 unique words with a minimum frequency of 2. If the minimum frequency of occurrence is set to 1, the size of the bag of words vector will further increase. On the other hand, vectors generated through Word2Vec are not affected by the size of the vocabulary.
Finding Similar Words
Earlier we said that contextual information of the words is not lost using Word2Vec approach. We can verify this by finding all the words similar to the word “intelligence”.
Take a look at the following script:
sim_words = word2vec.wv.most_similar('intelligence')
If you print the sim_words
variable to the console, you will see the words most similar to “intelligence” as shown below:
('ai', 0.7124934196472168)
('human', 0.6869025826454163)
('artificial', 0.6208730936050415)
('would', 0.583903431892395)
('many', 0.5610555410385132)
('also', 0.5557990670204163)
('learning', 0.554862380027771)
('search', 0.5522681474685669)
('language', 0.5408136248588562)
('include', 0.5248900055885315)
From the output, you can see the words similar to “intelligence” along with their similarity index. The word “ai” is the most similar word to “intelligence” according to the model, which actually makes sense. Similarly, words such as “human” and “artificial” often coexist with the word “intelligence”. Our model has successfully captured these relations using just a single Wikipedia article.
Conclusion
In this article, we implemented a Word2Vec word embedding model with Python’s Gensim Library. We did this by scraping a Wikipedia article and built our Word2Vec model using the article as a corpus. We also briefly reviewed the most commonly used word embedding approaches along with their pros and cons as a comparison to Word2Vec.
I would suggest you to create a Word2Vec model of your own with the help of any text corpus and see if you can get better results compared to the bag of words approach.