Python for NLP: Vocabulary and Phrase Matching with SpaCy
This is the third article in this series of articles on Python for Natural Language Processing. In the previous article, we saw how Python’s NLTK and spaCy libraries can be used to perform simple NLP tasks such as tokenization, stemming and lemmatization. We also saw how to perform parts of speech tagging, named entity recognition and noun-parsing. However, all of these operations are performed on individual words.
In this article, we will move a step further and explore vocabulary and phrase matching using the spaCy library. We will define patterns and then will see which phrases that match the pattern we define. This is similar to defining regular expressions that involve parts of speech.
Rule-Based Matching
The spaCy library comes with Matcher
tool that can be used to specify custom rules for phrase matching. The process to use the Matcher
tool is pretty straight forward. The first thing you have to do is define the patterns that you want to match. Next, you have to add the patterns to the Matcher
tool and finally, you have to apply the Matcher
tool to the document that you want