Python Scikit-learn to simplify Machine learning : { Bag of words } To [ TF-IDF ]
Text (word) analysis and tokenized text modeling always give a chill air around ears, specially when you are new to machine learning. Thanks to Python and its extended libraries for its warm support around text analytics and machine learning. Scikit-learn is a savior and excellent support in text processing when you also understand some of the concept like “Bag of word”, “Clustering” and “vectorization”. Vectorization is must-to-know technique for all machine leaning learners, text miner and algorithm implementor. I personally consider it as a revolution in the analytical calculations. Read one of my earlier post about vectorization. Let’s look at the implementors of vectorization and try to zero down the process of text analysis.
Fundamentally, before we start any text analysis we need to first tokenize every word in a given text, so we can apply mathematical model on these words. When we actually tokenize the text, it can be transform into {bag of words} model of document classification. This {bag of word} model is used as a feature to train classifiers. We’ll observe in code how the feature and