Handling Imbalanced Data – Machine Learning, Computer Vision and NLP
This article was published as a part of the Data Science Blogathon.
Introduction:
In the real world, the data we gather will be heavily imbalanced most of the time. so, what is an Imbalanced Dataset?. The training samples are not equally distributed across the target classes. For instance, if we take the case of the personal loan classification problem, it is effortless to get the ‘not approved’ data, in contrast to, ‘approved’ details. As a result, the model is more biased to the class which has a large number of training instances which degrades the model’s prediction power.
It also results in an increase in Type II errors, in the case of a typical binary classification problem. This stumbling block is not just limited to machine learning models but can also be predominantly observed in computer vision and NLP areas as well. These hiccups could be handled effectively by using distinct techniques for each area respectively.
Notes: This article will give a brief overview of various data