Dimensionality Reduction in Python with Scikit-Learn
Introduction
In machine learning, the performance of a model only benefits from more features up until a certain point. The more features are fed into a model, the more the dimensionality of the data increases. As the dimensionality increases, overfitting becomes more likely.
There are multiple techniques that can be used to fight overfitting, but dimensionality reduction is one of the most effective techniques. Dimensionality reduction selects the most important components of the feature space, preserving them and dropping the other components.
Why is Dimensionality Reduction Needed?
There are a few reasons that dimensionality reduction is used in machine learning: to combat computational cost, to control overfitting, and to visualize and help interpret high dimensional data sets.
Often in machine learning, the more features that are present in the dataset the better a classifier can learn. However, more features also means a higher computational cost. Not only can high dimensionality lead to long training times, more features often lead to an algorithm overfitting as it tries to create a model that explains all the features in the data.
Because dimensionality reduction reduces the overall number of features, it can reduce the computational demands associated with training a