How to Perform Feature Selection with Categorical Data
Last Updated on August 18, 2020
Feature selection is the process of identifying and selecting a subset of input features that are most relevant to the target variable.
Feature selection is often straightforward when working with real-valued data, such as using the Pearson’s correlation coefficient, but can be challenging when working with categorical data.
The two most commonly used feature selection methods for categorical input data when the target variable is also categorical (e.g. classification predictive modeling) are the chi-squared statistic and the mutual information statistic.
In this tutorial, you will discover how to perform feature selection with categorical input data.
After completing this tutorial, you will know:
- The breast cancer predictive modeling problem with categorical inputs and binary classification target variable.
- How to evaluate the importance of categorical features using the chi-squared and mutual information statistics.
- How to perform feature selection for categorical data when fitting and evaluating a classification model.
Kick-start your project with my new book Data Preparation for Machine Learning, including step-by-step tutorials and the Python source code files for all examples.
Let’s get started.