How to Create Custom Data Transforms for Scikit-Learn
Last Updated on July 19, 2020
The scikit-learn Python library for machine learning offers a suite of data transforms for changing the scale and distribution of input data, as well as removing input features (columns).
There are many simple data cleaning operations, such as removing outliers and removing columns with few observations, that are often performed manually to the data, requiring custom code.
The scikit-learn library provides a way to wrap these custom data transforms in a standard way so they can be used just like any other transform, either on data directly or as a part of a modeling pipeline.
In this tutorial, you will discover how to define and use custom data transforms for scikit-learn.
After completing this tutorial, you will know:
- That custom data transforms can be created for scikit-learn using the FunctionTransformer class.
- How to develop and apply a custom transform to remove columns with few unique values.
- How to develop and apply a custom transform that replaces outliers for each column.
Kick-start your project with my new book Data Preparation for Machine Learning, including step-by-step tutorials and the Python source code files for all examples.
Let’s get started.