How to Avoid Data Leakage When Performing Data Preparation

Last Updated on August 17, 2020

Data preparation is the process of transforming raw data into a form that is appropriate for modeling.

A naive approach to preparing data applies the transform on the entire dataset before evaluating the performance of the model. This results in a problem referred to as data leakage, where knowledge of the hold-out test set leaks into the dataset used to train the model. This can result in an incorrect estimate of model performance when making predictions on new data.

A careful application of data preparation techniques is required in order to avoid data leakage, and this varies depending on the model evaluation scheme used, such as train-test splits or k-fold cross-validation.

In this tutorial, you will discover how to avoid data leakage during data preparation when evaluating machine learning models.

After completing this tutorial, you will know:

Naive application of data preparation methods to the whole dataset results in data leakage that causes incorrect estimates of model performance.
Data preparation must be prepared on the training set only in order to avoid data leakage.
How to implement data preparation without data leakage for train-test splits and k-fold cross-validation in Python.

Kick-start your
To finish reading, please visit source site

Data Preparation