How to Avoid Data Leakage When Performing Data Preparation

Last Updated on August 17, 2020

Data preparation is the process of transforming raw data into a form that is appropriate for modeling.

A naive approach to preparing data applies the transform on the entire dataset before evaluating the performance of the model. This results in a problem referred to as data leakage, where knowledge of the hold-out test set leaks into the dataset used to train the model. This can result in an incorrect estimate of model performance when making predictions on new data.

A careful application of data preparation techniques is required in order to avoid data leakage, and this varies depending on the model evaluation scheme used, such as train-test splits or k-fold cross-validation.

In this tutorial, you will discover how to avoid data leakage during data preparation when evaluating machine learning models.

After completing this tutorial, you will know:

  • Naive application of data preparation methods to the whole dataset results in data leakage that causes incorrect estimates of model performance.
  • Data preparation must be prepared on the training set only in order to avoid data leakage.
  • How to implement data preparation without data leakage for train-test splits and k-fold cross-validation in Python.

Kick-start your
To finish reading, please visit source site