How to Avoid Data Leakage When Performing Data Preparation
Last Updated on August 17, 2020
Data preparation is the process of transforming raw data into a form that is appropriate for modeling.
A naive approach to preparing data applies the transform on the entire dataset before evaluating the performance of the model. This results in a problem referred to as data leakage, where knowledge of the hold-out test set leaks into the dataset used to train the model. This can result in an incorrect estimate of model performance when making predictions on new data.
A careful application of data preparation techniques is required in order to avoid data leakage, and this varies depending on the model evaluation scheme used, such as train-test splits or k-fold cross-validation.
In this tutorial, you will discover how to avoid data leakage during data preparation when evaluating machine learning models.
After completing this tutorial, you will know:
- Naive application of data preparation methods to the whole dataset results in data leakage that causes incorrect estimates of model performance.
- Data preparation must be prepared on the training set only in order to avoid data leakage.
- How to implement data preparation without data leakage for train-test splits and k-fold cross-validation in Python.
Kick-start your
To finish reading, please visit source site