kNN Imputation for Missing Values in Machine Learning

Last Updated on August 17, 2020

Datasets may have missing values, and this can cause problems for many machine learning algorithms.

As such, it is good practice to identify and replace missing values for each column in your input data prior to modeling your prediction task. This is called missing data imputation, or imputing for short.

A popular approach to missing data imputation is to use a model to predict the missing values. This requires a model to be created for each input variable that has missing values. Although any one among a range of different models can be used to predict the missing values, the k-nearest neighbor (KNN) algorithm has proven to be generally effective, often referred to as “nearest neighbor imputation.”

In this tutorial, you will discover how to use nearest neighbor imputation strategies for missing data in machine learning.

After completing this tutorial, you will know:

Missing values must be marked with NaN values and can be replaced with nearest neighbor estimated values.
How to load a CSV file with missing values and mark the missing values with NaN values and report the number and percentage of missing values for each column.
How to impute
To finish reading, please visit source site

Data Preparation