Missing Data with k-Nearest Neighbor Imputation

knn

In today’s blog, I want to give a case study of using k-Nearest Neighbor (kNN) imputation to fill in missing data. About a year ago, I talked about using the kNN machine learning algorithm for a classification problem of predicting drought location. There are several great resources out there from Analytics Vidhya, Machine Learning Mastery, and A Mueller  with Python code examples.  As always, let me know what you think of this post with its “big picture” description.

The problem I’m trying to solve is how to fill in missing data for a coffee rust dataset from Brazil. It’s a side project I started working on in November 2018 from an earlier project in the same problem space I started in graduate school. In a nutshell, I’m trying to use machine learning to quantify the relationship between coffee rust (disease on the plant) using temperature, rainfall, and production variables. The dataset has 1584 observations from seven sources and includes data from different coffee-growing regions in Brazil each week from January 1, 1991 through July 30, 2018. A screenshot of the first few rows of data sorted by increasing rust amount follows.

Date Temp Rain Production Futures Rust
12/1/1993 25.7498 208.841 2347.25 0.7865 0
12/8/1993 25.7498 208.841 2347.25 0.7715 0
12/15/1993 25.7498 208.841 2347.25 0.798 0
12/22/1993 25.7498 208.841 2347.25 0.763 0
12/29/1993 25.7498 208.841 2347.25 0.749 0
8/1/1994 24.2654 56.0826 2349.33 2.116 0
8/8/1994 24.2654 56.0826 2349.33 1.786 0
8/15/1994 24.2654 56.0826 2349.33 1.94 0
8/22/1994 24.2654 56.0826 2349.33 1.8275 0
8/29/1994 24.2654 56.0826 2349.33 2.1116 0


The dataset has 773 missing rust values (48.8%) and 109 rust values greater than 50% (6.9%).  These missing and large value observations were replaced using the k-Nearest neighbor imputation method described by Lorenzo Beretta and Alessandro Santaniello. The kNN imputation method attempts to control noise in the data. kNN imputation assigns values to missing data based on the closest class or cluster that doesn’t have a missing value. This may not be entirely foolproof in the case of my coffee dataset since the values are discrete rather than continuous.

One important assumption is that the data is missing completely at random. In our case, there was no known data available for these 882 observations. Let’s take a look at some kNN imputed rust values: 0.17, 0.33, and 0.67. These values were determined by looking at the other four variables (Temp, Rain, Production, and Futures) that are not missing and are closest. Now that missing data has been filled in with imputed values, you can continue with your machine learning / deep learning models to solve your problem.

Temp Rain Production Futures Rust
26.642 112.547 4452.33 1.811 0.17
26.642 112.547 4452.33 1.8215 0.17
26.3971 195.832 4730.33 1.429 0.33
26.3971 195.832 4358.25 1.4445 0.4
26.3971 195.832 4730.33 1.445 0.4
25.7173 169.603 4558.17 1.425 0.5
26.1414 174.786 4216 2.1775 0.67
25.8444 179.768 2347.25 0.7755 1

Next month join me as I give a case study for beginners on natural language processing.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.