In today’s blog, I want to give a case study of using k-Nearest Neighbor (kNN) imputation to fill in missing data. About a year ago, I talked about using the kNN machine learning algorithm for a classification problem of predicting drought location. There are several great resources out there from Analytics Vidhya, Machine Learning Mastery, and A Mueller with Python code examples. As always, let me know what you think of this post with its “big picture” description.
The problem I’m trying to solve is how to fill in missing data for a coffee rust dataset from Brazil. It’s a side project I started working on in November 2018 from an earlier project in the same problem space I started in graduate school. In a nutshell, I’m trying to use machine learning to quantify the relationship between coffee rust (disease on the plant) using temperature, rainfall, and production variables. The dataset has 1584 observations from seven sources and includes data from different coffee-growing regions in Brazil each week from January 1, 1991 through July 30, 2018. A screenshot of the first few rows of data sorted by increasing rust amount follows.
The dataset has 773 missing rust values (48.8%) and 109 rust values greater than 50% (6.9%). These missing and large value observations were replaced using the k-Nearest neighbor imputation method described by Lorenzo Beretta and Alessandro Santaniello. The kNN imputation method attempts to control noise in the data. kNN imputation assigns values to missing data based on the closest class or cluster that doesn’t have a missing value. This may not be entirely foolproof in the case of my coffee dataset since the values are discrete rather than continuous.
One important assumption is that the data is missing completely at random. In our case, there was no known data available for these 882 observations. Let’s take a look at some kNN imputed rust values: 0.17, 0.33, and 0.67. These values were determined by looking at the other four variables (Temp, Rain, Production, and Futures) that are not missing and are closest. Now that missing data has been filled in with imputed values, you can continue with your machine learning / deep learning models to solve your problem.
Next month join me as I give a case study for beginners on natural language processing.