In a September 2018 blog, I talked about a K-means clustering case study of cyber profiling in Indonesia. Today I want to continue that discussion by giving you a case study for dimensionality reduction with some Python 3 code snippets (using Jupyter Notebook development environment). I’ve gone over three years with no code so even though this blog is primarily for a non-technical audience, stay with me here. Dimensionality reduction is a way to select the best features that help you make a decision. If you were shopping for a car that had 1500 features, perhaps only a few of those would be the most important in helping you choose that car. Perhaps you care about gas mileage, appearance and number of passengers it can carry. Perhaps you care more about the technology package, the comfort of the seats and the gas mileage. In machine learning, dimensionality reduction removes redundant features so less space is required for data storage and so that the answer can be computed quicker. Dimensionality reduction also helps keep the relevant information to improve the model’s performance.
The Driven Data competition sponsored by the Inter-American Development Bank was trying to use predictive analytics to verify incomes in Costa Rica to ensure they receive economic aid. The training dataset had observations from 9557 households with 142 features (read more about the data). The first step in the data analysis process was to understand the data. Some of the features included monthly rent payment, whether there was overcrowding by bedrooms, size of the household, years of education for household head, etc. The goal was to figure out which of these 142 features could predict whether a household was in extreme poverty. This ‘target feature’ is named ‘Target’ in the dataset.
After understanding the problem, I moved to the next step in data science of exploratory data analysis (EDA). The data was in cleaned CSV format (think spreadsheet rows and columns). I used Python 3 programming language and a couple of libraries within Python to load and visualize the data. (The most popular example of EDA in the competition was by Will Koehersen.)
Following exploratory data analysis, Spearman’s correlation was chosen to see how the variables were related to one another and the code was adapted from datascience.com. Variables that are highly correlated to each other have larger positive or negative values than those that are not correlated to each other. There is a bit of art in choosing what this value will be. For example, the variable ‘v2a1’ has a Spearman’s correlation of -0.1 with the variable ‘hacdor’ but a correlation of 0.38 with the ‘rooms’ variable. This means there is a stronger and positive correlation between ‘v2a1’ and ‘rooms’ than between ‘v2a1’ and ‘hacdor’.
Since highly correlated variables are redundant and don’t give us any additional information, we can remove them from the data set. In our case, we will remove any features with Spearman’s correlation greater than 0.2. The screenshot below is the last bit of code we need to identify which features to eliminate from our training data set.
After you have this list of 106 features to drop, you can delete them from your pandas dataframe. Then you can go through the machine learning model selection and optimization process. Let me know if you’d like me to go into any details of the other types of dimensionality reduction techniques. In the next blog, I will look at a case study that uses the k-Nearest Neighbors algorithm.