In last month’s blog, I promised we’d go into more detail for a use case of classification machine learning algorithms. Today, let’s see if we can predict foreign economic aid disbursement categories using machine learning techniques. Using the following chart, we decide to use classification algorithms since we are trying to predict a category from our data and it is labeled (more on that later). Now let’s dive into the details a bit more at the problem we’re trying to solve using data science.
Foreign assistance has the potential to promote America’s interests, improves lives abroad and contribute to the United Nations Sustainable Development Goals. What if decision makers could use data science techniques to predict an approximation of how much aid they would receive? Countries could plan for the future by knowing how much aid they will receive. This project was a part of the Applied Machine Learning class I took as part of my Data Science Master’s program at Indiana University in the Spring of 2017. In this project, I extracted economic aid disbursement amounts in US dollars from the U.S. Agency for International Development from 2014 to 2016 for 176 countries using the freely available data source – explorer.usaid.gov. Economic aid is defined as being used for programs with a development or humanitarian objective.
The data was cleaned into a pandas data frame (think rows and column format like in Microsoft Excel) in Python 3.5 with three features: “2014”, “2015” and “2016”. The economic aid disbursement values are shown in US dollars for a few of the countries.
As you can begin to see from this sample data, the numerical values vary quite significantly from year to year. (Here’s a link to the entire cleaned dataset). For example the maximum difference in Benin from 2014 to 2015 is $27 million and between 2015 and 2016 in Guinea, the difference is $67 million. Eighty-one percent (143) countries had aid amounts decrease by more than 15% in 2016. Twenty percent of the countries (36 countries) that received economic aid in 2015 received no economic aid in 2016. According to the World Bank’s calculations, aid remittance to developing countries decreased by 2% cumulatively in 2016.
All 176 countries had data for at least two of the three years during the studied time period of 2014-2016. Thirty seven countries (21%) were missing a value for at least one of the years. The missing values were assumed to be independent of their features and to occur at random. In order to avoid having a reduced model that would exclude countries, the missing value was replaced with a calculation of the average of the two existing values in a manner similar to work by Saar-Tsechansky.
Linear and logistic regression classifiers were used to predict the 2017 economic aid amounts using the historical 2014, 2015 and 2016 data. Since the accuracies of these classifiers was 0, my project partner and I decided to encode the aid data into categories. A value of 1 is given for aid values less than $5 million and category of “Micro”. A value of 2 is given for aid values between $5 to $20 million and a category of Small. A value of 3 is given for aid values between $20 to $50 million and a category of Medium. A value of 4 is given for aid values between $50 to $100 million and a category of Large. A value of 5 is given for aid values between $100 to $200 million and a category of X-Large. A value of 6 is given for aid values over $200 million and a category of Substantial.
|Economic Aid Disbursement||Category||Value|
|$50,000,000 – $100,000,000||Large||4|
|$100,000,000 – $200,000,000||X-Large||5|
There were 528 observations (176 countries * 3 years) in the final dataset. Over half of the economic aid from 2014-2016 fell into one of two categories – Micro (less than $5 million) or Substantial (more than $200 million). Machine learning models predicted the 2017 categories of economic aid disbursement. A baseline accuracy was calculated by looking at the highest number of correct predictions for all of the years (104/176) and was 59.1%. K-nearest Neighbors had the highest accuracy of 93.18%, support vector machine 92.04% and decision tree and Naive Bayes 90.91%.
Now that we’ve looked at regression and classification machine learning case studies, in the next blog I’ll explain a clustering case study.