A few weeks ago at the end of January as I was procrastinating doing my homework for my last semester in grad school, I came across something that made my face light up and had me starting my Jupyter Notebook faster than you can say kernel. The World Bank is currently hosting a machine learning competition to help them know which survey questions best predict whether a household or individual is likely to be poor or not. The bigger picture is that results from this competition will help improve their methods for measuring poverty which can in turn help them achieve the goal of ending extreme poverty by 2030. Since working on a project that uses data to improve lives is at the top of my short-term goals and since this is the first semester I’m only taking three credit hours, I began work immediately on the competition.
The data is from three countries – A, B and C. It is split into household level and individual level and has been coded to avoid personal identifiable information. There is training and test data for each country. Country A has 8203 observations for the training data, 4041 observations for the test data and 342 features. Country B has 3255 observations for the training data, 1604 observations for the test data and 440 features. Country C has 6469 observations for the training data, 3177 observations for the test data and 162 features. The first few rows for country A looks like this:
As you can see, it’s important to not get tied down by the confusing symbols that are in each column above. Since most data scientists (including myself) can’t be domain experts in everything, any actual values in each cell above would just introduce more complexity in solving the problem.
I processed the household level data with a test and training split of 80 / 20 and encoded using the sklearn library within Python 3 (For more technical details on pre-processing, visit: https://www.analyticsvidhya.com/blog/2016/07/practical-guide-data-preprocessing-python-scikit-learn/).
The authors of this World Bank competition used a Random Forest classifier. The metric for machine learning algorithm success is something called log loss. Log loss quantifies the accuracy of an algorithm by penalizing anytime the classifier is wrong. The benchmark log loss number I’m trying to beat is 0.5739.
I started with a quick and easy-to-interpret algorithm which was only 53% accurate and had a log loss of 15.68. I moved next to another type of algorithm which had about the same accuracy as the first algorithm but with a better log loss of 5.09.
Finally after some more thought and decidedly much more caffeine, I decided to try a third type of algorithm. A key part of data science is understanding and using the scientific method so trial and error of machine learning algorithms to support the original hypothesis is key. I’m thrilled to say that for each country, the precision, recall and f1-scores were all 100%. And more importantly the mean log loss for all three countries was 9.992007e-16 or 0.000000000000000992!
I’m currently ranked 337 out of 1700 competitors on the contest’s leader board. It’s cool being in the top 20% in my first ever data challenge and I’m going to keep experimenting as time permits before the contest ends. This challenge is very enjoyable and confirmation that I’m starting the right career path!