One of the first steps in the Data Science process is identifying what data you need to answer the question. In March 2017, I featured a series of blogs about characteristics of a data scientist. Today I want to add to that discussion by giving a case study of how clarifying questions is also a key part of the data science process.
It all began when I decided to participate in a crowdsourced Driven Data competition to predict local epidemics of dengue fever. I’m passionate about using machine learning and predictive analytics to solve some of the most challenging questions and thought this would be an excellent use of free time. I love learning new domains and data mining techniques to add improve my skills and help others at the same time.
Dengue fever is a mosquito-borne disease with 60,000 reported cases in Perú and Puerto Rico in 2016. (I could not find city-level case data from Iquitos or San Juan.) The disease used to be limited to Southeast Asia and the Pacific Islands but has recently spread to many other parts of the world. According to the competition organizers, “Accurate dengue predictions would help public health workers … and people around the world take steps to reduce the impact of these epidemics. But predicting dengue is a hefty task that calls for the consolidation of different data sets on disease incidence, weather, and the environment.” The National Oceanic and Atmospheric Administration, Centers for Disease Control and other government entities are looking to data science to help them with predict dengue fever cases.
In the competition, the question is how many total cases of dengue fever will occur weekly in the cities of San Juan, Puerto Rico and Iquitos, Perú. I registered for the competition and was ready to get started with exploratory data analysis, visualizations, dimensionality reduction and all those fun parts of predictive analytics with machine learning. The training data set has 22 input features from multiple data sources including time, weather and vegetation variables from 1990 to 2010 from both cities.
Before I began writing my first line of code, I fortunately remembered two important rules from an EdX course I took: “Data Science Research Methods: Python Edition” taught by Dr. Tom Carpenter. Rule #1. Ask only one question for each problem. Rule #2. Make sure you have all the variables you need to answer the question. The Data Driven data set contains city, date, air temperatures and ranges, humidity, precipitation, and something called normalized difference vegetation index. Being a newbie in this domain, it appeared that the first rule was satisfied – the problem was asking only one question at a time – to predict future number of cases of dengue fever.
The second rule from Carpenter’s Data Science Research Methods course was a bit trickier to evaluate. I did a bit of research into infectious disease and what variables affect whether someone gets dengue fever and how it spreads in a city. A quick literature search showed prediction models research by Padet Siriyasatien, Leigh R. Bowman and others included variables about the number of people infected as shown in Table 1. (Siriyasatien, Padet et al. “Analysis of Significant Factors for Dengue Fever Incidence Prediction.” BMC Bioinformatics 17 (2016): 166. PMC. Web. 16 Oct. 2018.)
When people get bitten by the dengue mosquito, they may get the disease. Other people living in the same city will not get the disease if they are not bitten by the mosquito. The cases are reported after the person is bitten. Since the competition includes no features about the number of cases, it does not satisfy rule #2 and we cannot predict the future number of dengue cases each week in San Juan, Puerto Rico or Iquitos, Peru. The most any model can predict with the given data is conditions where dengue is likely to occur within a city in a certain year. The scope of this question is different than a weekly prediction before cases occur that competition organizers are asking. Even if the competition had number of cases data at the city-level by year, there are other confounding factors (not all mosquitoes carry dengue) that would probably prevent data scientists from predicting the number of cases before they occur. And so, I decide not to compete in the competition and must find another challenging and interesting data science problem to solve.