Last week I began a five-part series on my team’s graduate school project that will tell the story of coffee rust, production and futures. Today I’m going to focus on the data acquisition process, some challenges we faced in getting the data and the ethics that were considered. Our project mimics data in real world data science projects in that it comes from multiple sources, is not always in English and needs to be cleaned before it can be analyzed.
The analysis is limited by the data that is available for reuse under a creative commons license. Since the data is open and does not contain personally identifiable information, the data is not considered private. The reuse for our visualization research does not violate any data science ethics principles. Considering the purpose and reuse of the data is a key component of ensuring ethical data use in the data acquisition process.
Surprisingly there was no freely available rust data from some of the other top coffee producing nations such as Vietnam or Indonesia even though it is known that rust exists in all coffee-producing regions. Perhaps if we had the assistance of a coffee domain expert, we would have been able to include this data in our analysis. However, since Brasil and Colombia are in the top 10 coffee producers our data is considered a representative sample of coffee producing countries. Our data included average monthly rainfall in mm, average monthly temperature in degrees Celsius, average monthly percent rust, average monthly production measured in 1000 60-kg bags, and bi-weekly coffee futures in US dollars for Brasil (2005-2006 and 2008-2009), Colombia (1995 and 2011-2013) and Papa New Guinea (1989-1991).
Weather and physical crop properties that affect rust were included to make the simplest and easiest to understand model possible. The team chose to not use certain crop management variables such as fertilization and pesticide use since we don’t want this to be a confounding variable and since they are known methods to control coffee rust after it occurs. Soil conditions such as humidity (Quinones), pH (Lamouroux) and day length (Quinones) may be significant variables that affect rust and were not included since there was no consistent and reliable source data for the countries in the scope of the project (Brasil, Colombia and Papa New Guinea.) There are 339 observations of the 7 features (see GitHub repo for CSV file). The team concluded that the variables formed a correlative model where environmental factors influence coffee production. The desired output is Futures Price.
Challenges to data acquisition were that the data came from multiple sources, were not always in English and needed to be cleaned before they could be analyzed. All documents regarding coffee rust in Brazil were in Portuguese so we used Google Translate to translate the data we needed. The project team requested supplemental rust data from one of the more prolific researchers, Corrales, and even though we received a response, the data descriptions and titles were in Spanish. Fortunately, one of the project team members is fluent in Spanish and provided the necessary translation. On average, each country had 6-7 different data sources that had to be manipulated and clean for further analysis and visualization.
Next week I’ll move onto the exploratory data analysis using histograms and kernel density estimate visualizations.