Morning Joe Project – Data Acquisition

This is supporting documentation for the Morning Joe Viz Project.

Data Acquisition Process
Data Sources

Brazil

Papa New Guinea

Colombia

  • Rust Data: January – December 1995 – Ruiz Cardenas
    • Figure 1 – % rust incidence = percent dead leaves = 100 -% alive leaves = 100 – (striped bars in figure 1)- will take an average of %Supia, %La Catalina and %Naranjal at 60 days after flowering – samples are from trees not individual leaves

= 1- ((48+38+20)/3) = 1-35.33 = 34.33% – will use this one value for all months


As with most real-world data science projects, the analysis was limited by data that is open. Surprisingly there is no freely available rust data from some of the other top coffee producing nations such as Vietnam or Indonesia. Since Vietnam and Indonesia do not produce cafe arabica variety of coffee, the lack of data from these countries is irrelevant to the research. Since Brazil and Colombia are in the top 5 coffee producers, the data is a reasonable sample of all coffee-producing regions.

Data Assumptions

  • Latitude and Longitude – a negative value represents South and West directions.
  • Rainfall, temperature, rust percent and calculated production amounts were used for each date within the same month. The only data that is available weekly is the coffee futures prices data. Rainfall, temperature and rust percent are monthly.
  • Production amounts data is only available per country per year. To calculate the production amount per month, yearly production amount was divided by 12.
  • Coffee futures prices are worldwide prices and are not broken down by country.
  • Altitudes are similar for Brazil (1010 m), Columbia (1310m and Cauca-1761 m) and Papa New Guinea (1410-1770 m) so using data from these three coffee-growing regions is appropriate from a geography /agricultural point of view. The altitude for each coffee growing region if not given explicitly in the cited literature was determined from freemaptools.com ‘Elevation Finder’.
  • The impact of climate change on coffee growing regions has been studied by Jaramillo and Alves. However, any possible implications of climate change on the rust, production or futures was not included in the analysis.


Attribute Selection

Seven features were selected for analysis including:

  1. Date (Month/Date/Year)
  2. Country
  3. Latitude/Longitude
  4. Avg Monthly Rain (mm)
  5. Average Temperature (degrees C)
  6. % Rust
  7. Production (quantity measured in 1000 60 kg bags of coffee beans)

Rationale for Attribute Selection:
Weather and physical crop properties that affect rust were included to make the simplest and easiest to interpret model possible. The literature suggests that two main variables determine coffee rust – air temperature and rainfall amounts (Galli, Zambolim, Bock). The literature also suggests that farmers apply rust control measures such as pesticides when crop leaves have 5-20% brown spots (Cunha). Certain crop management variables such as fertilization and pesticide use were not included to avoid introducing confounding variables into the analysis. Furthermore, adding these crop management variables would not add a tremendous amount of value to analysing the relationships between variables and the hypotheses since they are known methods to control coffee rust after it reaches a 5% threshold.

Other attributes such as soil humidity (Quinones), pH (Lamouroux) and day length (Quinones) may be significant variables that affect rust but were not included in the analysis since there was no consistent and reliable source data for Brasil, Colombia and Papa New Guinea.
Data Summary

Category Attributes Type
Weather conditions Latitude, Longitude

Temperature

Average Rainfall

Numerical

Numerical

Numerical

Physical crop properties Location

Coffee Variety

% Rust

Production amount

Nominal

Nominal

Numerical

Numerical

Financial properties Futures price Numerical

There were 337 observations of the 7 features. The variables formed a correlative model where environmental factors such as temperature and rainfall were assumed to influence coffee production.

Data Acquisition Challenges

 

  • Lacking Domain Expertise

 

Gathering enough accurate data with all the necessary variables was challenging at the beginning of the project and took a significant amount of time. Existing literature from experts with coffee production and rust data was relied upon since the team did not have this domain knowledge. A learning curve in knowing how to search and find the necessary data was part of the data acquisition process.

For the coffee futures price data, at the initial phase of the project proposal printed historical charts scanned at low resolution was the only available data found (see 1989 Coffee Historical Prices chart below).

http://futures.tradingcharts.com/historical/CF/1989/3/linewchart.html

futures-1989.png

Since no more precise data was found, initially manual approximations were made for prices at the beginning of each month rounded to the nearest decimal point.

Fortunately after the initial project proposal deadline, more accurate futures price data from the US Commodities Futures Trading Commission was found. This resource contained much more accurate data to three decimal points from 1986-2012. These amounts were reported 4-5 times each month (except in 1989-1991 where there were only 2 reports per month). These amounts were used as the authoritative data source for futures.

 

  • Language Barrier

 

All documents regarding coffee rust occurrences in Brazil were in Portuguese (Paolo, Japiassu, Galli, Zambolim, Carvalho) so Google Translate was used to translate the relevant data. Supplemental rust data from one of the more prolific rust researchers, Corrales, was requested. Corrales’ supplemental CSV data was in Spanish. Fortunately, one of the project team members is fluent in Spanish and provided the necessary translation.

 

  • Combining Data from Multiple Sources

 

On average, each country had five different data sources that had to be manipulated and cleaned before the analysis and visualization. Units of measurement were verified for each variable to ensure they were consistent for each country.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s