Today I’ll delve into a bit more detail about the initial findings from the Morning Joe project using histograms and kernel density estimates. The exploratory data visualization process begins by looking at summary statistics of the data in Python. Average monthly rain ranges from 0.2 – 407.7 mm per month, average monthly temperature ranges from 23.36-27.16 C, average monthly rust ranges from 0.33-50%, monthly calculated production ranges from 80.33 – 3832.67 (1000-60kg bags) and bi-weekly futures range from 21.98-175.18 US$.
To understand how the coffee data is distributed, a histogram and kernel density estimate (KDE) were created for each of the variables: rust, production and futures. The default bin size of 10 was used for each histogram after a bit of experimentation to find the best bin size. The KDE color was changed to purple since it provided a greater contrast from the histogram than the default blue color.
Coffee rust peaks between 5 – 20%. Note that farmers begin coffee rust pest management practices when leaf rust exceeds 3%.
From the Coffee Production histogram and KDE, coffee production peaks at 1000 and then 3000 – 4000 (1000 – 60kg bags). There is a gap in the data between 1000 and 3000 since Papa New Guinea’s production is much lower (80.33-97.92 in 1989-1991) than Colombia’s production of 1010.33 in 2011-2012.
The coffee futures histogram shown above peaks around 30 and 125 US$ and most of the data is distributed between 100-150 US$. The KDE shows a bimodal distribution of data with two peaks. The relationship between time and futures needs to be understood better to see if this is the reason for the bimodal histogram. Recall that the data set has futures prices in US$ from 1989-2013.
Next week I’ll shift focus and talk in more detail about the other visualizations I used to find correlations between the variables.