A Summer of Learning: Univariate Statistics

Over the last five weeks, I’ve talked about what I’ve learned in my summer courses at Indiana University. I delved into a bit more detail in these blogs about data acquisition using mySQL, exploratory data analysis using R and data visualization using Tableau. Today I want to discuss some key lessons I learned from a 5-week condensed univariate statistics course.

The picture at the beginning of this post provides a mind map or visual overview of what material was covered. Knowing the basics of statistics is important for data scientists, as I discussed in a February 2017 blog. Univariate statistics looks at data with one variable and tries to find patterns to describe the data. Using Michael Trosset’s An Introduction to Statistical Inference with R, we covered about three chapters of the book each week. We began the course by learning how to collect the data and reviewed some basic mathematical principles as well as learning about the probability something would happen and random variables.

The entire course was looking at data that followed a normal or Gaussian distribution. The instructor emphasized statisticians often use this distribution when they are not sure of the actual distribution of the data. The figure below shows four normal distributions which basically are symmetrical on both sides and is also known as a bell curve.

The following week we continued to learn about the data attributes of the population mean, assumptions we were making to apply the mathematical formulas, and extracting information from the sample data. We covered the Weak Law of Large Numbers, the Central Limit Theorem and how to use the corresponding R commands to apply those principles to our data. We concluded the class by spending two weeks on estimation, correlation and regression.

After the first two weeks of class, it became clear that I was NOT going to get a PhD in Statistics even if that was the last career on Earth and that I most likely would not ace this class. Maybe the online format was not the best for this type of material, perhaps it had been too long since I had statistics as an undergraduate or perhaps 13 chapters in 4 ½ weeks was just too much for me. It’s good to be reminded from time to time that you don’t know everything — keeps me humble as a grow to become a data scientist. I am also forever grateful to my TA and my professor for showing mercy on me in grading.

About a week before the end of the class, I found a book written for data scientists that has a very different teaching/writing style than my class that I plan to go through on my own at my own pace to make sure I learn the material I struggled with in the class. It’ll be nice to also review concepts that may not have been covered that I may need in my new career.

I’ll use the remaining three weeks before the fall term begins to discuss other things I’ve been learning on my own as I train to become a data scientist.