Last week I briefly discussed what I consider the most important skills a great data scientist has – analytical thinking and problem solving. This week I want to continue this “Anatomy of a Data Scientist” series by looking at the role of statistics in the field of data science.
Let’s be clear however, that a statistician is not the same as a data scientist as a MixPanel article states even though they share some commonalities. Companies rely on data scientists to bring together knowledge from statistics, programming and domain expertise to automatically collect information from the past and make decisions about the future from it.
Shannon Cebron, pictured above, is a Data Scientist at Pegged Software that uses statistics to solve real-world problems such as extracting data from writing samples to get insights about a job candidate. Having a background in statistics is key for Shannon and other data scientists to understanding the theory needed to work with the data in order to solve problems. Even the American Statistics Association and other traditional statistics institutions have recently included data science within their umbrella. And as I discussed in a January 2017 blog, all four types of data scientists have statistics skills.
Another viewpoint comes from Jingfen Zhu, a Chief Scientist at Genpact, who summarizes that statistics is critical in two of the three typical phases of a data science project: Data Collection, Data Analysis and Results Communication. How much statistics is used in data science is still a question that is debated and I think will continue to change as the field matures. A talend article from 2016 provides some comparison that summaries the current state of both fields as shown in the table at the end of the article. Next week we’ll look at the role machine learning plays in data science.