Image from University of Texas-Austin
Today is part three of my mini-series on the Anatomy of a Data Scientist. I spent the first two weeks talking about the data scientist needing problem solving and analytical skills, and how the data scientist uses statistics in their job. Today we’re going to learn about the role of machine learning in the skill set needed to be an effective data scientist. We’re going to look at the boundaries between statistics, coding and their role within this emerging field.
Data science and machine learning are inclusive to one another. One of the best definitions I have seen is that the Data Scientist generally determines which machine learning approach to use, models the algorithms and prototypes and tests it using a coding language such as R or Python. Machine learning is a way to find patterns from the past to predict what can possibly happen in the future. I like to think the Data Scientist is responsible more for data strategy in that they decide which algorithm to use to solve the problem and the machine learning engineer implements the algorithm into production at a large scale.
(Image by Drew Conway)
There are two types of machine learning: supervised and unsupervised or predictive and descriptive. Five main steps are used in machine learning: collecting the data, preparing the data, training a model, evaluating the model and improving the performance. Keep in mind the key point of machine learning is to quantitatively answer a business problem. I think it can be easy for us (aspiring) and actual data scientists that we’re using all of these tools to answer a problem and derive value for the organization. We do this by gaining knowledge from the data.
One of the first machine learning problems most people do when they are first learning this discipline is with a data set that is trying to answer the question of predicting what type of iris flower will you get given certain flower characteristics. There are many tutorials on this: http://machinelearningmastery.com/machine-learning-in-python-step-by-step/ and http://scikit-learn.org/stable/tutorial/basic/tutorial.html. The gist of machine learning is to describe characteristics in numerical terms so that the future can be predicted. Next week I’ll conclude the ‘Anatomy of a Data Scientist’ series by looking at the soft skills needed to become a data science ninja.