After a two week mini series on data ethics, today I’m going to go back to the conversation we were having about a month ago about the basic principles of machine learning, including the random forest algorithm. Part 9 of the machine learning for beginners blog talked about using decision tree algorithms to determine what type of animal you have. Today I want to expand on the ideas presented in my Random “Data Science in 90 Seconds” YouTube video and continue the discussion in plain language without math or code.
If you recall from earlier discussions, supervised machine learning is the ‘task of inferring a function to describe hidden structure from labeled data’. Unlike unsupervised machine learning, in supervised machine learning, the computer takes observations of data that have a predetermined class or category label. The algorithm then tries to predict future outcomes from these observations. Random forest is a type of supervised machine learning since it is a collection of decision trees.
I want to study an example of using random forest with sports data. In a 2016 paper by Vaidya et al in the International Journal of Computer Applications, the researchers use random forest to predict outcomes of English Premier League football (soccer in USA) matches. It’s a fairly difficult problem to solve with machine learning since whether a team wins, loses or ties is pretty random (about 35%, 35% and 29% respectively).
The random forest algorithm clusters data points into functional groups. The features from a training data set include number of points, home loss, away win, shots on target and goals on target. These features form branches of a decision tree. At each split in this tree diagram (i.e. – is the number of points greater than 1), the computer chooses random features to compare if they have a close relationship to one another or not.
The algorithm uses multiple decision trees that are different from one another (aka – the forest) to classify whether a team will win, lose or tie the match. As you go down each tree and answer the questions about the features, you can predict the outcome of match as a wine, loss or tie. Since Vaidya has an accuracy of 47%, this is much higher than the 35% accuracy of randomly guessing the game outcome.