I’ve devoted at least a dozen blogs in 2018 to case studies to describe specific machine learning algorithms and even have a video blog series that explains these algorithms in a different way you might want to check out. Today I want to start a mini series on case studies that will help us determine which algorithms to choose in the first place. If you’ll recall, machine learning is a key component of data science in which the computer learns from past data it is given and can predict patterns. If you’re new to data science and would like more information on machine learning basics, check out this clear Wikipedia article.
The most important reason we’re using data analytics is to solve a problem or answer a question. So let’s start with the data question we’re trying to answer: can data science be used to predict the future opening price of natural gas futures on the New York Stock Exchange (NYSE)? Let’s take a look at the financial data we have to see if we can systematically determine what type of machine learning algorithm(s) we need to use.
The screenshot above shows a portion of the 15,000 daily observations with closing, open, high and low natural gas futures prices in the NYSE. The closing, open, high and low are the four features in our data set. Here’s a flowchart that I made (yes, I know -another data visualization but who doesn’t like pictures?) might help us choose our machine learning approach.
In our data, we are trying to predict a quantity or amount – the closing price, so we will choose a regression algorithm. Now using the scikit learn algorithm cheat sheet featured above, we go to the “regression” blue bubble. We have less than 100,000 samples, four features and no sparsity (meaning no missing values). Based on this criteria, we should choose Elastic Net and LASSO regression algorithms. According to a talk given by Yunting Sun , Elastic Net regression enforces sparsity and does not limit the number of selected variables.Note that according to Quora, the biggest pro of LASSO is that it can automatically choose variables. However, it ignores nonsignificant variables that may be interesting or important.
In our case, since we don’t have sparsity in our natural case futures price data, I chose the most simplest approach of linear regression. I know, I know – data scientists are supposed to sprinkle purple unicorn dust like artificial neural networks in the cloud on a problem to solve it, not use an old basic statistics method. (Well, not really, but that’s a whole different blog). I then added decision tree regression and AdaBoost algorithms for comparison to linear regression since these models are transparent and easy to interpret. The accuracy, precision and recall numbers were great which is a good indication that these models were a good choice for the analysis.