Last week I looked at Singular Value Decomposition unsupervised machine learning technique as part of a four-part series on data science concepts for beginners. Remember that unsupervised machine learning is data driven rather than task driven (supervised machine learning). Today we’ll be staying in the dimension reduction part of unsupervised machine learning as shown in the Cheat-sheet below and will talk about principal component analysis or PCA.
In a similar manner to SVD, PCA is trying to reduce the number of dimensions for data exploration. The PCA method is trying to maximize variance of the data to make a predictive model and converts a set of possibly correlated variables into a set of linearly uncorrelated variables.
A great example from Francois Labelle is when we have three variables (population, average income and area) from 18 countries (Australia, Canada, US, Russia, Brazil, Mexico, Spain, France, Italy, Germany, UK, South Korea, Japan, Iran, Turkey, Thailand, Mexico, Indonesia, Pakistan, India and China). We want to reduce the dimensionality of show the features in a two-dimensional rather than three-dimensional space as shown below. In other words we reduce the variables to reduce redundancy.
Some other examples are from the SciKit Learn (Python) and Michael Barton’s (R) blogs. Some of the best articles are from University of Illinois (ppt) and PCA by Abdi. Georgia Tech Udacity has one of the best YouTube video explanations and some great GitHub source code is from SciKit Learn (Python) and Michael Barton (R). Other technology applications include image processing, pattern recognition, time series prediction, neuroscience and the Internet of Things.
Next week we’ll take a look at Apriori unsupervised machine learning.