(Image by Quora)
During one of my spring 2017 graduate courses, one of my assignments was to develop a one-page knowledge base solution that would be a one-stop resource for people new to data science. For those of you not familiar with the term in my case, a knowledge base is a store of information that is available to draw upon. The result would be added to the existing Data Science Knowledge Base at Indiana University, where I’m getting my degree. Each of my knowledge base solutions would include a definition of the concept in plain language, key vocabulary terms, GitHub source code, examples, tutorials, the best white papers or articles, videos, some application of the concept and forums to learn lingo from the experts. Over the next four blogs, we’ll look at one concept in the field of unsupervised machine learning. To learn what unsupervised machine learning is, check out my video “Data Science in 90 seconds: Part 1.”
Today we’ll begin the series by looking at K-means clustering unsupervised machine learning. According to Wikipedia, unsupervised machine learning is the task of inferring a function to describe hidden structure from “unlabeled” data (a classification or categorization that is not included in the observations). Since the examples given to the learner are unlabeled, there is no evaluation of the accuracy of the structure that is output by the relevant algorithm—which is one way of distinguishing unsupervised learning from supervised learning and reinforcement learning.
As the name suggests, clustering is when we try to group raw data points together in clusters or groups. K is the number of groups that you want when you’re making the clusters and can be adjusted to get the best results. In the image at the top of this article, k = 7 since there are seven different colored circles or clusters of data points (black, red, yellow, light blue, dark blue, purple and green). Another way of understanding this concept is that given a set of observations, the goal of the k-means clustering method is to partition the observations into k sets. There are a number of methods to determine the ‘best’ number of clusters in a data set.
Examples of K-means clustering can be found here: Data Camp (R); SciPy (Python); University of Illinois Data Mining Concepts and Techniques (ppt). Some of the best free online tutorials include: Python SK-learn Tutorial; R-Studio Wine Data Tutorial; Data Mining in R (book; pp. 77+). If my explanation is still unclear, Anil Jain has a great white paper for a more in-depth explanation: Data clustering: 50 years beyond K-means. My favorite You Tube video is by Kanza Haider and favorite GitHub source code is by The Lazy Programmer (Python); DGRTwo (R). To practice more on your own, here are some sample data sets: Causality Workbench; Digging into Data; Kaggle; UCI Machine Learning Repository. Finally, some forums where you can learn the lingo from the experts include: Data Science Central, Quora, Stack Overflow, FastML and/ or KD Nuggets. Some technology applications include disease detection, sports performance predictions, food safety and fraud detection.
Next week I’ll talk about Singular Value Decomposition unsupervised machine learning.