In last month’s blog, we looked at a supervised machine learning classification case study.Today we will look at a case study by Muhammad Zulfadhilah et al. of cyber profiling of higher education in Indonesia using K-means clustering. Saurav Kaushik aptly says that ‘clustering is the task of dividing the population or data points into a number of groups such that data points in the same groups are more similar to other data points in the same group than those in other groups.’ K-means clustering is a centroid model, meaning that it works by looking at the closeness of a data point to the middle of the cluster of each group.
This cyber profiling case study explores the data from educational institutions in Indonesia to categorize what activities users perform on the Internet. This type of data mining helps find new patterns in the data and is also referred to as behavioral segmentation. The researchers use K-means clustering to group the number of websites visited to answer the question of which websites are the most popular. User profiling is done when Internet browsing data is married with user information.The entire dataset was 320,773 records of internet network traffic for five days from Indonesian educational institutions. Note the dataset does not have data on an individual computer level but rather on a network level. A sample of the data is shown below in Table 1.
There are four main steps during K-means clustering, which I will discuss in generalized terms. Step 1 is to pick K number of random points as cluster centers (aka – centroids). Step 2 is to assign each data point to the nearest cluster by calculating its distance to each centroid. Step 3 is to find a new cluster center by taking the average of the assigned data points. The final step is to repeat steps 2 and 3 until none of the cluster assignment groupings change. For the nitty gritty math, here’s a great blog. Here are two sample blogs on how to implement in Python and R.
Using the K-means clustering algorithm, the level of visits to a website is divided into three groups: low, medium and high. There were 2 users and 1467 websites in Cluster 1, 19 users and 126 websites in Cluster 2, and 46 users and 33 websites in Cluster 3. Cluster 1 has low levels of network traffic and the content is all advertisements. Cluster 2 has a moderate level of network traffic and the content is all news sites. Cluster 3 has a high level of network traffic and the content is social media sites. Clustering techniques show that most of the users from Indonesian educational institutions sampled use the Internet for getting content from social media sites.