K-Means clustering is an unsupervised machine learning algorithm. Being unsupervised means that it requires no label or categories with the data under observation. If you are interested in supervised algorithms then you can start here.
K-means clustering is a surprisingly simple algorithm that creates groups (clusters) of similar data points within our entire dataset. This algorithm proves to be a very handy tool when looking for hidden patterns in scattered data.
The entire code used in this tutorial can be found here.
This tutorial will cover the following elements:
· A brief overview of the k means algorithm.
· Implementing the k means with random initialization.
· Overview of the importance of optimal centroid initialization.
· Implementation of K means++ for smart centroid initialization.
Let us get started:
1. Understanding the Algorithm:
Suppose we have some random-looking data as shown in the picture below. We wish to create groupings in the data, so it looks a little more structured however, with the naked eye, it becomes difficult to decide which points to associate together. K-means will do this for us.
One shortcoming of the K means algorithm is that you need to specify how many clusters you need from the data. This can be a problem in cases where you want to segregate your data but are unsure how many categories there should be optimal.
- * Methods like the elbow method can be used to find an optimal number of clusters but those are not discussed in this article. *
The K-means algorithm follows the following steps:
- Pick n data points that will act as the initial centroids.
- Calculate the Euclidean distance of each data point from each of the centroid points selected in step 1.
- Form data clusters by assigning every data point to whichever centroid it has the smallest distance from.
- Take the average of each formed cluster. The mean points are our new centroids.
Repeat steps 2 through 4 until there is no longer a change in centroids.
Enough theory lets us get to coding.
First, we will take a look at our dataset. For this tutorial, we will use the dataset of human height-weight body index. The dataset is available for free and can be downloaded here.
#loading up the libraries import pandas as pd import numpy as np import random as rd import matplotlib.pyplot as plt #loading data data = pd.read_csv("500_Person_Gender_Height_Weight_Index.csv")
From the scatter plot, we can see that the data has quite a random distribution and there are no clear clusters. It would be interesting to see what sort of groupings our algorithm performs.
Let’s quickly write down a function to get random points.
def get_random_centroids(data, k = 3): #return random samples from the dataset cent = (X.sample(n = k)) return cent
Let’s see if it works.
These random centroids are…well… quite random 😵.
Let us code down the routine to reach appropriate centroids and create corresponding clusters.
def k_means_fit(X,centroids, n = 5): #get a copy of the original data X_data = X diff = 1 j=0 while(diff!=0): #creating a copy of the original dataframe i=1 #iterate over each centroid point for index1,row_c in centroids.iterrows(): ED= #iterate over each data point for index2,row_d in X_data.iterrows(): #calculate distance between current point and centroid d1=(row_c["Height"]-row_d["Height"])**2 d2=(row_c["Weight"]-row_d["Weight"])**2 d=np.sqrt(d1+d2) #append distance in a list 'ED' ED.append(d) #append distace for a centroid in original data frame X[i]=ED i=i+1 C= for index,row in X.iterrows(): #get distance from centroid of current data point min_dist=row pos=1 #loop to locate the closest centroid to current point for i in range(n): #if current distance is greater than that of other centroids if row[i+1] < min_dist: #the smaller distanc becomes the minimum distance min_dist = row[i+1] pos=i+1 C.append(pos) #assigning the closest cluster to each data point X["Cluster"]=C #grouping each cluster by their mean value to create new centroids centroids_new = X.groupby(["Cluster"]).mean()[["Weight","Height"]] if j == 0: diff=1 j=j+1 else: #check if there is a difference between old and new centroids diff = (centroids_new['Weight'] - centroids['Weight']).sum() + (centroids_new['Height'] - centroids['Height']).sum() print(diff.sum()) centroids = X.groupby(["Cluster"]).mean()[["Weight","Height"]] return X, centroids
Let’s run this routine to locate 4 clusters in the data.
centroids = get_random_centroids(X, k = 4) clustered, cent = k_means_fit(X,centroids, n= 4)
Let’s see what we got.
#setting color values for our color=['brown','blue','green','cyan'] #plot data for k in range(len(color)): cluster=clustered[clustered["Cluster"]==k+1] plt.scatter(cluster["Height"],cluster["Weight"],c=color[k]) #plot centroids plt.scatter(cent["Height"],cent["Weight"],c='red') plt.xlabel('Height/ cm') plt.ylabel('Weight/ kg')
Looks like we have our groupings all done ✌️. Now it is time to see if initializing the centroids any different would make a difference.
Let’s talk about the K means++ initialization algorithm.
3. Optimal Centroid Initialization.
When initializing the centroids, the initially selected points must be fairly apart. If the points are too close together, there is a good chance the points will find a cluster in their local region and the actual cluster will be blended with another cluster.
When randomly initializing centroids, we have no control over where their initial position will be.
The K means++ is a smart way to tackle this problem.
Just like K Means itself, K Means++ too is a very simple algorithm.
1. The first centroid is selected randomly.
2. Calculate the Euclidean distance between the centroid and every other data point in the dataset. The point farthest away will become our next centroid.
3. Create clusters around these centroids by associating every point with its nearest centroid.
4. The point which has the farthest distance from its centroid will be our next centroid.
5. Repeat steps 3 and 4 until n number of centroids are located.
4. Implementation of K Means++
Let’s code our algorithm.
def get_kmeans_pp_centroids(X1,k = 5): centroids = X1.sample() print(centroids) i = 1 dist =  while i != k: max_dist = [0,0] #go through the centroids for index, row in centroids.iterrows(): #calculate distance of every centroid with every other data point d = np.sqrt((X1["Height"] - row["Height"])**2 +(X1["Weight"] - row["Weight"])**2) #check which centroid has a max distance with another point if max(d) > max(max_dist): max_dist = d X1 = pd.concat([X1, max_dist], axis = 1) idx = X1.iloc[:,i+1].idxmax() max_coor = pd.DataFrame(X1.iloc[idx][["Height", "Weight"]]).T centroids = pd.concat([centroids,max_coor]) X1 = X1.drop(idx) i+=1 return centroids
With the algorithm penned down, let us test it on the K-Means algorithm we built above.
centroids = get_kmeans_pp_centroids(X, k = 4) clustered, cent = k_means_fit(X,centroids, n= 4)
We can see that this time we have found different clusters. This shows the difference initial centroids can make.
In summary, K-Means is a simple yet powerful algorithm. It can be used to create clusters in your current data. These clusters help you get a better picture of your current data and the clusters can be used to analyze any future data. This can be helpful if you are trying to analyze the customer base of your business. Grouping together customers can help you create personalized policies for each cluster and when a new customer joins, they can be easily associated with the already formed clusters, the possibilities are limitless.