** K-Means clustering** is an

*algorithm. Being unsupervised means that it requires no label or categories with the data under observation. If you are interested in*

*unsupervised machine learning***then you can start here.**

**supervised algorithms**K-means clustering is a surprisingly simple algorithm that creates groups (clusters) of similar data points within our entire dataset. This algorithm proves to be a very handy tool when **looking for hidden patterns in scattered data.**

The entire code used in this tutorial can be found here.

This tutorial will cover the following elements:

· **A brief overview of the k** **means algorithm.**

**· Implementing the k** **means with random initialization.**

**· Overview of the importance of optimal centroid initialization.**

· **Implementation****of K** **means++ for smart centroid initialization.**

Let us get started:

### 1. Understanding the Algorithm:

Suppose we have some random-looking data as shown in the picture below. We wish to create groupings in the data, so it looks a little more structured however, with the naked eye, it becomes difficult to decide which points to associate together. K-means will do this for us.

* One shortcoming of the K means algorithm* is that you need to specify how many clusters you need from the data. This can be a problem in cases where you want to segregate your data but are unsure how many categories there should be optimal.

*** Methods like the elbow method can be used to find an optimal number of clusters but those are not discussed in this article. ***

The K-means algorithm follows the following steps:

- Pick n data points that will act as the initial centroids.
- Calculate the Euclidean distance of each data point from each of the centroid points selected in step 1.
- Form data clusters by assigning every data point to whichever centroid it has the smallest distance from.
- Take the average of each formed cluster. The mean points are our new centroids.

* Repeat steps 2 through 4 until there is no longer a change in centroids*.

### 2. Implementation

Enough theory lets us get to coding.

First, we will take a look at our dataset. For this tutorial, we will use the dataset of human height-weight body index. The dataset is available for free and can be downloaded here.

```
#loading up the libraries
import pandas as pd
import numpy as np
import random as rd
import matplotlib.pyplot as plt
#loading data
data = pd.read_csv("500_Person_Gender_Height_Weight_Index.csv")
```

Code language: PHP (php)

From the scatter plot, we can see that the data has quite a random distribution and there are no clear clusters. It would be interesting to see what sort of groupings our algorithm performs.

Let’s quickly write down a function to get random points.

```
def get_random_centroids(data, k = 3):
#return random samples from the dataset
cent = (X.sample(n = k))
return cent
```

Code language: PHP (php)

Let’s see if it works.

These random centroids are…well… quite random 😵.

Let us code down the routine to reach appropriate centroids and create corresponding clusters.

```
def k_means_fit(X,centroids, n = 5):
#get a copy of the original data
X_data = X
diff = 1
j=0
while(diff!=0):
#creating a copy of the original dataframe
i=1
#iterate over each centroid point
for index1,row_c in centroids.iterrows():
ED=[]
#iterate over each data point
for index2,row_d in X_data.iterrows():
#calculate distance between current point and centroid
d1=(row_c["Height"]-row_d["Height"])**2
d2=(row_c["Weight"]-row_d["Weight"])**2
d=np.sqrt(d1+d2)
#append distance in a list 'ED'
ED.append(d)
#append distace for a centroid in original data frame
X[i]=ED
i=i+1
C=[]
for index,row in X.iterrows():
#get distance from centroid of current data point
min_dist=row[1]
pos=1
#loop to locate the closest centroid to current point
for i in range(n):
#if current distance is greater than that of other centroids
if row[i+1] < min_dist:
#the smaller distanc becomes the minimum distance
min_dist = row[i+1]
pos=i+1
C.append(pos)
#assigning the closest cluster to each data point
X["Cluster"]=C
#grouping each cluster by their mean value to create new centroids
centroids_new = X.groupby(["Cluster"]).mean()[["Weight","Height"]]
if j == 0:
diff=1
j=j+1
else:
#check if there is a difference between old and new centroids
diff = (centroids_new['Weight'] - centroids['Weight']).sum() + (centroids_new['Height'] - centroids['Height']).sum()
print(diff.sum())
centroids = X.groupby(["Cluster"]).mean()[["Weight","Height"]]
return X, centroids
```

Code language: PHP (php)

Let’s run this routine to locate 4 clusters in the data.

`centroids = get_random_centroids(X, k = 4) clustered, cent = k_means_fit(X,centroids, n= 4)`

All done!

Let’s see what we got.

```
#setting color values for our
color=['brown','blue','green','cyan']
#plot data
for k in range(len(color)):
cluster=clustered[clustered["Cluster"]==k+1]
plt.scatter(cluster["Height"],cluster["Weight"],c=color[k])
#plot centroids
plt.scatter(cent["Height"],cent["Weight"],c='red')
plt.xlabel('Height/ cm')
plt.ylabel('Weight/ kg')
```

Code language: PHP (php)

Looks like we have our groupings all done ✌️. Now it is time to see if initializing the centroids any different would make a difference.

Let’s talk about the K means++ initialization algorithm.

### 3. Optimal Centroid Initialization.

When initializing the centroids, the initially selected points must be fairly apart. If the points are too close together, there is a good chance the points will find a cluster in their local region and the actual cluster will be blended with another cluster.

*When randomly initializing centroids, we have no control over where their initial position will be.*

The K means++ is a smart way to tackle this problem.

Just like K Means itself, K Means++ too is a very simple algorithm.

1. The first centroid is selected randomly.

2. Calculate the Euclidean distance between the centroid and every other data point in the dataset. The point farthest away will become our next centroid.

3. Create clusters around these centroids by associating every point with its nearest centroid.

4. The point which has the farthest distance from its centroid will be our next centroid.

5. Repeat steps 3 and 4 until n number of centroids are located.

### 4. Implementation of K Means++

Let’s code our algorithm.

```
def get_kmeans_pp_centroids(X1,k = 5):
centroids = X1.sample()
print(centroids)
i = 1
dist = []
while i != k:
max_dist = [0,0]
#go through the centroids
for index, row in centroids.iterrows():
#calculate distance of every centroid with every other data point
d = np.sqrt((X1["Height"] - row["Height"])**2 +(X1["Weight"] - row["Weight"])**2)
#check which centroid has a max distance with another point
if max(d) > max(max_dist):
max_dist = d
X1 = pd.concat([X1, max_dist], axis = 1)
idx = X1.iloc[:,i+1].idxmax()
max_coor = pd.DataFrame(X1.iloc[idx][["Height", "Weight"]]).T
centroids = pd.concat([centroids,max_coor])
X1 = X1.drop(idx)
i+=1
return centroids
```

Code language: PHP (php)

With the algorithm penned down, let us test it on the K-Means algorithm we built above.

`centroids = get_kmeans_pp_centroids(X, k = 4) clustered, cent = k_means_fit(X,centroids, n= 4)`

We can see that this time we have found different clusters. This shows the difference initial centroids can make.

### 5. Conclusion

In summary, K-Means is a simple yet powerful algorithm. It can be used to create clusters in your current data. These clusters help you get a better picture of your current data and the clusters can be used to analyze any future data. This can be helpful if you are trying to analyze the customer base of your business. Grouping together customers can help you create personalized policies for each cluster and when a new customer joins, they can be easily associated with the already formed clusters, the possibilities are limitless.