• December 2, 2021 The worlds most valuable resource is no longer oil, but data

As surprising as the above statement may sound, it is absolutely true. We now have now moved towards a world where data is the real power but as we all know,

with great data, comes great analytical problems.

Well maybe not ALL of us but the data engineers out there know what  I am talking about. Many times we have huge chunks of randomly spread data that make no sense at first glance. This is where clustering comes to our aid.

In the simplest of terms, clustering is creating groups within data of similar-looking points. There are many clustering algorithms that have proven very effective,

• K means Clustering
• K median Clustering
• DBSCAN Clustering

… to name a few.

For this article, we forget the first two and focus on the last one that is DBSCAN Clustering.

DBSCAN stands for Density-based spatial clustering of applications with noise.

• Understanding the DBSCAN algorithm
• Implementing the DBSCAN algorithm
• DBSCAN vs K means

Let’s get to it.

NOTE: Fun Bonus at the end

### Understanding the Algorithm

Say we have some random data points.

Visually, it is impossible to make out which points should be grouped together as many of the points have very minor differences in them.

To start off the algorithm, we choose a random point.

It is time to introduce 2 major parameters for DBSCAN clustering.

1. Epsilon

This is the radius of the circle that we must draw around our pointing focus.

2. Minimum number of points

This is the number of points that we want in the neighbourhood of our point in focus (within the circle).

…continuing with the algorithm

We draw a circle of radius ‘epsilon‘ around the chosen point and inspect the neighbourhood.

Now let’s suppose that we kept the ‘minimum points‘ parameter to be 3. This is how it would matter.

There are few rules to follow here;

• A point is a ‘core point’ if its neighbouring points are greater than or equal to the minimum points we chose at the start.
• A point is a ‘border point’ if its neighbouring points are less than the minimum points we chose at the start.
• A point is considered ‘Noise‘ if it has no points in its neighbourhood

According to the rules of the algorithm, this point is a core point since it has more neighbours (5) than the number we decided (3).

These all points are said to be part of a single cluster.

**Since it was a core point, we now move on to apply the same algorithm on all of its neighbours. **

Checking the neighbourhood of all the neighbours we see that there is one more point that is within the neighbourhood but not part of a cluster. It now becomes part of the red cluster.

The point in whose neighbourhood this new point lies is also a core point.

So we check this new point to continue the cluster.

Seems like this is a border point, even though it becomes part of the red cluster we will not check its neighbourhood to continue the cluster.

Seem like we have hit a dead-end; there are no more points to consider for this cluster. Well, this means we start the algorithm all over with new points, creating a new cluster.

This point seems like a dead end on its own; it doesn’t have any neighbours.

Good thing we defined rules for every situation. This point is what we label as ‘Noise

Looks like we gotta start over.

Following the algorithm, we apply the set rules to all unvisited points until we have them all in a cluster.

• Unsupervised learning

The DBSCAN algorithm requires no labels to create clusters hence it can be applied to all sorts of data.

• Self cluster forming

Unlike its much more famous counterpart, k means, DBSCAN does not require a number of clusters to be defined beforehand. It forms clusters using the rules we defined above.

• Noise detection

As demonstrated above, it is not influenced by noisy points and can detect and isolate any outliers in the data.

### Implementation

Full code available here.

We start by importing all the useful libraries.

``````from sklearn.datasets import make_blobs
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import random``````

Now we obviously need some data that we can cluster.

``````centers = [(0, 4), (5, 5) , (8,2)]
cluster_std = [1.2, 1, 1.1]

X, y= make_blobs(n_samples=200, cluster_std=cluster_std, centers=centers, n_features=2, random_state=1)

plt.scatter(X[y == 0, 0], X[y == 0, 1], s=10, label="Cluster1")
plt.scatter(X[y == 1, 0], X[y == 1, 1], s=10, label="Cluster2")
plt.scatter(X[y == 2, 0], X[y == 2, 1], s=10, label="Cluster3")
plt.title("Scattered data")``````

** Note: The differently coloured data points represent the 3 different distributions that the data is drawn from **

The data that our algorithm will work on will look like this;

Let’s define our two main functions;

``````def check_core_point(eps,minPts, df, index):
#get points from given index
x, y = df.iloc[index]['X']  ,  df.iloc[index]['Y']

temp =  df[((np.abs(x - df['X']) <= eps) & (np.abs(y - df['Y']) <= eps)) & (df.index != index)]

#check how many points are present within radius
if len(temp) >= minPts:
#return format (dataframe, is_core, is_border, is_noise)
return (temp.index , True, False, False)

elif (len(temp) < minPts) and len(temp) > 0:
#return format (dataframe, is_core, is_border, is_noise)
return (temp.index , False, True, False)

elif len(temp) == 0:
#return format (dataframe, is_core, is_border, is_noise)
return (temp.index , False, False, True)``````

The above function will look at each point in focus and determine if it is a core point, border point, or noise.

``````def cluster_with_stack(eps, minPts, df):

#initiating cluster number
C = 1
#initiating stacks to maintain
current_stack = set()
unvisited = list(df.index)
clusters = []

while (len(unvisited) != 0): #run until all points have been visited

#identifier for first point of a cluster
first_point = True

#choose a random unvisited point

while len(current_stack) != 0: #run until a cluster is complete

#pop current point from stack
curr_idx = current_stack.pop()

#check if point is core, neighbour or border
neigh_indexes, iscore, isborder, isnoise = check_core_point(eps, minPts, df, curr_idx)

#dealing with an edge case
if (isborder & first_point):
#for first border point, we label it aand its neighbours as noise
clusters.append((curr_idx, 0))
clusters.extend(list(zip(neigh_indexes,[0 for _ in range(len(neigh_indexes))])))

#label as visited
unvisited.remove(curr_idx)
unvisited = [e for e in unvisited if e not in neigh_indexes]

continue

unvisited.remove(curr_idx) #remove point from unvisited list

neigh_indexes = set(neigh_indexes) & set(unvisited) #look at only unvisited points

if iscore: #if current point is a core
first_point = False

clusters.append((curr_idx,C)) #assign to a cluster
current_stack.update(neigh_indexes) #add neighbours to a stack

elif isborder: #if current point is a border point
clusters.append((curr_idx,C))

continue

elif isnoise: #if current point is noise
clusters.append((curr_idx, 0))

continue

if not first_point:
#increment cluster number
C+=1

return clusters``````

This is our main function where our entire algorithm logic resides.

Before running this code there is one important thing to consider when coding the algorithm. There is one particular edge case that can cause the algorithm to run forever or cause problems in clustering.

Whenever picking out a point, it might be possible that the first point you pick out might turn out to be a border point. Now here we have 3 possibilities;

1. Ignore the point and start the algorithm again with another random point.
2. Create a separate cluster of this point and its neighbours.
3. label the point as noise and move on with the algorithm.

The first option above seems plausible but we may run into a dead end. Your data distribution might be such that at the end you are left with several such points that just form small clusters. Your algorithm will simply keep on skipping these points and run forever.

The second option seemed good too but such small clusters did not make sense in the data and seem useless.

I went with the third option to keep the clustering as clean as possible,

``````#dealing with an edge case
if (isborder & first_point):
#for first border point, we label it aand its neighbours as noise
clusters.append((curr_idx, 0))
clusters.extend(list(zip(neigh_indexes,[0 for _ in range(len(neigh_indexes))])))

#label as visited
unvisited.remove(curr_idx)
unvisited = [e for e in unvisited if e not in neigh_indexes]

continue``````

This snippet is from the main function above and deals with this edge case.

### Results

Now we run the algorithm and check the results.

``````#radius of the circle defined as 0.6
eps = 0.6
#minimum neighbouring points set to 3
minPts = 3

data = pd.DataFrame(X, columns = ["X", "Y"] )
clustered = cluster_with_stack(eps, minPts, data)

idx , cluster = list(zip(*clustered))
cluster_df = pd.DataFrame(clustered, columns = ["idx", "cluster"])

plt.figure(figsize=(10,7))
for clust in np.unique(cluster):
plt.scatter(X[cluster_df["idx"][cluster_df["cluster"] == clust].values, 0], X[cluster_df["idx"][cluster_df["cluster"] == clust].values, 1], s=10, label=f"Cluster{clust}")

plt.legend([f"Cluster {clust}" for clust in np.unique(cluster)], loc ="lower right")
plt.title('Clustered Data')
plt.xlabel('X')
plt.ylabel('Y')``````

Seems like we have reasonable clusters. If you feel like some data has been ‘mis-clustered’ then you can play around with the epsilon and min-points value.

Below are some results from changing these values.

### DBSCAN vs K Means

K Means, even though much more popular, lacks behind DBSCAN in some cases. Firstly, it needs a predefined number for how many clusters we need. We may not always have this information and it can be quite difficult to analyse data.

K Means also does not identify outliers and tries to group every point in a cluster.

``````from sklearn.cluster import KMeans

clustering = KMeans(3).fit(X)

plt.figure(figsize=(10,7))
for clust in np.unique(clustering.labels_):
plt.scatter(X[clustering.labels_ == clust, 0], X[clustering.labels_ == clust, 1], s=10, label=f"Cluster{clust}")

plt.legend([f"Cluster {clust}" for clust in np.unique(clustering.labels_)], loc ="lower right")``````

We will apply K Means clustering from sklearn library to the same data.

The 3 clusters seemed to be formed alright but we have many outliers that should not be in any cluster.

There are certain cases wherein K Means vs DBSCAN, DBSCAN takes the cake. The illustration below explains this perfectly.