DBSCAN Clustering Algorithm Implementation from scratch | Python

· 8 min read
DBSCAN Clustering Algorithm Implementation from scratch | Python
The worlds most valuable resource is no longer oil, but data

As surprising as the above statement may sound, it is absolutely true. We now have now moved towards a world where data is the real power but as we all know,

with great data, comes great analytical problems.

Well maybe not ALL of us but the data engineers out there know what  I am talking about. Many times we have huge chunks of randomly spread data that make no sense at first glance. This is where clustering comes to our aid.

In the simplest of terms, clustering is creating groups within data of similar-looking points. There are many clustering algorithms that have proven very effective,

  • K means Clustering
  • K median Clustering
  • DBSCAN Clustering    

... to name a few.

For this article, we forget the first two and focus on the last one that is DBSCAN Clustering.

DBSCAN stands for Density-based spatial clustering of applications with noise.

This article will cover the following;

  • Understanding the DBSCAN algorithm
  • DBSCAN algorithm Advantages
  • Implementing the DBSCAN algorithm
  • DBSCAN vs K means

Let's get to it.

NOTE: Fun Bonus at the end


data documents
source

Understanding the Algorithm

Say we have some random data points.

scattered data points

Visually, it is impossible to make out which points should be grouped together as many of the points have very minor differences in them.

To start off the algorithm, we choose a random point.

choose a random point
Red point will be our selected point
The red point will be our selected point

It is time to introduce 2 major parameters for DBSCAN clustering.

  1. Epsilon

This is the radius of the circle that we must draw around our pointing focus.

2. Minimum number of points

This is the number of points that we want in the neighbourhood of our point in focus (within the circle).

...continuing with the algorithm

We draw a circle of radius 'epsilon' around the chosen point and inspect the neighbourhood.

Now let's suppose that we kept the 'minimum points' parameter to be 3. This is how it would matter.

There are few rules to follow here;

  • A point is a 'core point' if its neighbouring points are greater than or equal to the minimum points we chose at the start.
  • A point is a 'border point' if its neighbouring points are less than the minimum points we chose at the start.
  • A point is considered 'Noise' if it has no points in its neighbourhood
circle drawn
draw a circle and inspect the neighbourhood

According to the rules of the algorithm, this point is a core point since it has more neighbours (5) than the number we decided (3).

These all points are said to be part of a single cluster.

clustered points
clustered together
**Since it was a core point, we now move on to apply the same algorithm on all of its neighbours. **
stray point found
found another stray point

Checking the neighbourhood of all the neighbours we see that there is one more point that is within the neighbourhood but not part of a cluster. It now becomes part of the red cluster.

The point in whose neighbourhood this new point lies is also a core point.

Another core point
Another core point

So we check this new point to continue the cluster.

border point
A border point found

Seems like this is a border point, even though it becomes part of the red cluster we will not check its neighbourhood to continue the cluster.

Seem like we have hit a dead-end; there are no more points to consider for this cluster. Well, this means we start the algorithm all over with new points, creating a new cluster.

New point
New point

This point seems like a dead end on its own; it doesn't have any neighbours.

Good thing we defined rules for every situation. This point is what we label as 'Noise'

Noise point
Noise found

Looks like we gotta start over.

algortihm repeat
Continuing the process

Following the algorithm, we apply the set rules to all unvisited points until we have them all in a cluster.

DBSCAN Advantages

  • Unsupervised learning

The DBSCAN algorithm requires no labels to create clusters hence it can be applied to all sorts of data.

  • Self cluster forming

Unlike its much more famous counterpart, k means, DBSCAN does not require a number of clusters to be defined beforehand. It forms clusters using the rules we defined above.

  • Noise detection

As demonstrated above, it is not influenced by noisy points and can detect and isolate any outliers in the data.


Implementation

Full code available here.

We start by importing all the useful libraries.

from sklearn.datasets import make_blobs
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import random

Now we obviously need some data that we can cluster.

centers = [(0, 4), (5, 5) , (8,2)]
cluster_std = [1.2, 1, 1.1]

X, y= make_blobs(n_samples=200, cluster_std=cluster_std, centers=centers, n_features=2, random_state=1)

plt.scatter(X[y == 0, 0], X[y == 0, 1], s=10, label="Cluster1")
plt.scatter(X[y == 1, 0], X[y == 1, 1], s=10, label="Cluster2")
plt.scatter(X[y == 2, 0], X[y == 2, 1], s=10, label="Cluster3")
plt.title("Scattered data")
Synthetically generated data

** Note: The differently coloured data points represent the 3 different distributions that the data is drawn from **

The data that our algorithm will work on will look like this;

All the same

Let's define our two main functions;

The above function will look at each point in focus and determine if it is a core point, border point, or noise.

This is our main function where our entire algorithm logic resides.

Before running this code there is one important thing to consider when coding the algorithm. There is one particular edge case that can cause the algorithm to run forever or cause problems in clustering.  

Whenever picking out a point, it might be possible that the first point you pick out might turn out to be a border point. Now here we have 3 possibilities;

  1. Ignore the point and start the algorithm again with another random point.
  2. Create a separate cluster of this point and its neighbours.
  3. label the point as noise and move on with the algorithm.

The first option above seems plausible but we may run into a dead end. Your data distribution might be such that at the end you are left with several such points that just form small clusters. Your algorithm will simply keep on skipping these points and run forever.

The second option seemed good too but such small clusters did not make sense in the data and seem useless.

I went with the third option to keep the clustering as clean as possible,

#dealing with an edge case
            if (isborder & first_point):
                #for first border point, we label it aand its neighbours as noise 
                clusters.append((curr_idx, 0))
                clusters.extend(list(zip(neigh_indexes,[0 for _ in range(len(neigh_indexes))])))
                
                #label as visited
                unvisited.remove(curr_idx)
                unvisited = [e for e in unvisited if e not in neigh_indexes]
    
                continue

This snippet is from the main function above and deals with this edge case.

Results

Now we run the algorithm and check the results.

#radius of the circle defined as 0.6
eps = 0.6
#minimum neighbouring points set to 3
minPts = 3

data = pd.DataFrame(X, columns = ["X", "Y"] )
clustered = cluster_with_stack(eps, minPts, data)

idx , cluster = list(zip(*clustered))
cluster_df = pd.DataFrame(clustered, columns = ["idx", "cluster"])

plt.figure(figsize=(10,7))
for clust in np.unique(cluster):
    plt.scatter(X[cluster_df["idx"][cluster_df["cluster"] == clust].values, 0], X[cluster_df["idx"][cluster_df["cluster"] == clust].values, 1], s=10, label=f"Cluster{clust}")

plt.legend([f"Cluster {clust}" for clust in np.unique(cluster)], loc ="lower right")
plt.title('Clustered Data')
plt.xlabel('X')
plt.ylabel('Y')
Clustered using DBSCAN
Clustered using DBSCAN

Seems like we have reasonable clusters. If you feel like some data has been 'mis-clustered' then you can play around with the epsilon and min-points value.

Below are some results from changing these values.

eps = 0.5 - minPts = 3
eps = 0.5 - minPts = 3
eps = 0.55 - minPts = 3
eps = 0.58 - minPts = 4

DBSCAN vs K Means

K Means, even though much more popular, lacks behind DBSCAN in some cases. Firstly, it needs a predefined number for how many clusters we need. We may not always have this information and it can be quite difficult to analyse data.

K Means also does not identify outliers and tries to group every point in a cluster.

from sklearn.cluster import KMeans

clustering = KMeans(3).fit(X)

plt.figure(figsize=(10,7))
for clust in np.unique(clustering.labels_):
    plt.scatter(X[clustering.labels_ == clust, 0], X[clustering.labels_ == clust, 1], s=10, label=f"Cluster{clust}")

plt.legend([f"Cluster {clust}" for clust in np.unique(clustering.labels_)], loc ="lower right")

We will apply K Means clustering from sklearn library to the same data.

K-Means with 3 clusters
K-Means with 3 clusters

The 3 clusters seemed to be formed alright but we have many outliers that should not be in any cluster.

There are certain cases wherein K Means vs DBSCAN, DBSCAN takes the cake. The illustration below explains this perfectly.

k means vs dbscan
source

Bonus content

For a fun visualisation of DBSCAN clustering, visit the following link.

Visualizing DBSCAN Clustering

Conclusion

DBSCAN may not be the most widely known clustering algorithm but it surely has its benefits. The illustration above shows how K Means can go create clusters that may actually not make sense whereas DBSCAN finds the right patterns. The final decision depends on your data type and which would benefit your more.