DBSCAN Clustering Algorithm Implementation from scratch | Python

The worlds most valuable resource is no longer oil, but data

As surprising as the above statement may sound, it is absolutely true. We now have now moved towards a world where data is the real power but as we all know,

with great data, comes great analytical problems.

Well maybe not ALL of us but the data engineers out there know what  I am talking about. Many times we have huge chunks of randomly spread data that make no sense at first glance. This is where clustering comes to our aid.

In the simplest of terms, clustering is creating groups within data of similar-looking points. There are many clustering algorithms that have proven very effective,

  • K means Clustering
  • K median Clustering
  • DBSCAN Clustering    

… to name a few.

For this article, we forget the first two and focus on the last one that is DBSCAN Clustering.

DBSCAN stands for Density-based spatial clustering of applications with noise.

This article will cover the following;

  • Understanding the DBSCAN algorithm
  • DBSCAN algorithm Advantages
  • Implementing the DBSCAN algorithm
  • DBSCAN vs K means

Let’s get to it.

NOTE: Fun Bonus at the end


Understanding the Algorithm

Say we have some random data points.

Visually, it is impossible to make out which points should be grouped together as many of the points have very minor differences in them.

To start off the algorithm, we choose a random point.

The red point will be our selected point

It is time to introduce 2 major parameters for DBSCAN clustering.

  1. Epsilon

This is the radius of the circle that we must draw around our pointing focus.

2. Minimum number of points

This is the number of points that we want in the neighbourhood of our point in focus (within the circle).

…continuing with the algorithm

We draw a circle of radius ‘epsilon‘ around the chosen point and inspect the neighbourhood.

Now let’s suppose that we kept the ‘minimum points‘ parameter to be 3. This is how it would matter.

There are few rules to follow here;

  • A point is a ‘core point’ if its neighbouring points are greater than or equal to the minimum points we chose at the start.
  • A point is a ‘border point’ if its neighbouring points are less than the minimum points we chose at the start.
  • A point is considered ‘Noise‘ if it has no points in its neighbourhood
draw a circle and inspect the neighbourhood

According to the rules of the algorithm, this point is a core point since it has more neighbours (5) than the number we decided (3).

These all points are said to be part of a single cluster.

clustered together

**Since it was a core point, we now move on to apply the same algorithm on all of its neighbours. **

found another stray point

Checking the neighbourhood of all the neighbours we see that there is one more point that is within the neighbourhood but not part of a cluster. It now becomes part of the red cluster.

The point in whose neighbourhood this new point lies is also a core point.

Another core point

So we check this new point to continue the cluster.

A border point found

Seems like this is a border point, even though it becomes part of the red cluster we will not check its neighbourhood to continue the cluster.

Seem like we have hit a dead-end; there are no more points to consider for this cluster. Well, this means we start the algorithm all over with new points, creating a new cluster.

New point

This point seems like a dead end on its own; it doesn’t have any neighbours.

Good thing we defined rules for every situation. This point is what we label as ‘Noise

Noise found

Looks like we gotta start over.

Continuing the process

Following the algorithm, we apply the set rules to all unvisited points until we have them all in a cluster.

DBSCAN Advantages

  • Unsupervised learning

The DBSCAN algorithm requires no labels to create clusters hence it can be applied to all sorts of data.

  • Self cluster forming

Unlike its much more famous counterpart, k means, DBSCAN does not require a number of clusters to be defined beforehand. It forms clusters using the rules we defined above.

  • Noise detection

As demonstrated above, it is not influenced by noisy points and can detect and isolate any outliers in the data.


Implementation

Full code available here.

We start by importing all the useful libraries.

from sklearn.datasets import make_blobs import matplotlib.pyplot as plt import pandas as pd import numpy as np import random
Code language: JavaScript (javascript)

Now we obviously need some data that we can cluster.

centers = [(0, 4), (5, 5) , (8,2)] cluster_std = [1.2, 1, 1.1] X, y= make_blobs(n_samples=200, cluster_std=cluster_std, centers=centers, n_features=2, random_state=1) plt.scatter(X[y == 0, 0], X[y == 0, 1], s=10, label="Cluster1") plt.scatter(X[y == 1, 0], X[y == 1, 1], s=10, label="Cluster2") plt.scatter(X[y == 2, 0], X[y == 2, 1], s=10, label="Cluster3") plt.title("Scattered data")
Code language: JavaScript (javascript)
Synthetically generated data

** Note: The differently coloured data points represent the 3 different distributions that the data is drawn from **

The data that our algorithm will work on will look like this;

All the same

Let’s define our two main functions;

def check_core_point(eps,minPts, df, index): #get points from given index x, y = df.iloc[index]['X'] , df.iloc[index]['Y'] #check available points within radius temp = df[((np.abs(x - df['X']) <= eps) & (np.abs(y - df['Y']) <= eps)) & (df.index != index)] #check how many points are present within radius if len(temp) >= minPts: #return format (dataframe, is_core, is_border, is_noise) return (temp.index , True, False, False) elif (len(temp) < minPts) and len(temp) > 0: #return format (dataframe, is_core, is_border, is_noise) return (temp.index , False, True, False) elif len(temp) == 0: #return format (dataframe, is_core, is_border, is_noise) return (temp.index , False, False, True)
Code language: PHP (php)

The above function will look at each point in focus and determine if it is a core point, border point, or noise.

def cluster_with_stack(eps, minPts, df): #initiating cluster number C = 1 #initiating stacks to maintain current_stack = set() unvisited = list(df.index) clusters = [] while (len(unvisited) != 0): #run until all points have been visited #identifier for first point of a cluster first_point = True #choose a random unvisited point current_stack.add(random.choice(unvisited)) while len(current_stack) != 0: #run until a cluster is complete #pop current point from stack curr_idx = current_stack.pop() #check if point is core, neighbour or border neigh_indexes, iscore, isborder, isnoise = check_core_point(eps, minPts, df, curr_idx) #dealing with an edge case if (isborder & first_point): #for first border point, we label it aand its neighbours as noise clusters.append((curr_idx, 0)) clusters.extend(list(zip(neigh_indexes,[0 for _ in range(len(neigh_indexes))]))) #label as visited unvisited.remove(curr_idx) unvisited = [e for e in unvisited if e not in neigh_indexes] continue unvisited.remove(curr_idx) #remove point from unvisited list neigh_indexes = set(neigh_indexes) & set(unvisited) #look at only unvisited points if iscore: #if current point is a core first_point = False clusters.append((curr_idx,C)) #assign to a cluster current_stack.update(neigh_indexes) #add neighbours to a stack elif isborder: #if current point is a border point clusters.append((curr_idx,C)) continue elif isnoise: #if current point is noise clusters.append((curr_idx, 0)) continue if not first_point: #increment cluster number C+=1 return clusters
Code language: PHP (php)

This is our main function where our entire algorithm logic resides.

Before running this code there is one important thing to consider when coding the algorithm. There is one particular edge case that can cause the algorithm to run forever or cause problems in clustering.  

Whenever picking out a point, it might be possible that the first point you pick out might turn out to be a border point. Now here we have 3 possibilities;

  1. Ignore the point and start the algorithm again with another random point.
  2. Create a separate cluster of this point and its neighbours.
  3. label the point as noise and move on with the algorithm.

The first option above seems plausible but we may run into a dead end. Your data distribution might be such that at the end you are left with several such points that just form small clusters. Your algorithm will simply keep on skipping these points and run forever.

The second option seemed good too but such small clusters did not make sense in the data and seem useless.

I went with the third option to keep the clustering as clean as possible,

#dealing with an edge case if (isborder & first_point): #for first border point, we label it aand its neighbours as noise clusters.append((curr_idx, 0)) clusters.extend(list(zip(neigh_indexes,[0 for _ in range(len(neigh_indexes))]))) #label as visited unvisited.remove(curr_idx) unvisited = [e for e in unvisited if e not in neigh_indexes] continue
Code language: PHP (php)

This snippet is from the main function above and deals with this edge case.

Results

Now we run the algorithm and check the results.

#radius of the circle defined as 0.6 eps = 0.6 #minimum neighbouring points set to 3 minPts = 3 data = pd.DataFrame(X, columns = ["X", "Y"] ) clustered = cluster_with_stack(eps, minPts, data) idx , cluster = list(zip(*clustered)) cluster_df = pd.DataFrame(clustered, columns = ["idx", "cluster"]) plt.figure(figsize=(10,7)) for clust in np.unique(cluster): plt.scatter(X[cluster_df["idx"][cluster_df["cluster"] == clust].values, 0], X[cluster_df["idx"][cluster_df["cluster"] == clust].values, 1], s=10, label=f"Cluster{clust}") plt.legend([f"Cluster {clust}" for clust in np.unique(cluster)], loc ="lower right") plt.title('Clustered Data') plt.xlabel('X') plt.ylabel('Y')
Code language: PHP (php)
Clustered using DBSCAN

Seems like we have reasonable clusters. If you feel like some data has been ‘mis-clustered’ then you can play around with the epsilon and min-points value.

Below are some results from changing these values.

eps = 0.5 – minPts = 3
eps = 0.55 – minPts = 3
eps = 0.58 – minPts = 4

DBSCAN vs K Means

K Means, even though much more popular, lacks behind DBSCAN in some cases. Firstly, it needs a predefined number for how many clusters we need. We may not always have this information and it can be quite difficult to analyse data.

K Means also does not identify outliers and tries to group every point in a cluster.

from sklearn.cluster import KMeans clustering = KMeans(3).fit(X) plt.figure(figsize=(10,7)) for clust in np.unique(clustering.labels_): plt.scatter(X[clustering.labels_ == clust, 0], X[clustering.labels_ == clust, 1], s=10, label=f"Cluster{clust}") plt.legend([f"Cluster {clust}" for clust in np.unique(clustering.labels_)], loc ="lower right")
Code language: JavaScript (javascript)

We will apply K Means clustering from sklearn library to the same data.

K-Means with 3 clusters

The 3 clusters seemed to be formed alright but we have many outliers that should not be in any cluster.

There are certain cases wherein K Means vs DBSCAN, DBSCAN takes the cake. The illustration below explains this perfectly.

Bonus content

For a fun visualisation of DBSCAN clustering, visit the following link.

Conclusion

DBSCAN may not be the most widely known clustering algorithm but it surely has its benefits. The illustration above shows how K Means can go create clusters that may actually not make sense whereas DBSCAN finds the right patterns. The final decision depends on your data type and which would benefit your more.

Total
0
Shares
Comments 1
Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Prev
KNN Classifier Python Implementation from scratch (96.6% Accuracy)| Machine Learning
cover-image

KNN Classifier Python Implementation from scratch (96.6% Accuracy)| Machine Learning

For beginners, the terminology “Machine Learning” seems something very

Next
Understanding and Implementing Logistic Regression Algorithm (Part 1)| Python | Machine Learning

Understanding and Implementing Logistic Regression Algorithm (Part 1)| Python | Machine Learning

Understand the theory and mathematical concepts of logistic regression and how

You May Also Like