The worlds most valuable resource is no longer oil, but data

As surprising as the above statement may sound, it is absolutely true. We now have now moved towards a world where data is the real power but as we all know,

with great data, comes great analytical problems.

Well maybe not ALL of us but the data engineers out there know what  I am talking about. Many times we have huge chunks of randomly spread data that make no sense at first glance. This is where clustering comes to our aid.

In the simplest of terms, clustering is creating groups within data of similar-looking points. There are many clustering algorithms that have proven very effective,

  • K means Clustering
  • K median Clustering
  • DBSCAN Clustering    

… to name a few.

For this article, we forget the first two and focus on the last one that is DBSCAN Clustering.

DBSCAN stands for Density-based spatial clustering of applications with noise.

This article will cover the following;

  • Understanding the DBSCAN algorithm
  • DBSCAN algorithm Advantages
  • Implementing the DBSCAN algorithm
  • DBSCAN vs K means

Let’s get to it.

NOTE: Fun Bonus at the end


Understanding the Algorithm

Say we have some random data points.

Visually, it is impossible to make out which points should be grouped together as many of the points have very minor differences in them.

To start off the algorithm, we choose a random point.

The red point will be our selected point

It is time to introduce 2 major parameters for DBSCAN clustering.

  1. Epsilon

This is the radius of the circle that we must draw around our pointing focus.

2. Minimum number of points

This is the number of points that we want in the neighbourhood of our point in focus (within the circle).

…continuing with the algorithm

We draw a circle of radius ‘epsilon‘ around the chosen point and inspect the neighbourhood.

Now let’s suppose that we kept the ‘minimum points‘ parameter to be 3. This is how it would matter.

There are few rules to follow here;

  • A point is a ‘core point’ if its neighbouring points are greater than or equal to the minimum points we chose at the start.
  • A point is a ‘border point’ if its neighbouring points are less than the minimum points we chose at the start.
  • A point is considered ‘Noise‘ if it has no points in its neighbourhood
draw a circle and inspect the neighbourhood

According to the rules of the algorithm, this point is a core point since it has more neighbours (5) than the number we decided (3).

These all points are said to be part of a single cluster.

clustered together

**Since it was a core point, we now move on to apply the same algorithm on all of its neighbours. **

found another stray point

Checking the neighbourhood of all the neighbours we see that there is one more point that is within the neighbourhood but not part of a cluster. It now becomes part of the red cluster.

The point in whose neighbourhood this new point lies is also a core point.

Another core point

So we check this new point to continue the cluster.

A border point found

Seems like this is a border point, even though it becomes part of the red cluster we will not check its neighbourhood to continue the cluster.

Seem like we have hit a dead-end; there are no more points to consider for this cluster. Well, this means we start the algorithm all over with new points, creating a new cluster.

New point

This point seems like a dead end on its own; it doesn’t have any neighbours.

Good thing we defined rules for every situation. This point is what we label as ‘Noise

Noise found

Looks like we gotta start over.

Continuing the process

Following the algorithm, we apply the set rules to all unvisited points until we have them all in a cluster.

DBSCAN Advantages

  • Unsupervised learning

The DBSCAN algorithm requires no labels to create clusters hence it can be applied to all sorts of data.

  • Self cluster forming

Unlike its much more famous counterpart, k means, DBSCAN does not require a number of clusters to be defined beforehand. It forms clusters using the rules we defined above.

  • Noise detection

As demonstrated above, it is not influenced by noisy points and can detect and isolate any outliers in the data.


Implementation

Full code available here.

We start by importing all the useful libraries.

from sklearn.datasets import make_blobs
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import random

Now we obviously need some data that we can cluster.

centers = [(0, 4), (5, 5) , (8,2)]
cluster_std = [1.2, 1, 1.1]

X, y= make_blobs(n_samples=200, cluster_std=cluster_std, centers=centers, n_features=2, random_state=1)

plt.scatter(X[y == 0, 0], X[y == 0, 1], s=10, label="Cluster1")
plt.scatter(X[y == 1, 0], X[y == 1, 1], s=10, label="Cluster2")
plt.scatter(X[y == 2, 0], X[y == 2, 1], s=10, label="Cluster3")
plt.title("Scattered data")
Synthetically generated data

** Note: The differently coloured data points represent the 3 different distributions that the data is drawn from **

The data that our algorithm will work on will look like this;

All the same

Let’s define our two main functions;

def check_core_point(eps,minPts, df, index):
    #get points from given index
    x, y = df.iloc[index]['X']  ,  df.iloc[index]['Y']
    
    #check available points within radius
    temp =  df[((np.abs(x - df['X']) <= eps) & (np.abs(y - df['Y']) <= eps)) & (df.index != index)]
    
    #check how many points are present within radius
    if len(temp) >= minPts:
        #return format (dataframe, is_core, is_border, is_noise)
        return (temp.index , True, False, False)
    
    elif (len(temp) < minPts) and len(temp) > 0:
        #return format (dataframe, is_core, is_border, is_noise)
        return (temp.index , False, True, False)
    
    elif len(temp) == 0:
        #return format (dataframe, is_core, is_border, is_noise)
        return (temp.index , False, False, True)

The above function will look at each point in focus and determine if it is a core point, border point, or noise.

def cluster_with_stack(eps, minPts, df):
    
    #initiating cluster number
    C = 1
    #initiating stacks to maintain
    current_stack = set()
    unvisited = list(df.index)
    clusters = []
    
    
    while (len(unvisited) != 0): #run until all points have been visited

        #identifier for first point of a cluster
        first_point = True
        
        #choose a random unvisited point
        current_stack.add(random.choice(unvisited))
        
        while len(current_stack) != 0: #run until a cluster is complete
            
            #pop current point from stack
            curr_idx = current_stack.pop()
            
            #check if point is core, neighbour or border
            neigh_indexes, iscore, isborder, isnoise = check_core_point(eps, minPts, df, curr_idx)
            
            #dealing with an edge case
            if (isborder & first_point):
                #for first border point, we label it aand its neighbours as noise 
                clusters.append((curr_idx, 0))
                clusters.extend(list(zip(neigh_indexes,[0 for _ in range(len(neigh_indexes))])))
                
                #label as visited
                unvisited.remove(curr_idx)
                unvisited = [e for e in unvisited if e not in neigh_indexes]
    
                continue
                
            unvisited.remove(curr_idx) #remove point from unvisited list
            
            
            neigh_indexes = set(neigh_indexes) & set(unvisited) #look at only unvisited points
            
            if iscore: #if current point is a core
                first_point = False
                
                clusters.append((curr_idx,C)) #assign to a cluster
                current_stack.update(neigh_indexes) #add neighbours to a stack

            elif isborder: #if current point is a border point
                clusters.append((curr_idx,C))
                
                continue

            elif isnoise: #if current point is noise
                clusters.append((curr_idx, 0))
                
                continue
                
        if not first_point:
            #increment cluster number
            C+=1
        
    return clusters

This is our main function where our entire algorithm logic resides.

Before running this code there is one important thing to consider when coding the algorithm. There is one particular edge case that can cause the algorithm to run forever or cause problems in clustering.  

Whenever picking out a point, it might be possible that the first point you pick out might turn out to be a border point. Now here we have 3 possibilities;

  1. Ignore the point and start the algorithm again with another random point.
  2. Create a separate cluster of this point and its neighbours.
  3. label the point as noise and move on with the algorithm.

The first option above seems plausible but we may run into a dead end. Your data distribution might be such that at the end you are left with several such points that just form small clusters. Your algorithm will simply keep on skipping these points and run forever.

The second option seemed good too but such small clusters did not make sense in the data and seem useless.

I went with the third option to keep the clustering as clean as possible,

#dealing with an edge case
            if (isborder & first_point):
                #for first border point, we label it aand its neighbours as noise 
                clusters.append((curr_idx, 0))
                clusters.extend(list(zip(neigh_indexes,[0 for _ in range(len(neigh_indexes))])))
                
                #label as visited
                unvisited.remove(curr_idx)
                unvisited = [e for e in unvisited if e not in neigh_indexes]
    
                continue

This snippet is from the main function above and deals with this edge case.

Results

Now we run the algorithm and check the results.

#radius of the circle defined as 0.6
eps = 0.6
#minimum neighbouring points set to 3
minPts = 3

data = pd.DataFrame(X, columns = ["X", "Y"] )
clustered = cluster_with_stack(eps, minPts, data)

idx , cluster = list(zip(*clustered))
cluster_df = pd.DataFrame(clustered, columns = ["idx", "cluster"])

plt.figure(figsize=(10,7))
for clust in np.unique(cluster):
    plt.scatter(X[cluster_df["idx"][cluster_df["cluster"] == clust].values, 0], X[cluster_df["idx"][cluster_df["cluster"] == clust].values, 1], s=10, label=f"Cluster{clust}")

plt.legend([f"Cluster {clust}" for clust in np.unique(cluster)], loc ="lower right")
plt.title('Clustered Data')
plt.xlabel('X')
plt.ylabel('Y')
Clustered using DBSCAN

Seems like we have reasonable clusters. If you feel like some data has been ‘mis-clustered’ then you can play around with the epsilon and min-points value.

Below are some results from changing these values.

eps = 0.5 – minPts = 3
eps = 0.55 – minPts = 3
eps = 0.58 – minPts = 4

DBSCAN vs K Means

K Means, even though much more popular, lacks behind DBSCAN in some cases. Firstly, it needs a predefined number for how many clusters we need. We may not always have this information and it can be quite difficult to analyse data.

K Means also does not identify outliers and tries to group every point in a cluster.

from sklearn.cluster import KMeans

clustering = KMeans(3).fit(X)

plt.figure(figsize=(10,7))
for clust in np.unique(clustering.labels_):
    plt.scatter(X[clustering.labels_ == clust, 0], X[clustering.labels_ == clust, 1], s=10, label=f"Cluster{clust}")

plt.legend([f"Cluster {clust}" for clust in np.unique(clustering.labels_)], loc ="lower right")

We will apply K Means clustering from sklearn library to the same data.

K-Means with 3 clusters

The 3 clusters seemed to be formed alright but we have many outliers that should not be in any cluster.

There are certain cases wherein K Means vs DBSCAN, DBSCAN takes the cake. The illustration below explains this perfectly.

Bonus content

For a fun visualisation of DBSCAN clustering, visit the following link.

Conclusion

DBSCAN may not be the most widely known clustering algorithm but it surely has its benefits. The illustration above shows how K Means can go create clusters that may actually not make sense whereas DBSCAN finds the right patterns. The final decision depends on your data type and which would benefit your more.

Leave a Reply

Your email address will not be published. Required fields are marked *