Naive Bayes Python Implementation and Understanding

An explanation of the Bayes theoram of conditional probability followed by implementation of naive bayes classifier in python.
Niave Bayes Explained in Python

Naive Bayes is a Machine Learning Classifier that is based on the Bayes Theoram of conditional probability. In this article, we will be understanding conditional probability (Bayes Theoram) and then moving on to how it translates to the Naive Byes classifier. we will be understanding the mathematics behind this classifier and then finally coding it in Python. If you are interested in more classification algorithms then you can start here.

Bayes Theoram

An image of the founder of the Bayes Theoram Thomas Bayes
source: Wikipedia

Named after the statistician Thomas Bayes, this theorem is also known as the theorem of conditional probability. It allows us to calculate the probability of a particular event GIVEN a set of prior conditions. For example, the probability that it will rain tomorrow GIVEN that it rained yesterday.

The formula for calculating conditional probability is shown below.

Formula of conditional probability also called bayes theoarm
conditional probability

The term on the left-hand side is read as ‘the probability of event A occurring given that event B has occurred’. The term on the right-hand side is the probability of both events occurring together divided by the probability of event B occurring. The formula is quite straightforward. I will not be delving into its derivation or intuition as this article is not about Bayes Theorem but rather Naive Bayes Classifier.

The Naive Bayes Classifier

Since we are classifying our data into discrete labels, just like any other classifier, for Naive Bayes we will have a set of input features as well as their corresponding output class. A Naive Bayes classifier calculates probability using the following formula.

Probability formula for Naive Bayes classifier
Probability formula for Naive Bayes classification

The left side means, what is the probability that we have y_1 as our output given that our inputs were {x_1 ,x_2 ,x_3}. Now let’s suppose that our problem had a total of 2 classes i.e. {y_1, y_2}. We will now use the above formula twice first to calculate the probability of y_1 occurring and then for y_2 occurring. Whichever has a higher probability will be our predicted class.

An important point to note here is that this classifier makes one assumption. It assumes that every input feature is independent of the other, this is specifically why the term ‘Naive‘ in the name.

This is how Naive Bayes is used for classification.

Naive Bayes Python Implementation

Let’s start coding it in Python.

This entire code can be found in the following Github repository.

import pandas as pd import numpy as np from sklearn.model_selection import train_test_split
Code language: JavaScript (javascript)

Our necessary libraries for playing with data. If you do not know how to work around with Pandas then you might want to read about it here first.

Now let’s load our dataset. I have used the Heart disease prediction data set which can be found on Kaggle.

data = pd.read_csv('heart-disease-data/heart.csv') #Read the dataset data.head()
Code language: PHP (php)
A snippet of the heart disease dataset
head of the dataset

For a Naive Bayes Classifier, we need discrete variables since we can not use continuous variables in calculating probabilities. So we need to drop some columns here such as cholesterol and trestbps.

data.drop(["age", "trestbps", "chol", "thalach", "oldpeak", "slope"],axis = 1 ,inplace=True) #drop irrelevant columns data.head()
Code language: PHP (php)
certain columns from the dataset
dataset with columns dropped
X = data[data.keys()[:-1]] y = data[data.keys()[-1]]X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.10, random_state=42)#test-train split data_train = pd.concat([X_train, y_train],axis = 1) #concat back data_test = pd.concat([X_test, y_test],axis = 1)
Code language: PHP (php)

Now we need to code the helper function that would help us calculate all the necessary probabilities.

According to the formula we will need to calculate the probability of occurrence of every input feature as well as output feature and their conditional probabilities given each class label.

First, we will calculate the probability for each input variable.

#Calcualting probabilites for inputs independantly def get_probabilities_for_inputs(n, column_name, data_frame): temp = data_frame[column_name] #isolate targetted column temp = temp.value_counts() #get counts of occurences of each input variable return (temp/n) #return probiblity of occurence by dividing with total no. of data points
Code language: PHP (php)

Next, we will calculate conditional probabilities for the input given an output class.

#calculating conditional probability def get_conditional_probabilities(data_frame, n,target, given): focused_data = data[[target, given]] #isolate target column an dfocus input column targets_unique = data[target].unique()#list of unique outputs in data inputs_unique = data[given].unique() groups = focused_data.groupby(by = [given, target]).size().reset_index() groups[0] = groups[0]/ n for targets in targets_unique: current_target_length = len(focused_data[focused_data[target] == targets]) groups[0] = np.where(groups[target] == targets, groups[0].div(current_target_length),groups[0]) return groups
Code language: PHP (php)

Next, we will write down our ‘fit’ function that will calculate and return all the necessary probabilities which we will then use for making classifications.

def calculate_probabilities(data): #splititng input data x = data[data.keys()[:-1]] y = data[data.keys()[-1]] target = y.name #get length of dataframe n = len(data) #get probabilities for each individual input and output f_in = lambda lst: get_probabilities_for_inputs(n, lst, x) input_probablities = list(map(f_in,x.keys())) output_probabilities = get_probabilities_for_inputs(n ,target, y.to_frame()) #get conditional probabilities for every input against every output f1 = lambda lst: get_conditional_probabilities(data, n, target,lst) conditional_probabilities = list(map(f1, data.keys()[:-1])) return input_probablities, output_probabilities, conditional_probabilities
Code language: PHP (php)

Now that we have all the necessary calculations done and out of the way, we need to make a function that will give us our output class label by making calculations according to the Naive Bayes formula that we wrote above.

def naive_bayes_calculator(target_values, input_values, in_prob, out_prob, cond_prob): target_values.sort()#sort the target values to assure ascending order classes = [] #initialise empty probabilites list for target_value in target_values: num = 1 #initilaise numerator den = 1 #initialise denominator #calculate denominator according to the formula for i,x in enumerate(input_values): den *= in_prob[i][x] #calculate numerator according to the formula for i, x_1 in enumerate(input_values): temp_df = cond_prob[i] num *= temp_df[(temp_df.iloc[:,0] == x_1) & (temp_df.iloc[:,1] == target_value)][0].values[0] num *= out_prob[target_value] final_probability = (num/den) #final conditional probability value classes.append(final_probability) #append probability for current class in a list return (classes.index(max(classes)), classes)
Code language: PHP (php)

Now we have all our functions out of the way, we can move on to running them and storing our calculations.

in_prob, out_prob, cond_prob = calculate_probabilities(data_train)#use training data for the initial calculations
Code language: PHP (php)

in_prob, out_prob, cond_prob = calculate_probabilities(data_train)#use training data for the initial calculations.

#testing with dummy data naive_bayes_calculator([1,0], [1,1,0,2,1,3,3],in_prob,out_prob,cond_prob)
Code language: CSS (css)
outputs from our test run on the Naive Bayes classification calculations
Outputs

We have our class prediction and the probabilities for each class inside a tuple.

Testing the Naive Bayes classifier

Now it’s time to test on our ‘test data’.

The following function takes a set of inputs and returns the predicted class against each in a list.

def naive_bayes_predictor(test_data, outputs, in_prob, out_prob, cond_prob): final_predictions = [] #initialise empty list to store test predictions for row in test_data: #get prediction for current data predicted_class, probabilities = naive_bayes_calculator(outputs, row, in_prob, out_prob, cond_prob) #append to list final_predictions.append(predicted_class) return final_predictions
Code language: PHP (php)

Now calculate accuracy.

test_data_as_list = X_test.values.tolist() unique_targets = y_test.unique().tolist() predicted_y = naive_bayes_predictor(test_data_as_list,unique_targets,in_prob,out_prob,cond_prob) print("Accuracy:", (np.count_nonzero(y_test == predicted_y)/len(y_test)) *100)
Code language: PHP (php)
Accuracy from our Naive Bayes classifier

An accuracy of 77.4% is certainly not a bad number considering that we dropped certain important columns and the naivety of the algorithm ignores certain correlations between the input variables.

Conclusion

Naive Bayes is a very simple classifier granted that you understand basic probability and the concept of inputs and outputs in Machine Learning. The algorithm does have certain shortcomings such as ignoring the dependency of input variables on each other. It is very simple to build and gives good results if your data is according to its requirements.

Total
0
Shares
Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Prev
Applied Data Science with Python and Pandas
Applied Data Science with python and Pandas

Applied Data Science with Python and Pandas

Learn applied data science with pandas and Python

Next
Why is big data so important?
Understanding big data and the 4 Vs of big data

Why is big data so important?

Learn what big data is, how is it defined using the 5 Vs of big data and how it

You May Also Like