##### Subscription

Subscribe to our newsletter and receive a selection of cool articles every week

# Naive Bayes Python Implementation and Understanding

An explanation of the Bayes theoram of conditional probability followed by implementation of naive bayes classifier in python.

Naive Bayes is a Machine Learning Classifier that is based on the Bayes Theoram of conditional probability. In this article, we will be understanding conditional probability (Bayes Theoram) and then moving on to how it translates to the Naive Byes classifier. we will be understanding the mathematics behind this classifier and then finally coding it in Python. If you are interested in more classification algorithms then you can start here.

## Bayes Theoram

Named after the statistician Thomas Bayes, this theorem is also known as the theorem of conditional probability. It allows us to calculate the probability of a particular event GIVEN a set of prior conditions. For example, the probability that it will rain tomorrow GIVEN that it rained yesterday.

The formula for calculating conditional probability is shown below.

The term on the left-hand side is read as ‘the probability of event A occurring given that event B has occurred’. The term on the right-hand side is the probability of both events occurring together divided by the probability of event B occurring. The formula is quite straightforward. I will not be delving into its derivation or intuition as this article is not about Bayes Theorem but rather Naive Bayes Classifier.

## The Naive Bayes Classifier

Since we are classifying our data into discrete labels, just like any other classifier, for Naive Bayes we will have a set of input features as well as their corresponding output class. A Naive Bayes classifier calculates probability using the following formula.

The left side means, what is the probability that we have y_1 as our output given that our inputs were {x_1 ,x_2 ,x_3}. Now let’s suppose that our problem had a total of 2 classes i.e. {y_1, y_2}. We will now use the above formula twice first to calculate the probability of y_1 occurring and then for y_2 occurring. Whichever has a higher probability will be our predicted class.

An important point to note here is that this classifier makes one assumption. It assumes that every input feature is independent of the other, this is specifically why the term ‘Naive‘ in the name.

This is how Naive Bayes is used for classification.

## Naive Bayes Python Implementation

Let’s start coding it in Python.

This entire code can be found in the following Github repository.

```.wp-block-code {
border: 0;
}

.wp-block-code > div {
overflow: auto;
}

.shcb-language {
border: 0;
clip: rect(1px, 1px, 1px, 1px);
-webkit-clip-path: inset(50%);
clip-path: inset(50%);
height: 1px;
margin: -1px;
overflow: hidden;
position: absolute;
width: 1px;
word-wrap: normal;
word-break: normal;
}

.hljs {
box-sizing: border-box;
}

.hljs.shcb-code-table {
display: table;
width: 100%;
}

.hljs.shcb-code-table > .shcb-loc {
color: inherit;
display: table-row;
width: 100%;
}

.hljs.shcb-code-table .shcb-loc > span {
display: table-cell;
}

.wp-block-code code.hljs:not(.shcb-wrap-lines) {
white-space: pre;
}

.wp-block-code code.hljs.shcb-wrap-lines {
white-space: pre-wrap;
}

.hljs.shcb-line-numbers {
border-spacing: 0;
counter-reset: line;
}

.hljs.shcb-line-numbers > .shcb-loc {
counter-increment: line;
}

.hljs.shcb-line-numbers .shcb-loc > span {
}

.hljs.shcb-line-numbers .shcb-loc::before {
border-right: 1px solid #ddd;
content: counter(line);
display: table-cell;
text-align: right;
-webkit-user-select: none;
-moz-user-select: none;
-ms-user-select: none;
user-select: none;
white-space: nowrap;
width: 1%;
}
```import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split```Code language: JavaScript (javascript)```

Our necessary libraries for playing with data. If you do not know how to work around with Pandas then you might want to read about it here first.

Now let’s load our dataset. I have used the Heart disease prediction data set which can be found on

``````data = pd.read_csv('heart-disease-data/heart.csv') #Read the dataset

For a Naive Bayes Classifier, we need discrete variables since we can not use continuous variables in calculating probabilities. So we need to drop some columns here such as cholesterol and trestbps.

``````data.drop(["age", "trestbps", "chol", "thalach", "oldpeak", "slope"],axis = 1 ,inplace=True) #drop irrelevant columns
``````X = data[data.keys()[:-1]]

y = data[data.keys()[-1]]X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.10, random_state=42)#test-train split

data_train = pd.concat([X_train, y_train],axis = 1) #concat back
data_test = pd.concat([X_test, y_test],axis = 1)```Code language: PHP (php)```

Now we need to code the helper function that would help us calculate all the necessary probabilities.

According to the formula we will need to calculate the probability of occurrence of every input feature as well as output feature and their conditional probabilities given each class label.

First, we will calculate the probability for each input variable.

``````#Calcualting probabilites for inputs independantly
def get_probabilities_for_inputs(n, column_name, data_frame):

temp = data_frame[column_name] #isolate targetted column
temp = temp.value_counts() #get counts of occurences of each input variable

return (temp/n) #return probiblity of occurence by dividing with total no. of data points```Code language: PHP (php)```

Next, we will calculate conditional probabilities for the input given an output class.

``````#calculating conditional probability
def get_conditional_probabilities(data_frame, n,target, given):

focused_data = data[[target, given]] #isolate target column an dfocus input column
targets_unique = data[target].unique()#list of unique outputs in data
inputs_unique = data[given].unique()

groups = focused_data.groupby(by = [given, target]).size().reset_index()
groups = groups/ n

for targets in targets_unique:
current_target_length = len(focused_data[focused_data[target] == targets])
groups = np.where(groups[target] == targets, groups.div(current_target_length),groups)

return groups```Code language: PHP (php)```

Next, we will write down our ‘fit’ function that will calculate and return all the necessary probabilities which we will then use for making classifications.

``````def calculate_probabilities(data):
#splititng input data
x = data[data.keys()[:-1]]
y = data[data.keys()[-1]]
target = y.name

#get length of dataframe
n = len(data)

#get probabilities for each individual input and output
f_in = lambda lst: get_probabilities_for_inputs(n, lst, x)
input_probablities = list(map(f_in,x.keys()))

output_probabilities = get_probabilities_for_inputs(n ,target, y.to_frame())

#get conditional probabilities for every input against every output
f1 = lambda lst: get_conditional_probabilities(data, n, target,lst)
conditional_probabilities = list(map(f1, data.keys()[:-1]))

return input_probablities, output_probabilities, conditional_probabilities```Code language: PHP (php)```

Now that we have all the necessary calculations done and out of the way, we need to make a function that will give us our output class label by making calculations according to the Naive Bayes formula that we wrote above.

``````def naive_bayes_calculator(target_values, input_values, in_prob, out_prob, cond_prob):

target_values.sort()#sort the target values to assure ascending order
classes = [] #initialise empty probabilites list

for target_value in target_values:
num = 1 #initilaise numerator
den = 1 #initialise denominator
#calculate denominator according to the formula
for i,x in enumerate(input_values):
den *= in_prob[i][x]
#calculate numerator according to the formula
for i, x_1 in enumerate(input_values):
temp_df = cond_prob[i]
num *= temp_df[(temp_df.iloc[:,0] == x_1) & (temp_df.iloc[:,1] == target_value)].values
num *= out_prob[target_value]
final_probability = (num/den) #final conditional probability value

classes.append(final_probability) #append probability for current class in a list

return (classes.index(max(classes)), classes)```Code language: PHP (php)```

Now we have all our functions out of the way, we can move on to running them and storing our calculations.

``in_prob, out_prob, cond_prob = calculate_probabilities(data_train)#use training data for the initial calculations`Code language: PHP (php)`

in_prob, out_prob, cond_prob = calculate_probabilities(data_train)#use training data for the initial calculations.

``````#testing with dummy data
naive_bayes_calculator([1,0], [1,1,0,2,1,3,3],in_prob,out_prob,cond_prob)```Code language: CSS (css)```

We have our class prediction and the probabilities for each class inside a tuple.

### Testing the Naive Bayes classifier

Now it’s time to test on our ‘test data’.

The following function takes a set of inputs and returns the predicted class against each in a list.

``````def naive_bayes_predictor(test_data, outputs, in_prob, out_prob, cond_prob):

final_predictions = [] #initialise empty list to store test predictions

for row in test_data:
#get prediction for current data
predicted_class, probabilities = naive_bayes_calculator(outputs, row, in_prob, out_prob, cond_prob)
#append to list
final_predictions.append(predicted_class)

return final_predictions```Code language: PHP (php)```

Now calculate accuracy.

``````test_data_as_list = X_test.values.tolist()
unique_targets = y_test.unique().tolist()
predicted_y = naive_bayes_predictor(test_data_as_list,unique_targets,in_prob,out_prob,cond_prob)
print("Accuracy:", (np.count_nonzero(y_test == predicted_y)/len(y_test)) *100)```Code language: PHP (php)```

An accuracy of 77.4% is certainly not a bad number considering that we dropped certain important columns and the naivety of the algorithm ignores certain correlations between the input variables.

## Conclusion

Naive Bayes is a very simple classifier granted that you understand basic probability and the concept of inputs and outputs in Machine Learning. The algorithm does have certain shortcomings such as ignoring the dependency of input variables on each other. It is very simple to build and gives good results if your data is according to its requirements.

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Prev
Applied Data Science with Python and Pandas ## Applied Data Science with Python and Pandas

Learn applied data science with pandas and Python

Next
Why is big data so important? ## Why is big data so important?

Learn what big data is, how is it defined using the 5 Vs of big data and how it

##### You May Also Like   