THIS IS A TWO-PART SERIES. PART 1 WILL BE ALL ABOUT THE THEORETICAL UNDERSTANDING OF THE LOGISTIC REGRESSION ALGORITHM. PART 2 (this one) WILL BE COVERING THE CODING PART, WHERE WE WILL BE IMPLEMENTING LOGISTIC REGRESSION IN PYTHON*

If you are confused about the theoretical working of the logistic regression algorithm, you can read part 1 here.

The entire code used in this tutorial can be found here.

Since we have closely related logistic regression and linear regression, I would advise you to read up about linear regression as well to get a better understanding of these algorithms. You can read about it here.

Since we have already covered the theory in part 1, here we will jump straight into the code.


Python Implementation

We will start by importing all important libraries. We will not use the logistic regression from any library as that would be against the purpose of this article.

#Getting all the important libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import MinMaxScaler

Let’s first explore the dataset which we will use.

Since we predict discrete classes using logistic regression, we will use the dataset on the possibility of cardiovascular disease. The dataset is open source and can be found here.

#loading the dataset
heart = pd.read_csv("heart.csv")

#shuffling the dataset 
heart = heart.sample(frac=1).reset_index(drop=True)

#Seperating the input and output variables
X = heart[["age", "sex", "cp", "trestbps", "chol", "fbs", "restecg", "thalach", "exang", "oldpeak", "slope", "ca", "thal"]]
Y = heart["target"]

#Splitting the dataset in to test and training samples
sample_x = X[:250]
sample_y = Y[:250].values
sample_y = sample_y.reshape(sample_y.shape[0],-1)
test_x = X[250:]
test_y = Y[250:].values
test_y = test_y.reshape(test_y.shape[0],-1)
Overview of our Input features | heart.csv
Overview of our Input features | heart.csv

From the above overview, we see that we have a total of 12 input features stored in the variable sample_x.

Our outputs are a series of 1’s and 0’s where 1’s support the possibility of cardiovascular disease and 0’s reject it.

We have roughly 300 total samples. We separate the first 250 samples as training data and the remaining 50 will be used for testing.

Algorithm

  1. Calculate outputs using randomly initialized weights.
  2. Calculate loss between original and predicted outputs.
  3. Update weights using gradient descent.
  4. Repeat until max iterations reached or loss reduced significantly.

From the above steps we can isolate the functions we will need. Let’s get to coding.

  • Sigmoid
def get_sigmoid(inp):
    
    #simply returning sigmoid our values
    return (1/(1+np.exp(-inp))) - 0.00000001
  • Loss
def get_loss(y_hat, y, n):
    
    #simply returning the loss calculated using our loss function
    return -(1/n)*(np.sum(y*(np.log(y_hat)) + (1-y)*np.log(1-y_hat)))
  • Weight Update

def update_params(n, X, y_hat, y, alpha, w_old):
  
    #looping to update every parameter
    diff = y_hat-y
    update = (1/n)*np.dot(X.T,diff)
   
    #applying the update equation
    w_new = w_old-(alpha)*(update.T)
    
    #return the updated weights
    return w_new

Now, all there is left to do is code a routine that uses these functions in the algorithmic way we discussed above.

def train_logistic_regression(X,y, alpha = 0.01, epochs = 10000):
    
    #first we get the number of examples present in our dataset
    n = len(X)
    
    #normalise data using min_max scaling
    scaler = MinMaxScaler()
    scaler.fit(X)
    X = scaler.transform(X)
    
    #Randomly initialising weights of our model
    w = np.random.rand(1,X.shape[1])
    print("Original weights =",w)
    
    #setting an initial value for loss (a high value)
    #loss = 1000
    for i in range(epochs):
        
        dot = np.dot(X, w.T)
        y_hat = get_sigmoid(dot)
        
        #calculate the loss against the predicted outputs
        loss = get_loss(y_hat, y, n)
        
        print("Loss ================>",loss)
        
        #updating the weights based on the loss
        w = update_params(n, X, y_hat, y, alpha, w)
    
    #we return the predictions made on the train dataset as well as the final weights.
    return w, y_hat

One thing you will note is that the first step in our routine above is the min-max scaling of data. This is done to normalize the data set. Since each input feature covers a different numeric range, the algorithm has trouble learning the pattern.

The remaining routine remains the same as we discussed.

Since we have a small dataset, we can afford to run it for a large number of epochs. We let our run for the default value; 10000.

%%time
weights, out = train_logistic_regression(sample_x,sample_y)

After the run, we have our final updated weights.

It is time to build some additional functions to aid our testing.

  • Prediction function
def predict_logistic_regression(weights, x_test):
    
    #doing the neccesary calculations before applying sigmoid funciton
    X = x_test.values
    #normalise data using min_max scaling
    scaler = MinMaxScaler()
    scaler.fit(X)
    X = scaler.transform(X)
    
    #calculate the predictions
    product = np.dot(X,weights.T)
    y_preds = get_sigmoid(product)
    
    #return the predictions with applied thresholds
    return [1 if i >= 0.5 else 0 for i in y_preds]
  • Accuracy
'''
funciton to calculate the percentage accuracy from the given predictions
'''

def get_accuracy(y_pred, y):
    
    assert len(y_pred) == len(y)
    n = len(y)
    Y = y
    
    count = 0
    for i in range(n):
        if y_pred[i] == Y[i]:
            count+=1
     
    return print("Accuracy:", round(((count/n)*100),2), "%")

Using the above helper functions, we see that our model gives an accuracy of 75%.

This number can be improved by tuning the hyperparameters but that is up to you.

Conclusion:

After a long journey, we finally reached the end of this learning trip. In this 2-part series, we learned the mathematics of logistic regression and applied that very knowledge to a real dataset. The results we achieve are somewhat acceptable but can be improved if you play around with the learning rate and no. of epochs.

*DON’T FORGET TO READ UP ON PART 1 HERE*

*IF YOU HAVE ANY SUGGESTIONS/QUESTIONS, DO POST THEM DOWN IN THE COMMENTS*

Leave a Reply

Your email address will not be published. Required fields are marked *