THIS IS A TWO-PART SERIES. PART 1 WILL BE ALL ABOUT THE THEORETICAL UNDERSTANDING OF THE LOGISTIC REGRESSION ALGORITHM. PART 2 (this one) WILL BE COVERING THE CODING PART, WHERE WE WILL BE IMPLEMENTING LOGISTIC REGRESSION IN PYTHON*
If you are confused about the theoretical working of the logistic regression algorithm, you can read part 1 here.
The entire code used in this tutorial can be found here.
Since we have closely related logistic regression and linear regression, I would advise you to read up about linear regression as well to get a better understanding of these algorithms. You can read about it here.
Since we have already covered the theory in part 1, here we will jump straight into the code.
We will start by importing all important libraries. We will not use the logistic regression from any library as that would be against the purpose of this article.
#Getting all the important libraries import numpy as np import pandas as pd import matplotlib.pyplot as plt from sklearn.preprocessing import MinMaxScaler
Let’s first explore the dataset which we will use.
Since we predict discrete classes using logistic regression, we will use the dataset on the possibility of cardiovascular disease. The dataset is open source and can be found here.
#loading the dataset heart = pd.read_csv("heart.csv") #shuffling the dataset heart = heart.sample(frac=1).reset_index(drop=True) #Seperating the input and output variables X = heart[["age", "sex", "cp", "trestbps", "chol", "fbs", "restecg", "thalach", "exang", "oldpeak", "slope", "ca", "thal"]] Y = heart["target"] #Splitting the dataset in to test and training samples sample_x = X[:250] sample_y = Y[:250].values sample_y = sample_y.reshape(sample_y.shape,-1) test_x = X[250:] test_y = Y[250:].values test_y = test_y.reshape(test_y.shape,-1)
From the above overview, we see that we have a total of 12 input features stored in the variable sample_x.
Our outputs are a series of 1’s and 0’s where 1’s support the possibility of cardiovascular disease and 0’s reject it.
We have roughly 300 total samples. We separate the first 250 samples as training data and the remaining 50 will be used for testing.
- Calculate outputs using randomly initialized weights.
- Calculate loss between original and predicted outputs.
- Update weights using gradient descent.
- Repeat until max iterations reached or loss reduced significantly.
From the above steps we can isolate the functions we will need. Let’s get to coding.
def get_sigmoid(inp): #simply returning sigmoid our values return (1/(1+np.exp(-inp))) - 0.00000001
def get_loss(y_hat, y, n): #simply returning the loss calculated using our loss function return -(1/n)*(np.sum(y*(np.log(y_hat)) + (1-y)*np.log(1-y_hat)))
- Weight Update
def update_params(n, X, y_hat, y, alpha, w_old): #looping to update every parameter diff = y_hat-y update = (1/n)*np.dot(X.T,diff) #applying the update equation w_new = w_old-(alpha)*(update.T) #return the updated weights return w_new
Now, all there is left to do is code a routine that uses these functions in the algorithmic way we discussed above.
def train_logistic_regression(X,y, alpha = 0.01, epochs = 10000): #first we get the number of examples present in our dataset n = len(X) #normalise data using min_max scaling scaler = MinMaxScaler() scaler.fit(X) X = scaler.transform(X) #Randomly initialising weights of our model w = np.random.rand(1,X.shape) print("Original weights =",w) #setting an initial value for loss (a high value) #loss = 1000 for i in range(epochs): dot = np.dot(X, w.T) y_hat = get_sigmoid(dot) #calculate the loss against the predicted outputs loss = get_loss(y_hat, y, n) print("Loss ================>",loss) #updating the weights based on the loss w = update_params(n, X, y_hat, y, alpha, w) #we return the predictions made on the train dataset as well as the final weights. return w, y_hat
One thing you will note is that the first step in our routine above is the min-max scaling of data. This is done to normalize the data set. Since each input feature covers a different numeric range, the algorithm has trouble learning the pattern.
The remaining routine remains the same as we discussed.
Since we have a small dataset, we can afford to run it for a large number of epochs. We let our run for the default value; 10000.
%%time weights, out = train_logistic_regression(sample_x,sample_y)
After the run, we have our final updated weights.
It is time to build some additional functions to aid our testing.
- Prediction function
def predict_logistic_regression(weights, x_test): #doing the neccesary calculations before applying sigmoid funciton X = x_test.values #normalise data using min_max scaling scaler = MinMaxScaler() scaler.fit(X) X = scaler.transform(X) #calculate the predictions product = np.dot(X,weights.T) y_preds = get_sigmoid(product) #return the predictions with applied thresholds return [1 if i >= 0.5 else 0 for i in y_preds]
''' funciton to calculate the percentage accuracy from the given predictions ''' def get_accuracy(y_pred, y): assert len(y_pred) == len(y) n = len(y) Y = y count = 0 for i in range(n): if y_pred[i] == Y[i]: count+=1 return print("Accuracy:", round(((count/n)*100),2), "%")
Using the above helper functions, we see that our model gives an accuracy of 75%.
This number can be improved by tuning the hyperparameters but that is up to you.
After a long journey, we finally reached the end of this learning trip. In this 2-part series, we learned the mathematics of logistic regression and applied that very knowledge to a real dataset. The results we achieve are somewhat acceptable but can be improved if you play around with the learning rate and no. of epochs.
*DON’T FORGET TO READ UP ON PART 1 HERE*
*IF YOU HAVE ANY SUGGESTIONS/QUESTIONS, DO POST THEM DOWN IN THE COMMENTS*