THIS IS A TWO-PART SERIES. PART 1 WILL BE ALL ABOUT THE THEORETICAL UNDERSTANDING OF THE LOGISTIC REGRESSION ALGORITHM. PART 2 (this one) WILL BE COVERING THE CODING PART, WHERE WE WILL BE IMPLEMENTING LOGISTIC REGRESSION IN PYTHON*
If you are confused about the theoretical working of the logistic regression algorithm, you can read part 1 here.
The entire code used in this tutorial can be found here.
Since we have closely related logistic regression and linear regression, I would advise you to read up about linear regression as well to get a better understanding of these algorithms. You can read about it here.
Since we have already covered the theory in part 1, here we will jump straight into the code.
We will start by importing all important libraries. We will not use the logistic regression from any library as that would be against the purpose of this article.
Let’s first explore the dataset which we will use.
Since we predict discrete classes using logistic regression, we will use the dataset on the possibility of cardiovascular disease. The dataset is open source and can be found here.
From the above overview, we see that we have a total of 12 input features stored in the variable sample_x.
Our outputs are a series of 1’s and 0’s where 1’s support the possibility of cardiovascular disease and 0’s reject it.
We have roughly 300 total samples. We separate the first 250 samples as training data and the remaining 50 will be used for testing.
- Calculate outputs using randomly initialized weights.
- Calculate loss between original and predicted outputs.
- Update weights using gradient descent.
- Repeat until max iterations reached or loss reduced significantly.
From the above steps we can isolate the functions we will need. Let’s get to coding.
def get_sigmoid(inp): #simply returning sigmoid our values return (1/(1+np.exp(-inp))) - 0.00000001
def get_loss(y_hat, y, n): #simply returning the loss calculated using our loss function return -(1/n)*(np.sum(y*(np.log(y_hat)) + (1-y)*np.log(1-y_hat)))
- Weight Update
def update_params(n, X, y_hat, y, alpha, w_old): #looping to update every parameter diff = y_hat-y update = (1/n)*np.dot(X.T,diff) #applying the update equation w_new = w_old-(alpha)*(update.T) #return the updated weights return w_new
Now, all there is left to do is code a routine that uses these functions in the algorithmic way we discussed above.
One thing you will note is that the first step in our routine above is the min-max scaling of data. This is done to normalize the data set. Since each input feature covers a different numeric range, the algorithm has trouble learning the pattern.
The remaining routine remains the same as we discussed.
Since we have a small dataset, we can afford to run it for a large number of epochs. We let our run for the default value; 10000.
%%time weights, out = train_logistic_regression(sample_x,sample_y)
After the run, we have our final updated weights.
It is time to build some additional functions to aid our testing.
- Prediction function
Using the above helper functions, we see that our model gives an accuracy of 75%.
This number can be improved by tuning the hyperparameters but that is up to you.
After a long journey, we finally reached the end of this learning trip. In this 2-part series, we learned the mathematics of logistic regression and applied that very knowledge to a real dataset. The results we achieve are somewhat acceptable but can be improved if you play around with the learning rate and no. of epochs.
*DON’T FORGET TO READ UP ON PART 1 HERE*
*IF YOU HAVE ANY SUGGESTIONS/QUESTIONS, DO POST THEM DOWN IN THE COMMENTS*