Multi-variable Linear Regression Python Implementation.

Linear Regression is the most basic Machine Learning Algorithm and understanding it is imperative for anyone looking to build a career in Artificial Intelligence.
Linear Regression in Python

Machine Learning algorithms have gained massive popularity over the last decade. Today these algorithms are used in several work fields for all sorts of data manipulation and predictions. In this tutorial, we will be implementing the most basic machine learning algorithm called “Linear Regression”.

If we were to describe linear regression in a single line, it would be something like:

“Fitting a straight line through your data points

This is a supervised machine learning algorithm. If you wish to learn about unsupervised algorithms, click here.

This tutorial will consist of 3 main parts:

  • Understanding and implementing the algorithm step-by-step.
  • Applying the built algorithm on 2- dimensional data (For better visualization).
  • Applying the algorithm on n-dimensional data.

**The entire code used in this tutorial can be found at the following GitHub repository**.

Let’s get started.

Step-by-step implementation

Basic understanding

As we have discussed before, linear regression is as simple as fitting a straight line across our data, hence for basic understanding we need to revisit some basic mathematics and look at the equation for a straight line.

The equation for a straight line
The equation for a straight line

In the equation above, X and Y are data points and m and c are the slope and y-intercept respectively. We need the appropriate values for ‘m’ and ‘c’ in order to create an appropriate straight line.

Say we have multiple values of ‘X’ against which, we have a single ‘Y’, then we can change the above equation to create a more general one as follows:

Linear equation for n-variables
Linear equation for n-variables

Now instead of ‘m’ and ‘c’, now we need to find ‘c’ and all coefficients of X from a_0 to a_n and that is the complete concept of multivariable linear regression.

Implementation of linear regression

We will need the following libraries for basic math functionality and visualization.

#importing relevant libraries import pandas as pd import numpy as np import matplotlib.pyplot as plt
Code language: PHP (php)

First, we define a function that would return us predicted values of y (conventionally called Y-hat) from some given values of c and a.

def get_pred_value(a,b,X): b_s = np.full((X.shape[0],1),b) #Applying linear equation y = np.dot(X,a)+b_s return y
Code language: PHP (php)

Next, we need to calculate how far off we were from the original values of Y. In machine learning terms this is called calculating the cost. There are several different cost functions defined but for our case, we will use the good old “Mean-Squared Error”. The function for MSE is,

Formula for calculating Mean-Squared-Error
The formula for calculating Mean-Squared-Error

Let’s code this down:

def get_loss(y_hat,y): #getting length of our data n = len(y) #ensuring a flattened vector y = y.reshape(-1,1) #calculating the mean squared error diff = (y_hat-y)**2 summed = diff.sum(0) return (1/n)*summed
Code language: PHP (php)

Our aim is to minimize the cost calculated by the above function in order to come close to the original values as possible.

For this, we need to see how each of our parameters affects the overall cost. This can be done by calculating the rate of change, also known as gradient, with respect to every parameter. The equations would be:

Gradients with respect to the bias and coefficients
Gradients with respect to the bias and coefficients

In order to minimize the cost, we need to carry out a process called “Gradient Descent”. The graph below represents the cost with respect to a particular parameter plotted against the parameter value. In order to reach the minimum cost, we must change the value of our parameter and that is where our update equations are used.

The final equations to update the values of our parameters are as below,

Update Equations for the parameters involved

Alpha in the above equation is called the learning rate. Its value needs to be optimally set, too small a value will update the parameters very slowly and too large a value can cause the updated parameters to overshoot into limbo and never converge to the global minima.

Let us code these update equations in a general fashion such that the n number of parameters can be updated.

def update_params(data,y,y_hat,coeff,b,alpha = 0.0001): #get rate of change n = data.shape[0] y = y.reshape(-1,1) #updating the bias with its update equation b_new = b - alpha*(2/n)*np.sum(y_hat-y) for i, x in enumerate(coeff): #running loop to update every coefficient current_feat = data[:,i].reshape(len(data),-1) diff = y_hat-y dot = np.dot(current_feat.T,diff) eva = (alpha)*((2/n)*np.sum(dot)) #updating value of coefficient coeff[i] = coeff[i] - eva return coeff, b_new
Code language: PHP (php)

With all the functions in place, we just need to create a training routine that would call these functions in order to minimize the cost and find suitable parameters.

def fit(X_train,y_train,iterr = 1000): #an initial huge value of loss loss = 10000 #Randomly defining initial coefficients a = np.random.rand(X_train.shape[1],1) #Setting inital biases to zero b = 0 #b = np.zeros((X_train.shape[0],1),dtype=int) for i in range(iterr): #get best ftt line based on current coefficients and baises y_hat = get_pred_value(a,b,X_train) #Calculate loss according to above predicted values loss = get_loss(y_hat,y_train) print("Loss =======>", loss) #Update parameters to minimise loss a, b = update_params(X_train,y_train,y_hat,a,b,alpha = 0.01) print("Fitting complete!!") print("Final Loss acheived:", loss) #returning the final weights and biases return a,b, y_hat
Code language: PHP (php)

Testing

It is now time to test our algorithm. We will first test on a simpler dataset. The dataset can be downloaded for free from here. The filtering and visualizing steps of the data are carried out below.

#loading training data data_train = pd.read_csv("Linear data/train.csv") data_test = pd.read_csv("Linear data/test.csv") #filtering the data data_train = data_train.dropna() data_test = data_test.dropna() #Exploring the dataset data_train.head(10)
Code language: PHP (php)
Sample Data values
Sample Data values
linear relationship
The data seems to have a clear linear relationship so it will provide us with a good insight into how well our algorithm works.

We will normalize the data using the min-max scaling technique. Let’s define a function for min-max scaling. (There are several libraries that can do this for you but its much more fun to code)

def min_max_scaler(df): temp = df cols = temp.keys() #iterating over each column for key in cols: #getting the max value max_val = temp[key].max() #getting the min value min_val = temp[key].min() #applying the formula denom = (max_val-min_val) temp[key] = (temp[key]-min_val)/denom return temp
Code language: PHP (php)
data_scaled = min_max_scaler(data_train)

With normalization done, we can begin to fit data on our model.

X_train = data_scaled["x"].values.reshape(-1,1) y_train = data_scaled["y"].values.reshape(-1,1) coeff, intercept, preds = fit(X_train, y_train, iterr = 3000)
Code language: JavaScript (javascript)
Final cost
Final cost

We run our process for 3000 iterations. Our final Loss is a significantly low value so we can say it would be a good fit. Let’s visualize the results.

Fit on training data after applying linear regression.
Fit on training data
Fit on test data after using linear regression
Fit on test data

We can see that our model fits quite accurately on the training set and reasonably well on the test set. The results might improve if the model is trained for more iterations. (The experimentation is up to you 😀)

Another good measure to test check our model is the “R square Value” also known as “Goodness of fit”. For this purpose, we’ll use the sklearn library (because sometimes you get tired of coding😅).

from sklearn.metrics import r2_score print("R_square Score:",r2_score(data_test["y"],y_pred))
Code language: JavaScript (javascript)
R-squared score

The above score shows that a 92.6% goodness of fit is quite a good number.

Testing on n-dimensional data

Now to check if our algorithm actually works on n-dimensional data. For this purpose, we have used the housing prices data set. (The original dataset has a lot of features but we have selected only a few of those for the purpose of this test)

''' The dataset is quite extensive so we select features that seem most relevant to keep things simple ''' housing = pd.read_csv("train.csv", usecols = ["MSZoning", "LotArea", "Street", "LandContour", "Utilities", "HouseStyle", "OverallQual", "OverallCond","RoofStyle", "ExterQual", "Foundation", "GarageQual","MiscVal","SalePrice"]) ''' Most of the variables are categorical and represented in string form so we convert them into numerical labels since we have to work with numeric data ''' housing["MSZoning"] = pd.factorize(housing["MSZoning"])[0] housing["Street"] = pd.factorize(housing["Street"])[0] housing["LandContour"] = pd.factorize(housing["LandContour"])[0] housing["Utilities"] = pd.factorize(housing["Utilities"])[0] housing["HouseStyle"] = pd.factorize(housing["HouseStyle"])[0] housing["RoofStyle"] = pd.factorize(housing["RoofStyle"])[0] housing["ExterQual"] = pd.factorize(housing["ExterQual"])[0] housing["Foundation"] = pd.factorize(housing["Foundation"])[0] housing["GarageQual"] = pd.factorize(housing["GarageQual"])[0]
Code language: PHP (php)

Since most of the features were string types, we convert them into numerical labels.

Housing prices dataset
Housing prices dataset

We normalize the dataset just as before and set it to train for 2000 epochs.

weights, biases, y_predicted = fit(X_train,y_train, iterr = 2000)
The final cost for the housing data
The final cost for the housing data

The final cost is again quite low as we would expect. Now let’s check the “R Squared Value

from sklearn.metrics import r2_score print(r2_score(y_test,predicted))
Code language: JavaScript (javascript)
goodness of fit
A 27.9% goodness of fit

The goodness of fit seems to be quite low, there can be multiple reasons for this; Running the algorithm for more iterations might increase the score, or as we increase the number of input parameters, the complexity of data increases so a linear relationship between the data is difficult to exist.

Conclusion

We have successfully implemented the Linear Regression algorithm using Python and achieved a reasonably good fit on our initial data set. We did not get a very good score on our second (multi-variable) dataset because it is very difficult for such a complex dataset to follow a linear trend. For datasets like these, there are algorithms such as “Polynomial Regression” or SVM. These algorithms are a little more complex, but they achieve much better scores.

Total
0
Shares
Comments 2
Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Prev
5 Data Processing Techniques for Data Science Beginners | Python and Pandas

5 Data Processing Techniques for Data Science Beginners | Python and Pandas

For all the young scientists looking for data science jobs or a data science

Next
Image Segmentation using K means clustering algorithm | Python

Image Segmentation using K means clustering algorithm | Python

Ina previous article, we saw how to implement the K means algorithm from scratch

You May Also Like