Machine Learning algorithms have gained massive popularity over the last decade. Today these algorithms are used in several work fields for all sorts of data manipulation and predictions.

In this tutorial, we will be implementing the most basic machine learning algorithm called “Linear Regression”. If we were to describe linear regression in a single line, it would be something like:

“Fitting a straight line through your data points

This is a supervised machine learning algorithm. If you wish to learn about unsupervised algorithms, click here.

This tutorial will consist of 3 main parts:

  • Understanding and implementing the algorithm step-by-step.
  • Applying the built algorithm on 2- dimensional data (For better visualization).
  • Applying the algorithm on n-dimensional data.

**The entire code used in this tutorial can be found at the following GitHub repository**.

Let’s get started.

Step-by-step implementation

Basic understanding

As we have discussed before, linear regression is as simple as fitting a straight line across our data, hence for basic understanding we need to revisit some basic mathematics and look at the equation for a straight line.

The equation for a straight line
The equation for a straight line

In the equation above, X and Y are data points and m and c are the slope and y-intercept respectively. We need the appropriate values for ‘m’ and ‘c’ in order to create an appropriate straight line.

Say we have multiple values of ‘X’ against which, we have a single ‘Y’, then we can change the above equation to create a more general one as follows:

Linear equation for n-variables
Linear equation for n-variables

Now instead of ‘m’ and ‘c’, now we need to find ‘c’ and all coefficients of X from a_0 to a_n and that is the complete concept of multivariable linear regression.

Implementation

We will need the following libraries for basic math functionality and visualization.

#importing relevant libraries
import pandas as pd
import numpy as np 
import matplotlib.pyplot as plt

First, we define a function that would return us predicted values of y (conventionally called Y-hat) from some given values of c and a.

def get_pred_value(a,b,X):
    
    b_s = np.full((X.shape[0],1),b)
    
    #Applying linear equation
    y = np.dot(X,a)+b_s

    return y

Next, we need to calculate how far off we were from the original values of Y. In machine learning terms this is called calculating the cost. There are several different cost functions defined but for our case, we will use the good old “Mean-Squared Error”. The function for MSE is,

Formula for calculating Mean-Squared-Error
The formula for calculating Mean-Squared-Error

Let’s code this down:

def get_loss(y_hat,y):
    #getting length of our data
    n = len(y)
    
    #ensuring a flattened vector
    y = y.reshape(-1,1)
    
    #calculating the mean squared error
    diff = (y_hat-y)**2
    summed = diff.sum(0)
    
    return (1/n)*summed

Our aim is to minimize the cost calculated by the above function in order to come close to the original values as possible.

For this, we need to see how each of our parameters affects the overall cost. This can be done by calculating the rate of change, also known as gradient, with respect to every parameter. The equations would be:

Gradients with respect to the bias and coefficients
Gradients with respect to the bias and coefficients

In order to minimize the cost, we need to carry out a process called “Gradient Descent”. The graph below represents the cost with respect to a particular parameter plotted against the parameter value. In order to reach the minimum cost, we must change the value of our parameter and that is where our update equations are used.

The final equations to update the values of our parameters are as below,

Update Equations for the parameters involved

Alpha in the above equation is called the learning rate. Its value needs to be optimally set, too small a value will update the parameters very slowly and too large a value can cause the updated parameters to overshoot into limbo and never converge to the global minima.

Let us code these update equations in a general fashion such that the n number of parameters can be updated.

def update_params(data,y,y_hat,coeff,b,alpha = 0.0001):
    #get rate of change
    n = data.shape[0]
    y = y.reshape(-1,1)
    
    #updating the bias with its update equation
    b_new = b - alpha*(2/n)*np.sum(y_hat-y)
    
    for i, x in enumerate(coeff):
        
        #running loop to update every coefficient
        current_feat = data[:,i].reshape(len(data),-1)
        diff = y_hat-y
        dot = np.dot(current_feat.T,diff)
        eva = (alpha)*((2/n)*np.sum(dot))
        
        #updating value of coefficient
        coeff[i] = coeff[i] - eva
        
    return coeff, b_new

With all the functions in place, we just need to create a training routine that would call these functions in order to minimize the cost and find suitable parameters.

def fit(X_train,y_train,iterr = 1000): 
    #an initial huge value of loss
    loss = 10000
    
    #Randomly defining initial coefficients
    a = np.random.rand(X_train.shape[1],1)


    #Setting inital biases to zero
    b = 0
    #b = np.zeros((X_train.shape[0],1),dtype=int)

    for i in range(iterr):
        
        #get best ftt line based on current coefficients and baises
        y_hat = get_pred_value(a,b,X_train)
                               
        #Calculate loss according to above predicted values
        loss = get_loss(y_hat,y_train)
        print("Loss =======>", loss)
        
        #Update parameters to minimise loss
        a, b = update_params(X_train,y_train,y_hat,a,b,alpha = 0.01)

    print("Fitting complete!!")
    print("Final Loss acheived:", loss)
    
    #returning the final weights and biases
    return a,b, y_hat

Testing

It is now time to test our algorithm. We will first test on a simpler dataset. The dataset can be downloaded for free from here. The filtering and visualizing steps of the data are carried out below.


#loading training data
data_train = pd.read_csv("Linear data/train.csv")
data_test = pd.read_csv("Linear data/test.csv")


#filtering the data
data_train = data_train.dropna()
data_test = data_test.dropna()

#Exploring the dataset
data_train.head(10)
Sample Data values
Sample Data values
linear relationship
The data seems to have a clear linear relationship so it will provide us with a good insight into how well our algorithm works.

We will normalize the data using the min-max scaling technique. Let’s define a function for min-max scaling. (There are several libraries that can do this for you but its much more fun to code)

def min_max_scaler(df):
    
    temp = df
    cols = temp.keys()
    
    #iterating over each column
    for key in cols:
        
        #getting the max value
        max_val = temp[key].max()
        
        #getting the min value
        min_val = temp[key].min()
        
        #applying the formula
        denom = (max_val-min_val)
        temp[key] = (temp[key]-min_val)/denom
        
    return temp
data_scaled = min_max_scaler(data_train)

With normalization done, we can begin to fit data on our model.

X_train = data_scaled["x"].values.reshape(-1,1)
y_train = data_scaled["y"].values.reshape(-1,1)
coeff, intercept, preds = fit(X_train, y_train, iterr = 3000)
Final cost
Final cost

We run our process for 3000 iterations. Our final Loss is a significantly low value so we can say it would be a good fit. Let’s visualize the results.

Fit on training data
Fit on training data
Fit on test data
Fit on test data

We can see that our model fits quite accurately on the training set and reasonably well on the test set. The results might improve if the model is trained for more iterations. (The experimentation is up to you 😀)

Another good measure to test check our model is the “R square Value” also known as “Goodness of fit”. For this purpose, we’ll use the sklearn library (because sometimes you get tired of coding😅).

from sklearn.metrics import r2_score  
print("R_square Score:",r2_score(data_test["y"],y_pred))
R-squared score

The above score shows that a 92.6% goodness of fit is quite a good number.

Testing on n-dimensional data

Now to check if our algorithm actually works on n-dimensional data. For this purpose, we have used the housing prices data set. (The original dataset has a lot of features but we have selected only a few of those for the purpose of this test)

'''
The dataset is quite extensive so we select features that seem most relevant to keep things simple
'''
housing = pd.read_csv("train.csv", usecols = ["MSZoning", "LotArea", "Street", "LandContour", "Utilities", "HouseStyle", "OverallQual", "OverallCond","RoofStyle", "ExterQual", "Foundation", "GarageQual","MiscVal","SalePrice"])
'''
Most of the variables are categorical and represented in string form so we convert them into numerical labels since 
we have to work with numeric data

'''

housing["MSZoning"] = pd.factorize(housing["MSZoning"])[0]
housing["Street"] = pd.factorize(housing["Street"])[0]
housing["LandContour"] = pd.factorize(housing["LandContour"])[0]
housing["Utilities"] = pd.factorize(housing["Utilities"])[0]
housing["HouseStyle"] = pd.factorize(housing["HouseStyle"])[0]
housing["RoofStyle"] = pd.factorize(housing["RoofStyle"])[0]
housing["ExterQual"] = pd.factorize(housing["ExterQual"])[0]
housing["Foundation"] = pd.factorize(housing["Foundation"])[0]
housing["GarageQual"] = pd.factorize(housing["GarageQual"])[0]

Since most of the features were string types, we convert them into numerical labels.

Housing prices dataset
Housing prices dataset

We normalize the dataset just as before and set it to train for 2000 epochs.

weights, biases, y_predicted = fit(X_train,y_train, iterr = 2000)
The final cost for the housing data
The final cost for the housing data

The final cost is again quite low as we would expect. Now let’s check the “R Squared Value

from sklearn.metrics import r2_score
print(r2_score(y_test,predicted))
goodness of fit
A 27.9% goodness of fit

The goodness of fit seems to be quite low, there can be multiple reasons for this; Running the algorithm for more iterations might increase the score, or as we increase the number of input parameters, the complexity of data increases so a linear relationship between the data is difficult to exist.

Conclusion

We have successfully implemented the Linear Regression algorithm using Python and achieved a reasonably good fit on our initial data set. We did not get a very good score on our second (multi-variable) dataset because it is very difficult for such a complex dataset to follow a linear trend. For datasets like these, there are algorithms such as “Polynomial Regression” or SVM. These algorithms are a little more complex, but they achieve much better scores.

Leave a Reply

Your email address will not be published. Required fields are marked *