PCA implementation in Python

PCA or Principal Component Analysis is an age-old Machine Learning algorithm and its main use has been for dimensionality reduction. PCA is a mathematical technique that allows you to engineer new features from your given dataset such that the new features will be smaller in dimensions but will be able to represent the original features so these reduced features (the new ones) can be passed to a Machine Learning model to and still get reasonable results whilst drastically reducing complexity.

A lot of big terms going on in the first paragraph, don’t worry if you do not understand much right now because we will be seeing PCA in action step-by-step.

What is PCA ?

Let’s break down the term itself;

Principal: Reflecting importance

Component: A part of something

Analysis: Analyzing something

So together it means to find or analyze the most important parts of some entity. In Machine Learning the entity is data and the job of PCA is to extract the most important features from that data.

Extracting principal components from data

There is one clarification needed here, PCA does not drop any data as most people falsely believe, it creates a linear combination of the given data such that the resultant data is a very close (if not exact) representation of the original data.

Why do we need PCA in Machine Learning?

PCA is used to counter problems that occur with high dimensional data — also known as the curse of high dimensionalityDimensions refer to the number of features in every example of your dataset. Take the example of the dummy data below.

Tabular data with 3 columns

The columns Age, weight, gender are the input features of the data. This means that our data has 3 input features. Now say we have data of 100 people so our total input data points become 100 3 = 300. Now let’s say we add another feature to the dataset, Exercises. Now our total data points increase to 100 4 = 400. We just increased our dataset by 100 points by just adding one feature. This becomes a real problem when the data sets increase to thousands.

Data with 4 columns

Not only does this increase the training time on this data but also reduces the probability of data covering all possible combinations of the real world.

Using PCA we can reduce the dimensions of this data all the while conserving the information depicted by the complete dataset.

PCA Implementation (Step by step)

In order to find the principal components from a given dataset, the following steps are carried out:

  • Normalize the original dataset using z-score normalization.
  • Calculate a covariance matrix between the normalized data points.
  • Calculate Eigenvalues and Eigenvectors for the Covariance Matrix.
  • Choose top N Eigenvectors as your Principal components based on their Eigenvalues.
  • Use the Eigenvectors to transform the original data to a new (lower dimension) space.

Let’s do the Python Implementation on a real dataset.

The dataset used can be found on Kaggle here.

Load the most important libraries for data analysis.

If you would like to more about data analysis, click here.

import pandas as pd
import numpy as np

…and now load the dataset.

data = pd.read_csv("archive/bodyPerformance.csv")
data.head()
Complete Dataset to be use for PCA calculation
Complete Dataset to be used for PCA calculation

We only need the input variables and numerical variables so we will drop the output ‘class’ and the categorical variable ‘Age’.

data.drop(columns = ["gender", "class"], inplace = True)
data.head()
Filtered dataset
Filtered dataset

Now for Step 1

Data Normalization (Z-score Normalisation)

Data Normalization for PCA

Data normalization is the process of reducing the mean of the data to 0 and the standard deviation to 1. These two steps will be carried out for each feature (column) in the data.

Data Normalization formula
Z-score Normalization formula
def data_normalisation(scaled_data):
    for col in scaled_data.columns: #iterate over each column
        scaled_data[col] = (scaled_data[col]-scaled_data[col].mean())/scaled_data[col].std() #data normalisation 
    
    return scaled_data

This returns a scaled dataset as follows.

scaled_data = data_normalisation(data)
print(scaled_data)
Normalized data
Normalized data

Covariance calculation

Covariance is the measure of change of a variable with respect to another variable. For example, if we have two variables X and Y, then the covariance between these two would tell us how one changes with respect to the other. Covariance is an integer that can take any value (Negative or Positive). It is not the integer that matters but the sign.

Data correlation for PCA
Data correlation
  • Negative covariance means that the two variables move in the opposite direction i.e. they are inversely proportional.
  • Positive covariance means that they move in the same direction i.e. they are directly proportional to each other

We have the following formula to calculate covariance.

Covariance formula for 2 variables X and Y
Covariance formula for 2 variables X and Y

And this is how we code it in python.

def covariance_calculation(mean_subtracted):
    #calcualte covariance amongst scaled values
    for col in mean_subtracted.columns:
        mean_subtracted[col] = mean_subtracted[col]-mean_subtracted[col].mean()
        
    return np.dot(mean_subtracted.T,mean_subtracted)/(len(mean_subtracted) - 1)

Note that we have used vectorization to calculate the covariance matrix in the return statement above. This is a python pro tip, when ever you have to perform mathematical operations on a large array of numbers, always analyze if you can do it via vector/matrix operations, it is much faster and memory efficient then using loops.

Eigenvector Calculation

Eigenvalues and Eigenvectors are a very important part of calculus however it is far beyond the scope of this article to explain what they are and how they work. If you want to understand them in-depth, you can find several helpful resources on the internet.

The simplest explanation of an Eigenvector is that is a vector in an N-dimensional space such that a particular matrix transformation (say Adoes not cause any rotation to this particular vector. This eigenvector is, particularly for the transformation A, if there is another transformation it will have different eigenvectors corresponding to it in the same N-dimensional space.

Even though no rotation occurs, this eigenvector is scaled by a certain factor and that factor is called the eigenvalue for that vector.

As I mentioned before, the in-depth concepts behind eigenvectors and eigenvalues are beyond the scope of this article so we will use the eig module for the calculation of both these entities.

from numpy.linalg import eig

w,v=eig(cov_calc)

contains the eigenvalues and contains the corresponding eigenvectors.

The importance of each eigenvector is depicted by its corresponding eigenvalue. A higher eigenvalue means a greater spread of data depicted by the eigenvector which is what we want so let’s plot and see the trend of the eigenvalues.

import matplotlib.pyplot as plt
plt.bar(["e" + str(i+1) for i in range(len(w))], w) #plotting the eigenvalues
plt.title("Eigenvalues")
plt.xlabel("Magnitude of the eigenvalue")
Plot of the eigenvalues calculated
Plot of the eigenvalues calculated

Well looks like the first 4 or 5 eigenvectors (e1 through e5) should be more than enough to give us a representation of our data.

Final calculation

The final step to complete PCA

Now we can use it to transform our original data into a 5-dimensional space (basically reduce it to 5 dimensions). We can simply do that by taking the dot product between the dataset and these eigenvectors.

useful_pc = v[:,:5] #keeping only the first 5 eigenvectors
principle_data = np.dot(data.values, useful_pc) #dot product of the two terms
print(priciple_data)
print("Shape of the New data is:", principle_data.shape)
Original data transformed into a 5-dimensional space after PCA
Original data transformed into a 5-dimensional space

and done!!

And that’s how we transformed the original dataset (with 10 features) to something much smaller that still represents the original data.

Leave a Reply to r c t Cancel reply

Your email address will not be published.

3 replies on “PCA Implementation in Python made Easy”

  • r c t
    December 20, 2021 at 4:55 pm

    Maybe you should review your “Nomalization formula”, Or you would like to write Standardization.

    • Moosa Ali
      December 21, 2021 at 4:23 pm

      Hi, thanks for pointing it out. Standardization is also called Z- score normalization. I forgot to write the complete term

  • December 23, 2021 at 8:04 pm

    Thanks for finally writing abnout >PCA Implementation in Python made Easy – WritersByte <Loved it! https://Penzu.com/p/6bb0260c