PCA or Principal Component Analysis is an age-old Machine Learning algorithm and its main use has been for dimensionality reduction. PCA is a mathematical technique that allows you to engineer new features from your given dataset such that the new features will be smaller in dimensions but will be able to represent the original features so these reduced features (the new ones) can be passed to a Machine Learning model to and still get reasonable results whilst drastically reducing complexity.
A lot of big terms going on in the first paragraph, don’t worry if you do not understand much right now because we will be seeing PCA in action step-by-step.
What is PCA ?
Let’s break down the term itself;
Principal: Reflecting importance
Component: A part of something
Analysis: Analyzing something
So together it means to find or analyze the most important parts of some entity. In Machine Learning the entity is data and the job of PCA is to extract the most important features from that data.
There is one clarification needed here, PCA does not drop any data as most people falsely believe, it creates a linear combination of the given data such that the resultant data is a very close (if not exact) representation of the original data.
Why do we need PCA in Machine Learning?
PCA is used to counter problems that occur with high dimensional data — also known as the curse of high dimensionality. Dimensions refer to the number of features in every example of your dataset. Take the example of the dummy data below.
The columns Age, weight, gender are the input features of the data. This means that our data has 3 input features. Now say we have data of 100 people so our total input data points become 100 x 3 = 300. Now let’s say we add another feature to the dataset, Exercises. Now our total data points increase to 100 x 4 = 400. We just increased our dataset by 100 points by just adding one feature. This becomes a real problem when the data sets increase to thousands.
Not only does this increase the training time on this data but also reduces the probability of data covering all possible combinations of the real world.
Using PCA we can reduce the dimensions of this data all the while conserving the information depicted by the complete dataset.
PCA Implementation (Step by step)
In order to find the principal components from a given dataset, the following steps are carried out:
- Normalize the original dataset using z-score normalization.
- Calculate a covariance matrix between the normalized data points.
- Calculate Eigenvalues and Eigenvectors for the Covariance Matrix.
- Choose top N Eigenvectors as your Principal components based on their Eigenvalues.
- Use the Eigenvectors to transform the original data to a new (lower dimension) space.
Let’s do the Python Implementation on a real dataset.
The dataset used can be found on Kaggle here.
Load the most important libraries for data analysis.
If you would like to more about data analysis, click here.
import pandas as pd import numpy as np
…and now load the dataset.
data = pd.read_csv("archive/bodyPerformance.csv") data.head()
We only need the input variables and numerical variables so we will drop the output ‘class’ and the categorical variable ‘Age’.
data.drop(columns = ["gender", "class"], inplace = True) data.head()
Now for Step 1
Data Normalization (Z-score Normalisation)
Data normalization is the process of reducing the mean of the data to 0 and the standard deviation to 1. These two steps will be carried out for each feature (column) in the data.
def data_normalisation(scaled_data): for col in scaled_data.columns: #iterate over each column scaled_data[col] = (scaled_data[col]-scaled_data[col].mean())/scaled_data[col].std() #data normalisation return scaled_data
This returns a scaled dataset as follows.
scaled_data = data_normalisation(data) print(scaled_data)
Covariance is the measure of change of a variable with respect to another variable. For example, if we have two variables X and Y, then the covariance between these two would tell us how one changes with respect to the other. Covariance is an integer that can take any value (Negative or Positive). It is not the integer that matters but the sign.
- Negative covariance means that the two variables move in the opposite direction i.e. they are inversely proportional.
- Positive covariance means that they move in the same direction i.e. they are directly proportional to each other
We have the following formula to calculate covariance.
And this is how we code it in python.
def covariance_calculation(mean_subtracted): #calcualte covariance amongst scaled values for col in mean_subtracted.columns: mean_subtracted[col] = mean_subtracted[col]-mean_subtracted[col].mean() return np.dot(mean_subtracted.T,mean_subtracted)/(len(mean_subtracted) - 1)
Note that we have used vectorization to calculate the covariance matrix in the return statement above. This is a python pro tip, when ever you have to perform mathematical operations on a large array of numbers, always analyze if you can do it via vector/matrix operations, it is much faster and memory efficient then using loops.
Eigenvalues and Eigenvectors are a very important part of calculus however it is far beyond the scope of this article to explain what they are and how they work. If you want to understand them in-depth, you can find several helpful resources on the internet.
The simplest explanation of an Eigenvector is that is a vector in an N-dimensional space such that a particular matrix transformation (say A) does not cause any rotation to this particular vector. This eigenvector is, particularly for the transformation A, if there is another transformation B it will have different eigenvectors corresponding to it in the same N-dimensional space.
Even though no rotation occurs, this eigenvector is scaled by a certain factor and that factor is called the eigenvalue for that vector.
As I mentioned before, the in-depth concepts behind eigenvectors and eigenvalues are beyond the scope of this article so we will use the eig module for the calculation of both these entities.
from numpy.linalg import eig w,v=eig(cov_calc)
w contains the eigenvalues and v contains the corresponding eigenvectors.
The importance of each eigenvector is depicted by its corresponding eigenvalue. A higher eigenvalue means a greater spread of data depicted by the eigenvector which is what we want so let’s plot and see the trend of the eigenvalues.
import matplotlib.pyplot as plt plt.bar(["e" + str(i+1) for i in range(len(w))], w) #plotting the eigenvalues plt.title("Eigenvalues") plt.xlabel("Magnitude of the eigenvalue")
Well looks like the first 4 or 5 eigenvectors (e1 through e5) should be more than enough to give us a representation of our data.
Now we can use it to transform our original data into a 5-dimensional space (basically reduce it to 5 dimensions). We can simply do that by taking the dot product between the dataset and these eigenvectors.
useful_pc = v[:,:5] #keeping only the first 5 eigenvectors principle_data = np.dot(data.values, useful_pc) #dot product of the two terms print(priciple_data) print("Shape of the New data is:", principle_data.shape)
And that’s how we transformed the original dataset (with 10 features) to something much smaller that still represents the original data.