- Home
- Featured Post
- PCA Implementation in Python m ...

PCA or *Principal Component Analysis* is an age-old Machine Learning algorithm and its main use has been for *dimensionality reduction. *PCA is a mathematical technique that allows you to engineer new features from your given dataset such that the new features will be smaller in dimensions but will be able to represent the original features so these reduced features (the new ones) can be passed to a Machine Learning model to and still get reasonable results whilst drastically reducing complexity.

A lot of big terms going on in the first paragraph, don’t worry if you do not understand much right now because we will be seeing PCA in action step-by-step.

## What is PCA ?

Let’s break down the term itself;

*Principal**: Reflecting importance*

*Component**: A part of something*

*Analysis**: Analyzing something*

So together it means to **find or analyze the most important parts of some entity**. In Machine Learning the entity is data and the job of PCA is to extract the most important features from that data.

There is one clarification needed here, PCA does not drop any data as most people falsely believe, it creates a linear combination of the given data such that the resultant data is a very close (*if not exact*) representation of the original data.

## Why do we need PCA in Machine Learning?

PCA is used to counter problems that occur with high dimensional data — also known as the *curse of **high dimensionality. ***Dimensions **refer to the number of features in every example of your dataset. Take the example of the dummy data below.

The columns **Age, weight, gender **are the input features of the data. This means that our data has 3 input features. Now say we have data of 100 people so our total input data points become *100 *x *3 = 300. *Now let’s say we add another feature to the dataset, **Exercises**. Now our total data points increase to *100 *x *4 = 400. *We just increased our dataset by 100 points by just adding one feature. This becomes a real problem when the data sets increase to thousands.

Not only does this increase the training time on this data but also reduces the probability of data covering all possible combinations of the real world.

Using PCA we can reduce the dimensions of this data all the while conserving the information depicted by the complete dataset.

**PCA Implementation (Step by step)**

In order to find the principal components from a given dataset, the following steps are carried out:

**Normalize**the original dataset using z-score normalization.- Calculate a
**covariance matrix**between the**normalized data points**. - Calculate
**Eigenvalues**and**Eigenvectors**for the**Covariance Matrix.** - Choose
**top N Eigenvectors**as your**Principal components**based on their**Eigenvalues.** - Use the
**Eigenvectors**to transform the original data to a new (l**ower dimension**) space.

Let’s do the Python Implementation on a real dataset.

The dataset used can be found on Kaggle here.

Load the most important libraries for data analysis.

If you would like to more about data analysis, click here.

```
import pandas as pd
import numpy as np
```

…and now load the dataset.

```
data = pd.read_csv("archive/bodyPerformance.csv")
data.head()
```

We only need the input variables and numerical variables so we will drop the output ‘**class**’ and the categorical variable ‘**Age**’.

```
data.drop(columns = ["gender", "class"], inplace = True)
data.head()
```

SweetViz: Easy EDA and applied Data Science with Python

EDA stands for ‘Exploratory Data Analysis’. Data Analysis is the core part of a data scientists/analyst ‘s job and can turn out to be a very tiring job especially when you have lots of data at hand and of variable data types. Data visualization in Python is still no easy task. There are certain python […]

Now for *Step 1*

### Data Normalization (Z-score Normalisation)

Data normalization is the process of reducing the mean of the data to 0 and the standard deviation to 1. These two steps will be carried out for each feature (column) in the data.

```
def data_normalisation(scaled_data):
for col in scaled_data.columns: #iterate over each column
scaled_data[col] = (scaled_data[col]-scaled_data[col].mean())/scaled_data[col].std() #data normalisation
return scaled_data
```

This returns a scaled dataset as follows.

```
scaled_data = data_normalisation(data)
print(scaled_data)
```

### Covariance calculation

Covariance is the measure of change of a variable with respect to another variable. For example, if we have two variables X and Y, then the covariance between these two would tell us how one changes with respect to the other. Covariance is an integer that can take any value (Negative or Positive). It is not the integer that matters but the sign.

**Negative covariance**means that the two variables move in the opposite direction i.e. they are inversely proportional.**Positive covariance**means that they move in the same direction i.e. they are directly proportional to each other

We have the following formula to calculate covariance.

And this is how we code it in python.

```
def covariance_calculation(mean_subtracted):
#calcualte covariance amongst scaled values
for col in mean_subtracted.columns:
mean_subtracted[col] = mean_subtracted[col]-mean_subtracted[col].mean()
return np.dot(mean_subtracted.T,mean_subtracted)/(len(mean_subtracted) - 1)
```

Note that we have usedvectorizationto calculate the covariance matrix in the return statement above. This is a python pro tip, when ever you have to perform mathematical operations on a large array of numbers, always analyze if you can do it viavector/matrix operations, it is much faster and memory efficient then using loops.

### Eigenvector Calculation

Eigenvalues and Eigenvectors are a very important part of calculus however it is far beyond the scope of this article to explain what they are and how they work. If you want to understand them in-depth, you can find several helpful resources on the internet.

Data Visualization and Analysis in Python using MatPlotLib

A question people often raise is ”Data Science vs Data Analytics”. We have discussed data science in a lot of other articles but today we talk about data analytics, in fact, we talk about the first step of data analytics which is data visualization. In an earlier article, we saw how you can visualize your […]

The simplest explanation of an **Eigenvector** is that is a vector in an N-dimensional space such that a particular matrix transformation (say ** A**)

**This**

*does not cause any rotation to this particular vector.***eigenvector**is, particularly for the transformation

**if there is another transformation**

*A,***it will have different eigenvectors corresponding to it in the same N-dimensional space.**

*B*Even though no rotation occurs, this eigenvector is scaled by a certain factor and that factor is called the ** eigenvalue** for that vector.

As I mentioned before, the in-depth concepts behind eigenvectors and eigenvalues are beyond the scope of this article so we will use the ** eig **module for the calculation of both these entities.

```
from numpy.linalg import eig
w,v=eig(cov_calc)
```

** w **contains the eigenvalues and

**contains the corresponding eigenvectors.**

*v*The importance of each eigenvector is depicted by its corresponding eigenvalue. A higher eigenvalue means a greater spread of data depicted by the eigenvector which is what we want so let’s plot and see the trend of the eigenvalues.

```
import matplotlib.pyplot as plt
plt.bar(["e" + str(i+1) for i in range(len(w))], w) #plotting the eigenvalues
plt.title("Eigenvalues")
plt.xlabel("Magnitude of the eigenvalue")
```

Well looks like the first 4 or 5 eigenvectors (e1 through e5) should be more than enough to give us a representation of our data.

### Final calculation

Now we can use it to transform our original data into a 5-dimensional space (basically reduce it to 5 dimensions). We can simply do that by taking the dot product between the dataset and these eigenvectors.

```
useful_pc = v[:,:5] #keeping only the first 5 eigenvectors
principle_data = np.dot(data.values, useful_pc) #dot product of the two terms
print(priciple_data)
print("Shape of the New data is:", principle_data.shape)
```

and done!!

And that’s how we transformed the original dataset (with 10 features) to something much smaller that still represents the original data.

## 3 replies on “PCA Implementation in Python made Easy”

Maybe you should review your “Nomalization formula”, Or you would like to write Standardization.

Hi, thanks for pointing it out. Standardization is also called Z- score normalization. I forgot to write the complete term

Thanks for finally writing abnout >PCA Implementation in Python made Easy – WritersByte <Loved it! https://Penzu.com/p/6bb0260c