Feature Selection Algorithms for Machine Learning

Lets talk two powerful feature selection algorithms and how they can help you reduce problems in your Machine Learning model.
Feature Selection Algorithms for Machine Learning

Feature Selection is an optional, yet important preprocessing step for your Machine Learning model. It is a common practice to feed your Machine Learning model data as you receive it. This is a very common rookie mistake, data that you receive from whatever source will always contain errors. There are 2 important steps that you must carry out on this data before moving on to training the models.

  1. Data Cleaning
  2. Feature filtering/ Selection

I have talked about Data Cleaning multiple times in other articles, if you would like to learn more about it then you can check the following blog post.

Feature Selection

However, in this post, we will talk about selecting important input features from your dataset such that it has minimum impact on the outputs and reduces the complexity of the data.

First, let’s talk about how correlation amongst input variables affects a Machine Learning model.

Correlation between features

Suppose you have a dataset with the following 2 features and the output labeled Y as shown in the diagram below.

Dataset on which we will use feature selection
Dataset with highly correlated values

variables x_2 and x_3 have exactly the same value throughout the dataset. A Machine Learning model learns the given pattern in the input variables corresponding to each output variable. Now for variables that have a high correlation(x_2 and x_3 in the above diagram), one of these variables is not providing any valuable information to our model. This is because they both have the same pattern, it is as if they are the same variable.

Dataset after feature selection
Dataset after feature selection

Since one of these variables adds no additional information to the model we can remove the feature altogether. What’s the harm in keeping such a variable you may ask?

well…

  1. It increases the dimensionality of the data unnecessarily. This increases training times and we may run into the problem of curse of dimensionality.
  2. Although in most models, keeping the feature may not make the model any “worse” however such highly correlated variables affect different models differently. These might cause to add confusion to the model and hence reduce performance.

If you liked this article, consider subscribing to the mailing list. Don’t forget to confirm your email!!

Algorithms

At this point it may seem quite simple, just remove a feature at random amongst the 2. However in real life, you will rarely find variables that show EXACTLY the same pattern/values across the dataset, hence there are a few more things to consider before removing a variable or deciding whether two variables are even highly correlated or not.

This is why there are specialized algorithms that decide for you, which variables to keep and which to remove. We will talk about 2 such algorithms.

Boruta Feature Selection

Boruta Feature selection algorithm was first introduced as a package for R. It is a very useful algorithm that defines its own thresholds and provides you with the most accurate features from the provided dataset.

I will write a separate article to get into the working of Boruta however a little summary is as follows.

Boruta shuffles the provided input features (each feature column separately) and then concatenates these (called shadow features) with the original data. After this, the complete data set is trained using a Random Forest classifier. This classifier returns feature importance for the entire input. Boruta then sets the threshold as the strongest shuffled(shadow) feature.

Any real feature which has an importance level lower than the most important shuffled feature is dropped. Boruta has a python package that helps you calculate the features. Below is a demonstration of how it works.

# install the package !pip install boruta # import important libraries import pandas as pd from boruta import BorutaPy from sklearn.ensemble import RandomForestRegressor import numpy as np
Code language: PHP (php)

Now we load the dataset and clean it up a little bit like removing NaN values and converting categorical variables to numerical representation.

#load data heart_data = pd.read_csv("healthcare-dataset-stroke-data.csv") # converting to numeric heart_data["gender"] = pd.factorize(heart_data["gender"])[0] heart_data["ever_married"] = pd.factorize(heart_data["ever_married"])[0] heart_data["work_type"] = pd.factorize(heart_data["work_type"])[0] heart_data["Residence_type"] = pd.factorize(heart_data["Residence_type"])[0] heart_data["smoking_status"] = pd.factorize(heart_data["smoking_status"])[0] # additional cleaning heart_data.dropna(inplace =True) heart_data.drop("id", axis =1, inplace = True) heart_data.head()
Code language: PHP (php)

The Final dataset looks like the one below.

Dataset for feature selection
Dataset from Kaggle

Now let’s run the Boruta algorithm.

X = heart_data.drop("stroke", axis = 1) y = heart_data["stroke"] # we will use the randomforest algorithm forest = RandomForestRegressor(n_jobs = -1,max_depth = 10) # initialize boruta boruta = BorutaPy(estimator = forest, n_estimators = 'auto',max_iter = 50,) # Boruta accepts np.array boruta.fit(np.array(X), np.array(y)) # get results green_area = X.columns[boruta.support_].to_list() blue_area = X.columns[boruta.support_weak_].to_list() print('Selected Features:', green_area) print('Blue area features:', blue_area)
Code language: PHP (php)
Result of the Boruta algortihm

So out of the 10 original features, Boruta believes that only the 2 features returned are the most important features to make any reasonable decision.

mRMR Feature Selection

MRMR stands for Maximum Relevance Minimum Redundancy. While Boruta looks amongst the features to find the most important ones, MRMR makes sure that the features selected are not only the ones that provide minimum correlation between the input features but also have a high correlation with the output variable.

This algorithm was first introduced in the following paper.

MRMR works iteratively, it first asks you how many features you want to keep, and then for every iteration it calculates 1 feature that is most relevant to the output variable and least related to any of the features in our dataset. Once a feature is selected it is removed from the original dataset and the next iteration begins until K (the number of features we require) iterations are completed.

I will explain the details of the algorithm in a separate post. For now, let’s look at its python implementation.

Install the python package using the following command

!pip install mrmr_selection

You can find the complete documentation for this package at their official Github repository here.

The usage is quite straightforward.

from mrmr import mrmr_classif selected_features = mrmr_classif(X=X, y=y, K=2)
Code language: JavaScript (javascript)

I have set K as 2 just to see if the selected features match with what we are returned by Boruta.

print(selected_features)
Code language: PHP (php)
Features returned by MRMR with K=2
Features returned by MRMR with K=2

And well yes we have the exact same features as what we got from the Boruta algorithm above. However, what makes MRMR flexible is that if you believe that 2 features might not be enough to get you a better result then you can choose to use as many as you want.

Let’s carry out a few more runs.

# top 4 features top_4 = mrmr_classif(X=X, y=y, K=4) # top 6 features top_6 = mrmr_classif(X=X, y=y, K=6) print("Best 4 features:", top_4) print("Best 6 features:", top_6)
Code language: PHP (php)
Features returned by MRMR for k = 4 and k = 6

Conclusion

Feature selection is a live saver when you are low on memory resources and, at times, can even help improve the performance of your model. It is an essential step in the process of building your machine learning model.

Total
0
Shares
Leave a Reply

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Prev
Install SQOOP on windows — step by step guide (with bug fixes)

Install SQOOP on windows — step by step guide (with bug fixes)

A guide to installing Apache SQOOP on a windows device

Next
Boruta Feature Selection Explained in Python
Boruta Feature Selection Python

Boruta Feature Selection Explained in Python

Learn to implement the boruta feature selection in Python and improve your

You May Also Like