Boruta Feature Selection Explained in Python

Learn to implement the boruta feature selection in Python and improve your Machine Learning model.
Boruta Feature Selection Python

This article aims to explain, the very popular, Boruta feature selection algorithm. Boruta automates the process of feature selection as it automatically determines any thresholds and returns features that are most meaningful in your dataset. Boruta works on the “all-relevant” principle as it provides you with ALL the features that are relevant to your Machine Learning problem.

Need for Feature Selection?

Datasets can contain features that may be completely irrelevant to your problem. These features increase the size of your dataset, add complexity to the artificial intelligence model, and have either, no impact on the output, or worsen the results. It is important to identify these features and remove them before moving on to the training stages.

You can find a little more detail on Feature selection in the following article.

Boruta Algorithm

This algorithm was first introduced as a package for R. It comprises the following steps:

Create copies of the original features by randomly shuffling the features(Shadow Features).

  1. Concatenate these shadow features to the original dataset.

2. Train this new dataset using the Random Forest Classifier.

3. Check feature importance for the highest-rated Shadow feature.

4. All original features that are more important than the most important shadow feature are the ones that we want to keep.

5. Repeat 3 and 4 for some iterations (20 is a reasonable number) and keep track of the features that appear as important in every iteration.

6. Use binomial distribution to finalize which features provide enough confidence to be kept in the final list.

Before moving on, if you are finding this article helpful, do consider supporting me on Ko-Fi.

support on kofi. Ko-fi.com/moosaali9906

Implementation

Complete code can be found at Github-Repo

Before any algorithm we obviously need some sort of data to perform feature selection. For this purpose, we will use the same dataset which we used in the last article about feature selection.

This dataset can be found at the following link on Kaggle.

Load and process data

# important libraries import pandas as pd import numpy as np from tqdm.notebook import tqdm import scipy as sp from sklearn.ensemble import RandomForestClassifier import matplotlib.pyplot as plt from sklearn.utils import shuffle
Code language: PHP (php)

We will use all of the above libraries above.

data = pd.read_csv("healthcare-dataset-stroke-data.csv") data.head()
Code language: JavaScript (javascript)
Heart stroke dataset for feature selection
Heart Stroke Dataset

A lot of useless data, nothing we haven’t seen before.

Time for a cleanup. For advanced classes on data cleaning and processing, refer to the following post.

# converting to numeric data["gender"] = pd.factorize(data["gender"])[0] data["ever_married"] = pd.factorize(data["ever_married"])[0] data["work_type"] = pd.factorize(data["work_type"])[0] data["Residence_type"] = pd.factorize(data["Residence_type"])[0] data["smoking_status"] = pd.factorize(data["smoking_status"])[0] # additional cleaning data.dropna(inplace =True) data.drop("id", axis =1, inplace = True) data.reset_index(inplace=True, drop=True) data.head()
Code language: PHP (php)
cleaned data for feature selection
cleaned data for feature selection

All freshened up.

# seperate input and output variables X = data.drop("stroke", axis = 1) y = data["stroke"]
Code language: PHP (php)

Separating inputs and outputs.

  1. Creating Shadow features

For this, we just need to shuffle the original features and concatenate them to the original dataset.

for col in X.columns: X[f"shadow_{col}"] = X
.sample(frac=1).reset_index(drop=True)
Code language: PHP (php)
Shadow features according to boruta algorithm
Shadow features concatenated

2. Calculate Importance

def get_important_features(X, y): # Initiliaze Random Forest CLassifier rf = RandomForestClassifier(max_depth=20) # Fit Random Forest on provided data rf.fit(X,y) # Create dictionary of feature importances importances = {feature_name: f_importance for feature_name, f_importance in zip(X.columns, rf.feature_importances_)} # Isolate importances of Shadow features only_shadow_feat_importance = {key:value for key,value in importances.items() if "shadow" in key} # get importance level of most important shadow feature highest_shadow_feature = list(dict(sorted(only_shadow_feat_importance.items(), key=lambda item: item[1], reverse=True)).values())[0] # get original feature which fulfill boruta selection criteria selected_features = [key for key, value in importances.items() if value > highest_shadow_feature] return selected_features
Code language: PHP (php)

This function trains a Random Forest classifier on our heart stroke dataset. The classifier returns the importance it assigns to each feature in the variable `feature_importances_`.

We then create a dictionary of each feature along with its importance and single out the most important shadow feature.

Finally, it returns the dictionary with all original features which have an importance score greater than the singled out shadow feature.

Now since one trial isn’t enough, we need to run multiple trials to make sure we get satisfactory results.

Multiple Trials

TRIALS = 50 feature_hits = {i:0 for i in data.columns} for _ in tqdm(range(TRIALS)): imp_features = get_important_features(X, y) for key, _ in feature_hits.items(): if key in imp_features: feature_hits[key] += 1 print(feature_hits)
Code language: PHP (php)

The results of our 50 runs are as follows.

{'gender': 0, 'age': 50, 'hypertension': 0, 'heart_disease': 0, 'ever_married': 0, 'work_type': 0, 'Residence_type': 0, 'avg_glucose_level': 50, 'bmi': 1, 'smoking_status': 0, 'stroke': 0}
Code language: JavaScript (javascript)

Age and avg_glucose_level come at as important, 50 times and BMI comes out important in 1 trial. Now to justify whether BMI appearing important in just 1 trial makes it important or not, we will use a binomial distribution.

Binomial Distribution

The following line of code returns us the probabilities according to a binomial distribution.

# Calculate the probability mass function pmf = [sp.stats.binom.pmf(x, TRIALS, .5) for x in range(TRIALS + 1)]
Code language: PHP (php)

A binomial distribution with a probability 0.5 has a bell-shaped curve with 5% of the overall probability in the tails.

First, we need a function that gives us the number of iterations that form the tail.

# trails_in_green_zone def get_tail_items(pmf): total = 0 for i, x in enumerate(pmf): total += x if total >= 0.05: break return i
Code language: PHP (php)

The rules are simple. If the no. of iterations fall in the right tail, we call it the green zone (features that must be kept). If it falls in between the bell shape, we call it the blue zone(Features that can be played around with) and if they are in the right tail, it is called the red zone (features that should be dropped).

Let’s visualize the distribution we created.

# plot the binomial distribution plt.plot([i for i in range(TRIALS + 1)], pmf,"-o") plt.title(f"Binomial distribution for {TRIALS} trials") plt.xlabel("No. of trials") plt.ylabel("Probability") plt.grid(True)
Code language: PHP (php)
Binomial distribution for boruta feature selection
Binomial distribution for 50 trials

Consider subscribing for all the latest updates. Don’t forget to verify your email.

Final Selection

Now we just need to code the rules which we discussed above, about deciding which features fall into the Green, Blue, and Red zone.

# select features from n number of trials def choose_features(feature_hits, TRIALS, thresh): #define boundries green_zone_thresh = TRIALS - thresh blue_zone_upper = green_zone_thresh blue_zone_lower = thresh green_zone = [key for key, value in feature_hits.items() if value >= green_zone_thresh] blue_zone = [key for key, value in feature_hits.items() if (value >= blue_zone_lower and value < blue_zone_upper)] return green_zone, blue_zone
Code language: PHP (php)

Now run the above functions in the following order.

thresh = get_tail_items(pmf) green, blue = choose_features(feature_hits, TRIALS, thresh) green,blue
Important Features according to our Boruta Algorithm
Important Features according to our Boruta Algorithm

As we can see these are exactly the features that we got when we ran the Python implementation in Boruta in the other article.

Also, don’t forget to buy me a Kofi.

Total
0
Shares
Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Prev
Feature Selection Algorithms for Machine Learning
Feature Selection Algorithms for Machine Learning

Feature Selection Algorithms for Machine Learning

Lets talk two powerful feature selection algorithms and how they can help you

Next
How does Blockchain technology work in Web3?
Understand blockchain technology

How does Blockchain technology work in Web3?

Understand Blockchain, Blockchain mining, Proof of Work, Proof of Stake and

You May Also Like