EDA stands for ‘Exploratory Data Analysis’. Data Analysis is the core part of a data scientists/analyst ‘s job and can turn out to be a very tiring job especially when you have lots of data at hand and of variable data types.
Data visualization in Python is still no easy task. There are certain python data visualization libraries such as the ‘matplotlib’ library that require some amount of expertise to be operated. Wouldn’t it be great if we had a free data visualization tool, one that provides interactive data visualization?
SweetViz is one such open-source tool developed to aid your data analytics tasks and by aid, I mean to reduce it all to almost 2 lines of code — literally!!
Let’s see what magic this library holds.
Step 1 would be to obviously install the library, thankfully this can be done with one line of code in python.
pip install sweetviz
Once this is installed we can import this library into our Jupyter notebook.
import sweetviz as sv import pandas as pd
We import pandas because of course, we need to load our dataset.
The dataset we used is open-source and available on Kaggle. You can download it here.
data = pd.read_csv(“heart.csv”) data.head()
Now for the magic!!
my_report = sv.analyze(data, "HeartDisease")
That’s it! That’s the magic, this single line of code has done everything for us.
The above execution will create an HTML file in your working directory but if you’re working in Jupyter you can launch the HTML itself by the following command:
Code language: CSS (css)
Let’s see what we get within this file.
On the very top, we see a very brief summary of the data frame. The summary includes:
- The total number of rows in the dataset.
- Number of duplicate rows.
- The memory it occupies on RAM.
- Total features(columns) and how these are distributed (Numerical, categorical, text).
In the very same box, we see a button labeled as “Associations”. Clicking this button opens another box.
Here we see how each variable is associated with the other. The description at the top suggested what the squares and circles mean. Squares are categorical associations in the data and circles are the correlation between numerical data. Since most of our data is categorical we don’t see a lot of numerical correlations on the grid above.
The color bar on the right shows the degree of relationship between the variables. We can see that ‘MaxHR’ and ‘Age’ have a mildly negative correlation.
Now coming back to the main dashboard, we can see a list of tabs, each representing a feature of the dataset.
Features of the dataset.
The ages are represented in the form of a histogram. We can see that our data consists mostly of ages between 48 to about 56. Below that we see a distribution of the largest values, smallest values, and the most frequently occurring values.
Throughout the screenshots shared above, you will notice that apart from the histogram there is a line also drawn on the graph. This line depicts our target variable. This is another feature introduced in SweetViz 2.0 with which you can check the trend of a particular feature compared to every other feature in the dataset.
This is a useful feature but a little difficult to understand in our case since our data consists mostly of categorical variables and it does not make intuitive sense to compare different types of variables against each other. In the above graphs, we have the ‘HeartDisease’ feature compared to others. This is a binary variable and it doesn’t really make sense to plot it against a continuous variable, on the same grid.
However to make use out of this feature we can check the features where it would actually make sense.
Here we have a comparison between the maxHR and the Age of the patients. The trend is quite clear. Both these variables have a negative correlation as we can see that as we increase the Maximum heart rate(Exactly what was depicted by the Associations grid discussed at the beginning of this article), the age of the person decreases. Most people around the age of above 55 have a heart rate that lies in the bin of 60 to 70. Similarly, people around 30 years of age have a heart rate that reaches around 200.
SweetViz also allows us to compare two different datasets of the same features by plotting them on the same grid and analyzing the trends. It can be done in the following way.
Code language: CSS (css)
The above snapshot shows how the test and train data frames compare.
Exploratory data analysis can be a tedious task and that is why tools like these come in handy for a quick overview of the data. For in-depth analysis, you will have to write your own scripts of course which are tailored to your need and use case but it is always good to know that whenever you are short on time or need a quick analysis, tools like the SweetViz are available to make your work easier.