Data Visualization and Analysis in Python using MatPlotLib

A question people often raise is ”Data Science vs Data Analytics”. We have discussed data science in a lot of other articles but today we talk about data analytics, in fact, we talk about the first step of data analytics which is data visualization.

In an earlier article, we saw how you can visualize your data and perform analysis in about 2 lines of code using the SweetViz library (A data visualization tool). If you missed that article, you can read it here.


2 lines of code may seem like a deal-breaker but most of the time your data visualization jobs will require you to perform much more complex analysis and for that, you need to create visualizations of your own that cater to your requirements.

MatPlotLib is the most popular Python library that is used for plotting all sorts of charts and graphs. This library allows a great deal of flexibility and customization down to the smallest level.

In this article, we will be exploring different kinds of plots and looking into the customization features that MatPlotLib allows and will look at several data visualization examples.


Getting started with Matplotlib

Of course, the first step is always to install the desired library. Matplotlib can simply be installed by running the following command in your terminal/command prompt.

pip install matplotlib

In this article, we will use the Covid-19 Vaccinations dataset that can be found here.

For loading the dataset we will use the pandas library. If you do not have pandas installed, you install it by running the following command in your terminal.

pip install pandas

The first steps are to load and explore the dataset to see what values can be plotted for a useful visualization.

#importing the libraries import matplotlib.pyplot as plt import pandas as pd
Code language: PHP (php)
covid = pd.read_csv("country_vaccinations.csv") #load the dataset covid.sample(5) #sample 5 rows from the dataset
Code language: PHP (php)
Loaded Dataset

So what we have here is the information from all the countries regarding the no. of vaccinations administered. We can definitely do some fun plotting here. Also important to note, we have some missing values in the dataset (denoted by NaN above).

Alright enough wait. Lets plot.

covid_pak = covid[covid["country"] == "Pakistan"] #single out data for Pakistan covid_pak = covid_pak.sort_values(by = "date") #sort data by date plt.plot(covid_pak["date"] ,covid_pak["total_vaccinations"]) #Plot the relevant data
Code language: PHP (php)

Here we have singled out data for the country “Pakistan” and the last line uses matplotlib to plot the total vaccine doses administered against the date on the x-axis.

Plot for covid vaccinations in Pakistan

That single line created our plot for us. The line is broken because as I mentioned before the data has missing values. We can see that the trend is that the number of vaccinations is increasing with time.

The plot is there but it is quite – UGLY – for lack of a better word. Let’s add some more elements and make the graph a little larger for better viewing.

plt.figure(figsize = (12,10))#define the figure along with its size plt.plot( covid_pak["date"] ,covid_pak["total_vaccinations"]) #Plot the relevant data plt.title("Total Vaccinations administered in Pakistan")#Set title for the plot plt.xlabel("Date") #Set X-axis label plt.ylabel("No. of vaccines") #Set y-axis label
Code language: PHP (php)
Graph with some necessary elements

In the code above we have first altered the figure size to (12,10) corresponding to the width and height of the figure. Next, we go on to plot the data and then set the labels of the X and Y axis, and finally give the plot a title.

Another important concept to learn is matplotlib subplot. In the above figure we have a single line graph but what if we want to look at multiple graphs. IN such a case we create subplots.

The simple command to initiate a matplotlib subplot is:

plt.subplot(2,2,1) #create a 2 by 2 grid to create 4 different plots
Code language: CSS (css)

The above line creates an empty 2 x 2 with space for 4 different plots, the ‘1’ in the last position in the parenthesis means the plot created immediately after this line will fill up cell no. 1. Let’s create 4 such plots.

plt.figure(figsize = (12,10))#define the figure along with its size plt.subplot(2,2,1) #fill up cell no. 1 plt.plot(India["people_vaccinated"]) plt.title("No. of Vaccinated people in India") plt.xlabel("Date") #Set X-axis label plt.ylabel("No. of vaccines") #Set y-axis label #---------------------------------------------------- plt.subplot(2,2,2) #fill up cell no. 2 plt.plot(covid_pak["people_vaccinated"]) plt.title("No. of Vaccinated people in Pakistan") plt.xlabel("Date") #Set X-axis label plt.ylabel("No. of vaccines") #Set y-axis label #---------------------------------------------------- plt.subplot(2,2,3) #fill up cell no. 3 plt.plot(america["people_vaccinated"]) plt.title("No. of Vaccinated people in America") plt.xlabel("Date") #Set X-axis label plt.ylabel("No. of vaccines") #Set y-axis label #---------------------------------------------------- plt.subplot(2,2,4) #fill up cell no. 4 plt.plot(uk["people_vaccinated"]) plt.title("No. of Vaccinated people in the United Kingdom") plt.xlabel("Date") #Set X-axis label plt.ylabel("No. of vaccines") #Set y-axis label
Code language: PHP (php)
2×2 grid with 4 subplots

A grid-like this makes it much easier to compare graphs of different data points.

** Another important point to note is that in the subplots above, we haven’t specified any values for the y -axis. In such a case MatPlotLib automatically generates numbers from 0 to N, N is the length of data points on the y-axis. **


Different kinds of plots

What we have created above is a simple line graph. MatPlotLib allows you to create several different types of plots according to your need. Some of these plots are:

  • Scatter Plot
  • Bar Graph
  • Pie Chart

Let’s see how we can create these

  • MatPlotLib Scatter Plot
plt.figure(figsize = (12,10))#define the figure along with its size plt.scatter(country_name, max_vaccinations)# create a scatter plot for the total vaccinations in every country plt.title("Total vaccinations in a country") plt.xlabel("Country Name") plt.ylabel("No. of Vaccinations")
Code language: PHP (php)
Scatter Plot for total vaccinations in every country

The graph seems to have a very odd pattern as almost every country seems to have its value to close to zero with the exception of a handful. This does not mean that all those countries did not receive any vaccine, if you look at the top right side of the graph you can read ‘1e9‘. This defines the scale of the graph and means that the numerals on the y-axis are to be multiplied with 10^9. This means that many of the countries might have values 0.1 x 10e9. This is still a large number however the trend of our data makes it difficult to understand this. We can fix this skewness of the plot by changing the y-axis scale to logarithmic by simply adding the following line to the code.

plt.yscale("log")
Code language: JavaScript (javascript)
Scatter plot with a log scale on the y-axis

This makes a lot more sense. This data visualization example shows that how important it is to alter graphs to make appropriate sense out of our data.

There is one more disturbing element in all the plots we have created so far. The labels on the x-axis are not visible at all. This is because there are so many labels to display that they are all jumbled up. Fear not, we will address this issue in the advanced portion of this tutorial where we look to beautify our plots more and make the most out of them.  

  • MatPlotLib Bar Plot

A barplot is usually used to display the magnitudes of a certain quantity against different independent variables. In MatPlotLib you can simply create a bar plot by the following function.

plt.bar()
Code language: CSS (css)

We need to create some additional data to display a useful bar plot.

canada = covid[covid["country"] == "Canada"] #isolate records for canada #extract records for date: 2021-11-04 uk_onedate = uk[uk["date"] == "2021-11-04"] america_onedate = america[america["date"] == "2021-11-04"] canada_onedate = canada[canada["date"] == "2021-11-04"] pak_onedate = covid_pak[covid_pak["date"] == "2021-11-04"]
Code language: PHP (php)
labels = ["UK", "USA", "Canada", "Pakistan"] plt.figure(figsize = (12,10))#define the figure along with its size plt.bar(labels, [uk_onedate["people_vaccinated"].iloc[0], america_onedate["people_vaccinated"].iloc[0], canada_onedate["people_vaccinated"].iloc[0], pak_onedate["people_vaccinated"].iloc[0]]) #argument 1 is the list of labels to be displayed on the x-axis and argument 2 is the list of values against the labels plt.title("Total people vaccinated till 04-11-2021") plt.xlabel("Countries") plt.ylabel("No. of people vaccinated")
Code language: PHP (php)
Bar plot to display total vaccinated people till 04-11-2021
  • MatPlotLib Pie Chart

A pie chart is usually used to visualize the percentages of different elements inside a single entity.

MatplotLib has the following function to draw a pie chart.

plt.pie()
Code language: CSS (css)

Let’s just use dummy data for this example. We’ll use our covid data in the next section

plt.figure(figsize = (12,10))#define the figure along with its size _ = plt.pie([24,40,12])
Code language: PHP (php)
Pie Chart for dummy values

We passed the function the values: 24,40,12. In the above pie chart, the orange part represents 40 since it is the largest value and similarly for the remaining values.


Beautifying the graphs and charts (Advanced options)

Photo by Maxime Bhm on Unsplash

All the above plots we have created do the job well but none of them look professional. Luckily what we’ve seen above is just the tip of the ice burg. MatPlotLib has a plethora of customization options.

Let’s create some cool-looking graphs from the same covid dataset that we loaded earlier. We will need some additional information for the plots.

population_uk = 67.2e6 #population of of the UK population_usa = 329.8e6 #population of USA
Code language: PHP (php)

Above we have defined the total population of the two mentioned countries.

Now we write our code.

fig = plt.figure(figsize = (14,10)) #create figure and set figure size _ = plt.plot(uk["date"][:n], uk["people_vaccinated"][:n], color = 'red')#plot data for vaccinations in UK _ = plt.plot(uk["date"][:n], america["people_vaccinated"][:n], color = 'blue')#plot data for vaccinations in USA _ = plt.fill_between(uk["date"][:n], uk["people_vaccinated"][:n],america["people_vaccinated"][:n], facecolor = "#0435d4", alpha = 0.5)#color inbetween the two plots _ = plt.axhline(population_uk , linestyle = '--', color = 'red')#draw line for poppulation of UK _ = plt.axhline(population_usa , linestyle = '--', color = 'blue')#draw line for poppulation of USA plt.ylabel("No. of vaccine doses administered", fontsize=18)#set y-axis label plt.xlabel("TimeStamp", fontsize=18)#set y-axis label plt.title("Vaccinations in UK and USA (2021)")#set title for the chart plt.legend(["UK", "USA",r"Total population(UK)",r"Total Population(USA)"], loc = 2)#place legend ax = plt.gca()#get object representing the current axis to modify it. ax.patch.set_facecolor('blue')#set background color ax.patch.set_alpha(0.1)#set oppacity of background ax.grid(axis = 'y',linewidth = 1, alpha = 0.5) #enable grid ax.xaxis.set_major_locator(plt.MaxNLocator(20)) #set no. of ticks to show on the y-axis plt.xticks(rotation = 45) #rotate x-axxis ticks #hide chart edges right_side = ax.spines["right"] top_side = ax.spines["top"] left_side = ax.spines["left"] bottom_side = ax.spines["bottom"] right_side.set_visible(False) top_side.set_visible(False) left_side.set_visible(False) bottom_side.set_visible(False)
Code language: PHP (php)

That is a lot of code for one plot but the result is totally worth it.

Comparing UK and USA in terms of total vaccinations

Now isn’t that just gorgeous!! 😍Let’s first talk about what the graph represents.

We have the number of vaccinations applied for Covid in the UK and USA. Even though the US seems to be quite ahead in no. of the vaccinations, the dashed line representing the total population tells a different story. UK has vaccinated almost 80% of its population whereas the USA is at about only 60%.

Now let’s see what we have changed here.

  • Added some gridlines.
  • Created multiple plots on the same graph.
  • Shaded the region between the lines to highlight the differences.
  • Reduced the number of labels displayed on the x-axis to avoid jumbling.
  • Increased font sizes for the x and y labels
  • Added a MatPlotLib legend

The code above is well-commented so you can see which line does what.

Now let’s see a few more examples.

#creating some additional data single_date = covid[(covid["date"] == "2021-11-05") & (covid["country"].isin(["United States", "United Kingdom", "India", "Pakistan", "China"]))] #adding population for more countries population_china = 1.402e9 population_pak = 220.9e6 population_india = 1.38e9
Code language: PHP (php)
fig = plt.figure(figsize = (10,7)) #create figure and set figure size plt.scatter(single_date["country"],[population_china,population_india,population_pak,population_usa], s = 40,color = 'red',zorder = 4) bars = plt.bar(single_date["country"], single_date["total_vaccinations"], edgecolor='blue',zorder = 3) #show grid plt.grid(linewidth = 0.5,zorder = -1) #show legend plt.legend(["Total population of country","Total vaccinations done"], loc=1) #This code colors bars green on the provided condition for i,x in enumerate(zip(single_date["total_vaccinations"],[population_china,population_india,population_pak,population_usa])): if x[0] > x[1]: bars[i].set_color('green')
Code language: PHP (php)
A customized bar chart

A bar chart that shows total vaccinations in a country with the bar colored green if the no. of vaccinations exceeds the total population of the country (The vaccinations have multiple doses so this is possible) to get a gist of how well the vaccinations are going.

Now finally a good-looking pie chart.

vacc_demo_pak = covid_pak[covid_pak["date"] == "2021-11-05"] partially_vaccinated = vacc_demo_pak["people_vaccinated"].iloc[0] - vacc_demo_pak["people_fully_vaccinated"].iloc[0] unvaccinated = population_pak - vacc_demo_pak["people_vaccinated"].iloc[0] #organising data for the pie chart data = [partially_vaccinated, unvaccinated, vacc_demo_pak["people_fully_vaccinated"].iloc[0]] mylabels = ["partially Vaccinated","Unvaccinated", "Fully Vaccinated"] mycolors = ["Blue", "Red", "Green"]
Code language: PHP (php)
fig = plt.figure(figsize = (10,7)) _ = plt.pie(data, labels = mylabels, explode = [0,0,0.2], shadow = True,autopct='%1.1f%%', colors=mycolors) _ = plt.title("Vaccination demographics in Pakistan")
Code language: PHP (php)
Pie chart representing the Covid vaccinations in Pakistan 

Now that looks so much better than the earlier one. It even highlights the no. of people that are fully vaccinated by the exploding that piece.


Conclusion

This was a very long article but still does not cover half of the capabilities of the MatPlotLib library. The data visualization examples shown above are just to get you started, there is so much more to it. Only with practice will you be able to above bad data visualization practices and carry out some amazing data visualization projects.

Total
0
Shares
Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Prev
Introduction to APIs with Python FastAPI

Introduction to APIs with Python FastAPI

Fastapi is one of the most convenient API frameworks in Python

Next
Primer on Functions in Python
photo of computer monitor displaying program

Primer on Functions in Python

Functions are one of the primitive building blocks of a programming language

You May Also Like