Applied Data Science with python and Pandas

Data Science is a very important skill that has become a necessity of the 21st century. With the increase in data, the need for data scientists has increased. When it comes to learning Data Science, Python is the go-to programming language. Python comes with a plethora of tools that aid the data science process and one of the most prominent tools is Pandas. In this article, we will be looking at how to process data using the Pandas library such that it can be used for other tasks such as Analysis or Machine Learning.

What is Data Science?

Surprisingly there is no right answer to this question. People define Data Science only in the way that THEY work with data. In reality, Data Science is a general term that covers all those techniques and procedures that we apply to data. This is why when looking at a Data Science job you will find a variety of tools, as a requirement for the job, that may or may not be related to each other e.g. SQL, Python, R, or Power BI

A person that works with extracting data and ETL would be proficient with SQL but may or may not know Python. Conversely, a person who builds Machine Learning models will be proficient in Python but may or may not know SQL. However, both of these can call themselves data scientists since they work with data but have different responsibilities.

In this article, I will be explaining Data Cleaning and filtering as part of the Data Science domain. If you are interested in Machine Learning then you can start reading here.

Downloading data

If you land a data science job then you will be working with real-world collected by whichever organization you work for. You will have to extract that data from an SQL database using …you guessed it…SQL!. However, for the purpose of this tutorial, I will be downloading an open-source dataset from Kaggle. 

We will use the “World Energy Consumption” dataset which you can download from here.

Getting started with pandas

This entire notebook is available on the following GitHub repository.

You will first have to install pandas to start working. It can be done by simply running the following command in your command prompt.

pip install pandas

If you are unfamiliar with working with the command prompt or just generally installing packages for python, you can start reading here.

Once Pandas is installed we are ready to load the data set.

Import pandas first.

import pandas as pd

Notice that we have written ‘as pd’ in the import statement above, this is how we alias imports in Python. Since it is difficult to write the compete term Pandas everywhere you use the library, you simply reference it by using the term ‘pd’. 

Pandas allow you to load data from various sources such as excel sheets, CSV(comma separated values) files, text files, and even SQL tables. Our data is in the form of a CSV file so we will use the following function to load our dataset.

world_power_data = pd.read_csv('World Energy Consumption.csv')

Let’s view what we have in the data.

Head of the data frame from pandas in python
top 5 rows of the dataset

The .head() function provides us with the first 5 rows of the dataset (the head of the dataset). We can enter any number inside the parenthesis to pass it as an argument and the function will return the number of rows equal to the number passed.

Conversely, we also have a .tail() function which does the exact opposite i.e. returns the last 5 rows of the dataset.

Looks like we have 122 columns in this dataset, let’s get an overview of the data.
Information about the dataframe

Alright, so we have 119 columns with float values, 1 column with integers, and 2 columns as objects. An Object data type in Pandas simply means that the column contains values of multiple datatypes. 

We also now know that the dataset has taken up 16.2+ MB of RAM.

So this is a very huge dataset.

What is a data frame?

Before we move on to carry out an analysis of the data, it is important that we understand what a data frame is and what its structure is. A data frame has 3 essential parts. 

  1. Header Row
  2. Index
  3. Data
Structure of a data table
Structure of a data table

The Header row defines the column names of the data. The Index gives an identifier to each row and the data is the main crux of the data frame, the numerical data which holds meaning.

We can jump to any row or column of the dataset using the indexes and columns.

By Index

world_power_data.loc[45:50] #return data rows from index no. 45 to 50
Indexing dataframe by index

By Column

world_power_data[["country", "year", "coal_prod_change_pct", "coal_prod_change_twh" ,"gas_prod_change_pct"]].head() #show only certain columns
Indexing data frame by column

Removing 2020 data

Looks like this dataset contains values till the year 2019 only. Let’s remove all rows with sparse data from the year 2020. We will do this in the following steps.

  • Get indexes of the data rows for the year 2020
twentytwenty_idx = world_power_data[world_power_data['year'] == 2020].index

Now tell pandas to drop the columns with these indexes.

world_power_data.drop(index = twentytwenty_idx, inplace = True)

We have the data removed but we have messed up the indexes so for consistency across the entire data frame we need to reset the indexes.

world_power_data.reset_index(inplace = True)

Keeping only focus countries

Now we want to perform our analysis only on some specific developed countries since these are the ones that are the major energy consumers in the world.

#our focus countries
focus_countries = ['United States', 'Canada', 'Russia', 'United Kingdom', 'China']
#filter for focus countries
power_data = world_power_data[world_power_data["country"].isin(focus_countries)]
Reduced data frame with specific countries only

Filtering data based on year

We have data since the year 1900 (which really raises questions on the credibility of this dataset XD). We want to analyze data for only the 21st century so let’s get to filtering.

power_data = power_data[power_data["year"] >= 2000] #return data for only years greater than or equal to 2000
Data frame filtered on year

Column Filtering

We have a total of 122 columns in this data set. We are only interested in the data of consumption of energy so we only want to keep those columns.

We will start by creating a list of all the column names which have the term ‘consumption’ in them.

#get column names with the word consumption in them
consumption_cols = [col for col in power_data.keys()[power_data.keys().str.contains('consumption')]]

Next, we will still need identifiers for the country name and year.

consumption_cols.append('country') #append country column
consumption_cols.append('year') #append year column
                                                                                          Only show these specific columns

Let’s reduce the dataset to only show these columns. Note, since the year and country are appended at the end of the list they will appear as the last columns, this is okay for our analysis but if you want to fix this for aesthetic reasons then you can have several workarounds.

power_data = power_data[consumption_cols]
Reduced columns in our dataset.
Oly 15 columns left

Total energy consumptions

Our next problem statement is to find the total energy consumed by each of these countries over the last 2 decades for each of the energy sources. We will use the group by statement for this purpose, group by is a very useful and important aggregation tool.

#group by country and sum the values in the remainig columns
total_power_consumed = power_data.groupby(['country']).sum()

The above statement tells pandas to group by the data using the ‘country’ column and then sum the values in the remaining columns. The illustration explains it better.

Showing how a groupby clause works in data science
Groupby and sum data

You can have any type of aggregation you want for example average, count etc. The aggregation of the year column does not make sense and it does not even matter for our case so we can ignore this column or even drop it if we want.

Data grouped by country name.
Grouped data

We now have the total power consumed by each country for each resources type. We want this data to be saved separately so that we don’t have to carry out the analysis every time.

total_power_consumed.to_csv('Total_Power_Consumed.csv') #save as csv

The above function saves our data as a separate CSV file.

Let’s perform some more analysis. Say we want to answer the following questions.

Which resource is used most widely for energy consumption by each country?

Before we answer this question we need to take care of one problem. There is a column in our data named ‘primary_energy_consumption’ which gives a vague number for total energy consumed but since we want to data specifically for individual energy resources. So we will drop this column.

total_power_consumed.drop(columns = ['primary_energy_consumption'],inplace = True)

Now we will first see which country uses which energy resource most. For this we will sum the dataset column-wise and return the name of the column with the highest sum.

total_power_consumed.idxmax(axis = 1) #which resource is consumed most
Data for which power resource is most utilized by each country

Looks like everyone uses fossil fuel as their primary energy generator.

Now we will see in what amount are these consumed. For this, we will simply return the max of the column-wise sums.

print(total_power_consumed.max(axis = 1))
Data for max power consumed by each country
Maximum consumption of power

So we see that china actually utilizes 48k+ TW of power from fossil fuels alone!!

Now for one final question, let’s answer how the usage trend changed in America over the last two decades for Solar, Wind, Nuclear, and Fossil Fuel power.

First, we filter the data for only US Data.

united_states = power_data[power_data["country"] == "United States"]
united_states.reset_index(drop = True, inplace =True)

Next, we plot the data for the respective columns. Even though we have specialized tools for plotting Python, however, Pandas itself also allows you to plot data directly from a data frame. If you want to learn plotting in python, you can read below.

For Pandas, we do the following.

['solar_consumption','wind_consumption', 'nuclear_consumption', 'fossil_fuel_consumption'],\
logy = True ,figsize= (16,12), title = "Trend of Energy Consumption in America")

Here we have specified the following items.

  1. ‘year’ should be used for the x-axis.
  2. Columns used for y-axis values.
  3. Use logarithmic y-axis (To see trends clearly)
  4. Set figure size to [16×12]
  5. Give the plot a title.
A trend plot of all the power consumption in the united states

Looks like the use of Fossil fuel and Nuclear energy has been quite consistent however there has been a sharp rise in the use of Wind and Solar power. This is most probably because of the increase in demand for electricity.

Final Thoughts

So what you saw above is just one part of Data Science. This is what we call data processing. You ask yourself what sort of insight you want from the data and then you apply different processing techniques to gain that insight. This article aimed and showed you a handful of techniques that you can apply to your dataset to gain valuable insight.

Leave a Reply

Your email address will not be published.

3 replies on “Applied Data Science with Python and Pandas”