For all the young scientists looking for data science jobs or a data science internship, data processing is a skill that is a must-have. You can always learn these skills on the fly but if you know these basic techniques beforehand, they will not only boost your profile but also save a huge amount of your time while working.
After thorough consideration, I have narrowed down the 5 techniques that I learned after having to go through a lot of trouble so you don’t have to.
Looking for NaN/ Empty values
Working with data in the 21st century no longer means a few hundred records. As a data science professional you will be looking at data with records starting with several thousand records and go up to several million!
The most frustrating part about having to work with such a huge data size is that it is impossible to verify data visually. Having NaN/NULL/inf values in the dataset is one such problem that will have you pull your hair out because these can be hard to identify. I remember when I first coded a linear regression predictor, my loss would constantly rise to infinity without any explanation. It took me two days of searching to realize that my dataset had infinity values.
So whenever you get a dataset, the first step should always be to identify NaN/NULL/inf values. The below example shows how that can be done.
The data we have:
The data has a total of 110527 rows. A glance over the data doesn’t show any problems. But we have to be sure. Let's look for NaN values.
The ‘ isna() ’ is a built-in function in pandas that returns a data frame of Boolean values, True for where there is a NaN value and False otherwise. The ‘ any() ’ function returns the rows where there is a True Boolean value. The functions altogether return us rows where there are NaN values.
As we can see we have multiple rows with NaN values in the Diabetes column. Now all we have to remove these corrupted rows from the original data.
#drop NA values from dataframe data.dropna(inplace = True) #check for NA values again len(data[data.isna().any(axis=1)])
‘ dropna() ’ is also a built-in pandas function that removes all rows that have any NaN values, the inplace = True flag simply ensures that the changes are made on the original data and not its copy.
The second statement returns a length of 0.
Similarly, you can remove Inf values by converting those to first NaN and then removing the NaNs.
data.replace([np.inf, -np.inf], np.nan, inplace=True).dropna(inplace = True)
The above single line would handle the conversion and dropping of useless data.
Notice how I have joined multiple functions into a single statement. This is the preferred way of coding in python. Code like this is referred to as ‘Pythonic’.
Converting to data-time
Many times you will have to deal with dates. The problem is what variable category do dates fall into? are they integers? are they string or are they alphanumeric. Well if you have worked with SQL you would know, date-time is itself a type.
Converting such data to date-time format makes it very easy to play around with dates.
The data we have used above has dates in it. So let’s get to it.
#convert to datetime format data["ScheduledDay"] = pd.to_datetime(data["ScheduledDay"])
we have used the pandas to_datetime function to convert the ScheduledDay column to a DateTime type. Now python knows that this column contains dates. It will be much easier for us to use these dates in processing now.
#extracting the day month and year from the date print("Day:", data["ScheduledDay"].day) print("Month:", data["ScheduledDay"].month) print("Year:", data["ScheduledDay"].year)
The date-time object has functions to extract the individual elements from a date. The above example demonstrates how this can be done.
The DateTime type also encapsulates time so you can also extract smaller level features such as seconds, minutes, and hours.
Consistent data type across columns
Data can be confusing sometimes e.g.
#defining variables a = "2" b = 2 #printing variables print("a =",a) print("b =",b)
Anyone who looks at the output would assume that both ‘a’ and ‘b’ hold identical data however that is far from the truth. If we print the type of these variables we get the full picture.
‘a’ belongs to the class str or string while ‘b’ is an integer and both these would be interpreted differently in python.
Why does this matter? this is because raw data can have such problems sometimes. Looking at the initially loaded dataset above we can see that multiple fields look like integers. Simply by looking at the data we can not determine whether these are integers or strings.
We can check the data type of the elements of the column but for larger datasets, running such processing for every element does not seem reasonable.
Here’s what we do:
data["Hipertension"] = data["Hipertension"].astype("int") data["Alcoholism"] = data["Alcoholism"].astype("int")
The above statements will ensure that the entire column (“Hipertension” and “Alcoholism ”) only contain integers. This step is not a very critical one but performing it would provide us with a dataset that we are sure is clean.
Filtering data frame rows (Boolean Masking)
A data set may be spread out onto millions of rows but it does not mean that all that data is useful for you. Sometimes we want to view only some specific rows based on some condition. This is where boolean masks come in handy.
Let’s say for the given dataset I only want to view data of patients who are alcoholics. I would create the mask in the following manner.
data["Alcoholism"] == 1
However these are boolean values, we still need to apply this mask onto the data frame.
#applying a boolean mask to the data data[data["Alcoholism"] == 1]
In the single line above we have created the mask and applied it at the same time.
Now let's say I want patients who are alcoholics and over the age of 30. This mask would be created in the following way.
#applying more than one masks to the data data[(data["Alcoholism"] == 1) & (data["Age"] > 40)]
The above line of code also demonstrates why maintaining data types is important. If the age column did not have integer values, the mask would not have worked.
If you are working with Machine Learning models, such categorical (text) data will do you no good. This data will need to be converted into a computer-readable format i.e. numbers. In technical terms, this is called encoding.
Pandas allows a very simple method for encoding such categorical variables.
#encoding categorical variables data["No-show-labels"] = pd.factorize(data["No-show"])
The pd.factorise() function simply maps all the unique values from the “No-Show” column to integer representation. Now the computer understands that 0 means No and 1 means Yes.
This is a very helpful function as now we have data that can be directly fed to the algorithms.