Regex 101 for Python Data Science

Regex are an NLP technique that provide structure and format to your text document. This article is the ultimate regex 101 article to start working with regular expressions.
Understanding regex with python

When working with data we rarely come across clean and processed data. Before this data can be used anywhere, we need to apply a lot of processing to it in order to make it of any use. One of these techniques is pattern matching in strings to keep only the desired pattern or, conversely, remove unwanted patterns. We find these patterns in strings by using Regular Expressions or Regex for short. Regex are an internationally accepted set of rules to determine the formatting of a string.

In this article, we will explore what these expressions are and how we can utilize them in Python to cleanse our data. To learn about other data processing techniques click here.


Understanding Regular Expressions: Regex 101

Understanding Regex

Regex are a set of rules defined to find patterns in strings.

Regex are defined by a set of characters. There are ordinary characters and then there are certain special characters. Together these are used to define your regular expression.

Ordinary Characters

These are the basic alphabetic, numeric, or symbolic characters that we use every day. These are mostly used in an exact match scenario. For example, if I want to find all the exclamation marks within a piece of text then my regex would be simply ‘!’. The same applies to matching entire words or sentences as it is.

Special Characters

What we discussed above is string matching but that’s not really what regex are used for. These expressions are created to match complex patterns within the text and that is where we use the special characters. The table below represents some of the most common special characters used.

CharacterExplanation
. (Dot)Matches any character except a newline character
* (Asterisk)Matches 0 or more characters specified by the preceding character
+ (Plus) Matches at least 1 or more occurrences of the term specified by the preceding character
^ (Caret)Matches the start of a string
$ (Dollar)Matches the end of a string
? (Question mark)Causes the resulting RE to match 0 or 1 repetitions of the preceding RE. 'ab?' will match either ‘a’ or ‘ab’.
{m}matches exactly m occurrences of the preceding regex
{m,n}matches at least m and almost n occurrences of the preceding regex

Let’s look at a few examples, we will use the following text from Harry Potter and the philosophers stone.

Yet Harry Potter was still there, asleep at the moment, but not for long. His Aunt Petunia was awake and it was her shrill voice that made the first noise of the day.

“Up! Get up! Now!”

Harry woke with a start. His aunt rapped on the door again.

Harry Potter and the philosophers stone.

If you want to match an exact string then all you need to do is pass in that exact string as the expression. For example, if you want to find Harry Potter in the above text then you will write “Harry Potter” as the expression.

You can specify regex to find only a specific set of characters or numbers. You can also specify it to find only alphabetic characters or only numbers in a particular text. This is done by using the square brackets ‘[]’. The square braces enclose the set of characters we want e.g. if we want only alphabetic characters we will write it as:

[A-Za-z]
Code language: JSON / JSON with Comments (json)

This specifies 2 ranges of characters; from A to Z and from a to z (making sure the match is case insensitive). We can specify to match all characters in a text as well.

Let’s say we want to match the entire string that is between quotation marks. Our expression will look like this:

".*"
Code language: JSON / JSON with Comments (json)

The above expression can be read as :

CharacterExplanation
Look for a quotation mark
. Look for any character afterward
* Refers to the preceding character. Look for any number of occurrences of the character specified by the preceding character (dot in our case — so it will look for any number of occurrences of any character)
Look for another quotation mark to end the search
Explanation of the above regular expression

So the regex above will basically look for a quotation mark, then any character after if until another quotation mark is found.

All of this might seem difficult to understand. Regular Expressions can be very difficult to understand and the only real way to understand them is to practice them. So instead of just writing them, we will now see them in action in Python.

Regex in Python

Read till the end for a bonus tip (you don’t want to miss out on this.)

You can try out the entire code used in this tutorial by cloning the following repository. Click here.

If you want to learn to build a Jupyter notebook as in the git directory, the read here

First, we need to import the Python library that is designed to resolve Regular Expressions.

import re
Code language: JavaScript (javascript)

Now let’s find some text where we need matches.

txt = '''We have contacted Mr. Jhon Doe and havbe confirmed that he will be joining us for the meeting this evening. If you would like to contact him yourself you can call him on +1-415-5552671 or email him at [email protected] We also have the contact details of his assistant, you can contact him in case Mr. Doe does not respond. The assistants email id is [email protected]'''
Code language: PHP (php)

The above text has a lot of information. We don’t need it all. Let’s say we just want to extract the contact information from it.

Let’s start with the email.

email_regex = '\[email protected]{1}\S+(.com){1}' x = re.search(email_regex, txt) #regex to find an email address
Code language: PHP (php)
print(email_regex) #print out the result
Code language: PHP (php)
Regex object

The above object shows the match that was found with the regex as well as the indexes on which the match was found.

Let’s first see how to read this regex.

SymbolMeaning
\SFinds a non-whitespace character
+Specifies to find 1 or more non-whitespace occurrences
@Exact match specifies to find a ‘@’ symbol.
{1}specifies to find only 1 ‘@’ symbol.
\SAgain specifies to find non-whitespace characters.
+Find at least 1 non-whitespace character.
(.com)Find exact match for .com

{1}
Find exactly one occurence for ‘.com’

The re.search() function only returns the first occurrence of the match. The purpose of this function is no more than to know if a match is present or not. The 're' library has many other useful functions that we can utilize.

Find all occurences

To return all matches of the provided regex we use the re.findall() function.

emails = re.findall(r'\[email protected]{1}\S+(?:\.com)', txt) #finding all emails
Code language: PHP (php)
print(emails)
Code language: PHP (php)
Emails found in the text

There is one slight difference to take note of here. I changed the regex a little bit when using it in the re.findall() function. The reason for this is that the ‘findall()‘ function returns whatever group it finds within the regex. When we specified the (.com) at the end, it treated it as a group and returned only this group instead of the entire match. When we place the ‘?:‘ inside the parenthesis, it specifies it as a non-capturing group (according to the documentation) and the function returns the entire match.

Substitute Expressions

We can use regular expressions to remove certain parts of a string by specifying its pattern.

This is done using the re.sub() function.

substituted_string = re.sub(r'\[email protected]{1}\S+(.com){1}', '', txt) #remove emails from the given text.
Code language: PHP (php)
print(substituted_string)
Code language: PHP (php)
We have contacted Mr. Jhon Doe and havbe confirmed that he will be joining us for the meeting this evening. If you would like to contact him yourself you can call him on +1-415-5552671 or email him at . We also have the contact details of his assistant, you can contact him in case Mr. Doe does not respond. The assistants email id is .

As you can see in the text above, the emails have been replaced with a blank string. This is a good way of removing unwanted entities from your text.

Another good use of this is to redact documents. Say we want to forward the above text to someone but want to hide the emails. We can simply do it as.

redacted = re.sub('\[email protected]{1}\S+(.com){1}', '<email>', txt) #place emails tags in text print(redacted)
Code language: PHP (php)
We have contacted Mr. Jhon Doe and havbe confirmed that he will be joining us for the meeting this evening. If you would like to contact him yourself you can call him on +1-415-5552671 or email him at <email>. We also have the contact details of his assistant, you can contact him in case Mr. Doe does not respond. The assistants email id is <email>.

Our receiver now knows that a certain portion is to contain an email address, but the actual address is hidden from them.

Multiple expressions in a single line.

We specify Python to find multiple expressions by using the same regex. This is done by separating the two expressions using the | (bar) symbol. The complete expression will look like this

(exp1)|(exp2)|(exp3)

We can specify as many expressions as we want. The bar is read as an OR and Python is told to find expression 1 OR expression 2 OR expression 3.

Let’s try it out. We can see that our text also has a phone number. Let’s redact all contact information from the text.

re.findall('\[email protected]{1}\S+[.com]{1}|\+[0-9]{1}-[0-9]{3}-[0-9]{7}', txt)
Code language: JavaScript (javascript)
Contact information from the text

Seems like our multi-regex string is working ok.

#tagging all contact information as confidential redacted = re.sub('\[email protected]{1}\S+[.com]{1}|\+[0-9]{1}-[0-9]{3}-[0-9]{7}', '<confidential>', txt) print(redacted)
Code language: PHP (php)
We have contacted Mr. Jhon Doe and havbe confirmed that he will be joining us for the meeting this evening. If you would like to contact him yourself you can call him on <confidential> or email him at <confidential> We also have the contact details of his assistant, you can contact him in case Mr. Doe does not respond. The assistants email id is <confidential>

BONUS TIP!!

If you want to practice regular expressions and want to see what your particular regex does then regexr.com is a very helpful tool.

Final Thoughts

Regex are very helpful if you work with data a lot. They help you in making sure that your data is in the right shape and structure. Other than data science, these are also widely used in development especially when you want to ensure user input, from the front end, to be in the right format. Regular expressions may seem a little difficult to understand at first but they get more clear with practice.

Total
0
Shares
Leave a Reply

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Prev
How to use Jupyter Notebook for Python Programming

How to use Jupyter Notebook for Python Programming

Jupyter Notebook is one of the best tool for a Data Science enthusiast

Next
Applied Data Science with Python and Pandas
Applied Data Science with python and Pandas

Applied Data Science with Python and Pandas

Learn applied data science with pandas and Python

You May Also Like