3 Best Libraries for Fuzzy Matching In Python

Learn about 3 of the most used fuzzy matching libraries in Python. These libraries offer fast implementations of multiple fuzzy matching algorithms.
Fuzzy matching in python

Data scientists have to deal with all sorts of data during their day-to-day tasks. One of the biggest challenges to data processing comes with unstructured data. Data is called unstructured when it can not be represented in a tabular (structured) manner. Unstructured data includes text, audio, video files, etc.

Text data is processed via use of special techniques such as fuzzy matching. Fuzzy matching refers to finding similarities between strings, these similarity metrics can be used for multiple purposes. Fuzzy matching can be done via multiple algorithms, each with its pros and cons. Let’s discuss these algorithms below.

Common Fuzzy Matching Algorithms

Fuzzy matching is used to check whether two strings are the same or different and, in the case of the latter, by what factor are they dissimilar. Fuzzy matching is often referred to as edit-distance, i.e., how much does string A need to be edited to match string B? There are multiple methods of calculating the similarity metric. Let’s discuss three of the most common ones.

  • Levenshtein Distance

Levenshtein distance was first used in 1965 by Vladimir Levenshtein. It decides the similarity between 2 strings by computing the minimum number of single-character edits required to convert the first string to the second string. Levenshtein considers the following operations as an edit:

  1. Addition
  2. Deletion
  3. Substitution

The smaller the number of required edits is, the closer the two strings are.

  • Damerau–Levenshtein distance

The Damerau-Levenshtein works in exactly the same way as the conventional Levenshtein algorithm but offers one slight improvement. Apart from the three edits mentioned above, Damerau-Levenshtein also takes into account transposition. Transposition corresponds to swapping two adjacent characters and is considered a single operation. Damerau stated that these 4 operations account for 80% of all human misspellings.

  • Jaro-Winkler Distance

The Jaro-Winkler distance algorithm was proposed in 1990. It builds upon the Jaro similarity algorithm, which uses the following equation to calculate the similarity.

  • m = number of matching characters
  • t is half of the number of transpositions
  • |s1| and |s2| are the lengths of each string, respectively.

The Jaro similarity returns a score between 0 and 1, where 0 represents no match, and 1 represents that the strings are exactly alike. Jaro-Winkler modifies the formula by applying more weightage to the first i  matching characters. I will not go into too much detail, but the final form is represented by the equation below.

Jaro Winkler similarity equation
Jaro Winkler similarity equation

Fuzzy matching libraries in python

Python has a lot of implementations for fuzzy matching algorithms. I have compiled a small list of some of the best libraries available for open-source use.

  1. FuzzyWuzzy

FuzzyWuzzy uses the Levenshtein distance implementation of fuzzy matching to give you a ratio score between the provided sentences.

Install the library using the following command.

fuzzywuzzy installation for fuzzy string matching
fuzzywuzzy installation

Once installed, the library can be used simply with one line, as shown below.

Testing fuzzywuzzy

The resulting ratio comes out to be 90, meaning the 2 sentences are 90% similar.

The library also comes with an additional package that improves the calculation speed up to 10x. Learn more about FuzzyWuzzy from their documentation.

2. Rapidfuzz

RapidFuzz uses the same implementation as FuzzyWuzzy but is mostly written in C++ and offers additional algorithmic improvements over it. Its C++ implementation makes this library extremely fast, and it offers some additional benefits over FuzzyWuzzy.

  • It is an MIT license, so you are free to choose whichever license for your project.
  • It includes additional implementations such as Jaro-Winkler, which are not present in FuzzyWuzzy

You can install this library using the following command.

Command for installing RapidFuzz in python for fuzzy matching
RapidFuzz Installation

Its functionality is exactly the same as FuzzyWuzzy but with the added benefit of speed.

Testing RapidFuzz
RapidFuzz Testing

The ratio comes out to be 80.7%

RapidFuzz also offers more functionality, similar to FuzzyWuzzy. Read more about them in the official documentation.

3. Jaro-Winkler

The Jaro-Winkler distance also has multiple implementations in python, but apparently, many of those are incorrect. According to one user on StackOverflow, the results produced by pyjarowinkler, a very old and popular python library, seem incorrect. I tested the same problem with the jaro-winkler library and got matching results to other correct implementations.

Install the library using the following command.

jaro winkler installation via pip command
installing jaro-winkler

Calculate the distance using the “jaro_winkler_metric” function. See the application below.

Testing jaro winkler similarity score
Testing Jaro-Winkler library

The similarity comes out as 78.3 %

Jaro-Winkler distance has a very important advantage over Levenshtein or Damerau–Levenshtein when calculating ratios or percentages. The latter-mentioned algorithms provide misleading results for smaller strings. Let’s see an example below.

Testing edit distance for short strings
A 67% match score
Result

Even if the strings only differ by one character, the ratio comes out to be 67% which is a very significant number. Now let’s compare it with the Jaro-Winkler distance.

Testing jaro winkler distance for short string
Result comparison between jaro-winkler and Levenshtein distance
Result comparison

The Jaro-Winkler algorithm returns a similarity percentage of 82.2% , which is a much more reasonable number.

Applications of Fuzzy Matching

String matching plays an important part in everyday applications. When you search something on google, and it corrects your typo with the famous “Including results for: …”.

Google using string matching to improve search results
Google using string matching to improve search results

In the background, Google actually runs your query through an edit-distance algorithm and matches it with every word in a pre-defined dictionary. Another example of the same functionality is the auto-correct feature on your phone.

It is also used in many search algorithms that pop up suggestions of terms that resemble the actual query. This makes the search feature more advanced as you can find the correct match even if you do not input the exact terms.

A more application of fuzzy-matching is in the de-duplication of records in a database. Many times our records contain similar data, so tricks like ‘groupby’ do not work. Fuzzy matching helps with picking out all records within the defined threshold.

Concluding Thoughts

Fuzzy matching is a simple yet very useful data processing technique. It is used in web applications in different aspects and is also included in many data processing pipelines. Python offers some amazing libraries that implement some form of fuzzy matching. These libraries offer simple APIs to calculate the string matching score and can be utilized in your applications.

Total
0
Shares
Leave a Reply

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Prev
Memory Management and Garbage Collection in Python
macro shot photo of a computer RAM

Memory Management and Garbage Collection in Python

Python is a high-level interpreted, dynamically typed language

Next
Finding duplicates in a List: Python Programming Challenge 1
python programming

Finding duplicates in a List: Python Programming Challenge 1

The Python Programming Challenge series is a new addition at WritersByte where

You May Also Like