Data scientists have to deal with all sorts of data during their day-to-day tasks. One of the biggest challenges to data processing comes with unstructured data. Data is called unstructured when it can not be represented in a tabular (structured) manner. Unstructured data includes text, audio, video files, etc.
Text data is processed via use of special techniques such as fuzzy matching. Fuzzy matching refers to finding similarities between strings, these similarity metrics can be used for multiple purposes. Fuzzy matching can be done via multiple algorithms, each with its pros and cons. Let’s discuss these algorithms below.
Common Fuzzy Matching Algorithms
Fuzzy matching is used to check whether two strings are the same or different and, in the case of the latter, by what factor are they dissimilar. Fuzzy matching is often referred to as edit-distance, i.e., how much does string A need to be edited to match string B? There are multiple methods of calculating the similarity metric. Let’s discuss three of the most common ones.
- Levenshtein Distance
Levenshtein distance was first used in 1965 by Vladimir Levenshtein. It decides the similarity between 2 strings by computing the minimum number of single-character edits required to convert the first string to the second string. Levenshtein considers the following operations as an edit:
The smaller the number of required edits is, the closer the two strings are.
- Damerau–Levenshtein distance
The Damerau-Levenshtein works in exactly the same way as the conventional Levenshtein algorithm but offers one slight improvement. Apart from the three edits mentioned above, Damerau-Levenshtein also takes into account transposition. Transposition corresponds to swapping two adjacent characters and is considered a single operation. Damerau stated that these 4 operations account for 80% of all human misspellings.
- Jaro-Winkler Distance
The Jaro-Winkler distance algorithm was proposed in 1990. It builds upon the Jaro similarity algorithm, which uses the following equation to calculate the similarity.
- m = number of matching characters
- t is half of the number of transpositions
- |s1| and |s2| are the lengths of each string, respectively.
The Jaro similarity returns a score between 0 and 1, where 0 represents no match, and 1 represents that the strings are exactly alike. Jaro-Winkler modifies the formula by applying more weightage to the first i matching characters. I will not go into too much detail, but the final form is represented by the equation below.
Fuzzy matching libraries in python
Python has a lot of implementations for fuzzy matching algorithms. I have compiled a small list of some of the best libraries available for open-source use.
FuzzyWuzzy uses the Levenshtein distance implementation of fuzzy matching to give you a ratio score between the provided sentences.
Install the library using the following command.
Once installed, the library can be used simply with one line, as shown below.
The resulting ratio comes out to be 90, meaning the 2 sentences are 90% similar.
The library also comes with an additional package that improves the calculation speed up to 10x. Learn more about FuzzyWuzzy from their documentation.
RapidFuzz uses the same implementation as FuzzyWuzzy but is mostly written in C++ and offers additional algorithmic improvements over it. Its C++ implementation makes this library extremely fast, and it offers some additional benefits over FuzzyWuzzy.
- It is an MIT license, so you are free to choose whichever license for your project.
- It includes additional implementations such as Jaro-Winkler, which are not present in FuzzyWuzzy
You can install this library using the following command.
Its functionality is exactly the same as FuzzyWuzzy but with the added benefit of speed.
The ratio comes out to be 80.7%
RapidFuzz also offers more functionality, similar to FuzzyWuzzy. Read more about them in the official documentation.
The Jaro-Winkler distance also has multiple implementations in python, but apparently, many of those are incorrect. According to one user on StackOverflow, the results produced by pyjarowinkler, a very old and popular python library, seem incorrect. I tested the same problem with the jaro-winkler library and got matching results to other correct implementations.
Install the library using the following command.
Calculate the distance using the “jaro_winkler_metric” function. See the application below.
The similarity comes out as 78.3 %
Jaro-Winkler distance has a very important advantage over Levenshtein or Damerau–Levenshtein when calculating ratios or percentages. The latter-mentioned algorithms provide misleading results for smaller strings. Let’s see an example below.
Even if the strings only differ by one character, the ratio comes out to be 67% which is a very significant number. Now let’s compare it with the Jaro-Winkler distance.
The Jaro-Winkler algorithm returns a similarity percentage of 82.2% , which is a much more reasonable number.
Applications of Fuzzy Matching
String matching plays an important part in everyday applications. When you search something on google, and it corrects your typo with the famous “Including results for: …”.
In the background, Google actually runs your query through an edit-distance algorithm and matches it with every word in a pre-defined dictionary. Another example of the same functionality is the auto-correct feature on your phone.
It is also used in many search algorithms that pop up suggestions of terms that resemble the actual query. This makes the search feature more advanced as you can find the correct match even if you do not input the exact terms.
A more application of fuzzy-matching is in the de-duplication of records in a database. Many times our records contain similar data, so tricks like ‘groupby’ do not work. Fuzzy matching helps with picking out all records within the defined threshold.
Fuzzy matching is a simple yet very useful data processing technique. It is used in web applications in different aspects and is also included in many data processing pipelines. Python offers some amazing libraries that implement some form of fuzzy matching. These libraries offer simple APIs to calculate the string matching score and can be utilized in your applications.