0% found this document useful (0 votes)
6 views

Note 4

This document discusses fuzzy string matching techniques, which are used to find approximate matches between strings in various applications such as spell checking and plagiarism detection. It covers several algorithms including Levenshtein, Damerau-Levenshtein, Bitap, and n-gram, explaining their methodologies and implementations in Python. The article also highlights the differences between online and offline algorithms and provides references for further reading.

Uploaded by

adhinanm12
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

Note 4

This document discusses fuzzy string matching techniques, which are used to find approximate matches between strings in various applications such as spell checking and plagiarism detection. It covers several algorithms including Levenshtein, Damerau-Levenshtein, Bitap, and n-gram, explaining their methodologies and implementations in Python. The article also highlights the differences between online and offline algorithms and provides references for further reading.

Uploaded by

adhinanm12
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 1

Fuzzy matching

algorithms
Madhurima Nath, PhD · Follow
5 min read · Jan 8, 2024

Listen Share

Fuzzy string matching is technique to find


strings which have approximate matches.
They are widely used in spell checkers, de-
duplication of records, master data
management, plagiarism detection,
bioinformatics and DNA sequencing, spam
filtering, content searches, similarity
matches etc.

This article will cover a few algorithms —


Levenshtein, Damerau-Levenshtein, Bitap
and n-gram — which are implemented for
such approximate string matchings. The
detailed python implementation and codes
are available in the Jupyter notebook in
the GitHub repo.

(This was presented as a Women Who


Code Data Science session in April 2022.
The recording of the session is on
YouTube.)

Introduction to fuzzy matching


String matching or fuzzy matching is a
method to find strings which match a
given pattern or string approximately. It
identifies the likelihood or probability that
two records are true match based on some
parameters. In the example shown below,
the algorithms try to match the 5 different
variations to the given string ‘Microsoft
Corporation’ and score each of them based
on how close they match the true value.

Example where string/fuzzy matching is used

Algorithms used
Most commonly used fuzzy matching
algorithms involve calculating the edit
distance metrics between the strings. Edit
distance metric quantifies how dissimilar
two strings are by counting the minimum
number of operations required to transform
one string into the other. Some of the well-
known distance metrics are

Levenshtein distance

Damerau–Levenshtein distance

Longest common subsequence

Hamming distance

Jaro distance

Bitap algorithm (shift-or, shift-and


algorithm or Baeza-Yates–Gonnet
algorithm) which tells whether a given text
contains a substring which is
“approximately equal” to a given pattern
also makes use of Levenshtein distance. It
is very efficient for relatively short pattern
strings. The Bitap algorithm is the heart of
the utility function agrep/grep in Unix
systems.

n-gram algorithm uses a Markov model to


predict the next item in a sequence of text.
n-gram is a pattern of n characters (words,
letters, symbols etc.) in some particular
order. In text processing, the use of n-
grams help capture some information
related to the order of words.

The main difference between the Bitap


and n-gram algorithms is that the former
is an on-line method, while the latter is the
off-line one. On-line algorithm do the
search without an index and therefore
their performance on large data is
inefficient. The use of indexing in off-line
techniques makes the search drastically
faster and is widely used for text
processing.

Other commonly used algorithms are


Needleman–Wunsch algorithm, Smith–
Waterman algorithm, BK Tree metric,
Soundex or Metaphone (this is a phonetic
algorithm).

Edit distance metrics — Levenshtein


distance & Damerau-Levenshtein
distance
The edit distance metric measures the
number of edits needed to transform one
word into another. Levenshtein distance is
a popular method to calculate edit
distance metric. The figures below show
how the Levenshtein and Damerau-
Levenshtein distances work and the
difference between these two.

Levenshtein and Damerau-Levenshtein distances

Calculating Levenshtein and Damerau-Levenshtein


distances

The math behind the algorithm is


explained below

Math behind calculating the Levenshtein distance

Levenshtein distance has the following


properties:

It is zero if and only if the strings are


equal.

It is at least the difference of the sizes


of the two strings.

It is at most the length of the longer


string.

Triangle inequality: The Levenshtein


distance between two strings is no
greater than the sum of their
Levenshtein distances from a third
string.

Implementation in python

Bitap algorithm
This is an on-line method of searching
(i.e., search without indexing) and uses
Levenshtein distance to calculate
approximate equality between the search
string and the given pattern. Bitap
algorithm uses bitwise operations on the
bitmasks (a bitmask is the data used for
bitwise operations, and multiple bits in a
byte, word etc. can be set either on or off,
or inverted from on to off or vice versa in a
single bitwise operation using bitmasks)
which are extremely fast. It performs best
on patterns of short lengths due to the
underlying data structures.

Example 1:
input text: womenwhocode, pattern: code
output: Pattern found at index: 8

Example 2:
input text: youareawesome, pattern:
youareamazing
output: No Match

Implementation in python

this is my python implementation of the code in


GeekforGeeks for bitap seach

this is my python implementation of the code in


GeekforGeeks for bitap search

n-gram algorithm
This algorithm predicts next item in a
sequence of text in form of a Markov
model. It is an off-line search, i.e., the
search is performed on the indices,
making this much computationally
efficient for large data. Currently, n-gram
techniques are used in almost every
Natural Language processing algorithms.
n-gram is a set of values generated from a
string by pairing sequentially occurring n

characters/words. The goal is to compute


probability of a sequence of
characters/words or sentence.

Math of n-gram algorithm

In this notebook, these algorithms are


applied on a simple example to show the
similarities/differences between these
methods.

References
1. Levenshtein, Vladimir I. “Binary codes
capable of correcting deletions,
insertions, and reversals.” In Soviet
physics doklady, vol. 10, no. 8, pp. 707–
710. 1966.

2. Damerau, Fred J. “A technique for


computer detection and correction of
spelling errors.” Communications of
the ACM 7, no. 3 (1964): 171–176.

3. Cayrol, M., Farreny, H. and Prade, H.


(1982), ‘Fuzzy Pattern Matching’,
Kybernetes, Vol. 11 №2, pp. 103–116.

4. Ukkonen, Esko. “Algorithms for


approximate string matching.”
Information and control 64, no. 1–3
(1985): 100–118.

5. Geek for Geeks — applications of fuzzy


string matching

6. Geek for Geeks — Bitap Algorithm

7. Stanford slides on n-gram

8. Data camp tutorial — fuzzy string


matching

9. Levenshtein distance theory

10. Article on record linking and fuzzy


matching

11. Medium post on Levenshtein distance

12. stackoverflow for n-gram similarity

130 1
Machine Learning NLP

Fuzzy Matching String Matching

Naturallanguageprocessing

130 1

Follow

Written by Madhurima Nath,


PhD
55 Followers · 2 Following

Data professional, physicist, passionate about diversity


in STEM

Responses (1)

To respond to this story,


Open in app
get the free Medium app.

Akash Agarwal
Apr 20, 2024

Bitmap

did you mean Bitap instead right?

1 reply

More from Madhurima Nath, PhD

Madhurima Nath, PhD

Topic modeling algorithms


Learn about the mathematical concepts behind
LDA, NMF, BERTopic models

Aug 21, 2023 27

Madhurima Nath, PhD

Residual plots in Linear Regression


in R
Learn how to check the distribution of residuals
in linear regression.

Aug 13, 2023 13

Madhurima Nath, PhD

Interpretation of output values of a


simple linear regression model in R
The output variables and functions of the linear
regression model generated in R differs from…
that in python. This article discusses what…
Aug 12, 2023 7

Madhurima Nath, PhD

Implementation of end-to-end
machine learning solution
Solutioning and designing the end-to-end
architecture for a enterprise wide/large scale…
implementation of machine learning models
Aug 20, 2023 1

See all from Madhurima Nath, PhD

Recommended from Medium

In Coding Beauty by Tari Ibaba

This new IDE from Google is an


absolute game changer
This new IDE from Google is seriously
revolutionary.

Mar 12 4.2K 232

Roya

Minimum Height Trees


This blog series attempts to solve the 500 Top
Leet Code Interview Questions with the help of…
AI Code Assistance, such as Gemini and GPT.
Mar 11

Python Coding

Network Graph using Python


This code snippet demonstrates how to create
and visualize a simple network graph using the…
networkx and matplotlib libraries in Python.
Nov 10, 2024 5 1

Sebastian Carlos

Fired From Meta After 1 Week: Here’s


All The Dirt I Got
This is not just another story of a disgruntled ex-
employee. I’m not shying away from the seriou…
corporate espionage or the ethical…
Jan 8 20K 462

In Science Spectrum by Laurel W

Simple Ways to Tell if Python Code


Was Written by an LLM
Yes, We Can Tell

Mar 23 987 50

In Level Up Coding by Jacob Bennett

The 5 paid subscriptions I actually


use in 2025 as a Staff Software…
Engineer
Tools I use that are cheaper than Netflix

Jan 7 12.4K 310

See more recommendations

You might also like