0% found this document useful (0 votes)

16 views1 page

Note 4

This document discusses fuzzy string matching techniques, which are used to find approximate matches between strings in various applications such as spell checking and plagiarism detection. It covers several algorithms including Levenshtein, Damerau-Levenshtein, Bitap, and n-gram, explaining their methodologies and implementations in Python. The article also highlights the differences between online and offline algorithms and provides references for further reading.

Uploaded by

adhinanm12

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views1 page

Note 4

Uploaded by

adhinanm12

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 1

Fuzzy matching

algorithms
Madhurima Nath, PhD · Follow
5 min read · Jan 8, 2024

Listen Share

Fuzzy string matching is technique to find

strings which have approximate matches.
They are widely used in spell checkers, de-
duplication of records, master data
management, plagiarism detection,
bioinformatics and DNA sequencing, spam
filtering, content searches, similarity
matches etc.

This article will cover a few algorithms —

Levenshtein, Damerau-Levenshtein, Bitap
and n-gram — which are implemented for
such approximate string matchings. The
detailed python implementation and codes
are available in the Jupyter notebook in
the GitHub repo.

(This was presented as a Women Who

Code Data Science session in April 2022.
The recording of the session is on
YouTube.)

Introduction to fuzzy matching

String matching or fuzzy matching is a
method to find strings which match a
given pattern or string approximately. It
identifies the likelihood or probability that
two records are true match based on some
parameters. In the example shown below,
the algorithms try to match the 5 different
variations to the given string ‘Microsoft
Corporation’ and score each of them based
on how close they match the true value.

Example where string/fuzzy matching is used

Algorithms used
Most commonly used fuzzy matching
algorithms involve calculating the edit
distance metrics between the strings. Edit
distance metric quantifies how dissimilar
two strings are by counting the minimum
number of operations required to transform
one string into the other. Some of the well-
known distance metrics are

Levenshtein distance

Damerau–Levenshtein distance

Longest common subsequence

Hamming distance

Jaro distance

Bitap algorithm (shift-or, shift-and

algorithm or Baeza-Yates–Gonnet
algorithm) which tells whether a given text
contains a substring which is
“approximately equal” to a given pattern
also makes use of Levenshtein distance. It
is very efficient for relatively short pattern
strings. The Bitap algorithm is the heart of
the utility function agrep/grep in Unix
systems.

n-gram algorithm uses a Markov model to

predict the next item in a sequence of text.
n-gram is a pattern of n characters (words,
letters, symbols etc.) in some particular
order. In text processing, the use of n-
grams help capture some information
related to the order of words.

The main difference between the Bitap

and n-gram algorithms is that the former
is an on-line method, while the latter is the
off-line one. On-line algorithm do the
search without an index and therefore
their performance on large data is
inefficient. The use of indexing in off-line
techniques makes the search drastically
faster and is widely used for text
processing.

Other commonly used algorithms are

Needleman–Wunsch algorithm, Smith–
Waterman algorithm, BK Tree metric,
Soundex or Metaphone (this is a phonetic
algorithm).

Edit distance metrics — Levenshtein

distance & Damerau-Levenshtein
distance
The edit distance metric measures the
number of edits needed to transform one
word into another. Levenshtein distance is
a popular method to calculate edit
distance metric. The figures below show
how the Levenshtein and Damerau-
Levenshtein distances work and the
difference between these two.

Levenshtein and Damerau-Levenshtein distances

Calculating Levenshtein and Damerau-Levenshtein

distances

The math behind the algorithm is

explained below

Math behind calculating the Levenshtein distance

Levenshtein distance has the following

properties:

It is zero if and only if the strings are

equal.

It is at least the difference of the sizes

of the two strings.

It is at most the length of the longer

string.

Triangle inequality: The Levenshtein

distance between two strings is no
greater than the sum of their
Levenshtein distances from a third
string.

Implementation in python

Bitap algorithm
This is an on-line method of searching
(i.e., search without indexing) and uses
Levenshtein distance to calculate
approximate equality between the search
string and the given pattern. Bitap
algorithm uses bitwise operations on the
bitmasks (a bitmask is the data used for
bitwise operations, and multiple bits in a
byte, word etc. can be set either on or off,
or inverted from on to off or vice versa in a
single bitwise operation using bitmasks)
which are extremely fast. It performs best
on patterns of short lengths due to the
underlying data structures.

Example 1:
input text: womenwhocode, pattern: code
output: Pattern found at index: 8

Example 2:
input text: youareawesome, pattern:
youareamazing
output: No Match

Implementation in python

this is my python implementation of the code in

GeekforGeeks for bitap seach

this is my python implementation of the code in

GeekforGeeks for bitap search

n-gram algorithm
This algorithm predicts next item in a
sequence of text in form of a Markov
model. It is an off-line search, i.e., the
search is performed on the indices,
making this much computationally
efficient for large data. Currently, n-gram
techniques are used in almost every
Natural Language processing algorithms.
n-gram is a set of values generated from a
string by pairing sequentially occurring n

characters/words. The goal is to compute

probability of a sequence of
characters/words or sentence.

Math of n-gram algorithm

In this notebook, these algorithms are

applied on a simple example to show the
similarities/differences between these
methods.

References
1. Levenshtein, Vladimir I. “Binary codes
capable of correcting deletions,
insertions, and reversals.” In Soviet
physics doklady, vol. 10, no. 8, pp. 707–
710. 1966.

2. Damerau, Fred J. “A technique for

computer detection and correction of
spelling errors.” Communications of
the ACM 7, no. 3 (1964): 171–176.

3. Cayrol, M., Farreny, H. and Prade, H.

(1982), ‘Fuzzy Pattern Matching’,
Kybernetes, Vol. 11 №2, pp. 103–116.

4. Ukkonen, Esko. “Algorithms for

approximate string matching.”
Information and control 64, no. 1–3
(1985): 100–118.

5. Geek for Geeks — applications of fuzzy

string matching

6. Geek for Geeks — Bitap Algorithm

7. Stanford slides on n-gram

8. Data camp tutorial — fuzzy string

matching

9. Levenshtein distance theory

10. Article on record linking and fuzzy

matching

11. Medium post on Levenshtein distance

12. stackoverflow for n-gram similarity

130 1
Machine Learning NLP

Fuzzy Matching String Matching

Naturallanguageprocessing

130 1

Written by Madhurima Nath,

PhD
55 Followers · 2 Following

Data professional, physicist, passionate about diversity

in STEM

Responses (1)

To respond to this story,

Open in app
get the free Medium app.

Akash Agarwal
Apr 20, 2024

Bitmap

did you mean Bitap instead right?

1 reply

Madhurima Nath, PhD

Topic modeling algorithms

Learn about the mathematical concepts behind
LDA, NMF, BERTopic models

Aug 21, 2023 27

Madhurima Nath, PhD

Residual plots in Linear Regression

in R
Learn how to check the distribution of residuals
in linear regression.

Aug 13, 2023 13

Madhurima Nath, PhD

Interpretation of output values of a

simple linear regression model in R
The output variables and functions of the linear
regression model generated in R differs from…
that in python. This article discusses what…
Aug 12, 2023 7

Madhurima Nath, PhD

Implementation of end-to-end
machine learning solution
Solutioning and designing the end-to-end
architecture for a enterprise wide/large scale…
implementation of machine learning models
Aug 20, 2023 1

See all from Madhurima Nath, PhD

Recommended from Medium

In Coding Beauty by Tari Ibaba

This new IDE from Google is an

absolute game changer
This new IDE from Google is seriously
revolutionary.

Mar 12 4.2K 232

Roya

Minimum Height Trees

This blog series attempts to solve the 500 Top
Leet Code Interview Questions with the help of…
AI Code Assistance, such as Gemini and GPT.
Mar 11

Python Coding

Network Graph using Python

This code snippet demonstrates how to create
and visualize a simple network graph using the…
networkx and matplotlib libraries in Python.
Nov 10, 2024 5 1

Sebastian Carlos

Fired From Meta After 1 Week: Here’s

All The Dirt I Got
This is not just another story of a disgruntled ex-
employee. I’m not shying away from the seriou…
corporate espionage or the ethical…
Jan 8 20K 462

In Science Spectrum by Laurel W

Simple Ways to Tell if Python Code

Was Written by an LLM
Yes, We Can Tell

Mar 23 987 50

In Level Up Coding by Jacob Bennett

The 5 paid subscriptions I actually

use in 2025 as a Staff Software…
Engineer
Tools I use that are cheaper than Netflix

Jan 7 12.4K 310

See more recommendations

DSA Pattern Wise Product - Parikh Jain
No ratings yet
DSA Pattern Wise Product - Parikh Jain
24 pages
4-Tolerant Retrieval
No ratings yet
4-Tolerant Retrieval
82 pages
Project Explanation
No ratings yet
Project Explanation
50 pages
DAA 2020 Week 06 Assignment 02
0% (1)
DAA 2020 Week 06 Assignment 02
6 pages
Foundations of Sequence Analysis
No ratings yet
Foundations of Sequence Analysis
161 pages
Approximate String
No ratings yet
Approximate String
36 pages
Course On Quantum Computing
No ratings yet
Course On Quantum Computing
235 pages
Arrays
No ratings yet
Arrays
47 pages
Clustering Part4
No ratings yet
Clustering Part4
79 pages
Unit 4
No ratings yet
Unit 4
66 pages
CS648A 1 Overview of The Course 2025
No ratings yet
CS648A 1 Overview of The Course 2025
35 pages
Graph Theory Notes
No ratings yet
Graph Theory Notes
45 pages
A Guided Tour To Approximate String Matching: Gonzalo Navarro
No ratings yet
A Guided Tour To Approximate String Matching: Gonzalo Navarro
58 pages
DAA Summarized Unit 5
No ratings yet
DAA Summarized Unit 5
21 pages
Lecture # 15 - New
No ratings yet
Lecture # 15 - New
70 pages
Unit 7
No ratings yet
Unit 7
60 pages
String Matching
No ratings yet
String Matching
66 pages
18-IntroNLP II PDF
No ratings yet
18-IntroNLP II PDF
187 pages
Unit V - Daa
No ratings yet
Unit V - Daa
39 pages
Lecture 5
No ratings yet
Lecture 5
28 pages
Lecture 4 - Brute-Force Algorithms (Part 2) - Miscellaneous
No ratings yet
Lecture 4 - Brute-Force Algorithms (Part 2) - Miscellaneous
14 pages
Efficient Merging and Filtering Algorithms For Approximate String Searches
No ratings yet
Efficient Merging and Filtering Algorithms For Approximate String Searches
10 pages
Introduction To String Matching
No ratings yet
Introduction To String Matching
28 pages
B20CI0101 - Introduction To Python Programming
No ratings yet
B20CI0101 - Introduction To Python Programming
2 pages
String Edit PDF
No ratings yet
String Edit PDF
39 pages
Lecture 01 IntroductionToAlgorithm
No ratings yet
Lecture 01 IntroductionToAlgorithm
25 pages
String Matching Algorithms
No ratings yet
String Matching Algorithms
13 pages
On Differentially Private String Distances
No ratings yet
On Differentially Private String Distances
25 pages
B505 Lec.10 DynamicProgramming 1
No ratings yet
B505 Lec.10 DynamicProgramming 1
19 pages
Approximate Matching
No ratings yet
Approximate Matching
16 pages
Daaunit5 IT3
No ratings yet
Daaunit5 IT3
21 pages
Similarity Distances For Natural Language Processing
No ratings yet
Similarity Distances For Natural Language Processing
16 pages
Dynamic Programming - 2
No ratings yet
Dynamic Programming - 2
24 pages
03 Myers Bit Vector
No ratings yet
03 Myers Bit Vector
12 pages
Efficient Algorithm For Auto Correction Using N-Gram Indexing
No ratings yet
Efficient Algorithm For Auto Correction Using N-Gram Indexing
5 pages
IR Practical B1
No ratings yet
IR Practical B1
15 pages
The Stringdist Package For Approximate String Matching
No ratings yet
The Stringdist Package For Approximate String Matching
13 pages
Semester Final Project Report
No ratings yet
Semester Final Project Report
11 pages
Design 1
No ratings yet
Design 1
15 pages
Problem Set 5 Instructions
No ratings yet
Problem Set 5 Instructions
8 pages
Exact String Matchin
No ratings yet
Exact String Matchin
7 pages
Task 1
No ratings yet
Task 1
5 pages
Prim's and Kruskal's Algorithm
No ratings yet
Prim's and Kruskal's Algorithm
58 pages
Module V
No ratings yet
Module V
4 pages
Disjoint Set and Next
No ratings yet
Disjoint Set and Next
6 pages
BioInfor Assignment
No ratings yet
BioInfor Assignment
4 pages
Numerical SolutionMaathewsSolutions
No ratings yet
Numerical SolutionMaathewsSolutions
10 pages
1 s2.0 S0020019015000411 Main
No ratings yet
1 s2.0 S0020019015000411 Main
3 pages
DAA (Algorithms Knowledge Capsule 4 by Dr. Choudhary Ravi Singh)
No ratings yet
DAA (Algorithms Knowledge Capsule 4 by Dr. Choudhary Ravi Singh)
20 pages
Program To Multiply Two Sparse Matrices Using C Language: Course Code:-Mcs 021 Course Name:-Ds Q.1 Ans
No ratings yet
Program To Multiply Two Sparse Matrices Using C Language: Course Code:-Mcs 021 Course Name:-Ds Q.1 Ans
17 pages
Unit-5 21CSC201J
No ratings yet
Unit-5 21CSC201J
23 pages
2: Models of Computation: Al-Khw Arizm I
No ratings yet
2: Models of Computation: Al-Khw Arizm I
8 pages
Damerau-Levenshtein Algorithm and Bayes Theorem For Spell Checker Optimization
No ratings yet
Damerau-Levenshtein Algorithm and Bayes Theorem For Spell Checker Optimization
6 pages
Assignment No.1: Eisha Javeed
No ratings yet
Assignment No.1: Eisha Javeed
7 pages
Slides
No ratings yet
Slides
44 pages
Reviewer Automata
No ratings yet
Reviewer Automata
7 pages
Gtu Pyqs Paper Fundamental of Ai
No ratings yet
Gtu Pyqs Paper Fundamental of Ai
1 page
Scikit Learn
No ratings yet
Scikit Learn
10 pages
Artificial Intelligence Chapter 8: First-Order Logic
No ratings yet
Artificial Intelligence Chapter 8: First-Order Logic
40 pages
Accelerating Benders Decomposition Algorithmic Enh PDF
No ratings yet
Accelerating Benders Decomposition Algorithmic Enh PDF
41 pages
Review On Greedy Algorithm
No ratings yet
Review On Greedy Algorithm
7 pages
QB Dsa
No ratings yet
QB Dsa
5 pages
MST 2 Study Material
No ratings yet
MST 2 Study Material
57 pages
AI - Popular Search Algorithms
No ratings yet
AI - Popular Search Algorithms
7 pages
Module89109 - 14907 - 4702013 - Group9 - PPT - Isomorphic Graphs and Adjancency Matrix
No ratings yet
Module89109 - 14907 - 4702013 - Group9 - PPT - Isomorphic Graphs and Adjancency Matrix
10 pages
TREES
No ratings yet
TREES
26 pages
Shortest Path
No ratings yet
Shortest Path
57 pages
Section 3-5: Lagrange Multipliers: Fxy X y X y
No ratings yet
Section 3-5: Lagrange Multipliers: Fxy X y X y
14 pages
Dual Simplex Method
No ratings yet
Dual Simplex Method
7 pages
Grade 9 CS Chapter # 7 & 8 Worksheet # 3 (A)
No ratings yet
Grade 9 CS Chapter # 7 & 8 Worksheet # 3 (A)
3 pages
Ada 3
No ratings yet
Ada 3
2 pages
BFS & DFS
No ratings yet
BFS & DFS
6 pages
Sheet 6 Digital (ECE 221)
No ratings yet
Sheet 6 Digital (ECE 221)
2 pages
Question - Data Structure and Algoriths - 2024
No ratings yet
Question - Data Structure and Algoriths - 2024
4 pages
A11+A12+A13 - 0038 - CSA2001 - Fundamentals of AI and ML
No ratings yet
A11+A12+A13 - 0038 - CSA2001 - Fundamentals of AI and ML
2 pages
Mastering Data Structures and Algorithms with Python: Unlock the Secrets of Expert-Level Skills
From Everand
Mastering Data Structures and Algorithms with Python: Unlock the Secrets of Expert-Level Skills
Larry Jones
No ratings yet
Statistics with Rust, Second Edition
From Everand
Statistics with Rust, Second Edition
Keiko Nakamura
No ratings yet
Statistics with Rust, Second Edition: Explore rust programming and its powerful crates across data science, machine learning and NLP projects
From Everand
Statistics with Rust, Second Edition: Explore rust programming and its powerful crates across data science, machine learning and NLP projects
Keiko Nakamura
No ratings yet
Python Programming: General-Purpose Libraries; NumPy,Pandas,Matplotlib,Seaborn,Requests,os & sys: Python, #2
From Everand
Python Programming: General-Purpose Libraries; NumPy,Pandas,Matplotlib,Seaborn,Requests,os & sys: Python, #2
e3
No ratings yet
Statistics with Rust: 50+ Statistical Techniques Put into Action
From Everand
Statistics with Rust: 50+ Statistical Techniques Put into Action
Keiko Nakamura
No ratings yet
Python for Data Science: A Practical Approach to Machine Learning
From Everand
Python for Data Science: A Practical Approach to Machine Learning
Jarrel E.
No ratings yet
Data Structures and Algorithms with Python
From Everand
Data Structures and Algorithms with Python
Aadinath Pothuvaal
No ratings yet
Machine Learning and Deep Learning With Python
From Everand
Machine Learning and Deep Learning With Python
James Chen
No ratings yet
Computer Data
From Everand
Computer Data
Angel Gabaldon
No ratings yet
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
César Pérez López
No ratings yet
Knuth-Morris-Pratt Algorithm Explained: Definitive Reference for Developers and Engineers
From Everand
Knuth-Morris-Pratt Algorithm Explained: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Efficient String Processing with Trie Structures: Definitive Reference for Developers and Engineers
From Everand
Efficient String Processing with Trie Structures: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Mastering Python Algorithms: Practical Solutions for Complex Problems
From Everand
Mastering Python Algorithms: Practical Solutions for Complex Problems
Robert Johnson
No ratings yet
Machine Learning - A Comprehensive, Step-by-Step Guide to Learning and Applying Advanced Concepts and Techniques in Machine Learning: 3
From Everand
Machine Learning - A Comprehensive, Step-by-Step Guide to Learning and Applying Advanced Concepts and Techniques in Machine Learning: 3
Peter Bradley
No ratings yet
Random Sample Consensus: Robust Estimation in Computer Vision
From Everand
Random Sample Consensus: Robust Estimation in Computer Vision
Fouad Sabry
No ratings yet

Note 4

Uploaded by

Note 4

Uploaded by

Fuzzy matching

Fuzzy string matching is technique to find

This article will cover a few algorithms —

(This was presented as a Women Who

Introduction to fuzzy matching

Example where string/fuzzy matching is used

Longest common subsequence

Bitap algorithm (shift-or, shift-and

n-gram algorithm uses a Markov model to

The main difference between the Bitap

Other commonly used algorithms are

Edit distance metrics — Levenshtein

Levenshtein and Damerau-Levenshtein distances

Calculating Levenshtein and Damerau-Levenshtein

The math behind the algorithm is

Math behind calculating the Levenshtein distance

Levenshtein distance has the following

It is zero if and only if the strings are

It is at least the difference of the sizes

It is at most the length of the longer

Triangle inequality: The Levenshtein

this is my python implementation of the code in

this is my python implementation of the code in

characters/words. The goal is to compute

Math of n-gram algorithm

In this notebook, these algorithms are

2. Damerau, Fred J. “A technique for

3. Cayrol, M., Farreny, H. and Prade, H.

4. Ukkonen, Esko. “Algorithms for

5. Geek for Geeks — applications of fuzzy

6. Geek for Geeks — Bitap Algorithm

7. Stanford slides on n-gram

8. Data camp tutorial — fuzzy string

9. Levenshtein distance theory

10. Article on record linking and fuzzy

11. Medium post on Levenshtein distance

12. stackoverflow for n-gram similarity

Fuzzy Matching String Matching

Written by Madhurima Nath,

Data professional, physicist, passionate about diversity

To respond to this story,

did you mean Bitap instead right?

More from Madhurima Nath, PhD

Madhurima Nath, PhD

Topic modeling algorithms

Aug 21, 2023 27

Madhurima Nath, PhD

Residual plots in Linear Regression

Aug 13, 2023 13

Madhurima Nath, PhD

Interpretation of output values of a

Madhurima Nath, PhD

See all from Madhurima Nath, PhD

Recommended from Medium

In Coding Beauty by Tari Ibaba

This new IDE from Google is an

Mar 12 4.2K 232

Minimum Height Trees

Network Graph using Python

Fired From Meta After 1 Week: Here’s

In Science Spectrum by Laurel W

Simple Ways to Tell if Python Code

In Level Up Coding by Jacob Bennett

The 5 paid subscriptions I actually

Jan 7 12.4K 310

See more recommendations

You might also like