0% found this document useful (0 votes)
20 views

Simple Efficient Algorithm

1. The document presents an efficient algorithm called CPMerge for approximate dictionary matching. CPMerge uses an inverted index approach to retrieve candidate strings that meet a minimum similarity threshold to a query string. 2. The algorithm works for various similarity measures like cosine, Dice, Jaccard, and overlap. It collects fewer candidate strings and prunes unlikely ones. 3. Experiments on large datasets show CPMerge retrieves strings much faster than other approaches, finding matches from over 1 million strings in just 1.1 milliseconds.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views

Simple Efficient Algorithm

1. The document presents an efficient algorithm called CPMerge for approximate dictionary matching. CPMerge uses an inverted index approach to retrieve candidate strings that meet a minimum similarity threshold to a query string. 2. The algorithm works for various similarity measures like cosine, Dice, Jaccard, and overlap. It collects fewer candidate strings and prunes unlikely ones. 3. Experiments on large datasets show CPMerge retrieves strings much faster than other approaches, finding matches from over 1 million strings in just 1.1 milliseconds.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Simple and Efficient Algorithm

for Approximate Dictionary Matching

Naoaki Okazaki Jun’ichi Tsujii


University of Tokyo University of Tokyo
[email protected] University of Manchester
National Centre for Text Mining
[email protected]

Abstract is a nontrivial task to find the entry from these sur-


face expressions appearing in text.
This paper presents a simple and effi- This paper addresses approximate dictionary
cient algorithm for approximate dictio- matching, which consists of finding all strings in
nary matching designed for similarity a string collection V such that they have similar-
measures such as cosine, Dice, Jaccard, ity that is no smaller than a threshold α with a
and overlap coefficients. We propose this query string x. This task has a broad range of ap-
algorithm, called CPMerge, for the τ - plications, including spelling correction, flexible
overlap join of inverted lists. First we dictionary look-up, record linkage, and duplicate
show that this task is solvable exactly by detection (Henzinger, 2006; Manku et al., 2007).
a τ -overlap join. Given inverted lists re- Formally, the task obtains a subset Yx,α ⊆ V ,
trieved for a query, the algorithm collects
fewer candidate strings and prunes un- Yx,α = {y ∈ V sim(x, y) ≥ α}, (1)
likely candidates to efficiently find strings
that satisfy the constraint of the τ -overlap where sim(x, y) presents the similarity between x
join. We conducted experiments of ap- and y. A naı̈ve solution to this task is to com-
proximate dictionary matching on three pute similarity values |V | times, i.e., between x
large-scale datasets that include person and every string y ∈ V . However, this solution
names, biomedical names, and general is impractical when the number of strings |V | is
English words. The algorithm exhib- huge (e.g., more than one million).
ited scalable performance on the datasets. In this paper, we present a simple and effi-
For example, it retrieved strings in 1.1 cient algorithm for approximate dictionary match-
ms from the string collection of Google ing designed for similarity measures such as co-
Web1T unigrams (with cosine similarity sine, Dice, Jaccard, and overlap coefficients. Our
and threshold 0.7). main contributions are twofold.

1 Introduction 1. We show that the problem of approximate


dictionary matching is solved exactly by a
Languages are sufficiently flexible to be able to τ -overlap join (Sarawagi and Kirpal, 2004)
express the same meaning through different dic- of inverted lists. Then we present CPMerge,
tion. At the same time, inconsistency of surface which is a simple and efficient algorithm for
expressions has persisted as a serious problem in the τ -overlap join. In addition, the algorithm
natural language processing. For example, in the is easily implemented.
biomedical domain, cardiovascular disorder can
be described using various expressions: cardio- 2. We demonstrate the efficiency of the al-
vascular diseases, cardiovascular system disor- gorithm on three large-scale datasets with
der, and disorder of the cardiovascular system. It person names, biomedical concept names,

851
Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010), pages 851–859,
Beijing, August 2010
and general English words. We com-
Table 1: Conditions for each similarity measure
pare the algorithm with state-of-the-art al- Measure min |Y | max |Y | τ (= min |X ∩ Y |)
gorithms, including Locality Sensitive Hash- Dice α
2−α
|X| 2−α
α
|X| 1
2
α(|X| + |Y |)
ing (Ravichandran et al., 2005; Andoni and Jaccard α|X| |X|/α α(|X|+|Y |)
p1+α
Indyk, 2008) and DivideSkip (Li et al., Cosine α2 |X| |X|/α2 α |X||Y |
2008). The proposed algorithm retrieves Overlap — — α min{|X|, |Y |}
strings the most rapidly, e.g., in 1.1 ms from
Google Web1T unigrams (with cosine simi- approximate dictionary matching,
larity and threshold 0.7). l p m
α |X||Y | ≤ |X ∩ Y | ≤ min{|X|, |Y |}.
2 Proposed Method (3)
2.1 Necessary and sufficient conditions This inequality states
l that two strings
m x and y must
p
In this paper, we assume that the features of a have at least τ = α |X||Y | features in com-
string are represented arbitrarily by a set. Al- mon. When ignoring |X ∩Y | in the inequality, we
though it is important to design a string represen- have an inequality about |X| and |Y |,
tation for an accurate similarity measure, we do  
 2  |X|
not address this problem: our emphasis is not on α |X| ≤ |Y | ≤ (4)
α2
designing a better representation for string simi-
larity but on establishing an efficient algorithm. This inequality presents the search range for re-
The most popular representation is given by n- trieving similar strings; that is, we can ignore
grams: all substrings of size n in a string. We strings whose feature size is out of this range.
use trigrams throughout this paper as an example Other derivations are also applicable to similar-
of string representation. For example, the string ity measures, including Dice, Jaccard, and overlap
“methyl sulphone” is expressed by 17 elements coefficients. Table 1 summarizes the conditions
of letter trigrams1 , {‘$$m’, ‘$me’, ‘met’, for these similarity measures.
‘eth’, ‘thy’, ‘hyl’, ‘yl ’, ‘l s’, ‘ su’, We explain one usage of these conditions. Let
‘sul’, ‘ulp’, ‘lph’, ‘pho’, ‘hon’, query string x = “methyl sulphone” and thresh-
‘one’, ‘ne$’, ‘e$$’}. We insert two $s be- old for approximate dictionary matching α = 0.7
fore and after the string to denote the start or end with cosine similarity. Representing the strings
of the string. In general, a string x consisting of with letter trigrams, we have the size of x, |X| =
|X| letters yields (|x| + n − 1) elements of n- 17. The inequality 4 gives the search range of |Y |
grams. We call |x| and |X| the length and size, of the retrieved strings, 9 ≤ |Y | ≤ 34. Presum-
respectively, of the string x. ing that we are searching for strings of |Y | = 16,
Let X and Y denote the feature sets of the we obtain the necessary and sufficient condition
strings x and y, respectively. The cosine similarity for the approximate dictionary matching from the
between the two strings x and y is, inequality 3, τ = 12 ≤ |X ∩ Y |. Thus, we need
to search for strings that have at least 12 letter tri-
|X ∩ Y | grams that overlap with X. When considering a
cosine(X, Y ) = p . (2)
|X||Y | string y = “methyl sulfone”, which is a spelling
variant of y (ph → f), we confirm that the string
By integrating this definition with Equation 1, we is a solution for approximate dictionary matching
obtain the necessary and sufficient condition for because |X ∩Y | = 13 (≥ τ ). Here,
√ the actual sim-
1 ilarity is cosine(X, Y ) = 13/ 17 × 16 = 0.788
In practice, we attach ordinal numbers to n-grams to rep-
resent multiple occurrences of n-grams in a string (Chaud- (≥ α).
huri et al., 2006). For example, the string “prepress”, which
contains two occurrences of the trigram ‘pre’, yields 2.2 Data structure and algorithm
the set {‘$$p’#1, ‘$pr’#1, ‘pre’#1, ‘rep’#1,
‘epr’#1, ‘pre’#2, ‘res’#1, ‘ess’#1, ‘ss$’#1, Algorithm 1 presents the pseudocode of the ap-
‘s$$’#1}. proximate dictionary matching based on Table 1.

852
Input: V : collection of strings Input: X: array of features of the query string
Input: x: query string Input: τ : minimum number of overlaps
Input: α: threshold for the similarity Input: V : collection of strings
Output: Y: list of strings similar to the query Input: l: size of target strings
1 X ← string to feature(x); Output: R: list of strings similar to the query
2 Y ←[]; 1 M ← {};
3 for l ← min y(|X|, α) to max y(|X|, α) do 2 R ← [];
4 τ ← min overlap(|X|, l, α); 3 foreach q ∈ X do
5 R ← overlapjoin(X, τ , V , l); 4 foreach i ∈ get(V , l, q) do
6 foreach r ∈ R do append r to Y; 5 M [i] ← M [i] + 1;
7 end 6 if τ ≤ M [i] then
8 return Y; 7 append i to R;
8 end
Algorithm 1: Approximate dictionary 9 end
matching. 10 end
11 return R;

Algorithm 2: AllScan algorithm.


Given a query string x, a collection of strings V ,
and a similarity threshold α, the algorithm com-
putes the size range (line 3) given by Table 1. is able to maintain numerous candidate strings in
For each size l in the range, the algorithm com- M , but most candidates are not likely to qualified
putes the minimum number of overlaps τ (line 4). because they have few overlaps with X.
The function overlapjoin (line 5) finds sim- To reduce the number of the candidate strings,
ilar strings by solving the following problem (τ - we refer to signature-based algorithms (Arasu et
overlap join): given a list of features of the query al., 2006; Chaudhuri et al., 2006):
string X and the minimum number of overlaps τ , Property 1 Let there be a set (of size h) X and a
enumerate strings of size l in the collection V such set (of any size) Y . Consider any subset Z ⊆ X of
that they have at least τ feature overlaps with X. size (h − τ + 1). If |X ∩ Y | ≥ τ , then Z ∩ Y 6= φ.
To solve this problem efficiently, we build an We explain one usage of this property. Let query
inverted index that stores a mapping from the fea- string x = “methyl sulphone” and its trigram set
tures to their originating strings. Then, we can X be features (therefore, |X| = h = 17). Pre-
perform the τ -overlap join by finding strings that suming that we seek strings whose trigrams are
appear at least τ times in the inverted lists re- size 16 and have 12 overlaps with X, then string y
trieved for the query features X. must have at least one overlap with any subset of
Algorithm 2 portrays a naı̈ve solution for the size 6 (= 17 − 12 + 1) of X. We call the subset
τ -overlap join (AllScan algorithm). In this algo- signatures. The property leads to an algorithmic
rithm, function get(V , l, q) returns the inverted design by which we obtain a small set of candi-
list of strings (of size l) for the feature q. In date strings from the inverted lists for signatures,
short, this algorithm scans strings in the inverted (|X| − τ + 1) features in X, and verify whether
lists retrieved for the query features X, counts the each candidate string satisfies the τ overlap with
frequency of occurrences of every string in the the remaining (τ − 1) n-grams.
inverted lists, and returns the strings whose fre- Algorithm 3 presents the pseudocode employ-
quency of occurrences is no smaller than τ . ing this idea. In line 1, we arrange the features in
This algorithm is inefficient in that it scans X in ascending order of the number of strings in
all strings in the inverted lists. The number of their inverted lists. We denote the k-th element in
scanned strings is large, especially when some the ordered features as Xk (k ∈ {0, ..., |X| − 1}),
query features appear frequently in the strings, where the index number begins with 0. Based on
e.g., ‘s$$’ (words ending with ‘s’) and ‘pre’ this notation, X0 and X|X|−1 are the most uncom-
(words with substring ‘pre’). To make matters mon and the most common features in X, respec-
worse, such features are too common for charac- tively.
terizing string similarity. The AllScan algorithm In lines 2–7, we use (|X| − τ + 1) features

853
Input: X: array of features of the query string • Naive: Naı̈ve algorithm that computes the
Input: τ : minimum number of overlaps cosine similarity |V | times for every query.
Input: V : collection of strings
Input: l: size of target strings • AllScan: AllScan algorithm.
Output: R: list of strings similar to the query • Signature: CPMerge algorithm without
1 sort elements in X by order of |get(V , l, Xk )|; pruning; this is equivalent to Algorithm 3
2 M ← {}; without lines 17–18.
3 for k ← 0 to (|X| − τ ) do
4 foreach s ∈ get(V , l, Xk ) do • DivideSkip: our implementation of the algo-
5 M [s] ← M [s] + 1; rithm (Li et al., 2008)2 .
6 end • Locality Sensitive Hashing (LSH) (Andoni
7 end
8 R ← []; and Indyk, 2008): This baseline system fol-
9 for k ← (|X| − τ + 1) to (|X| − 1) do lows the design of previous work (Ravichan-
10 foreach s ∈ M do dran et al., 2005). This system approxi-
11 if bsearch(get(V , l, Xk ), s) then
12 M [s] ← M [s] + 1; mately solves Equation 1 by finding dictio-
13 end nary entries whose LSH values are within
14 if τ ≤ M [s] then the (bit-wise) hamming distance of θ from
15 append s to R;
16 remove s from M ; the LSH value of a query string. To adapt
17 else if M [s] + (|X| − k − 1) < τ then the method to approximate dictionary match-
18 remove s from M ; ing, we used a 64-bit LSH function com-
19 end
20 end puted with letter trigrams. By design, this
21 end method does not find an exact solution to
22 return R; Equation 1; in other words, the method can
Algorithm 3: CPMerge algorithm. miss dictionary entries that are actually sim-
ilar to the query strings. This system has
three parameters, θ, q (number of bit permu-
X0 , ..., X|X|−τ to generate a compact set of can- tations), and B (search width), to control the
didate strings. The algorithm stores the occur- tradeoff between retrieval speed and recall3 .
rence count of each string s in M [s]. In lines 9– Generally speaking, increasing these param-
21, we increment the occurrence counts if each eters improves the recall, but slows down the
of X|X|−τ +1 , ..., X|X|−1 inverted lists contain the speed. We determined θ = 24 and q = 24
candidate strings. For each string s in the candi- experimentally4 , and measured the perfor-
dates (line 10), we perform a binary search on the mance when B ∈ {16, 32, 64}.
inverted list (line 11), and increment the overlap
count if the string s exists (line 12). If the overlap The systems, excluding LSH, share the same
counter of the string reaches τ (line 14), then we implementation of Algorithm 1 so that we can
append the string s to the result list R and remove specifically examine the differences of the algo-
s from the candidate list (lines 15–16). We prune rithms for τ -overlap join. The C++ source code of
a candidate string (lines 17–18) if the candidate is the system used for this experiment is available5 .
found to be unreachable for τ overlaps even if it We ran all experiments on an application server
appears in all of the unexamined inverted lists. running Debian GNU/Linux 4.0 with Intel Xeon
5140 CPU (2.33 GHz) and 8 GB main memory.
3 Experiments 2
We tuned parameter values µ ∈ {0.01, 0.02, 0.04, 0.1,
We report the experimental results of approximate 0.2, 0.4, 1, 2, 4, 10, 20, 40, 100} for each dataset. We se-
lected the parameter with the fastest response.
dictionary matching on large-scale datasets with 3
We followed the notation of the original pa-
person names, biomedical names, and general En- per (Ravichandran et al., 2005) here. Refer to the original
glish words. We implemented various systems of paper for definitions of the parameters θ, q, and B.
4
q was set to 24 so that the arrays of shuffled hash values
approximate dictionary matching. are stored in memory. We chose θ = 24 from {8, 16, 24} be-
cause it showed a good balance between accuracy and speed.
• Proposed: CPMerge algorithm. 5
http://www.chokkan.org/software/simstring/

854
3.1 Datasets the letter at the position with an ASCII letter ran-
We used three large datasets with person names domly chosen from a uniform distribution.
(IMDB actors), general English words (Google
3.2 Results
Web1T), and biomedical names (UMLS).
To examine the scalability of each system, we
• IMDB actors: This dataset comprises actor controlled the number of strings to be indexed
names extracted from the IMDB database6 . from 10%–100%, and issued 1,000 queries. Fig-
We used all actor names (1,098,022 strings; ure 1 portrays the average response time for re-
18 MB) from the file actors.list.gz. trieving strings whose cosine similarity values are
The average number of letter trigrams in the no smaller than 0.7. Although LSH (B=16) seems
strings is 17.2. The total number of trigrams to be the fastest in the graph, this system missed
is 42,180. The system generated index files many true positives7 ; the recall scores of approx-
of 83 MB in 56.6 s. imate dictionary matching were 15.4% (IMDB),
• Google Web1T unigrams: This dataset con- 13.7% (Web1T), and 1.5% (UMLS). Increasing
sists of English word unigrams included in the parameter B improves the recall at the expense
the Google Web1T corpus (LDC2006T13). of the response time. LSH (B=64)8 . It not only
We used all word unigrams (13,588,391 ran slower than the proposed method, but also
strings; 121 MB) in the corpus after remov- suffered from low recall scores, 25.8% (IMDB),
ing the frequency information. The aver- 18.7% (Web1T), and 7.1% (UMLS). LSH was
age number of letter trigrams in the strings useful only when we required a quick response
is 10.3. The total number of trigrams is much more than recall.
301,459. The system generated index files The other systems were guaranteed to find
of 601 MB in 551.7 s. the exact solution (100% recall). The proposed
• UMLS: This dataset consists of English algorithm was the fastest of all exact systems
names and descriptions of biomedical con- on all datasets: the response times per query
cepts included in the Unified Medical Lan- (100% index size) were 1.07 ms (IMDB), 1.10 ms
guage System (UMLS). We extracted all (Web1T), and 20.37 ms (UMLS). The response
English concept names (5,216,323 strings; times of the Naı̈ve algorithm were too slow, 32.8 s
212 MB) from MRCONSO.RRF.aa.gz and (IMDB), 236.5 s (Web1T), and 416.3 s (UMLS).
MRCONSO.RRF.ab.gz in UMLS Release The proposed algorithm achieved substantial
2009AA. The average number of letter tri- improvements over the AllScan algorithm: the
grams in the strings is 43.6. The total number proposed method was 65.3 times (IMDB), 227.5
of trigrams is 171,596. The system generated times (Web1T), and 13.7 times (UMLS) faster
index files of 1.1 GB in 1216.8 s. than the Naı̈ve algorithm. We observed that the
Signature algorithm, which is Algorithm 3 with-
For each dataset, we prepared 1,000 query
out lines 17–18, did not perform well: The Sig-
strings by sampling strings randomly from the
nature algorithm was 1.8 times slower (IMDB),
dataset. To simulate the situation where query
2.1 times faster (Web1T), and 135.0 times slower
strings are not only identical but also similar to
(UMLS) than the AllScan algorithm. These re-
dictionary entries, we introduced random noise
sults indicate that it is imperative to minimize the
to the strings. In this experiment, one-third of
number of candidates to reduce the number of
the query strings are unchanged from the original
binary-search operations. The proposed algorithm
(sampled) strings, one-third of the query strings
was 11.1–13.4 times faster than DivideSkip.
have one letter changed, and one-third of the
Figure 2 presents the average response time
query strings have two letters changed. When
changing a letter, we randomly chose a letter po- 7
Solving Equation 1, all systems are expected to retrieve
sition from a uniform distribution, and replaced the exact set of strings retrieved by the Naı̈ve algorithm.
8
The response time of LSH (B=64) on the IMDB dataset
6
ftp://ftp.fu-berlin.de/misc/movies/database/ was 29.72 ms (100% index size).

855
25 50 60

Average response per query [ms]

Average response per query [ms]


Average response per query [ms] Proposed Proposed Proposed
AllScan AllScan 50 AllScan
20 40
Signature Signature Signature
DivideSkip DivideSkip 40 DivideSkip
15 LSH (B=16) 30 LSH (B=16) LSH (B=16)
LSH (B=32) LSH (B=32) 30 LSH (B=32)
10 20 LSH (B=64) LSH (B=64)
20

5 10
10

0 0 0
0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100
Number of indexed strings (%) Number of indexed strings (%) Number of indexed strings (%)
(a) IMDB actors (b) Google Web1T unigrams (c) UMLS

Figure 1: Average response time for processing a query (cosine similarity; α = 0.7).
30 70 400
Average response per query [ms]

Average response per query [ms]

Average response per query [ms]


Dice Dice Dice
60 350
25 Jaccard Jaccard Jaccard
Cosine Cosine 300 Cosine
50
20 Overlap Overlap Overlap
250
40
15 200
30
150
10
20
100
5 10 50

0 0 0
0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Similarity threshold Similarity threshold Similarity threshold


(a) IMDB actors (b) Google Web1T unigram (c) UMLS

Figure 2: Average response time for processing a query.

of the proposed algorithm for different similarity approximate dictionary matching. First, the pro-
measures and threshold values. When the similar- posed algorithm scanned far fewer strings than did
ity threshold is lowered, the algorithm runs slower the AllScan algorithm. For example, to obtain
because the number of retrieved strings |Y| in- candidate strings in the IMDB dataset, the pro-
creases exponentially. The Dice coefficient and posed algorithm scanned 279.7 strings, whereas
cosine similarity produced similar curves. the AllScan algorithm scanned 16,155.1 strings.
Therefore, the algorithm examined only 1.1%–
Table 2 summarizes the run-time statistics of
3.5% of the strings in the entire inverted lists in
the proposed method for each dataset (with co-
the three datasets. Second, the proposed algo-
sine similarity and threshold 0.7). Using the
rithm considered far fewer candidates than did
IMDB dataset, the proposed method searched for
the AllScan algorithm: the number of candidate
strings whose size was between 8.74 and 34.06;
strings considered by the algorithm was 1.2%–
it retrieved 4.63 strings per query string. The
6.6% of those considered by the AllScan algo-
proposed algorithm scanned 279.7 strings in 4.6
rithm. Finally, the proposed algorithm read fewer
inverted lists to obtain 232.5 candidate strings.
inverted lists than did the AllScan algorithm. The
The algorithm performed a binary search on 4.3
proposed algorithm actually read 8.9 (IMDB), 6.0
inverted lists containing 7,561.8 strings in all.
(Web1T), and 31.7 (UMLS) inverted lists during
In contrast, the AllScan algorithm had to scan
the experiments9 . These values indicate that the
16,155.1 strings in 17.7 inverted lists and con-
proposed algorithm can solve τ -overlap join prob-
sidered 9,788.7 candidate strings, and found only
lems by checking only 50.3% (IMDB), 53.6%
4.63 similar strings.
(Web1T), and 51.9% of the total inverted lists re-
This table clearly demonstrates three key con-
tributions of the proposed algorithm for efficient 9
These values are 4.6 + 4.3, 3.1 + 2.9, and 14.3 + 17.4.

856
Table 2: Run-time statistics of the proposed algorithm for each dataset
Averaged item IMDB Web1T UMLS Description
min |y| 8.74 5.35 21.87 minimum size of trigrams of target strings
max |y| 34.06 20.46 88.48 maximum size of trigrams of target strings
τ 14.13 9.09 47.77 minimum number of overlaps required/sufficient per query
|Y| 4.63 3.22 111.79 number of retrieved strings per query
Total — averaged for each query and target size:
# inverted lists 17.7 11.2 61.1 number of inverted lists retrieved for a query
# strings 16 155.1 52 557.6 49 561.4 number of strings in the inverted list
# unique strings 9 788.7 44 834.6 17 457.5 number of unique strings in the inverted list
Candidate stage — averaged for each query and target size:
# inverted lists 4.6 3.1 14.3 number of inverted lists scanned for generating candidates
# strings 279.7 552.7 1 756.3 number of strings scanned for generating candidates
# candidates 232.5 523.7 1 149.7 number of candidates generated for a query
Validation stage — averaged for each query and target size:
# inverted lists 4.3 2.9 17.4 number of inverted lists examined by binary search for a query
# strings 7 561.8 19 843.6 20 443.7 number of strings targeted by binary search

trieved for queries. cialized for the edit distance measure.


A few studies addressed approximate dictio-
4 Related Work nary matching for similarity measures such as
Numerous studies have addressed approximate cosine and Jaccard similarities. Chaudhuri et
dictionary matching. The most popular configu- al. (2006) proposed the SSJoin operator for sim-
ration uses n-grams as a string representation and ilarity joins with several measures including the
the edit distance as a similarity measure. Gra- edit distance and Jaccard similarity. This algo-
vano et al. (1998; 2001) presented various filter- rithm first generates signatures for strings, finds
ing strategies, e.g., count filtering, position fil- all pairs of strings whose signatures overlap,
tering, and length filtering, to reduce the num- and finally outputs the subset of these candi-
ber of candidates. Kim et al. (2005) proposed date pairs that satisfy the similarity predicate.
two-level n-gram inverted indices (n-Gram/2L) to Arasu et al. (2006) addressed signature schemes,
eliminate the redundancy of position information i.e., methodologies for obtaining signatures from
in n-gram indices. Li et al. (2007) explored the strings. They also presented an implementation of
use of variable-length grams (VGRAMs) for im- the SSJoin operator in SQL. Although we did not
proving the query performance. Lee et al. (2007) implement this algorithm in SQL, it is equivalent
extended n-grams to include wild cards and de- to the Signature algorithm in Section 3.
veloped algorithms based on a replacement semi- Sarawagi and Kirpal (2004) proposed the Mer-
lattice. Xiao et al. (2008) proposed the Ed-Join geOpt algorithm for the τ -overlap join to approx-
algorithm, which utilizes mismatching n-grams. imate string matching with overlap, Jaccard, and
Several studies addressed different paradigms cosine measures. This algorithm splits inverted
for approximate dictionary matching. Bocek et lists for a given query A into two groups, S and
al. (2007) presented the Fast Similarity Search L, maintains a heap to collect candidate strings on
(FastSS), an enhancement of the neighborhood S, and performs a binary search on L to verify the
generation algorithms, in which multiple variants condition of the τ -overlap join for each candidate
of each string record are stored in a database. string. Their subsequent work includes an effi-
Wang et al. (2009) further improved the technique cient algorithm for the top-k search of the overlap
of neighborhood generation by introducing parti- join (Chandel et al., 2006).
tioning and prefix pruning. Huynh et al. (2006) Li et al. (2008) extended this algorithm to the
developed a solution to the k-mismatch problem SkipMerge and DivideSkip algorithms. The Skip-
in compressed suffix arrays. Liu et al. (2008) Merge algorithm uses a heap to compute the τ -
stored string records in a trie, and proposed a overlap join on entire inverted lists A, but has
framework called TITAN. These studies are spe- an additional mechanism to increment the fron-

857
tier pointers of inverted lists efficiently based on applications for which the execution time of ap-
the strings popped most recently from the heap. proximate dictionary matching is critical.
Consequently, SkipMerge can reduce the number An advantage of the proposed algorithm over
of strings that are pushed to the heap. Similarly existing algorithms (e.g., MergeSkip) is that it
to the MergeOpt algorithm, DivideSkip splits in- does not need to read all the inverted lists retrieved
verted lists A into two groups S and L, but it ap- by query n-grams. We observed that the proposed
plies SkipMerge to S. In Section 3, we reported algorithm solved τ -overlap joins by checking ap-
the performance of DivideSkip. proximately half of the inverted lists (with cosine
Charikar (2002) presented the Locality Sen- similarity and threshold α = 0.7). This charac-
sitive Hash (LSH) function (Andoni and Indyk, teristic is well suited to processing compressed
2008), which preserves the property of cosine inverted lists because the algorithm needs to de-
similarity. The essence of this function is to map compress only half of the inverted lists. It is nat-
strings into N -bit hash values where the bitwise ural to extend this study to compressing and de-
hamming distance between the hash values of two compressing inverted lists for reducing disk space
strings approximately corresponds to the angle of and for improving query performance (Behm et
the two strings. Ravichandran et al. (2005) ap- al., 2009).
plied LSH to the task of noun clustering. Adapting
this algorithm to approximate dictionary match- Acknowledgments
ing, we discussed its performance in Section 3.
Several researchers have presented refined sim- This work was partially supported by Grants-
ilarity measures for strings (Winkler, 1999; Cohen in-Aid for Scientific Research on Priority Areas
et al., 2003; Bergsma and Kondrak, 2007; Davis et (MEXT, Japan) and for Solution-Oriented Re-
al., 2007). Although these studies are sometimes search for Science and Technology (JST, Japan).
regarded as a research topic of approximate dic-
tionary matching, they assume that two strings for
the target of similarity computation are given; in References
other words, it is out of their scope to find strings Andoni, Alexandr and Piotr Indyk. 2008. Near-
in a large collection that are similar to a given optimal hashing algorithms for approximate nearest
string. Thus, it is a reasonable approach for an ap- neighbor in high dimensions. Communications of
proximate dictionary matching to quickly collect the ACM, 51(1):117–122.
candidate strings with a loose similarity threshold,
Arasu, Arvind, Venkatesh Ganti, and Raghav Kaushik.
and for a refined similarity measure to scrutinize 2006. Efficient exact set-similarity joins. In VLDB
each candidate string for the target application. ’06: Proceedings of the 32nd International Confer-
ence on Very Large Data Bases, pages 918–929.
5 Conclusions
Behm, Alexander, Shengyue Ji, Chen Li, and Jiaheng
We present a simple and efficient algorithm for Lu. 2009. Space-constrained gram-based indexing
approximate dictionary matching with the co- for efficient approximate string search. In ICDE
sine, Dice, Jaccard, and overlap measures. We ’09: Proceedings of the 2009 IEEE International
conducted experiments of approximate dictio- Conference on Data Engineering, pages 604–615.
nary matching on large-scale datasets with person
Bergsma, Shane and Grzegorz Kondrak. 2007.
names, biomedical names, and general English Alignment-based discriminative string similarity. In
words. Even though the algorithm is very sim- ACL ’07: Proceedings of the 45th Annual Meet-
ple, our experimental results showed that the pro- ing of the Association of Computational Linguistics,
posed algorithm executed very quickly. We also pages 656–663.
confirmed that the proposed method drastically re-
Bocek, Thomas, Ela Hunt, and Burkhard Stiller. 2007.
duced the number of candidate strings considered Fast similarity search in large dictionaries. Tech-
during approximate dictionary matching. We be- nical Report ifi-2007.02, Department of Informatics
lieve that this study will advance practical NLP (IFI), University of Zurich.

858
Chandel, Amit, P. C. Nagesh, and Sunita Sarawagi. Li, Chen, Bin Wang, and Xiaochun Yang. 2007.
2006. Efficient batch top-k search for dictionary- Vgram: improving performance of approximate
based entity recognition. In ICDE ’06: Proceed- queries on string collections using variable-length
ings of the 22nd International Conference on Data grams. In VLDB ’07: Proceedings of the 33rd In-
Engineering. ternational Conference on Very Large Data Bases,
pages 303–314.
Charikar, Moses S. 2002. Similarity estimation tech-
niques from rounding algorithms. In STOC ’02: Li, Chen, Jiaheng Lu, and Yiming Lu. 2008. Effi-
Proceedings of the thiry-fourth annual ACM sym- cient merging and filtering algorithms for approx-
posium on Theory of computing, pages 380–388. imate string searches. In ICDE ’08: Proceedings
of the 2008 IEEE 24th International Conference on
Chaudhuri, Surajit, Venkatesh Ganti, and Raghav Data Engineering, pages 257–266.
Kaushik. 2006. A primitive operator for similar-
ity joins in data cleaning. In ICDE ’06: Proceed- Liu, Xuhui, Guoliang Li, Jianhua Feng, and Lizhu
ings of the 22nd International Conference on Data Zhou. 2008. Effective indices for efficient approxi-
Engineering. mate string search and similarity join. In WAIM ’08:
Proceedings of the 2008 The Ninth International
Cohen, William W., Pradeep Ravikumar, and Conference on Web-Age Information Management,
Stephen E. Fienberg. 2003. A comparison of pages 127–134.
string distance metrics for name-matching tasks.
In Proceedings of the IJCAI-2003 Workshop on Manku, Gurmeet Singh, Arvind Jain, and Anish
Information Integration on the Web (IIWeb-03), Das Sarma. 2007. Detecting near-duplicates for
pages 73–78. web crawling. In WWW ’07: Proceedings of the
16th International Conference on World Wide Web,
Davis, Jason V., Brian Kulis, Prateek Jain, Suvrit Sra, pages 141–150.
and Inderjit S. Dhillon. 2007. Information-theoretic
metric learning. In ICML ’07: Proceedings of the Navarro, Gonzalo and Ricardo Baeza-Yates. 1998. A
24th International Conference on Machine Learn- practical q-gram index for text retrieval allowing er-
ing, pages 209–216. rors. CLEI Electronic Journal, 1(2).

Gravano, Luis, Panagiotis G. Ipeirotis, H. V. Jagadish, Ravichandran, Deepak, Patrick Pantel, and Eduard
Nick Koudas, S. Muthukrishnan, and Divesh Srivas- Hovy. 2005. Randomized algorithms and nlp: us-
tava. 2001. Approximate string joins in a database ing locality sensitive hash function for high speed
(almost) for free. In VLDB ’01: Proceedings of the noun clustering. In ACL ’05: Proceedings of the
27th International Conference on Very Large Data 43rd Annual Meeting on Association for Computa-
Bases, pages 491–500. tional Linguistics, pages 622–629.

Henzinger, Monika. 2006. Finding near-duplicate Sarawagi, Sunita and Alok Kirpal. 2004. Efficient
web pages: a large-scale evaluation of algorithms. set joins on similarity predicates. In SIGMOD ’04:
In SIGIR ’06: Proceedings of the 29th Annual Inter- Proceedings of the 2004 ACM SIGMOD interna-
national ACM SIGIR Conference on Research and tional conference on Management of data, pages
Development in Information Retrieval, pages 284– 743–754.
291.
Wang, Wei, Chuan Xiao, Xuemin Lin, and Chengqi
Huynh, Trinh N. D., Wing-Kai Hon, Tak-Wah Lam, Zhang. 2009. Efficient approximate entity extrac-
and Wing-Kin Sung. 2006. Approximate string tion with edit distance constraints. In SIGMOD
matching using compressed suffix arrays. Theoreti- ’09: Proceedings of the 35th SIGMOD Interna-
cal Computer Science, 352(1-3):240–249. tional Conference on Management of Data, pages
759–770.
Kim, Min-Soo, Kyu-Young Whang, Jae-Gil Lee, and
Min-Jae Lee. 2005. n-Gram/2L: a space and time Winkler, William E. 1999. The state of record link-
efficient two-level n-gram inverted index structure. age and current research problems. Technical Re-
In VLDB ’05: Proceedings of the 31st International port R99/04, Statistics of Income Division, Internal
Conference on Very Large Data Bases, pages 325– Revenue Service Publication.
336.
Xiao, Chuan, Wei Wang, and Xuemin Lin. 2008. Ed-
Lee, Hongrae, Raymond T. Ng, and Kyuseok Shim. Join: an efficient algorithm for similarity joins with
2007. Extending q-grams to estimate selectivity of edit distance constraints. In VLDB ’08: Proceed-
string matching with low edit distance. In VLDB ings of the 34th International Conference on Very
’07: Proceedings of the 33rd International Confer- Large Data Bases, pages 933–944.
ence on Very Large Data Bases, pages 195–206.

859

You might also like