Simple Efficient Algorithm
Simple Efficient Algorithm
851
Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010), pages 851–859,
Beijing, August 2010
and general English words. We com-
Table 1: Conditions for each similarity measure
pare the algorithm with state-of-the-art al- Measure min |Y | max |Y | τ (= min |X ∩ Y |)
gorithms, including Locality Sensitive Hash- Dice α
2−α
|X| 2−α
α
|X| 1
2
α(|X| + |Y |)
ing (Ravichandran et al., 2005; Andoni and Jaccard α|X| |X|/α α(|X|+|Y |)
p1+α
Indyk, 2008) and DivideSkip (Li et al., Cosine α2 |X| |X|/α2 α |X||Y |
2008). The proposed algorithm retrieves Overlap — — α min{|X|, |Y |}
strings the most rapidly, e.g., in 1.1 ms from
Google Web1T unigrams (with cosine simi- approximate dictionary matching,
larity and threshold 0.7). l p m
α |X||Y | ≤ |X ∩ Y | ≤ min{|X|, |Y |}.
2 Proposed Method (3)
2.1 Necessary and sufficient conditions This inequality states
l that two strings
m x and y must
p
In this paper, we assume that the features of a have at least τ = α |X||Y | features in com-
string are represented arbitrarily by a set. Al- mon. When ignoring |X ∩Y | in the inequality, we
though it is important to design a string represen- have an inequality about |X| and |Y |,
tation for an accurate similarity measure, we do
2 |X|
not address this problem: our emphasis is not on α |X| ≤ |Y | ≤ (4)
α2
designing a better representation for string simi-
larity but on establishing an efficient algorithm. This inequality presents the search range for re-
The most popular representation is given by n- trieving similar strings; that is, we can ignore
grams: all substrings of size n in a string. We strings whose feature size is out of this range.
use trigrams throughout this paper as an example Other derivations are also applicable to similar-
of string representation. For example, the string ity measures, including Dice, Jaccard, and overlap
“methyl sulphone” is expressed by 17 elements coefficients. Table 1 summarizes the conditions
of letter trigrams1 , {‘$$m’, ‘$me’, ‘met’, for these similarity measures.
‘eth’, ‘thy’, ‘hyl’, ‘yl ’, ‘l s’, ‘ su’, We explain one usage of these conditions. Let
‘sul’, ‘ulp’, ‘lph’, ‘pho’, ‘hon’, query string x = “methyl sulphone” and thresh-
‘one’, ‘ne$’, ‘e$$’}. We insert two $s be- old for approximate dictionary matching α = 0.7
fore and after the string to denote the start or end with cosine similarity. Representing the strings
of the string. In general, a string x consisting of with letter trigrams, we have the size of x, |X| =
|X| letters yields (|x| + n − 1) elements of n- 17. The inequality 4 gives the search range of |Y |
grams. We call |x| and |X| the length and size, of the retrieved strings, 9 ≤ |Y | ≤ 34. Presum-
respectively, of the string x. ing that we are searching for strings of |Y | = 16,
Let X and Y denote the feature sets of the we obtain the necessary and sufficient condition
strings x and y, respectively. The cosine similarity for the approximate dictionary matching from the
between the two strings x and y is, inequality 3, τ = 12 ≤ |X ∩ Y |. Thus, we need
to search for strings that have at least 12 letter tri-
|X ∩ Y | grams that overlap with X. When considering a
cosine(X, Y ) = p . (2)
|X||Y | string y = “methyl sulfone”, which is a spelling
variant of y (ph → f), we confirm that the string
By integrating this definition with Equation 1, we is a solution for approximate dictionary matching
obtain the necessary and sufficient condition for because |X ∩Y | = 13 (≥ τ ). Here,
√ the actual sim-
1 ilarity is cosine(X, Y ) = 13/ 17 × 16 = 0.788
In practice, we attach ordinal numbers to n-grams to rep-
resent multiple occurrences of n-grams in a string (Chaud- (≥ α).
huri et al., 2006). For example, the string “prepress”, which
contains two occurrences of the trigram ‘pre’, yields 2.2 Data structure and algorithm
the set {‘$$p’#1, ‘$pr’#1, ‘pre’#1, ‘rep’#1,
‘epr’#1, ‘pre’#2, ‘res’#1, ‘ess’#1, ‘ss$’#1, Algorithm 1 presents the pseudocode of the ap-
‘s$$’#1}. proximate dictionary matching based on Table 1.
852
Input: V : collection of strings Input: X: array of features of the query string
Input: x: query string Input: τ : minimum number of overlaps
Input: α: threshold for the similarity Input: V : collection of strings
Output: Y: list of strings similar to the query Input: l: size of target strings
1 X ← string to feature(x); Output: R: list of strings similar to the query
2 Y ←[]; 1 M ← {};
3 for l ← min y(|X|, α) to max y(|X|, α) do 2 R ← [];
4 τ ← min overlap(|X|, l, α); 3 foreach q ∈ X do
5 R ← overlapjoin(X, τ , V , l); 4 foreach i ∈ get(V , l, q) do
6 foreach r ∈ R do append r to Y; 5 M [i] ← M [i] + 1;
7 end 6 if τ ≤ M [i] then
8 return Y; 7 append i to R;
8 end
Algorithm 1: Approximate dictionary 9 end
matching. 10 end
11 return R;
853
Input: X: array of features of the query string • Naive: Naı̈ve algorithm that computes the
Input: τ : minimum number of overlaps cosine similarity |V | times for every query.
Input: V : collection of strings
Input: l: size of target strings • AllScan: AllScan algorithm.
Output: R: list of strings similar to the query • Signature: CPMerge algorithm without
1 sort elements in X by order of |get(V , l, Xk )|; pruning; this is equivalent to Algorithm 3
2 M ← {}; without lines 17–18.
3 for k ← 0 to (|X| − τ ) do
4 foreach s ∈ get(V , l, Xk ) do • DivideSkip: our implementation of the algo-
5 M [s] ← M [s] + 1; rithm (Li et al., 2008)2 .
6 end • Locality Sensitive Hashing (LSH) (Andoni
7 end
8 R ← []; and Indyk, 2008): This baseline system fol-
9 for k ← (|X| − τ + 1) to (|X| − 1) do lows the design of previous work (Ravichan-
10 foreach s ∈ M do dran et al., 2005). This system approxi-
11 if bsearch(get(V , l, Xk ), s) then
12 M [s] ← M [s] + 1; mately solves Equation 1 by finding dictio-
13 end nary entries whose LSH values are within
14 if τ ≤ M [s] then the (bit-wise) hamming distance of θ from
15 append s to R;
16 remove s from M ; the LSH value of a query string. To adapt
17 else if M [s] + (|X| − k − 1) < τ then the method to approximate dictionary match-
18 remove s from M ; ing, we used a 64-bit LSH function com-
19 end
20 end puted with letter trigrams. By design, this
21 end method does not find an exact solution to
22 return R; Equation 1; in other words, the method can
Algorithm 3: CPMerge algorithm. miss dictionary entries that are actually sim-
ilar to the query strings. This system has
three parameters, θ, q (number of bit permu-
X0 , ..., X|X|−τ to generate a compact set of can- tations), and B (search width), to control the
didate strings. The algorithm stores the occur- tradeoff between retrieval speed and recall3 .
rence count of each string s in M [s]. In lines 9– Generally speaking, increasing these param-
21, we increment the occurrence counts if each eters improves the recall, but slows down the
of X|X|−τ +1 , ..., X|X|−1 inverted lists contain the speed. We determined θ = 24 and q = 24
candidate strings. For each string s in the candi- experimentally4 , and measured the perfor-
dates (line 10), we perform a binary search on the mance when B ∈ {16, 32, 64}.
inverted list (line 11), and increment the overlap
count if the string s exists (line 12). If the overlap The systems, excluding LSH, share the same
counter of the string reaches τ (line 14), then we implementation of Algorithm 1 so that we can
append the string s to the result list R and remove specifically examine the differences of the algo-
s from the candidate list (lines 15–16). We prune rithms for τ -overlap join. The C++ source code of
a candidate string (lines 17–18) if the candidate is the system used for this experiment is available5 .
found to be unreachable for τ overlaps even if it We ran all experiments on an application server
appears in all of the unexamined inverted lists. running Debian GNU/Linux 4.0 with Intel Xeon
5140 CPU (2.33 GHz) and 8 GB main memory.
3 Experiments 2
We tuned parameter values µ ∈ {0.01, 0.02, 0.04, 0.1,
We report the experimental results of approximate 0.2, 0.4, 1, 2, 4, 10, 20, 40, 100} for each dataset. We se-
lected the parameter with the fastest response.
dictionary matching on large-scale datasets with 3
We followed the notation of the original pa-
person names, biomedical names, and general En- per (Ravichandran et al., 2005) here. Refer to the original
glish words. We implemented various systems of paper for definitions of the parameters θ, q, and B.
4
q was set to 24 so that the arrays of shuffled hash values
approximate dictionary matching. are stored in memory. We chose θ = 24 from {8, 16, 24} be-
cause it showed a good balance between accuracy and speed.
• Proposed: CPMerge algorithm. 5
http://www.chokkan.org/software/simstring/
854
3.1 Datasets the letter at the position with an ASCII letter ran-
We used three large datasets with person names domly chosen from a uniform distribution.
(IMDB actors), general English words (Google
3.2 Results
Web1T), and biomedical names (UMLS).
To examine the scalability of each system, we
• IMDB actors: This dataset comprises actor controlled the number of strings to be indexed
names extracted from the IMDB database6 . from 10%–100%, and issued 1,000 queries. Fig-
We used all actor names (1,098,022 strings; ure 1 portrays the average response time for re-
18 MB) from the file actors.list.gz. trieving strings whose cosine similarity values are
The average number of letter trigrams in the no smaller than 0.7. Although LSH (B=16) seems
strings is 17.2. The total number of trigrams to be the fastest in the graph, this system missed
is 42,180. The system generated index files many true positives7 ; the recall scores of approx-
of 83 MB in 56.6 s. imate dictionary matching were 15.4% (IMDB),
• Google Web1T unigrams: This dataset con- 13.7% (Web1T), and 1.5% (UMLS). Increasing
sists of English word unigrams included in the parameter B improves the recall at the expense
the Google Web1T corpus (LDC2006T13). of the response time. LSH (B=64)8 . It not only
We used all word unigrams (13,588,391 ran slower than the proposed method, but also
strings; 121 MB) in the corpus after remov- suffered from low recall scores, 25.8% (IMDB),
ing the frequency information. The aver- 18.7% (Web1T), and 7.1% (UMLS). LSH was
age number of letter trigrams in the strings useful only when we required a quick response
is 10.3. The total number of trigrams is much more than recall.
301,459. The system generated index files The other systems were guaranteed to find
of 601 MB in 551.7 s. the exact solution (100% recall). The proposed
• UMLS: This dataset consists of English algorithm was the fastest of all exact systems
names and descriptions of biomedical con- on all datasets: the response times per query
cepts included in the Unified Medical Lan- (100% index size) were 1.07 ms (IMDB), 1.10 ms
guage System (UMLS). We extracted all (Web1T), and 20.37 ms (UMLS). The response
English concept names (5,216,323 strings; times of the Naı̈ve algorithm were too slow, 32.8 s
212 MB) from MRCONSO.RRF.aa.gz and (IMDB), 236.5 s (Web1T), and 416.3 s (UMLS).
MRCONSO.RRF.ab.gz in UMLS Release The proposed algorithm achieved substantial
2009AA. The average number of letter tri- improvements over the AllScan algorithm: the
grams in the strings is 43.6. The total number proposed method was 65.3 times (IMDB), 227.5
of trigrams is 171,596. The system generated times (Web1T), and 13.7 times (UMLS) faster
index files of 1.1 GB in 1216.8 s. than the Naı̈ve algorithm. We observed that the
Signature algorithm, which is Algorithm 3 with-
For each dataset, we prepared 1,000 query
out lines 17–18, did not perform well: The Sig-
strings by sampling strings randomly from the
nature algorithm was 1.8 times slower (IMDB),
dataset. To simulate the situation where query
2.1 times faster (Web1T), and 135.0 times slower
strings are not only identical but also similar to
(UMLS) than the AllScan algorithm. These re-
dictionary entries, we introduced random noise
sults indicate that it is imperative to minimize the
to the strings. In this experiment, one-third of
number of candidates to reduce the number of
the query strings are unchanged from the original
binary-search operations. The proposed algorithm
(sampled) strings, one-third of the query strings
was 11.1–13.4 times faster than DivideSkip.
have one letter changed, and one-third of the
Figure 2 presents the average response time
query strings have two letters changed. When
changing a letter, we randomly chose a letter po- 7
Solving Equation 1, all systems are expected to retrieve
sition from a uniform distribution, and replaced the exact set of strings retrieved by the Naı̈ve algorithm.
8
The response time of LSH (B=64) on the IMDB dataset
6
ftp://ftp.fu-berlin.de/misc/movies/database/ was 29.72 ms (100% index size).
855
25 50 60
5 10
10
0 0 0
0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100
Number of indexed strings (%) Number of indexed strings (%) Number of indexed strings (%)
(a) IMDB actors (b) Google Web1T unigrams (c) UMLS
Figure 1: Average response time for processing a query (cosine similarity; α = 0.7).
30 70 400
Average response per query [ms]
0 0 0
0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
of the proposed algorithm for different similarity approximate dictionary matching. First, the pro-
measures and threshold values. When the similar- posed algorithm scanned far fewer strings than did
ity threshold is lowered, the algorithm runs slower the AllScan algorithm. For example, to obtain
because the number of retrieved strings |Y| in- candidate strings in the IMDB dataset, the pro-
creases exponentially. The Dice coefficient and posed algorithm scanned 279.7 strings, whereas
cosine similarity produced similar curves. the AllScan algorithm scanned 16,155.1 strings.
Therefore, the algorithm examined only 1.1%–
Table 2 summarizes the run-time statistics of
3.5% of the strings in the entire inverted lists in
the proposed method for each dataset (with co-
the three datasets. Second, the proposed algo-
sine similarity and threshold 0.7). Using the
rithm considered far fewer candidates than did
IMDB dataset, the proposed method searched for
the AllScan algorithm: the number of candidate
strings whose size was between 8.74 and 34.06;
strings considered by the algorithm was 1.2%–
it retrieved 4.63 strings per query string. The
6.6% of those considered by the AllScan algo-
proposed algorithm scanned 279.7 strings in 4.6
rithm. Finally, the proposed algorithm read fewer
inverted lists to obtain 232.5 candidate strings.
inverted lists than did the AllScan algorithm. The
The algorithm performed a binary search on 4.3
proposed algorithm actually read 8.9 (IMDB), 6.0
inverted lists containing 7,561.8 strings in all.
(Web1T), and 31.7 (UMLS) inverted lists during
In contrast, the AllScan algorithm had to scan
the experiments9 . These values indicate that the
16,155.1 strings in 17.7 inverted lists and con-
proposed algorithm can solve τ -overlap join prob-
sidered 9,788.7 candidate strings, and found only
lems by checking only 50.3% (IMDB), 53.6%
4.63 similar strings.
(Web1T), and 51.9% of the total inverted lists re-
This table clearly demonstrates three key con-
tributions of the proposed algorithm for efficient 9
These values are 4.6 + 4.3, 3.1 + 2.9, and 14.3 + 17.4.
856
Table 2: Run-time statistics of the proposed algorithm for each dataset
Averaged item IMDB Web1T UMLS Description
min |y| 8.74 5.35 21.87 minimum size of trigrams of target strings
max |y| 34.06 20.46 88.48 maximum size of trigrams of target strings
τ 14.13 9.09 47.77 minimum number of overlaps required/sufficient per query
|Y| 4.63 3.22 111.79 number of retrieved strings per query
Total — averaged for each query and target size:
# inverted lists 17.7 11.2 61.1 number of inverted lists retrieved for a query
# strings 16 155.1 52 557.6 49 561.4 number of strings in the inverted list
# unique strings 9 788.7 44 834.6 17 457.5 number of unique strings in the inverted list
Candidate stage — averaged for each query and target size:
# inverted lists 4.6 3.1 14.3 number of inverted lists scanned for generating candidates
# strings 279.7 552.7 1 756.3 number of strings scanned for generating candidates
# candidates 232.5 523.7 1 149.7 number of candidates generated for a query
Validation stage — averaged for each query and target size:
# inverted lists 4.3 2.9 17.4 number of inverted lists examined by binary search for a query
# strings 7 561.8 19 843.6 20 443.7 number of strings targeted by binary search
857
tier pointers of inverted lists efficiently based on applications for which the execution time of ap-
the strings popped most recently from the heap. proximate dictionary matching is critical.
Consequently, SkipMerge can reduce the number An advantage of the proposed algorithm over
of strings that are pushed to the heap. Similarly existing algorithms (e.g., MergeSkip) is that it
to the MergeOpt algorithm, DivideSkip splits in- does not need to read all the inverted lists retrieved
verted lists A into two groups S and L, but it ap- by query n-grams. We observed that the proposed
plies SkipMerge to S. In Section 3, we reported algorithm solved τ -overlap joins by checking ap-
the performance of DivideSkip. proximately half of the inverted lists (with cosine
Charikar (2002) presented the Locality Sen- similarity and threshold α = 0.7). This charac-
sitive Hash (LSH) function (Andoni and Indyk, teristic is well suited to processing compressed
2008), which preserves the property of cosine inverted lists because the algorithm needs to de-
similarity. The essence of this function is to map compress only half of the inverted lists. It is nat-
strings into N -bit hash values where the bitwise ural to extend this study to compressing and de-
hamming distance between the hash values of two compressing inverted lists for reducing disk space
strings approximately corresponds to the angle of and for improving query performance (Behm et
the two strings. Ravichandran et al. (2005) ap- al., 2009).
plied LSH to the task of noun clustering. Adapting
this algorithm to approximate dictionary match- Acknowledgments
ing, we discussed its performance in Section 3.
Several researchers have presented refined sim- This work was partially supported by Grants-
ilarity measures for strings (Winkler, 1999; Cohen in-Aid for Scientific Research on Priority Areas
et al., 2003; Bergsma and Kondrak, 2007; Davis et (MEXT, Japan) and for Solution-Oriented Re-
al., 2007). Although these studies are sometimes search for Science and Technology (JST, Japan).
regarded as a research topic of approximate dic-
tionary matching, they assume that two strings for
the target of similarity computation are given; in References
other words, it is out of their scope to find strings Andoni, Alexandr and Piotr Indyk. 2008. Near-
in a large collection that are similar to a given optimal hashing algorithms for approximate nearest
string. Thus, it is a reasonable approach for an ap- neighbor in high dimensions. Communications of
proximate dictionary matching to quickly collect the ACM, 51(1):117–122.
candidate strings with a loose similarity threshold,
Arasu, Arvind, Venkatesh Ganti, and Raghav Kaushik.
and for a refined similarity measure to scrutinize 2006. Efficient exact set-similarity joins. In VLDB
each candidate string for the target application. ’06: Proceedings of the 32nd International Confer-
ence on Very Large Data Bases, pages 918–929.
5 Conclusions
Behm, Alexander, Shengyue Ji, Chen Li, and Jiaheng
We present a simple and efficient algorithm for Lu. 2009. Space-constrained gram-based indexing
approximate dictionary matching with the co- for efficient approximate string search. In ICDE
sine, Dice, Jaccard, and overlap measures. We ’09: Proceedings of the 2009 IEEE International
conducted experiments of approximate dictio- Conference on Data Engineering, pages 604–615.
nary matching on large-scale datasets with person
Bergsma, Shane and Grzegorz Kondrak. 2007.
names, biomedical names, and general English Alignment-based discriminative string similarity. In
words. Even though the algorithm is very sim- ACL ’07: Proceedings of the 45th Annual Meet-
ple, our experimental results showed that the pro- ing of the Association of Computational Linguistics,
posed algorithm executed very quickly. We also pages 656–663.
confirmed that the proposed method drastically re-
Bocek, Thomas, Ela Hunt, and Burkhard Stiller. 2007.
duced the number of candidate strings considered Fast similarity search in large dictionaries. Tech-
during approximate dictionary matching. We be- nical Report ifi-2007.02, Department of Informatics
lieve that this study will advance practical NLP (IFI), University of Zurich.
858
Chandel, Amit, P. C. Nagesh, and Sunita Sarawagi. Li, Chen, Bin Wang, and Xiaochun Yang. 2007.
2006. Efficient batch top-k search for dictionary- Vgram: improving performance of approximate
based entity recognition. In ICDE ’06: Proceed- queries on string collections using variable-length
ings of the 22nd International Conference on Data grams. In VLDB ’07: Proceedings of the 33rd In-
Engineering. ternational Conference on Very Large Data Bases,
pages 303–314.
Charikar, Moses S. 2002. Similarity estimation tech-
niques from rounding algorithms. In STOC ’02: Li, Chen, Jiaheng Lu, and Yiming Lu. 2008. Effi-
Proceedings of the thiry-fourth annual ACM sym- cient merging and filtering algorithms for approx-
posium on Theory of computing, pages 380–388. imate string searches. In ICDE ’08: Proceedings
of the 2008 IEEE 24th International Conference on
Chaudhuri, Surajit, Venkatesh Ganti, and Raghav Data Engineering, pages 257–266.
Kaushik. 2006. A primitive operator for similar-
ity joins in data cleaning. In ICDE ’06: Proceed- Liu, Xuhui, Guoliang Li, Jianhua Feng, and Lizhu
ings of the 22nd International Conference on Data Zhou. 2008. Effective indices for efficient approxi-
Engineering. mate string search and similarity join. In WAIM ’08:
Proceedings of the 2008 The Ninth International
Cohen, William W., Pradeep Ravikumar, and Conference on Web-Age Information Management,
Stephen E. Fienberg. 2003. A comparison of pages 127–134.
string distance metrics for name-matching tasks.
In Proceedings of the IJCAI-2003 Workshop on Manku, Gurmeet Singh, Arvind Jain, and Anish
Information Integration on the Web (IIWeb-03), Das Sarma. 2007. Detecting near-duplicates for
pages 73–78. web crawling. In WWW ’07: Proceedings of the
16th International Conference on World Wide Web,
Davis, Jason V., Brian Kulis, Prateek Jain, Suvrit Sra, pages 141–150.
and Inderjit S. Dhillon. 2007. Information-theoretic
metric learning. In ICML ’07: Proceedings of the Navarro, Gonzalo and Ricardo Baeza-Yates. 1998. A
24th International Conference on Machine Learn- practical q-gram index for text retrieval allowing er-
ing, pages 209–216. rors. CLEI Electronic Journal, 1(2).
Gravano, Luis, Panagiotis G. Ipeirotis, H. V. Jagadish, Ravichandran, Deepak, Patrick Pantel, and Eduard
Nick Koudas, S. Muthukrishnan, and Divesh Srivas- Hovy. 2005. Randomized algorithms and nlp: us-
tava. 2001. Approximate string joins in a database ing locality sensitive hash function for high speed
(almost) for free. In VLDB ’01: Proceedings of the noun clustering. In ACL ’05: Proceedings of the
27th International Conference on Very Large Data 43rd Annual Meeting on Association for Computa-
Bases, pages 491–500. tional Linguistics, pages 622–629.
Henzinger, Monika. 2006. Finding near-duplicate Sarawagi, Sunita and Alok Kirpal. 2004. Efficient
web pages: a large-scale evaluation of algorithms. set joins on similarity predicates. In SIGMOD ’04:
In SIGIR ’06: Proceedings of the 29th Annual Inter- Proceedings of the 2004 ACM SIGMOD interna-
national ACM SIGIR Conference on Research and tional conference on Management of data, pages
Development in Information Retrieval, pages 284– 743–754.
291.
Wang, Wei, Chuan Xiao, Xuemin Lin, and Chengqi
Huynh, Trinh N. D., Wing-Kai Hon, Tak-Wah Lam, Zhang. 2009. Efficient approximate entity extrac-
and Wing-Kin Sung. 2006. Approximate string tion with edit distance constraints. In SIGMOD
matching using compressed suffix arrays. Theoreti- ’09: Proceedings of the 35th SIGMOD Interna-
cal Computer Science, 352(1-3):240–249. tional Conference on Management of Data, pages
759–770.
Kim, Min-Soo, Kyu-Young Whang, Jae-Gil Lee, and
Min-Jae Lee. 2005. n-Gram/2L: a space and time Winkler, William E. 1999. The state of record link-
efficient two-level n-gram inverted index structure. age and current research problems. Technical Re-
In VLDB ’05: Proceedings of the 31st International port R99/04, Statistics of Income Division, Internal
Conference on Very Large Data Bases, pages 325– Revenue Service Publication.
336.
Xiao, Chuan, Wei Wang, and Xuemin Lin. 2008. Ed-
Lee, Hongrae, Raymond T. Ng, and Kyuseok Shim. Join: an efficient algorithm for similarity joins with
2007. Extending q-grams to estimate selectivity of edit distance constraints. In VLDB ’08: Proceed-
string matching with low edit distance. In VLDB ings of the 34th International Conference on Very
’07: Proceedings of the 33rd International Confer- Large Data Bases, pages 933–944.
ence on Very Large Data Bases, pages 195–206.
859