0% found this document useful (0 votes)
110 views15 pages

Tkde13 SQL PDF

This document discusses supporting search-as-you-type functionality using SQL in databases. The authors explore how to leverage existing database indexes and functions to enable interactive search speeds as a user types queries. They present solutions for single-keyword and multikeyword queries, as well as fuzzy search allowing mismatches between queries and answers. Experiments on large real datasets show their techniques allow databases on commodity computers to support search-as-you-type on tables with millions of records using standard SQL.

Uploaded by

gowtham k
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
110 views15 pages

Tkde13 SQL PDF

This document discusses supporting search-as-you-type functionality using SQL in databases. The authors explore how to leverage existing database indexes and functions to enable interactive search speeds as a user types queries. They present solutions for single-keyword and multikeyword queries, as well as fuzzy search allowing mismatches between queries and answers. Experiments on large real datasets show their techniques allow databases on commodity computers to support search-as-you-type on tables with millions of records using standard SQL.

Uploaded by

gowtham k
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 25, NO.

2, FEBRUARY 2013 461

Supporting Search-As-You-Type
Using SQL in Databases
Guoliang Li, Jianhua Feng, Member, IEEE, and Chen Li, Member, IEEE

AbstractA search-as-you-type system computes answers on-the-fly as a user types in a keyword query character by character. We
study how to support search-as-you-type on data residing in a relational DBMS. We focus on how to support this type of search using
the native database language, SQL. A main challenge is how to leverage existing database functionalities to meet the high-
performance requirement to achieve an interactive speed. We study how to use auxiliary indexes stored as tables to increase search
performance. We present solutions for both single-keyword queries and multikeyword queries, and develop novel techniques for fuzzy
search using SQL by allowing mismatches between query keywords and answers. We present techniques to answer first-N queries
and discuss how to support updates efficiently. Experiments on large, real data sets show that our techniques enable DBMS systems
on a commodity computer to support search-as-you-type on tables with millions of records.

Index TermsSearch-as-you-type, databases, SQL, fuzzy search

1 INTRODUCTION

M ANY information systems nowadays improve user


search experiences by providing instant feedback as
users formulate search queries. Most search engines and
approach has the advantage of achieving a high perfor-
mance, its main drawback is duplicating data and indexes,
resulting in additional hardware costs. Another approach is
online search forms support autocompletion, which shows to use database extenders, such as DB2 Extenders, Informix
suggested queries or even answers on the fly as a user DataBlades, Microsoft SQL Server Common Language
types in a keyword query character by character. For Runtime (CLR) integration, and Oracle Cartridges, which
instance, consider the Web search interface at Netflix,1 which allow developers to implement new functionalities to a
allows a user to search for movie information. If a user types DBMS. This approach is not feasible for databases that do not
in a partial query mad, the system shows movies with a title provide such an extender interface, such as MySQL. Since it
matching this keyword as a prefix, such as Madagascar
needs to utilize proprietary interfaces provided by database
and Mad Men: Season 1. The instant feedback helps the
vendors, a solution for one database may not be portable to
user not only in formulating the query, but also in under-
others. In addition, an extender-based solution, especially
standing the underlying data. This type of search is generally
those implemented in C/C++, could cause serious reliability
called search-as-you-type or type-ahead search.
Since many search systems store their information in a and security problems to database engines.
In this paper we study how to support search-as-you-type
backend relational DBMS, a question arises naturally: how to
on DBMS systems using the native query language (SQL). In
support search-as-you-type on the data residing in a DBMS?
other words, we want to use SQL to find answers to a search
Some databases such as Oracle and SQL server already
query as a user types in keywords character by character.
support prefix search, and we could use this feature to do
Our goal is to utilize the built-in query engine of the database
search-as-you-type. However, not all databases provide this
system as much as possible. In this way, we can reduce the
feature. For this reason, we study new methods that can be
programming efforts to support search-as-you-type. In
used in all databases. One approach is to develop a separate
addition, the solution developed on one database using
application layer on the database to construct indexes, and
standard SQL techniques is portable to other databases
implement algorithms for answering queries. While this
which support the same standard. Similar observation are
1. http://www.netflix.com/BrowseSelection. also made by Gravano et al. [17] and Jestes et al. [23] which
use SQL to support similarity join in databases.
A main question when adopting this attractive idea is: Is it
. G. Li and J. Feng are with the Tsinghua National Laboratory for feasible and scalable? In particular, can SQL meet the high-
Information Science and Technology, Department of Computer Science and
Technology, Tsinghua University, Beijing 100084, China. performance requirement to implement an interactive search
E-mail: {liguoliang, fengjh}@tsinghua.edu.cn. interface? Studies have shown that such an interface requires
. C. Li is with the Department of Computer Science, School of Information each query be answered within 100 milliseconds [38]. DBMS
and Computer Sciences, University of California, Irvine, CA 92697-3435.
E-mail: [email protected]. systems are not specially designed for keyword queries,
Manuscript received 30 Dec. 2010; revised 6 May 2011; accepted 9 June 2011; making it more challenging to support search-as-you-type.
published online 23 June 2011. As we will see later in this paper, some important
Recommended for acceptance by P. Ipeirotis. functionalities to support search-as-you-type require join
For information on obtaining reprints of this article, please send e-mail to:
[email protected], and reference IEEECS Log Number TKDE-2010-12-0704. operations, which could be rather expensive to execute by
Digital Object Identifier no. 10.1109/TKDE.2011.148. the query engine.
1041-4347/13/$31.00 2013 IEEE Published by the IEEE Computer Society
462 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 25, NO. 2, FEBRUARY 2013

TABLE 1
Table dblp: A Sample Publication Table (about Privacy)

The scalability becomes even more unclear if we want to We have conducted a thorough experimental evaluation
support two useful features in search-as-you-type, namely using large, real data sets in Section 8. We compare the
multikeyword search and fuzzy search. In multikeyword search, advantages and limitations of different approaches for
we allow a query string to have multiple keywords, and find search-as-you-type. The results show that our SQL-based
records that match these keywords, even if the keywords techniques enable DBMS systems running on a commodity
appear at different places. For instance, we allow a user who computer to support search-as-you-type on tables with
types in a query privacy mining rak to find a millions of records.
publication by Rakesh Agrawal with a title including the It is worth emphasizing that although our method shares
keywords privacy and mining, even though these an incremental-search idea with the earlier results in [24],
keywords are at different places in the record. In fuzzy developing new techniques in a DBMS environment is
search, we want to allow minor mismatches between query technically very challenging. A main challenge is how to
keywords and answers. For instance, a partial query utilize the limited expressive power of the SQL language
aggraw should find a record with a keyword agrawal (compared with other languages such as C++ and Java) to
despite the typo in the query. While these features can support efficient search. We study how to use the available
further improve user search experiences, supporting them resources inside a DBMS, such as the capabilities to build
makes it even more challenging to do search-as-you-type auxiliary tables, to improve query performance. An inter-
inside DBMS systems. esting observation is that despite the fact we need SQL
In this paper, we develop various techniques to address queries with join operations, using carefully designed
these challenges. In Section 3, we propose two types of auxiliary tables, built-in indexes on key attributes, foreign-
methods to support search-as-you-type for single-keyword key constraints, and incremental algorithms using cached
queries, based on whether they require additional index results, these SQL queries can be executed efficiently by the
structures stored as auxiliary tables. We discuss the methods DBMS engine to achieve a high speed.
that use SQL to scan a table and verify each record by calling a
user-defined function (UDF) or using the LIKE predicate. We
study how to use auxiliary tables to increase performance. 2 PRELIMINARIES
In Section 4, we study how to support fuzzy search for We first formulate the problem of search-as-you-type in
single-keyword queries. We discuss a gram-based method DBMS (Section 2.1) and then discuss different ways to
and a UDF-based method. As the two methods have a low support search-as-you-type (Section 2.2).
performance, we propose a new neighborhood-generation-
based method, using the idea that two strings are similar 2.1 Problem Formulation
only if they have common neighbors obtained by deleting Let T be a relational table with attributes A1 ; A2 ; . . . ; A . Let
characters. To further improve the performance, we R fr1 ; r2 ; . . . ; rn g be the collection of records in T , and
propose to incrementally answer a query by using ri Aj  denote the content of record ri in attribute Aj . Let W
previously computed results and utilizing built-in indexes be the set of tokenized keywords in R.
on key attributes.
In Section 5, we extend the techniques to support 2.1.1 Search-as-You-Type for Single-keyword Queries
multikeyword queries. We develop a word-level incremen- Exact Search: As a user types in a single partial (prefix)
tal method to efficiently answer multikeyword queries. keyword w character by character, search-as-you-type on-
Notice that when deployed in a Web application, the the-fly finds the records that contain keywords with a prefix
incremental-computation algorithms do not need to main- w. We call this search paradigm prefix search. Without loss of
tain session information, since the results of earlier queries generality, each tokenized keyword in the data set and
are stored inside the database and shared by future queries. queries is assumed to use lower case characters. For
We propose efficient techniques to progressively find the example, consider the data in Table 1, A1 title, A2
first-N answers in Section 6. We also discuss how to support authors, A3 booktitle, and A4 year. R fr1 ; . . . ; r10 g.
updates efficiently in Section 7. r3 [booktitle] sigmod.
LI ET AL.: SUPPORTING SEARCH-AS-YOU-TYPE USING SQL IN DATABASES 463

W fprivacy; sigmod; sigir; . . .g: DBMS systems do not provide the search-as-you-type
If a user types in a query sig, we return records r3 , r6 , and extension feature (indeed no DBMS systems provide such
r9 . In particular, r3 contains a keyword sigmod with a an extension), the SQL-based method can also be used in this
prefix sig. case. Thus, the SQL-based method is more portable to a
Fuzzy Search: As a user types in a single partial keyword w different platform than the first two methods.
character by character, fuzzy search on-the-fly finds records In this paper, we focus on the SQL-based method and
with keywords similar to the query keyword. In Table 1, develop various techniques to achieve a high interactive
assuming a user types in a query corel, record r7 is a speed.
relevant answer since it contains a keyword correlation
with a prefix correl similar to the query keyword corel.
We use edit distance to measure the similarity between
3 EXACT SEARCH FOR SINGLE KEYWORD
strings. Formally, the edit distance between two strings s1 This section proposes two types of methods to use SQL to
and s2 , denoted by ed(s1 , s2 ), is the minimum number of support search-as-you-type for single-keyword queries. In
single-character edit operations (i.e., insertion, deletion, and Section 3.1, we discuss no-index methods. In Section 3.2, we
substitution) needed to transform s1 to s2 . For example, build auxiliary tables as index structures to answer a query.
ed corelation; correlation 1 a n d ed coralation,
correlation 2. Given an edit-distance threshold , we 3.1 No-Index Methods
say a prefix p of a keyword in W is similar to the partial A straightforward way to support search-as-you-type is to
keyword w if edp; w  . We say a keyword d in W is similar issue an SQL query that scans each record and verifies
to the partial keyword w if d has a prefix p such that whether the record is an answer to the query. There are two
edp; w  . Fuzzy search finds the records with keywords ways to do the checking: 1) Calling User-Defined Functions
similar to the query keywords. (UDFs). We can add functions into databases to verify
whether a record contains the query keyword; and 2) Using
2.1.2 Search-as-You-Type for Multikeyword Queries the LIKE predicate. Databases provide a LIKE predicate to
Exact Search: Given a multikeyword query Q with m allow users to perform string matching. We can use the LIKE
keywords w1 ; w2 ; . . . ; wm , as the user is completing the last predicate to check whether a record contains the query
keyword wm , we treat wm as a partial keyword and other keyword. This method may introduce false positives, e.g.,
keywords as complete keywords.2 As a user types in query keyword publication contains the query string ic, but
Q character by character, search-as-you-type on-the-fly finds the keyword does not have the query string ic as a prefix.
the records that contain the complete keywords and a
We can remove these false positives by calling UDFs. The two
keyword with a prefix wm . For example, if a user types in a
no-index methods need no additional space, but they may
query privacysig, search-as-you-type returns records r3 ,
not scale since they need to scan all records in the table
r6 , and r9 . In particular, r3 contains the complete keyword
(Section 8 gives the results.).
privacy and a keyword sigmod with a prefix sig.
Fuzzy Search: Fuzzy search on-the-fly finds the records 3.2 Index-Based Methods
that contain keywords similar to the complete keywords and a
In this section, we propose to build auxiliary tables as index
keyword with a prefix similar to partial keyword wm . For
instance, suppose edit-distance threshold  1. Assuming structures to facilitate prefix search. Some databases such as
a user types in a query privicycorel, fuzzy type-ahead Oracle and SQL server already support prefix search, and we
search returns record r7 since it contains a keyword could use this feature to do prefix search. However, not all
privacy similar to the complete keyword privicy databases provide this feature. For this reason, we develop a
and a keyword correlation with a prefix correl new method that can be used in all databases. In addition, our
similar to the partial keyword corel. experiments in Section 8.3 show that our method performs
prefix search more efficiently.
2.2 Different Approaches for Search-as-You-Type Inverted-index table. Given a table T , we assign unique
We discuss different possible methods to support search-as- ids to the keywords in table T , following their alphabetical
you-type and give their advantages and limitations. order. We create an inverted-index table IT with records in
The first method is to use a separate application layer, the form hkid; ridi, where kid is the id of a keyword and rid
which can achieve a very high performance as it can use is the id of a record that contains the keyword. Given a
various programming languages and complex data struc- complete keyword, we can use the inverted-index table to
tures. However, it is isolated from the DBMS systems. find records with the keyword.
The second method is to use database extenders. How- Prefix table. Given a table T , for all prefixes of keywords
ever, this extension-based method is not safe to the query
in the table, we build a prefix table PT with records in the
engine, which could cause reliability and security problems
form hp; lkid; ukidi, where p is a prefix of a keyword, lkid is the
to the database engine. This method depends on the API of
smallest id of those keywords in the table T having p as a
the specific DBMS being used, and different DBMS systems
prefix, and ukid is the largest id of those keywords having p as
have different APIs. Moreover, this method does not work if
a prefix. An interesting observation is that a complete word
a DBMS system has no this extender feature, e.g., MySQL.
with p as a prefix must have an ID in the keyword range
The third method is to use SQL. The SQL-based method is
lkid; ukid, and each complete word in the table T with an ID
more compatible since it is using the standard SQL. Even if
in this keyword range must have a prefix p. Thus, given a
2. Our method can be easily extended to the case that every keyword is prefix keyword w, we can use the prefix table to find the
treated as a partial keyword. range of keywords with the prefix.
464 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 25, NO. 2, FEBRUARY 2013

TABLE 2
The Inverted-Index Table and Prefix Table

Fig. 1. Using inverted-index table and prefix table to support search-as-


you-type.

UDFs to support fuzzy search. We use a UDF PEDw; s that


takes a keyword w and a string s as two parameters, and
returns the minimal edit distance between w and the prefixes
of keywords in s. For instance, in Table 1,
For example, Table 2 illustrates the inverted-index table
and the prefix table for the records in Table 1.3 The inverted- PEDpvb; r10 title
index table has a tuple hk8 ; r3 i since keyword k8 (sigmod) is PEDpvb; privacy in database publishing 1
in record r3 . The prefix table has a tuple hsigk7 ; k8 i since
keyword k7 (sigir) is the minimal id of keywords with a as r10 contains a prefix pub with edit distance of 1 to the
prefix sig, and keyword k8 (sigmod) is the maximal id of query. We can improve the performance by doing early
keywords with a prefix sig. The ids of keywords with a termination in the dynamic-programming computation [30]
prefix sig must be in the range k7 ; k8 . using an edit-distance threshold (if prefixes of two strings are
Given a partial keyword w, we first get its keyword range not similar enough, then the two substrings cannot be
lkid; ukid using the prefix table PT , and then find the records similar), and devise a new UDF PEDTHw; s; . If there is a
that have a keyword in the range through the inverted-index keyword in string s having prefixes with an edit distance to w
table IT as shown in Fig. 1. We use the following SQL to within , PEDTH returns true. In this way, we issue an SQL
answer the prefix-search query w: query that scans each record and calls UDF PEDTH to verify
the record.
SELECT T : FROM PT ; IT ; T
WHERE PT .prefix w AND 4.2 Index-Based Methods
PT :ukid  IT :kid AND PT :lkid  IT :kid AND This section proposes to use the inverted-index table and
IT :rid T :rid. prefix table to support fuzzy search-as-you-type. Given a
partial keyword w, we compute its answers in two steps.
For example, assuming a user types in a partial query
First we compute its similar prefixes from the prefix table
sig on table dblp (Table 1), we issue the following SQL:
PT , and get the keyword ranges of these similar prefixes.
SELECT dblp: FROM Pdblp ; Idblp ; dblp Then we compute the answers based on these ranges using
WHERE Pdblp .prefix sig AND the inverted-index table IT as discussed in Section 3.2. In
Pdblp :ukid  Idblp :kid AND Pdblp :lkid  Idblp :kid AND this section, we focus on the first step: computing
Idblp :rid dblp:rid, ws similar prefixes.
which returns records r3 , r6 , and r9 . The SQL query first finds 4.2.1 Using UDF
the keyword range k7 ; k8  based on the prefix table. Then it
Given a keyword w, we can use a UDF to find its similar
finds the records containing a keyword with ID in k7 ; k8 
prefixes from the prefix table PT . We issue an SQL query
using the inverted-index table.
that scans each prefix in PT and calls the UDF to check if the
To answer the SQL query efficiently, we create built-in
prefix is similar to w. We issue the following SQL query to
indexes on attributes prefix, kid, and rid. The SQL could first
use the index on prefix to find the keyword range, and then answer the prefix-search query w:
compute the answers using the indexes on kid and rid. For SELECT T : FROM PT ; IT ; T
example, assuming a user types in a partial query sig on WHERE PEDTHw; PT ; prefix;  AND
table dblp (Table 1), we first get the keyword range of sig PT :ukid  IT :kid AND PT :lkid  IT :kid AND
(k7 ; k8 ) using the index on prefix and then find records r3 , r6 ,
IT :rid T :rid.
and r9 using the index on kid.
We can use length filtering to improve the performance,
by adding the following clause to the where clause:
4 FUZZY SEARCH FOR SINGLE KEYWORD
LENGTH(PT :prefix)  LENGTHw AND
4.1 No-Index Methods
LENGTH(PT :prefix)  LENGTHw  .
Recall the two no-index methods for exact search in
Section 3.1. Since the LIKE predicate does not support fuzzy
search, we cannot use the LIKE-based method. We can use 4.2.2 Gram-Based Method
There are many q-gram-based methods to support approx-
3. Here, we only use several keywords for ease of presentation. imate string search [17]. Given a string s, its q-grams are its
LI ET AL.: SUPPORTING SEARCH-AS-YOU-TYPE USING SQL IN DATABASES 465

TABLE 3
Neighborhood-Generation Table  1

Fig. 2. Using the q-gram table and the neighborhood generation table to pldb; pvdb; pvlb; pvld}. Moreover, there is a good property
support fuzzy search. that given two strings s1 and s2 , if eds1 ; s2  , D ^  s1 \
^
D s2 6  as formalized in Lemma 1.
substrings with length q. Let Gq s denote the set4 of its ^  s1 \
Lemma 1. Given two strings s1 ; s2 , if eds1 ; s2  , D
q-grams and jGq sj denote the size of Gq s. For example, ^
D s2 6 .
for pvldb and vldb, we have G2 pvldb fpv, vl, ld,
db} and G2 (vldb fvl; ld; dbg. Strings s1 and s2 have an Proof. We can use deletion operations to replace the
edit distance within threshold  if substitution and insertion operations as follows: suppose
we can transform s1 to s2 with d deletions, i insertions, and
jGq s1 \ Gq s2 j  maxjs1 j; js2 j 1  q    q 17; r substitutions, such that eds1 ; s2 d i r  . We
can transform s1 and s2 to the same string by doing d r
where js1 j and js2 j are the lengths of string s1 and s2 ,
deletions on s1 and i r deletions on s2 , respectively.
respectively. This technique is called count filtering. ^  s1 \ D
^  s2 6 .
Thus, D u
t
To find similar prefixes of a query keyword w, besides
maintaining the inverted-index table and the prefix table, we
We use this property as a filter to find similar prefixes of
need to create a q-gram table GT with records in the form
the query keyword w. We can prune all the prefixes if they
hp; qgrami, where p is a prefix in the prefix table and qgram is a
have no common i-deletion neighborhoods with w. To this
q-gram of p. Given a partial keyword w, we first find the
prefixes in GT with no smaller than jwj 1  q    q grams end, for prefixes in the prefix table PT , we create a deletion-
in Gq w. We use the following SQL with GROUP BY based neighborhood-generation table DT with records in the
command to get the candidates of ws similar prefixes: form hp, i-deletion, ii, where p is a prefix in the prefix table PT
and i-deletion is an i-deletion neighborhood of p i  . For
SELECT PT .prefix FROM GT ; PT example, Table 3 gives a neighborhood-generation table.
WHERE GT .prefix PT .prefix AND GT .qgram IN Gq w Given a query keyword w, we first find the similar
GROUP BY GT .prefix prefixes in DT which have i-deletion neighborhoods in
HAVING COUNT (GT .qgram)  jwj 1  q    q. D^  w. Then we use UDFs to verify the candidates to get
As this method may involve false positives, we have to similar prefixes. Formally, we use the following SQL to
use UDFs to verify the candidates to get the similar prefixes generate the candidates of w s similar prefixes:
of w. Fig. 2 illustrates how to use the gram-based method to SELECT DISTINCT prefix FROM DT
answer a query. We can further improve the query ^  w.
WHERE DT :i-deletion IN D
performance by using additional filtering techniques, e.g.,
length filtering or position filtering [17]. Assuming a user types in a keyword pvldb, we find
It could be expensive to use GROUP BY in databases, the prefixes in DT that have i-deletion neighborhoods in
and the q-gram-based method is inefficient, especially for {pvldb, vldb, pldb, pvdb, pvlb, pvld}. Here we
large q-gram tables. Moreover, this method is rather find vldb similar to pvldb with edit distance 1.
inefficient for short query keywords [46], as short keywords This method is efficient for short strings. However, it is
have smaller numbers of q-grams and the method has low inefficient for long strings, especially for large edit-distance
pruning power. thresholds, because given a string with length n, it has ni-
deletion neighborhoods and totally Ominn ; 2n neigh-
4.2.3 Neighborhood-Generation-Based Method borhoods. It needs large space to store these neighborhoods.
Ukkonen proposed a neighborhood-generation-based meth- As the three methods have some limitations, we propose
od to support approximate string search [45]. We extend this an incremental algorithm which uses previous computed
method to use SQL to support fuzzy search-as-you-type. results to answer subsequence queries in Section 4.3.
Given a keyword w, the substrings of w by deleting i
4.3 Incrementally Computing Similar Prefixes
characters are called i-deletion neighborhoods of w. Let
Di w denote the set of i-deletion neighborhoods of w and The previous methods have the following limitations. First,
D^  w [ Di w. For example, given a string pvldb, they need to find similar prefixes of a keyword from scratch.
i0
D0 pvldb fpvldbg, and D1 pvldb fvldb; pldb; pvdb, Second, they may need to call UDFs many times. In this
pvlb; pvldg. Su ppose  1, D ^  pvldb fpvldb; vldb; section, we propose a character-level incremental method to
find similar prefixes of a keyword as a user types character by
4. We need to use multisets to accommodate duplicated grams. character. Chaudhuri and Kaushik [14] and Ji et al. [24]
466 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 25, NO. 2, FEBRUARY 2013

TABLE 4
vldb
Similar-Prefix Table Sdblp  1

We can create indexes on the attribute prefix of the prefix


0
table PT and the similar-prefix table STw , and in this way the
SQL can be executed efficiently.
Fig. 3. Using character-level incremental method to support fuzzy For example, suppose  1. Assume a user has typed in a
search. vld
keyword vld and its similar-prefix table Sdblp fhvl; 1i;
hvld; 0i; hvldb; 1i; hpvld; 1ig has been computed and cached.
proposed to use a trie structure to incrementally compute
similar prefixes. While adopting a similar incremental-search Suppose the user types in one more character b. We first
vldb
framework, we focus on the challenge of how to use SQL to generate the similar-prefix table Sdblp fhvldb; 0i; hvld; 1i;
vld
do it. We develop effective index structures using auxiliary hpvldb; 1i; hvldbj; 1ig based on Sdblp (Section 4.3.2), as shown
tables and devise pruning techniques to achieve a high in Table 4, and then issue the following SQL to answer the
speed. We develop novel techniques on how to use auxiliary query vldb:
tables, built-in indexes on key attributes, and pruning SELECT dblp: FROM Sdblp vldb
; Pdblp ; Idblp ; dblp
techniques. We also provide theoretical correctness. vldb
WHERE Sdblp :prefix Pdblp :prefix AND
4.3.1 Incremental-Computation Framework Pdblp :ukid  Idblp :kid AND Pdblp :lkid  Idblp :kid AND
Idblp :rid dblp:rid,
Assume a user has typed in a keyword w c1 c2    cx
character by character. For each prefix p c1 c2    ci i  x, which returns records r1 , r4 , and r8 .
we maintain a similar-prefix table STp with records in the
form hprefix; edp; prefixi, which keeps all the prefixes 4.3.2 Incremental Computation Using SQL
similar to p and their corresponding edit distances. As the In this section, we discuss how to incrementally compute
similar-prefix tables are small (usually within 1,000 records), the similar-prefix table STw .
we can use in-memory tables to store them. The similar- Initialization. For the empty string , its similar-prefix
prefix table is shared by different queries. If the table gets too table ST has all the prefixes with a length within the edit-
big, we can periodically remove some of its entries. In other distance threshold , and we can compute such prefixes and
words, the incremental-computation algorithm does not their corresponding edit distances using the following SQL:
need to maintain session information for different queries.
SELECT prefix, LENGTH(prefix) AS ed FROM PT
Suppose the user types one more character cx1 and
WHERE LENGTH(prefix)  .
submits a new query w0 c1 c2    cx cx1 . We use table STw to
0
compute STw (Section 4.3.2), find the keyword ranges of For example, consider the prefix table with prefixes {,
0 0
similar prefixes in STw by joining the similar-prefix table STw p; pv; pvl; pvld; pvldb; v; vl; vld; vldb}. Suppose  1. This
and the prefix table PT , and compute the answer of w using 0 SQL returns ST fh; 0i, hp; 1i; hv; 1ig, as shown in Table 5a.
the inverted-index table IT (Fig. 3). We can precompute and materialize the similar-prefix table
0
Based on similar-prefix table STw of keyword w0 , we use of the empty string.
the following SQL to answer the single-keyword query w0 : 0
Then, we discuss how to use STw to compute STw . We have
0
SELECT T : FROM STw ; PT ; IT ; T an observation that only those prefixes having a prefix in STw
0
WHERE STw .prefix PT .prefix AND could be similar-prefixes of w0 based on Lemma 2.
0
PT :ukid  IT :kid AND PT :lkid  IT :kid AND Lemma 2. If hv0 ; e0 i 2 STw , then 9hv; ei 2 STw such that v is a
IT :rid T :rid. prefix of v0 and e  e0 .

TABLE 5
Similar-Prefix Tables of a Prefix Table with Strings in {, v; vl; vld; vldb, p; pv; pvl; pvld; pvldb}

( 1. For ease of presentation, we add two columns from and op, where column from denotes where the record is derived from, and column
op denotes operationsm:match, d:deletion, i:insertion, s:substitution.)
LI ET AL.: SUPPORTING SEARCH-AS-YOU-TYPE USING SQL IN DATABASES 467

Insertion. If vm is in PT , the strings with vm as a prefix


could also be similar to w0 . For each such string vi , we can
transform w0 to vi by first transforming w0 to vm (with e
operations) and then inserting jvi j  jvm j jvi j  jvj  1
characters after vm . Thus, the transformation distance is
e jvi j  jvj  1. If e jvi j  jvj  1  , vi is similar to w0 .
We can get such similar prefixes using the following SQL:
SELECT PT .prefix, LENGTH(PT :prefix)-
LENGTH(STw .prefix)-1+ed AS ed FROM STw ; PT
WHERE LENGTH(PT .prefix)> LENGTH
Fig. 4. Incrementally computing similar prefixes. (d is a string with length (STw .prefix)1 AND
jdj no larger than   e.) ed LENGTH PT :prefix-LENGTH(STw .prefix)-1
  AND
Proof. Consider a transformation from node v0 to keyword SUBSTRPT .prefix, 1, LENGTH STw .prefix 1
w0 with edv0 ; w0 operations. In the transformation, we CONCAT STw .prefix,cx1 ,
consider the last match case between two characters in v0
and w0 . Let pv and pw be, respectively, the prefixes of v0 where SUBSTRstr; pos; len returns a substring with len
and w0 before the last match characters (including the characters from the string str, starting at position pos. In the
two characters). SQL statement, STw :prefix, ed, and PT :prefix, respectively,
If pv pw ,5 we have e0 edv0 ; w0 maxjv0 j; jw0 j. correspond to v, e, and vi . In our running example, as only
Let v and w, respectively, denote the prefixes of v0 and w0 h; 0i 2 ST satisfies the SQL condition, this SQL returns
without the last characters. We have e edv; w hvl; 1i. Note that v matches the query keyword and we can
maxjv0 j; jw0 j  1. Thus, v is a prefix of v0 , e  e0 , and do an insertion l after v, thus vl is similar to the query
hv; ei must be in STw . Otherwise pv pw 6 . Let v and w, v with an edit distance of 1.
respectively, denote the prefixes of pv and pw without the Substitution. Let vs be the concatenated string of v and
last characters. We have e edv; w edv0 ; w0 e0 . character c0 where c0 6 cx1 , i.e., vs vc0 . We can transform
Thus, v is a prefix of v0 , e  e0 , and hv; ei must be in STw . t
u w0 to vs by first transforming w to v (with e operations) and
then substituting cx1 for c0 . Thus, the transformation
Based on this property, for a new prefix query w0 wcx1 distance is e 1. If e 1  , vs is similar to w0 , and we
by concatenating query w and a new character cx1 , we can get all such prefixes using the following SQL:
0
construct STw by checking each record in STw as follows: for SELECT PT .prefix, ed+1 AS ed FROM STw ; PT
hv; ei 2 STw edv; w e, we consider the following basic WHERE ed < 
edit operations (as illustrated in Fig. 4).
AND SUBSTRPT .prefix, 1, LENGTHPT .prefix-1
Deletion. We can transform w0 wcx1 to v by first
STw .prefix AND PT .prefix != CONCATSTw .prefix, cx1 .
transforming w to v (with e operations) and then deleting
cx1 from w0 . The transformation distance (the number of In the SQL, PT :prefix and STw :prefix, respectively, corre-
operations in the transformation) is e 1.6 If e 1  , v is spond to vs and v. In our running example, as h; 0i 2 ST ,
similar to w0 . We get such prefixes using the following SQL: this SQL returns hp; 1i since we can use p to replace v.
We insert the results of these SQL queries into the similar-
SELECT prefix, ed 1 as ed FROM STw WHERE ed < . 0
prefix table STw . During the insertion, it is possible to add
For example, suppose w0 v; w , and ST is the one multiple pairs hv0 ; e01 i and hv0 ; e02 i for the same similar prefix
shown in Table 5a. Since only h; 0i 2 ST satisfies the SQL v0 . In this case, only the one with the smallest edit operation
0
condition, this SQL returns h; 1i. Apparently  is similar to should be inserted in STw . The reason is that we only keep the
the new query v with an edit distance 1. minimum number of edit operations to transform the string
Match. Let vm be the concatenated string of string v and v0 to the string w0 . To this end, we can first insert all strings
character cx1 , i.e., vm vcx1 . We can transform w0 to vm by generated by the above SQL queries, and then use the
first transforming w to v (with e operations) and then following SQL to prune the nodes with larger edit distances
matching the last character. Thus, the transformation dis- by doing a postprocessing:
tance is e. As e  , v is similar to w0 , and we can get all such 0 0
DELETE FROM STw AS S1 , STw AS S2
similar prefixes using the following SQL:
WHERE S1 .prefix S2 .prefix AND S1 :ed > S2 :ed.
SELECT PT .prefix, ed FROM STw ; PT Theorem 1 shows the correctness of our method.
WHERE PT .prefix CONCATSTw :prefix; cx1 ,
Theorem 1. For a query string w c1 c2    cx , let STw be its
where CONCAT(s; t) concatenates two strings s and t. In the similar-prefix table. Consider a new query string w0 c1 c2   
SQL, STw :prefix corresponds to v and PT :prefix corresponds 0
cx cx1 . STw is generated by the above SQL queries based on STw .
to vm . In our example, as h; 0i 2 ST and hp; 1i 2 ST satisfy
1) Completeness: Every similar prefix of the new query string
the SQL condition, this SQL returns hv; 0i and hpv; 1i. 0 0
w0 will be in STw . 2) Soundness: Each string in STw is a similar
5. The operation between the empty string and the root node is also a prefix of the new query string w0 .
match case.
6. For each similar string of w0 , we can compute its real edit distance to w0 Proof. 1) We first prove the completeness. We prove it by
by keeping the smallest one as discussed later. induction. This claim is obviously true when w w0 .
468 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 25, NO. 2, FEBRUARY 2013

Suppose the claim is true for wx w with x characters. and substitute cx1 for the last character d. Then we get
We want to prove this claim is also true for a new query another transformation with the same number of edit
string wx1 , where wx1 w0 wx cx1 . operations. (Characters c and d cannot be the same, since
Suppose v is a similar prefix of wx1 . If v , then by otherwise the new transformation could have fewer
definition edv; wx1 ed; wx1 x 1  , and x  edit operations, contradicting to the minimality of edit
  1 < . Thus, edv; wx ed; wx x  , and v is distance.) We use the same argument in Case 2 to show
also a similar prefix of wx . When we consider this node v, that our method adds hny1 ; edny1 ; wx1 i to STwx1 .
we add the pair hv; x 1i (i.e., hv; edv; wx1 i) into STwx1 . In summary, for all cases the algorithm adds hny1 ;
Now consider the case where the similar prefix v of edny1 ; wx1 i to STwx1 .
wx1 is not the empty string. Let v ny1 ny d, i.e., it 2) Then we prove the soundness. By definition, a
has y 1 characters, and is concatenated from a string transformation distance of two strings in each added tuple
ny and a character d. By definition, edny1 ; wx1  . by the algorithm is no less than their edit distance. That is,
We want to prove that hny1 ; edny1 ; wx1 i will be edn; w0  . Thus, n must be a similar prefix of p. u
t
added to STwx1 .
Based on the idea in the classic dynamic-program- 4.3.3 Improving Performance Using Indexes
ming algorithm, we consider the following four cases in As we can create indexes on the attribute prefix of the prefix
the minimum number of edit operations to transform table PT and the similar-prefix table STw , the SQL statements
ny1 to wx1 . for the deletion and match cases can be efficiently executed
Case 1: Deleting the last character cx1 from wx1 , and using the indexes. However, for the SQL of the substitution
transforming ny1 to wx . Since edny1 ; wx1 edny1 ; case, the SQL contains a statement SUBSTRPT :prefix;
wx 1  , we have edny1 ; wx    1 < . Thus, ny1 1; LENGTHPT :prefix  1 STw :prefix. Although we can
is a similar prefix of wx . Based on the induction create an index on the attribute prefix of the similar prefix
assumption, hny1 ; edny1 ; wx i must be in STwx . From the table STw , if there is no index to support the predicate
node ny1 , our method considers the deletion case when it SUBSTRPT :prefix; 1; LENGTH PT :prefix  1, it is rather
considers the node ny , and adds hny1 ; edny1 ; wx 1i to expensive to execute the SQL. To improve the performance,
we can alter table PT by adding an attribute parent
STwx1 , which is exactly hny1 ; edny1 ; wx1 i.
SUBSTRPT :prefix; 1; LENGTHPT :prefix  1, and create a
Case 2: Substituting the character d of ny1 for the last
table PT hprefix; lkid; ukid; parenti. Using this table, we
character cx1 of wx1 . Since edny1 ; wx1 edny ; wx
propose an alternative method and use the following the
1  , we have edny ; wx    1 < . Thus, ny is a similar SQL for the substitution case:
prefix of wx . Based on the induction assumption,
hny ; edny ; wx i must be in STwx . From node ny , our method SELECT PT .prefix, ed+1 AS ed FROM STw ; PT
considers the substitution case when it considers this child WHERE PT :parent STw .prefix AND ed <  AND
node (ny1 ) of the node ny , and adds hny1 ; edny ; wx 1i PT .prefix! CONCAT STw :prefix; cx1 .
to STwx1 , which is exactly hny1 ; edny1 ; wx1 i. We can create an index on attribute parent of prefix table
Case 3: The last character cx1 of wx1 matching the character PT to increase search performance.
d of ny1 . Since edny1 ; wx1 edny ; wx  , then ny is a Similarly, it is inefficient to execute the SQL for the
similar prefix of nx . Based on the induction assumption, insertion case as it contains a complicated statement
hny ; edny ; wx i must be in STwx . From node ny , our method

considers the match case when it considers this child node SUBSTR PT :prefix; 1; LENGTHSTw :prefix 1
(ny1 ) of the node ny , and adds hny1 ; edny ; wx i to STwx1 , CONCATSTw :prefix; cx1 :
which is exactly hny1 ; edny1 ; wx1 i.
Case 4: Transforming ny to wx1 and inserting character d Next we discuss how to improve the SQL using indexes. Let
of ny1 . For each transformation from ny to wx1 , we Y0 be the similar-prefix table of the results of the SQL for the
consider the last character cx1 of wx1 . First, we can show match case. For insertions, we need to find the similar
that this transformation cannot delete the character cx1 , strings with prefixes in Y0 . Let Yi1 be the similar-prefix
since otherwise we can combine this deletion of cx1 and table composed of prefixes by appending one more
the insertion of d into one substitution, yielding another character to those prefixes in Yi for 0  i    1. Obviously
transformation with a smaller number of edit operations, [i1 Yi is exactly the result of the SQL for the insertion case.
contradicting to the minimality of edit distance. Thus, Note that Y0 can be efficiently computed as we can use
we can just consider two possible operations on the the SQL for the match case to generate it. Iteratively, we
character cx1 in this transformation. 1) Matching cx1 for compute Yi1 based on Yi using the following SQL:
the character of an ancestor na of ny1 : in this case, since
edny1 ; wx1 edna1 ; wx y  a 1  , we have SELECT PT .prefix, ed+1 AS ed FROM Yi ; PT
edna1 ; wx  , and na1 is a similar prefix of wx . Based WHERE PT :parent Yi .prefix AND ed < .
on the induction assumption, hna1 ; edna1 ; wx i must be We can create indexes on the parent attribute of the prefix
in STwx . From node na1 , the algorithm considers the table PT and the prefix attribute of Yi to improve the
matching case, and adds hny1 ; edna1 ; wx y  a 1i performance. In our running example, Y0 fhv; 0i; hpv; 1ig.
to STwx1 , which is hny1 ; edny1 ; wx1 i. 2) Substituting As only hv; 0i 2 Y0 satisfies the SQL condition, this SQL
cx1 for the character of an ancestor na of ny1 : in this case, returns hvl; 1i. Thus, we can use several SQL statements that
instead of substituting c for the character of na and can be efficiently executed to replace the original complicated
inserting the character d, we can insert the character of na SQL statement for the insertion case.
LI ET AL.: SUPPORTING SEARCH-AS-YOU-TYPE USING SQL IN DATABASES 469

5 SUPPORTING MULTIKEYWORD QUERIES


In this section, we propose efficient techniques to support
multikeyword queries.

5.1 Computing Answers from Scratch


Given a multikeyword query Q with m keywords w1 ;
w2 ; . . . ; wm , there are two ways to answer it from scratch.
1) Using the INTERSECT Operator: a straightforward way
is to first compute the records for each keyword using the
previous methods, and then use the INTERSECT operator
to join these records for different keywords to compute the
answers. 2) Using Full-text Indexes: we first use full-text
indexes (e.g., CONTAINS command) to find records match- Fig. 5. Incrementally computing first-N answers.
ing the first m  1 complete keywords, and then use our
methods to find records matching the last prefix keyword. the user arbitrarily modifies the query, we can easily extend
Finally, we join the results. These two methods cannot use this method to answer the new query.
the precomputed results and may lead to low performance.
To address this problem, we propose an incremental-
computation method. 6 SUPPORTING FIRST-N QUERIES
5.2 Word-Level Incremental Computation The previous methods focus on computing all the answers.
As a user types in a query character by character, we usually
We can use previously computed results to incrementally give the user the first-N (any-N) results as the instant
answer a query. Assuming a user has typed in a query Q with feedback. This section discusses how to compute the first-N
keywords w1 ; w2 ; . . . ; wm , we create a temporary table CQ to results.
cache the record ids of query Q. If the user types in a new Exact first-N queries. For exact search, we can use the
keyword wm1 and submits a new query Q0 with keywords LIMIT N syntax in databases to return the first-N results.
w1 ; w2 ; . . . ; wm ; wm1 , we use temporary table CQ to incre- For example, MYSQL uses LIMIT n1 ; n2 to return n2 rows
mentally answer the new query. starting from the n1 th row. As an example, we focus on how
Exact search. As an example, we focus on the method that
to extend the method based on the inverted-index table and
uses the prefix table and inverted-index table. As CQ
the prefix table (Section 3.2). Our techniques can be easily
contains all results for query Q, we check whether the
records in CQ contain keywords with the prefix wm1 of new extended to other methods.
query Q0 . We issue the following SQL query to answer For a single-keyword query, we can use LIMIT 0; N to
keyword query Q0 using CQ : find the first-N answers. For example, assume a user types
in a keyword query sig. To compute the first-2 answers,
SELECT T : FROM PT ; IT ; CQ ; T we issue the following SQL:
WHERE PT .prefix wm1 AND
PT :ukid  IT :kid AND PT :lkid  IT :kid AND SELECT dblp: FROM Pdblp ; Idblp ; dblp
IT :rid CQ :rid AND CQ :rid T :rid. WHERE Pdblp .prefix sig AND
Pdblp :ukid  Idblp :kid AND Pdblp :lkid  Idblp :kid AND
For example, suppose a user has typed in a query Q Idblp :rid dblp:rid
privacysigmod and we have created a temporary table
LIMIT 0; 2,
CQ fr3 ; r6 g. Then the user types in a new keyword pub
and submits a new query Q0 privacysigmodpub. We which returns records r3 and r6 .
check whether records r3 and r6 contain a keyword with the For multikeyword queries, if we use the INTERSECT
prefix pub. Using CQ , we find that only r6 contains a operator, we can use the LIMIT operator to find the first-N
keyword publishing with the prefix pub. answers. But it is not straightforward to extend the word-
Fuzzy search. As an example, we consider the character- level incremental method to support first-N queries, since
level incremental method. We first compute STwm1 using the the cached results of a query Q with keywords
character-level incremental method for the new keyword w1 ; w2 ; . . . ; wm have N records, instead of all the answers.
wm1 , and then use STwm1 to answer the query. Based on the For a query Q0 with one more keyword wm1 , we may not get
temporary table CQ , we use the following SQL query to N answers for Q0 using the cached results CQ , and need to
answer Q0 : continue to access records from the inverted-index table. To
address this issue, we first use the incremental algorithms as
SELECT T : FROM STwm1 ; PT ; IT ; CQ ; T discussed in Section 5 to answer Q0 using CQ . Let RCQ ; Q0
WHERE STwm1 .prefix PT .prefix AND denote the results. If the temporary table CQ has smaller
PT :ukid  IT :kid AND PT :lkid  IT :kid AND than N records or RCQ ; Q0 has N records, RCQ ; Q0 is
IT :rid CQ :rid AND CQ :rid T :rid. exactly the answers. Otherwise, we continue to access
If the user modifies the keyword wm of query Q to w0m records from the inverted-index table. Fig. 5 shows how to
and submits a query with keywords w1 ; w2 ; . . . ; wm1 ; w0m , incrementally find first-N results.
we can use the cached result of query w1 ; w2 ; . . . ; wm1 to We progressively access the records that contain the first
answer the new query using the above method. Similarly, if keyword w1 . Suppose we have accessed the records of w1
470 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 25, NO. 2, FEBRUARY 2013

from the 0th row to the   1st row for answering query Q. TABLE 6
Then for query Q0 , we continue to access the records of w1 Data Sets and Index Costs
from the th row. To get enough answers, we access  records
for each SQL query, where  is a parameter depending on the
keyword distribution (usually set to m  N).
Next we discuss how to assign  iteratively. Initially, for
m 1, that is, the query Q has only one keyword and Q0 has
two keywords. When answering Q, we have visited N
records for w1 , and we need to continue to access  records
starting from the Nth record. Thus, we set  N. If we
cannot get N answers, we set    until we get N results
(or we have accessed all of the records).
For example, assume a user types in a query privacyic
character by character, and N 2. When the user types in
keyword privacy, we issue the following SQL:
SELECT dblp: FROM Pdblp ; Idblp ; dblp
which returns records r4 and r8 . As we only get two results,
WHERE Pdblp .prefix privacy AND we increase the edit-distance threshold to 1, and issue a new
Pdblp :ukid  Idblp :kid AND Pdblp :lkid  Idblp :kid AND SQL, which returns record r1 . Thus, we get first-3 results
Idblp :rid dblp:rid and terminate the execution. We do not need to consider the
LIMIT 0; 2, case that  2 as we have gotten the first-3 results.
which returns records r1 and r2 . We compute and cache
CQ fr1 ; r2 g. When the user types in another keyword ic 7 SUPPORTING UPDATES EFFICIENTLY
and submits a query privacyic, we first use CQ to
answer the query and get record r2 . As we want to compute We can use a trigger to support data updates. We consider
insertions and deletions of records.
first-2 results, we need to issue the following SQL:
Insertion. Assume a record is inserted. We first assign it
SELECT dblp: FROM Pdblp ; Idblp ; dblp a new record ID. For each keyword in the record, we insert
WHERE Pdblp .prefix privacy AND the keyword into the inverted-index table. For each prefix of
Pdblp :ukid  Idblp :kid AND Pdblp :lkid  Idblp :kid AND the keyword, if the prefix is not in the prefix table, we add
Idblp :rid dblp:rid an entry for the prefix. For the keyword-range encoding of
LIMIT 2; 2 2 each prefix, we can reserve extra space for prefix ids to
accommodate future insertions. We only need to do global
INTERSECT
reordering if a reserved space of the insertion is consumed.
SELECT dblp: FROM Pdblp ; Idblp ; dblp
Deletion. Assume a record is deleted. For each keyword
WHERE Pdblp .prefix ic AND
in the record, in the inverted-index table we use a bit
Pdblp :ukid  Idblp :kid AND Pdblp :lkid  Idblp :kid AND to denote whether a record is deleted. Here we use the bit to
Idblp :rid dblp:rid, mark the record to be deleted. We do not update the table
which returns records r5 . Thus, we get first-2 results until we need to rebuild the index. For the range encoding of
(records r2 and r5 ) and terminate the execution. each prefix, we can use the deleted prefix ids for future
Fuzzy first-N queries. The above methods cannot be easily insertions.
extended to support fuzzy search, as they cannot distinguish The ranges of ids are assigned-based inverse document
the results of exact search and fuzzy search. Generally, we frequency (idf) of keywords. We use a larger range for a
need to first return the best results with smaller edit keyword with a smaller idf. In most cases, we can use the
distances. To address this issue, we propose to progressively kept extra space for update. But in the worst case, we need
compute the results. As an example, we consider the to rebuild the index. The problem of the range selection and
character-level incremental method (Section 4.3). analysis is beyond the scope of this paper.
For a single-keyword query w, we first get the results with
edit distance 0. If we have gotten N answers, we terminate 8 EXPERIMENTAL STUDY
the execution; otherwise, we progressively increase the edit-
distance threshold and select the records with edit-distance We implemented the proposed methods on two real data
thresholds 1; 2; . . . ; , until we get N answers. sets. 1) DBLP: It included 1.2 million computer science
publications.7 2) MEDLINE: It included 5 million biome-
For example, suppose  2. Considering a keyword
dical articles.8 Table 6 summarizes the data sets and index
vld, to get the first-3 answers, we first issue the following
sizes. We see that the size of inverted-index table and prefix
SQL: table is acceptable, compared with the data set size. As a
SELECT dblp: FROM Sdblp vld
; Pdblp ; Idblp ; dblp keyword may have many deletion-based neighbors, the size
vld
WHERE Sdblp :ed 0 AND Sdblp vld
.prefix Pdblp .prefix of prefix-deletion table is rather large. The size of q-gram
table is also larger than that of our method, since a substring
AND Pdblp :ukid  Idblp :kid AND Pdblp :lkid  Idblp :kid
AND Idblp :rid dblp:rid 7. http://dblp.uni-trier.de/xml/.
LIMIT 0; 3, 8. http://www.ncbi.nlm.nih.gov/pubmed/.
LI ET AL.: SUPPORTING SEARCH-AS-YOU-TYPE USING SQL IN DATABASES 471

TABLE 7
Ten Example Keyword Queries

Fig. 7. Exact-search performance of answering multikeyword queries


(varying keyword numbers).

Fig. 6. Exact-search performance for answering single-keyword queries


(varying keyword length). Fig. 8. Exact-search performance of computing first-N answers by
varying different N values. (a) Single keywordsDBLP. (b) Multi-
has multiple overlapped q-grams. Note that the size of keywordsMEDLINE.
similar-prefix table is very small as it only stores similar
prefixes of a keyword. 6. using the word-level incremental method (called
We used 1,000 real queries for each data set from the logs IPTables )
of our deployed systems. We assumed the characters of a Fig. 7 shows the results. We only show the results of four
query were typed in one by one. Table 7 gives ten example methods as the UDF-based method and the LIKE-based
queries. method achieved similar results and the two methods
We used a Windows 7 machine with an Intel Core 2 using full-text indexes got similar results.
Quad processor (X5450 3.00 GHz and 4 GB memory). We see that the LIKE-based method had the worst
We used three data bases, MYSQL, SQL Server 2005, and performance. The method using the full-text indexes
Oracle 11g. By default, we used MYSQL in the experiments. achieved a better performance. For example, on the MED-
We will compare different data bases in Section 8.3. LINE data set, the LIKE-based method took 5,000 ms to
answer a query, and the latter method reduced the time to
8.1 Exact Search
150 ms. IPTables achieved the highest performance. It could
Single-keyword queries. We implemented three methods
answer a query within 10 ms for the DBLP data set and 30 ms
for single-keyword queries: 1) using UDF; 2) using the LIKE
for the MEDLINE data set, as it used an incremental method
predicate; and 3) using the inverted-index table and the
to find first-N answers and did not scan all records.
prefix table (called IPTables). We compared the perfor-
Varying the number of answers N. We compared the
mance of the three methods to compute the first-N answers.
performance of the methods to compute first-N answers by
Unless otherwise specified, N 10. Fig. 6 shows the results.
varying the number of first results. Fig. 8 shows the
We see that both the UDF-based method and the LIKE-
experimental results. We can see that IPTables achieved
based method had a low search performance as they
the highest performance for single-keyword queries and
needed to scan records. IPTables achieved a high perfor-
IPTables outperformed other methods for multiple-key-
mance by using indexes. As the keyword length increased,
word queries, for different N values. For example, IPTables
the performance of the first two methods decreased, since
computed 100 answers for single-keyword queries within
the keyword became more selective, and the two methods
2 ms on the DBLP data set and IPTables computed
needed to scan more records in order to find the same
number (N) of answers. As the keyword length increased, 100 answers for multikeyword queries within 60 ms on the
IPTables had a higher performance, since there were fewer MEDLINE data set. This difference shows the advantages of
complete keywords for the query and the query needed our index structures and our incremental algorithms.
fewer join operations. 8.2 Fuzzy Search
Multikeyword queries. We implemented six methods Single-keyword queries. We first evaluated the perfor-
for multikeyword queries: mance of different methods to compute similar keywords of
1. using UDF; single-keyword queries. We implemented four methods:
2. using the LIKE predicate;
1. using UDF;
3. using full-text indexes and UDF (called FI+UDF);
2. using the gram-based method (called Gram)
4. using full-text indexes and the LIKE predicate (called
described in [30]9;
FILIKE);
5. using the inverted-index table and prefix table 9. We set q 2 and used the techniques of count filtering, length
(IPTables); filtering, and position filtering.
472 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 25, NO. 2, FEBRUARY 2013

Fig. 9. Fuzzy-search performance of computing similar keywords for


Fig. 11. Fuzzy-search performance (2 steps) of computing first-N
single-keyword queries by varying the query keyword length  2.
answers for multiple-keyword queries by varying keyword numbers in a
query ( 2).

Fig. 10. Fuzzy-search performance (overall) of computing first-N


answers for multikeyword queries by varying the keyword number in a Fig. 12. Fuzzy-search performance of computing first-N answers by
query ( 2). varying different N values.

3.using the neighborhood-generation-based method


(called NGB); and
4. using the character-level incremental algorithms
(called Incre) to compute similar keywords for a
given query keyword using the prefix table as
discussed in Section 4.2.
Fig. 9 shows the results.
As the keyword length increased, the running time of
Gram, UDF, and NGB increased while that of Incre Fig. 13. Comparison of different methods (SQL Server).
decreased. The main reason was the following: first, the
UDF-based method needed more time for computing edit Incre-SP); 2) computing first-N answers (called NGB-
distances for longer strings. Second, long strings have many R and Incre-R). Fig. 11 shows the results. We see that
more i-deletion neighborhoods, and NGB needed longer NGB needed more time for finding similar keywords and
time to find an i-deletion neighborhood of the query string Incre needed nearly the same amount time for the two
from the deletion table. Third, there were more grams for steps. For example, on the MEDLINE data set, NGB
longer strings and Gram needed longer time to process large needed 200 ms to compute similar prefixes and 50 ms to
numbers of grams. Besides the inverted-index table and compute the answers for queries with six keywords. Incre
prefix table, Gram and NGB maintained additional indexes. reduced the time to 50 ms for computing similar prefixes.
Fourth, Incre can incrementally compute the similar prefixes Varying the number of returned results (N). We
and longer strings have a smaller number of similar prefixes, compared the performance of different algorithms to
thus its running time decreased. compute first-Nanswers by varying N. Fig. 12 shows the
Multikeyword queries. We evaluated the performance of results. We can see that both Incre and NGB can efficiently
different methods to compute first-N answers for multikey- compute the first-N answers for different N values.
word queries. Gram and the UDF-based methods were too
slow to support search-as-you-type. We implemented two 8.3 Comparisons of Different Approaches
algorithms using NGB and Incre to find similar keywords on We compared different methods to support search-as-you-
top of the prefix table, and then computed the answers type. The first method uses existing built-in functionalities
based on the inverted-index table. For multikeyword (e.g., full-text indexes and the CONTAINS command) in
queries, we also implemented their word-level incremental Oracle and SQL Server. We used their indexes to maximize
algorithms, called NGB and Incre , respectively. their performance. The second method builds a separate
Fig. 10 shows the results. We see that the word-level application layer on the DBMS using techniques in [24],
incremental algorithms can improve the performance for namely StandAlone. For the extender-based method, we
multikeyword queries by using previously computed results implemented the proposed techniques in [24] and added
to answer queries. For example, Incre achieved a very high them as extenders of Oracle Cartridge in Oracle 11g and
performance; it could answer a query within 50 ms for the CLR in Microsoft SQL Server. The fourth method is based
DBLP data set and 100 ms for the MEDLINE data set. on SQL. We evaluated the scalability for both exact search
In addition, we evaluated the running time in two and fuzzy search. We set  2 and N 100. Figs. 13 and 14
steps: 1) finding similar keywords (called NGB-SP and show the results.
LI ET AL.: SUPPORTING SEARCH-AS-YOU-TYPE USING SQL IN DATABASES 473

Fig. 15. Performance of updates (10,000 records).


Fig. 14. Comparison of different methods (Oracle).
sentences with a phrase of the query string. Bast et al.
We see that the StandAlone method achieved the highest proposed HYB indexes to support search-as-you-type [4], [5],
performance as it used in-memory data index structures, [6]. Ji et al. [24] extended autocompletion to support fuzzy,
while the SQL-based method (DB-SQL) had a lower but full-text instant search. Chaudhuri and Kaushik [14] studied
comparable performance. DB-SQL outperformed the built-in how to find similar strings interactively as users type in a
support of Oracle and SQL Server, since we can incrementally query string, and they did not study how to support
compute answers using effective index structures, which is multikeyword queries. Li et al. [34] studied search-as-you-
very important for search-as-you-type. For exact search, all type on a data base with multiple tables by modeling
these methods achieved a high performance and scaled well relational data as graphs. Qin et al. [42] proposed to use
as the data set increased. For example, for exact search, DB- SQL to answer traditional keyword search, taking the query
SQL could answer a query within 2 ms for 100,000 records
keywords as complete keywords. Li et al. [32] proposed to
and 18 ms for one million records. For fuzzy search,10 DB-
suggest SQL queries based on keywords. Different from
SQL outperformed the built-in support in Oracle and SQL
existing studies [24], we study how to use SQL to support
Server by 1-2 orders of magnitude. DB-SQL scaled well as
search-as-you-type. We proposed to use the available
data sizes increased, which reflects the superiority of our
resources inside a DBMS and develop effective pruning
techniques. For example, the methods using built-in capabil-
techniques to improve the performance.
ities of Oracle and SQL Server took about 1,000 milliseconds
Approximate string search and similarity join. There
to answer a query. DB-SQL could answer a query within
have been recent studies to support efficient approximate
20 ms for one million records and 100 ms for five million
string search [9], [26], [3], [13], [11], [18], [30], [31], [27], [50],
records. This is because they cannot incrementally answer a
[19], [45], which, given a set of strings and a query string, all
query. Note that Oracle and Microsoft took fuzzy search as a
strings in the set that are similar to the query string. Many
black box and we do not know how they support fuzzy
studies used gram-based index structures to support
search. Here we took them as a baseline. The results
approximate string search (e.g., [30], [28], [18]). The
suggested that practitioners need to consider both the
experiments in [24] and [14] showed that these approaches
performance and other system aspects to decide the best
are not as efficient as trie-based methods for fuzzy search.
approach for search-as-you-type.
Similarity joins are extensively studied [17], [3], [7], [12],
8.4 Data Updates [43], [48], [49], [46], which given two sets of strings, find all
We tested the cost of updates on the DBLP data set. We similar string pairs from the two sets. Gravano et al. [17]
first built indexes for 1 million records, and then inserted proposed to use DBMS capabilities to support fuzzy joins of
10,000 records at each time. We compared the performance strings. Their methodology (q-gram-based techniques) has a
of the three methods on inserting 10,000 records. Fig. 15 low performance to support search-as-you-type (the experi-
shows the results. It took more than 40 seconds to reindex mental results in Section 8). Jestes et al. [23] proposed to use
the data, while our incremental-indexing method only took min-hash to improve performance. Chaudhuri et al. [10]
0.5 seconds. studied data cleansing operators in Microsoft SQL Server.
Summary. 1) In order to achieve a high speed, we have to There are also some studies on estimating selectivity of
rely on index-based methods. 2) The approach using approximate string queries [20], [28], [29] and approximate
inverted-index tables and the prefix tables can support prefix, entity extraction [1], [9], [47].
Keyword search in data bases. There are many studies
fuzzy search, and achieve the best performance among all
on keyword search in data bases [22], [2], [8], [25], [36], [37],
these methods and outperform the built-in methods in SQL
[35], [44], [15], [16], [37], [21], [39], [41], and [33].
Server and Oracle. 3) Our SQL-based method can achieve a
Our work complements these earlier studies by investi-
high interactive speed and scale well. gating how to support search-as-you-type inside DBMS. To
our best knowledge, our work is the first study on
9 RELATED WORK supporting search-as-you-type inside a DBMS using SQL,
even supporting multikeyword queries and fuzzy search.
Autocompletion and search-as-you-type. An autocomple-
tion system can predict a word or phrase that a user may type
in next based on the partial string the user has already typed 10 CONCLUSION AND FUTURE WORK
[40]. Nandi and Jagadish studied phrase prediction, which
In this paper, we studied the problem of using SQL to
took the query string as a single keyword and computed all
support search-as-you-type in data bases. We focused on
10. In Oracle one can set a fuzzy score between 0 and 80 to do fuzzy the challenge of how to leverage existing DBMS function-
search, we used the default value 60. alities to meet the high-performance requirement to achieve
474 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 25, NO. 2, FEBRUARY 2013

an interactive speed. To support prefix matching, we [12] S. Chaudhuri, V. Ganti, and R. Kaushik, A Primitive Operator for
Similarity Joins in Data Cleaning, Proc. 22nd Intl Conf. Data Eng.
proposed solutions that use auxiliary tables as index (ICDE 06), pp. 5-16, 2006.
structures and SQL queries to support search-as-you-type. [13] S. Chaudhuri, V. Ganti, and R. Motwani, Robust Identification of
We extended the techniques to the case of fuzzy queries, Fuzzy Duplicates, Proc. 21st Intl Conf. Data Eng. (ICDE), pp. 865-
876, 2005.
and proposed various techniques to improve query perfor-
[14] S. Chaudhuri and R. Kaushik, Extending Autocompletion to
mance. We proposed incremental-computation techniques Tolerate Errors, Proc. 35th ACM SIGMOD Intl Conf. Management
to answer multikeyword queries, and studied how to of Data (SIGMOD 09), pp. 433-439, 2009.
support first-N queries and incremental updates. Our [15] B.B. Dalvi, M. Kshirsagar, and S. Sudarshan, Keyword Search on
External Memory Data Graphs, Proc. VLDB Endowment, vol. 1,
experimental results on large, real data sets showed that no. 1, pp. 1189-1204, 2008.
the proposed techniques can enable DBMS systems to [16] B. Ding, J.X. Yu, S. Wang, L. Qin, X. Zhang, and X. Lin, Finding
support search-as-you-type on large tables. Top-K Min-Cost Connected Trees in Data Bases, Proc. IEEE 23rd
There are several open problems to support search-as- Intl Conf. Data Eng. (ICDE 07), pp. 836-845, 2007.
[17] L. Gravano, P.G. Ipeirotis, H.V. Jagadish, N. Koudas, S.
you-type using SQL. One is how to support ranking queries Muthukrishnan, and D. Srivastava, Approximate String Joins in
efficiently. Another one is how to support multiple tables. a Data Base (Almost) for Free, Proc. 27th Intl Conf. Very Large
Data Bases (VLDB 01), pp. 491-500, 2001.
[18] M. Hadjieleftheriou, A. Chandel, N. Koudas, and D. Srivastava,
ACKNOWLEDGMENTS Fast Indexes and Algorithms for Set Similarity Selection
Queries, Proc. IEEE 24th Intl Conf. Data Eng. (ICDE 08),
Guoliang Li and Jianhua Feng are partially supported by pp. 267-276, 2008.
the National Natural Science Foundation of China under [19] M. Hadjieleftheriou, N. Koudas, and D. Srivastava, Incremental
Grant No. 61003004 and 61272090, the National Grand Maintenance of Length Normalized Indexes for Approximate
String Matching, Proc. 35th ACM SIGMOD Intl Conf. Management
Fundamental Research 973 Program of China under Grant of Data (SIGMOD 09), pp. 429-440, 2009.
No. 2011CB302206, and National S&T Major Project of [20] M. Hadjieleftheriou, X. Yu, N. Koudas, and D. Srivastava,
China under Grant No. 2011ZX01042-001-002. Chen Li is Hashed Samples: Selectivity Estimators for Set Similarity Selec-
tion Queries, Proc. VLDB Endowment, vol. 1, no. 1, pp. 201-212,
partially supported by the NIH grant 1R21LM010143-01A1 2008.
and the NSF grant IIS-1030002. Chen Li declares financial [21] H. He, H. Wang, J. Yang, and P.S. Yu, Blinks: Ranked Keyword
interest in Bimaple Technology Inc., which is commercializ- Searches on Graphs, Proc. ACM SIGMOD Intl Conf. Management
of Data (SIGMOD 07), pp. 305-316, 2007.
ing some of the techniques used in this publication.
[22] V. Hristidis and Y. Papakonstantinou, Discover: Keyword Search
in Relational Data Bases, Proc. 28th Intl Conf. Very Large Data
Bases (VLDB 02), pp. 670-681, 2002.
REFERENCES [23] J. Jestes, F. Li, Z. Yan, and K. Yi, Probabilistic String Similarity
[1] S. Agrawal, K. Chakrabarti, S. Chaudhuri, and V. Ganti, Scalable Joins, Proc. Intl Conf. Management of Data (SIGMOD 10), pp. 327-
Ad-Hoc Entity Extraction from Text Collections, Proc. VLDB 338, 2010.
Endowment, vol. 1, no. 1, pp. 945-957, 2008. [24] S. Ji, G. Li, C. Li, and J. Feng, Efficient Interactive Fuzzy Keyword
[2] S. Agrawal, S. Chaudhuri, and G. Das, DBXplorer: A System for Search, Proc. 18th ACM SIGMOD Intl Conf. World Wide Web
Keyword-Based Search over Relational Data Bases, Proc. 18th (WWW), pp. 371-380, 2009.
Intl Conf. Data Eng. (ICDE 02), pp. 5-16, 2002. [25] V. Kacholia, S. Pandit, S. Chakrabarti, S. Sudarshan, R. Desai, and
[3] A. Arasu, V. Ganti, and R. Kaushik, Efficient Exact Set-Similarity H. Karambelkar, Bidirectional Expansion for Keyword Search on
Joins, Proc. 32nd Intl Conf. Very Large Data Bases (VLDB 06), Graph Data Bases, Proc. 31st Intl Conf. Very Large Data Bases
pp. 918-929, 2006. (VLDB 05), pp. 505-516, 2005.
[4] H. Bast, A. Chitea, F.M. Suchanek, and I. Weber, ESTER: Efficient [26] M.-S. Kim, K.-Y. Whang, J.-G. Lee, and M.-J. Lee, N-Gram/2l: A
Search on Text, Entities, and Relations, Proc. 30th Ann. Intl ACM Space and Time Efficient Two-Level N-Gram Inverted Index
SIGIR Conf. Research and Development in Information Retrieval Structure, Proc. 31st Intl Conf. Very Large Data Bases (VLDB 05),
(SIGIR 07), pp. 671-678, 2007. pp. 325-336, 2005.
[5] H. Bast and I. Weber, Type Less, Find More: Fast Autocomple- [27] N. Koudas, C. Li, A.K.H. Tung, and R. Vernica, Relaxing Join and
tion Search with a Succinct Index, Proc. 29th Ann. Intl ACM Selection Queries, Proc. 32nd Intl Conf. Very Large Data Bases
SIGIR Conf. Research and Development in Information Retrieval (VLDB 06), pp. 199-210, 2006.
(SIGIR 06), pp. 364-371, 2006.
[28] H. Lee, R.T. Ng, and K. Shim, Extending Q-Grams to Estimate
[6] H. Bast and I. Weber, The Complete Search Engine: Interactive,
Selectivity of String Matching with Low Edit Distance, Proc. 33rd
Efficient, and Towards IR & DB Integration, Proc. Conf. Innovative
Intl Conf. Very Large Data Bases (VLDB 07), pp. 195-206, 2007.
Data Systems Research (CIDR), pp. 88-95, 2007.
[7] R.J. Bayardo, Y. Ma, and R. Srikant, Scaling up all Pairs Similarity [29] H. Lee, R.T. Ng, and K. Shim, Power-Law Based Estimation of
Search, Proc. 16th Intl Conf. World Wide Web (WWW 07), pp. 131- Set Similarity Join Size, Proc. VLDB Endowment, vol. 2, no. 1,
140, 2007. pp. 658-669, 2009.
[8] G. Bhalotia, A. Hulgeri, C. Nakhe, S. Chakrabarti, and S. [30] C. Li, J. Lu, and Y. Lu, Efficient Merging and Filtering
Sudarshan, Keyword Searching and Browsing in Data Bases Algorithms for Approximate String Searches, Proc. IEEE 24th
Using Banks, Proc. 18th Intl Conf. Data Eng. (ICDE 02), pp. 431- Intl Conf. Data Eng. (ICDE 08), pp. 257-266, 2008.
440, 2002. [31] C. Li, B. Wang, and X. Yang, VGRAM: Improving Performance of
[9] K. Chakrabarti, S. Chaudhuri, V. Ganti, and D. Xin, An Efficient Approximate Queries on String Collections Using Variable-
Filter for Approximate Membership Checking, Proc. ACM Length Grams, Proc. 33rd Intl Conf. Very Large Data Bases (VLDB
SIGMOD Intl Conf. Management of Data (SIGMOD 08), pp. 805- 07), pp. 303-314, 2007.
818, 2008. [32] G. Li, J. Fan, H. Wu, J. Wang, and J. Feng, Dbease: Making Data
[10] S. Chaudhuri, K. Ganjam, V. Ganti, R. Kapoor, V. Narasayya, and Bases User-Friendly and Easily Accessible, Proc. Conf. Innovative
T. Vassilakis, Data Cleaning in Microsoft SQL Server 2005, Proc. Data Systems Research (CIDR), pp. 45-56, 2011.
ACM SIGMOD Intl Conf. Management of Data (SIGMOD 05), [33] G. Li, J. Feng, X. Zhou, and J. Wang, Providing Built-in Keyword
pp. 918-920, 2005. Search Capabilities in Rdbms, VLDB J., vol. 20, no. 1, pp. 1-19,
[11] S. Chaudhuri, K. Ganjam, V. Ganti, and R. Motwani, Robust and 2011.
Efficient Fuzzy Match for Online Data Cleaning, Proc. ACM [34] G. Li, S. Ji, C. Li, and J. Feng, Efficient Type-Ahead Search on
SIGMOD Intl Conf. Management of Data (SIGMOD 03), pp. 313- Relational Data: A Tastier Approach, Proc. 35th ACM SIGMOD
324, 2003. Intl Conf. Management of Data (SIGMOD 09), pp. 695-706, 2009.
LI ET AL.: SUPPORTING SEARCH-AS-YOU-TYPE USING SQL IN DATABASES 475

[35] G. Li, B.C. Ooi, J. Feng, J. Wang, and L. Zhou, EASE: An Effective Guoliang Li received the PhD degree in
3-in-1 Keyword Search Method for Unstructured, Semi-Structured computer science from Tsinghua University in
and Structured Data, Proc. ACM SIGMOD Intl Conf. Management 2009, and the bachelors degree in computer
of Data (SIGMOD 08), pp. 903-914, 2008. science from Harbin Institute of Technology in
[36] F. Liu, C.T. Yu, W. Meng, and A. Chowdhury, Effective Keyword 2004. He is an assistant professor in the
Search in Relational Data Bases, Proc. ACM SIGMOD Intl Conf. Department of Computer Science, Tsinghua
Management of Data (SIGMOD 06), pp. 563-574, 2006. University, Beijing, China. His research interests
[37] Y. Luo, X. Lin, W. Wang, and X. Zhou, Spark: Top-K Keyword include integrating data bases and information
Query in Relational Data Bases, Proc. ACM SIGMOD Intl Conf. retrieval, data cleaning, data base usability, and
Management of Data (SIGMOD 07), pp. 115-126, 2007. data integration.
[38] R.B. Miller, Response Time in Man-Computer Conversational
Transactions, Proc. AFIPS 68: Fall Joint Computer Conf., Part I,
pp. 267-277, 1968. Jianhua Feng received the BS, MS, and PhD
[39] S. Mitra, M. Winslett, W.W. Hsu, and K.C.-C. Chang, Trust- degrees in computer science and technology
worthy Keyword Search for Compliance Storage, VLDB J.Intl from Tsinghua University. He is currently work-
J. Very Large Data Bases, vol. 17, no. 2, pp. 225-242, 2008. ing as a professor in the Department of
[40] A. Nandi and H.V. Jagadish, Effective Phrase Prediction, Proc. Computer Science and Technology at Tsinghua
33rd Intl Conf. Very Large Data Bases (VLDB 07), pp. 219-230, 2007. University. His main research interests include
[41] L. Qin, J. Yu, and L. Chang, Ten Thousand Sqls: Parallel data bases, native XML data bases, and key-
Keyword Queries Computing, Proc. VLDB Endowment, vol. 3, word search over structured data. He is a
no. 1, pp. 58-69, 2010. member of the ACM and the IEEE, and a senior
[42] L. Qin, J.X. Yu, and L. Chang, Keyword Search in Data Bases: The member of China Computer Federation (CCF).
Power of Rdbms, Proc. 35th ACM SIGMOD Intl Conf. Manage-
ment of Data (SIGMOD 09), pp. 681-694, 2009.
[43] S. Sarawagi and A. Kirpal, Efficient Set Joins on Similarity
Chen Li received the BS and MS degrees in
Predicates, Proc. ACM SIGMOD Intl Conf. Management of Data
(SIGMOD 04), pp. 743-754, 2004. computer science from Tsinghua University,
[44] T. Tran, H. Wang, S. Rudolph, and P. Cimiano, Top-K China, in 1994 and 1996, respectively, and the
Exploration of Query Candidates for Efficient Keyword Search PhD degree in computer science from Stanford
on Graph-Shaped (RDF) Data, Proc. IEEE Intl Conf. Data Eng. University in 2001. He is an associate professor
(ICDE 09), pp. 405-416, 2009. in the Department of Computer Science at the
University of California, Irvine. He received a US
[45] E. Ukkonen, Finding Approximate Patterns in Strings,
J. Algorithms, vol. 6, no. 1, pp. 132-137, 1985. National Science Foundation (NSF) CAREER
[46] J. Wang, G. Li, and J. Feng, Trie-Join: Efficient Trie-Based String Award in 2003 and a few other NSF grants and
Similarity Joins with Edit-Distance Constraints, Proc. VLDB industry gifts. He was once a part-time visiting
Endowment, vol. 3, no. 1, pp. 1219-1230, 2010. research scientist at Google. His research interests include the fields of
[47] W. Wang, C. Xiao, X. Lin, and C. Zhang, Efficient Approximate data management and information search. He is the founder of
Bimaple.com. He is a member of the IEEE.
Entity Extraction with Edit Distance Constraints, Proc. 35th ACM
SIGMOD Intl Conf. Management of Data (SIGMOD 09), pp. 759-
770, 2009.
[48] C. Xiao, W. Wang, and X. Lin, Ed-Join: An Efficient Algorithm for . For more information on this or any other computing topic,
Similarity Joins with Edit Distance Constraints, Proc. VLDB please visit our Digital Library at www.computer.org/publications/dlib.
Endowment, vol. 1, no. 1, pp. 933-944, 2008.
[49] C. Xiao, W. Wang, X. Lin, and H. Shang, Top-K Set Similarity
Joins, Proc. IEEE Intl Conf. Data Eng. (ICDE 09), pp. 916-927,
2009.
[50] C. Xiao, W. Wang, X. Lin, and J.X. Yu, Efficient Similarity Joins
for Near Duplicate Detection, Proc. 17th Intl Conf. World Wide
Web (WWW 08), 2008.

You might also like