Tkde13 SQL PDF
Tkde13 SQL PDF
Supporting Search-As-You-Type
Using SQL in Databases
Guoliang Li, Jianhua Feng, Member, IEEE, and Chen Li, Member, IEEE
AbstractA search-as-you-type system computes answers on-the-fly as a user types in a keyword query character by character. We
study how to support search-as-you-type on data residing in a relational DBMS. We focus on how to support this type of search using
the native database language, SQL. A main challenge is how to leverage existing database functionalities to meet the high-
performance requirement to achieve an interactive speed. We study how to use auxiliary indexes stored as tables to increase search
performance. We present solutions for both single-keyword queries and multikeyword queries, and develop novel techniques for fuzzy
search using SQL by allowing mismatches between query keywords and answers. We present techniques to answer first-N queries
and discuss how to support updates efficiently. Experiments on large, real data sets show that our techniques enable DBMS systems
on a commodity computer to support search-as-you-type on tables with millions of records.
1 INTRODUCTION
TABLE 1
Table dblp: A Sample Publication Table (about Privacy)
The scalability becomes even more unclear if we want to We have conducted a thorough experimental evaluation
support two useful features in search-as-you-type, namely using large, real data sets in Section 8. We compare the
multikeyword search and fuzzy search. In multikeyword search, advantages and limitations of different approaches for
we allow a query string to have multiple keywords, and find search-as-you-type. The results show that our SQL-based
records that match these keywords, even if the keywords techniques enable DBMS systems running on a commodity
appear at different places. For instance, we allow a user who computer to support search-as-you-type on tables with
types in a query privacy mining rak to find a millions of records.
publication by Rakesh Agrawal with a title including the It is worth emphasizing that although our method shares
keywords privacy and mining, even though these an incremental-search idea with the earlier results in [24],
keywords are at different places in the record. In fuzzy developing new techniques in a DBMS environment is
search, we want to allow minor mismatches between query technically very challenging. A main challenge is how to
keywords and answers. For instance, a partial query utilize the limited expressive power of the SQL language
aggraw should find a record with a keyword agrawal (compared with other languages such as C++ and Java) to
despite the typo in the query. While these features can support efficient search. We study how to use the available
further improve user search experiences, supporting them resources inside a DBMS, such as the capabilities to build
makes it even more challenging to do search-as-you-type auxiliary tables, to improve query performance. An inter-
inside DBMS systems. esting observation is that despite the fact we need SQL
In this paper, we develop various techniques to address queries with join operations, using carefully designed
these challenges. In Section 3, we propose two types of auxiliary tables, built-in indexes on key attributes, foreign-
methods to support search-as-you-type for single-keyword key constraints, and incremental algorithms using cached
queries, based on whether they require additional index results, these SQL queries can be executed efficiently by the
structures stored as auxiliary tables. We discuss the methods DBMS engine to achieve a high speed.
that use SQL to scan a table and verify each record by calling a
user-defined function (UDF) or using the LIKE predicate. We
study how to use auxiliary tables to increase performance. 2 PRELIMINARIES
In Section 4, we study how to support fuzzy search for We first formulate the problem of search-as-you-type in
single-keyword queries. We discuss a gram-based method DBMS (Section 2.1) and then discuss different ways to
and a UDF-based method. As the two methods have a low support search-as-you-type (Section 2.2).
performance, we propose a new neighborhood-generation-
based method, using the idea that two strings are similar 2.1 Problem Formulation
only if they have common neighbors obtained by deleting Let T be a relational table with attributes A1 ; A2 ; . . . ; A . Let
characters. To further improve the performance, we R fr1 ; r2 ; . . . ; rn g be the collection of records in T , and
propose to incrementally answer a query by using ri Aj denote the content of record ri in attribute Aj . Let W
previously computed results and utilizing built-in indexes be the set of tokenized keywords in R.
on key attributes.
In Section 5, we extend the techniques to support 2.1.1 Search-as-You-Type for Single-keyword Queries
multikeyword queries. We develop a word-level incremen- Exact Search: As a user types in a single partial (prefix)
tal method to efficiently answer multikeyword queries. keyword w character by character, search-as-you-type on-
Notice that when deployed in a Web application, the the-fly finds the records that contain keywords with a prefix
incremental-computation algorithms do not need to main- w. We call this search paradigm prefix search. Without loss of
tain session information, since the results of earlier queries generality, each tokenized keyword in the data set and
are stored inside the database and shared by future queries. queries is assumed to use lower case characters. For
We propose efficient techniques to progressively find the example, consider the data in Table 1, A1 title, A2
first-N answers in Section 6. We also discuss how to support authors, A3 booktitle, and A4 year. R fr1 ; . . . ; r10 g.
updates efficiently in Section 7. r3 [booktitle] sigmod.
LI ET AL.: SUPPORTING SEARCH-AS-YOU-TYPE USING SQL IN DATABASES 463
W fprivacy; sigmod; sigir; . . .g: DBMS systems do not provide the search-as-you-type
If a user types in a query sig, we return records r3 , r6 , and extension feature (indeed no DBMS systems provide such
r9 . In particular, r3 contains a keyword sigmod with a an extension), the SQL-based method can also be used in this
prefix sig. case. Thus, the SQL-based method is more portable to a
Fuzzy Search: As a user types in a single partial keyword w different platform than the first two methods.
character by character, fuzzy search on-the-fly finds records In this paper, we focus on the SQL-based method and
with keywords similar to the query keyword. In Table 1, develop various techniques to achieve a high interactive
assuming a user types in a query corel, record r7 is a speed.
relevant answer since it contains a keyword correlation
with a prefix correl similar to the query keyword corel.
We use edit distance to measure the similarity between
3 EXACT SEARCH FOR SINGLE KEYWORD
strings. Formally, the edit distance between two strings s1 This section proposes two types of methods to use SQL to
and s2 , denoted by ed(s1 , s2 ), is the minimum number of support search-as-you-type for single-keyword queries. In
single-character edit operations (i.e., insertion, deletion, and Section 3.1, we discuss no-index methods. In Section 3.2, we
substitution) needed to transform s1 to s2 . For example, build auxiliary tables as index structures to answer a query.
ed corelation; correlation 1 a n d ed coralation,
correlation 2. Given an edit-distance threshold , we 3.1 No-Index Methods
say a prefix p of a keyword in W is similar to the partial A straightforward way to support search-as-you-type is to
keyword w if edp; w . We say a keyword d in W is similar issue an SQL query that scans each record and verifies
to the partial keyword w if d has a prefix p such that whether the record is an answer to the query. There are two
edp; w . Fuzzy search finds the records with keywords ways to do the checking: 1) Calling User-Defined Functions
similar to the query keywords. (UDFs). We can add functions into databases to verify
whether a record contains the query keyword; and 2) Using
2.1.2 Search-as-You-Type for Multikeyword Queries the LIKE predicate. Databases provide a LIKE predicate to
Exact Search: Given a multikeyword query Q with m allow users to perform string matching. We can use the LIKE
keywords w1 ; w2 ; . . . ; wm , as the user is completing the last predicate to check whether a record contains the query
keyword wm , we treat wm as a partial keyword and other keyword. This method may introduce false positives, e.g.,
keywords as complete keywords.2 As a user types in query keyword publication contains the query string ic, but
Q character by character, search-as-you-type on-the-fly finds the keyword does not have the query string ic as a prefix.
the records that contain the complete keywords and a
We can remove these false positives by calling UDFs. The two
keyword with a prefix wm . For example, if a user types in a
no-index methods need no additional space, but they may
query privacysig, search-as-you-type returns records r3 ,
not scale since they need to scan all records in the table
r6 , and r9 . In particular, r3 contains the complete keyword
(Section 8 gives the results.).
privacy and a keyword sigmod with a prefix sig.
Fuzzy Search: Fuzzy search on-the-fly finds the records 3.2 Index-Based Methods
that contain keywords similar to the complete keywords and a
In this section, we propose to build auxiliary tables as index
keyword with a prefix similar to partial keyword wm . For
instance, suppose edit-distance threshold 1. Assuming structures to facilitate prefix search. Some databases such as
a user types in a query privicycorel, fuzzy type-ahead Oracle and SQL server already support prefix search, and we
search returns record r7 since it contains a keyword could use this feature to do prefix search. However, not all
privacy similar to the complete keyword privicy databases provide this feature. For this reason, we develop a
and a keyword correlation with a prefix correl new method that can be used in all databases. In addition, our
similar to the partial keyword corel. experiments in Section 8.3 show that our method performs
prefix search more efficiently.
2.2 Different Approaches for Search-as-You-Type Inverted-index table. Given a table T , we assign unique
We discuss different possible methods to support search-as- ids to the keywords in table T , following their alphabetical
you-type and give their advantages and limitations. order. We create an inverted-index table IT with records in
The first method is to use a separate application layer, the form hkid; ridi, where kid is the id of a keyword and rid
which can achieve a very high performance as it can use is the id of a record that contains the keyword. Given a
various programming languages and complex data struc- complete keyword, we can use the inverted-index table to
tures. However, it is isolated from the DBMS systems. find records with the keyword.
The second method is to use database extenders. How- Prefix table. Given a table T , for all prefixes of keywords
ever, this extension-based method is not safe to the query
in the table, we build a prefix table PT with records in the
engine, which could cause reliability and security problems
form hp; lkid; ukidi, where p is a prefix of a keyword, lkid is the
to the database engine. This method depends on the API of
smallest id of those keywords in the table T having p as a
the specific DBMS being used, and different DBMS systems
prefix, and ukid is the largest id of those keywords having p as
have different APIs. Moreover, this method does not work if
a prefix. An interesting observation is that a complete word
a DBMS system has no this extender feature, e.g., MySQL.
with p as a prefix must have an ID in the keyword range
The third method is to use SQL. The SQL-based method is
lkid; ukid, and each complete word in the table T with an ID
more compatible since it is using the standard SQL. Even if
in this keyword range must have a prefix p. Thus, given a
2. Our method can be easily extended to the case that every keyword is prefix keyword w, we can use the prefix table to find the
treated as a partial keyword. range of keywords with the prefix.
464 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 25, NO. 2, FEBRUARY 2013
TABLE 2
The Inverted-Index Table and Prefix Table
TABLE 3
Neighborhood-Generation Table 1
Fig. 2. Using the q-gram table and the neighborhood generation table to pldb; pvdb; pvlb; pvld}. Moreover, there is a good property
support fuzzy search. that given two strings s1 and s2 , if eds1 ; s2 , D ^ s1 \
^
D s2 6 as formalized in Lemma 1.
substrings with length q. Let Gq s denote the set4 of its ^ s1 \
Lemma 1. Given two strings s1 ; s2 , if eds1 ; s2 , D
q-grams and jGq sj denote the size of Gq s. For example, ^
D s2 6 .
for pvldb and vldb, we have G2 pvldb fpv, vl, ld,
db} and G2 (vldb fvl; ld; dbg. Strings s1 and s2 have an Proof. We can use deletion operations to replace the
edit distance within threshold if substitution and insertion operations as follows: suppose
we can transform s1 to s2 with d deletions, i insertions, and
jGq s1 \ Gq s2 j maxjs1 j; js2 j 1 q q 17; r substitutions, such that eds1 ; s2 d i r . We
can transform s1 and s2 to the same string by doing d r
where js1 j and js2 j are the lengths of string s1 and s2 ,
deletions on s1 and i r deletions on s2 , respectively.
respectively. This technique is called count filtering. ^ s1 \ D
^ s2 6 .
Thus, D u
t
To find similar prefixes of a query keyword w, besides
maintaining the inverted-index table and the prefix table, we
We use this property as a filter to find similar prefixes of
need to create a q-gram table GT with records in the form
the query keyword w. We can prune all the prefixes if they
hp; qgrami, where p is a prefix in the prefix table and qgram is a
have no common i-deletion neighborhoods with w. To this
q-gram of p. Given a partial keyword w, we first find the
prefixes in GT with no smaller than jwj 1 q q grams end, for prefixes in the prefix table PT , we create a deletion-
in Gq w. We use the following SQL with GROUP BY based neighborhood-generation table DT with records in the
command to get the candidates of ws similar prefixes: form hp, i-deletion, ii, where p is a prefix in the prefix table PT
and i-deletion is an i-deletion neighborhood of p i . For
SELECT PT .prefix FROM GT ; PT example, Table 3 gives a neighborhood-generation table.
WHERE GT .prefix PT .prefix AND GT .qgram IN Gq w Given a query keyword w, we first find the similar
GROUP BY GT .prefix prefixes in DT which have i-deletion neighborhoods in
HAVING COUNT (GT .qgram) jwj 1 q q. D^ w. Then we use UDFs to verify the candidates to get
As this method may involve false positives, we have to similar prefixes. Formally, we use the following SQL to
use UDFs to verify the candidates to get the similar prefixes generate the candidates of w s similar prefixes:
of w. Fig. 2 illustrates how to use the gram-based method to SELECT DISTINCT prefix FROM DT
answer a query. We can further improve the query ^ w.
WHERE DT :i-deletion IN D
performance by using additional filtering techniques, e.g.,
length filtering or position filtering [17]. Assuming a user types in a keyword pvldb, we find
It could be expensive to use GROUP BY in databases, the prefixes in DT that have i-deletion neighborhoods in
and the q-gram-based method is inefficient, especially for {pvldb, vldb, pldb, pvdb, pvlb, pvld}. Here we
large q-gram tables. Moreover, this method is rather find vldb similar to pvldb with edit distance 1.
inefficient for short query keywords [46], as short keywords This method is efficient for short strings. However, it is
have smaller numbers of q-grams and the method has low inefficient for long strings, especially for large edit-distance
pruning power. thresholds, because given a string with length n, it has ni-
deletion neighborhoods and totally Ominn ; 2n neigh-
4.2.3 Neighborhood-Generation-Based Method borhoods. It needs large space to store these neighborhoods.
Ukkonen proposed a neighborhood-generation-based meth- As the three methods have some limitations, we propose
od to support approximate string search [45]. We extend this an incremental algorithm which uses previous computed
method to use SQL to support fuzzy search-as-you-type. results to answer subsequence queries in Section 4.3.
Given a keyword w, the substrings of w by deleting i
4.3 Incrementally Computing Similar Prefixes
characters are called i-deletion neighborhoods of w. Let
Di w denote the set of i-deletion neighborhoods of w and The previous methods have the following limitations. First,
D^ w [ Di w. For example, given a string pvldb, they need to find similar prefixes of a keyword from scratch.
i0
D0 pvldb fpvldbg, and D1 pvldb fvldb; pldb; pvdb, Second, they may need to call UDFs many times. In this
pvlb; pvldg. Su ppose 1, D ^ pvldb fpvldb; vldb; section, we propose a character-level incremental method to
find similar prefixes of a keyword as a user types character by
4. We need to use multisets to accommodate duplicated grams. character. Chaudhuri and Kaushik [14] and Ji et al. [24]
466 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 25, NO. 2, FEBRUARY 2013
TABLE 4
vldb
Similar-Prefix Table Sdblp 1
TABLE 5
Similar-Prefix Tables of a Prefix Table with Strings in {, v; vl; vld; vldb, p; pv; pvl; pvld; pvldb}
( 1. For ease of presentation, we add two columns from and op, where column from denotes where the record is derived from, and column
op denotes operationsm:match, d:deletion, i:insertion, s:substitution.)
LI ET AL.: SUPPORTING SEARCH-AS-YOU-TYPE USING SQL IN DATABASES 467
Suppose the claim is true for wx w with x characters. and substitute cx1 for the last character d. Then we get
We want to prove this claim is also true for a new query another transformation with the same number of edit
string wx1 , where wx1 w0 wx cx1 . operations. (Characters c and d cannot be the same, since
Suppose v is a similar prefix of wx1 . If v , then by otherwise the new transformation could have fewer
definition edv; wx1 ed; wx1 x 1 , and x edit operations, contradicting to the minimality of edit
1 < . Thus, edv; wx ed; wx x , and v is distance.) We use the same argument in Case 2 to show
also a similar prefix of wx . When we consider this node v, that our method adds hny1 ; edny1 ; wx1 i to STwx1 .
we add the pair hv; x 1i (i.e., hv; edv; wx1 i) into STwx1 . In summary, for all cases the algorithm adds hny1 ;
Now consider the case where the similar prefix v of edny1 ; wx1 i to STwx1 .
wx1 is not the empty string. Let v ny1 ny d, i.e., it 2) Then we prove the soundness. By definition, a
has y 1 characters, and is concatenated from a string transformation distance of two strings in each added tuple
ny and a character d. By definition, edny1 ; wx1 . by the algorithm is no less than their edit distance. That is,
We want to prove that hny1 ; edny1 ; wx1 i will be edn; w0 . Thus, n must be a similar prefix of p. u
t
added to STwx1 .
Based on the idea in the classic dynamic-program- 4.3.3 Improving Performance Using Indexes
ming algorithm, we consider the following four cases in As we can create indexes on the attribute prefix of the prefix
the minimum number of edit operations to transform table PT and the similar-prefix table STw , the SQL statements
ny1 to wx1 . for the deletion and match cases can be efficiently executed
Case 1: Deleting the last character cx1 from wx1 , and using the indexes. However, for the SQL of the substitution
transforming ny1 to wx . Since edny1 ; wx1 edny1 ; case, the SQL contains a statement SUBSTRPT :prefix;
wx 1 , we have edny1 ; wx 1 < . Thus, ny1 1; LENGTHPT :prefix 1 STw :prefix. Although we can
is a similar prefix of wx . Based on the induction create an index on the attribute prefix of the similar prefix
assumption, hny1 ; edny1 ; wx i must be in STwx . From the table STw , if there is no index to support the predicate
node ny1 , our method considers the deletion case when it SUBSTRPT :prefix; 1; LENGTH PT :prefix 1, it is rather
considers the node ny , and adds hny1 ; edny1 ; wx 1i to expensive to execute the SQL. To improve the performance,
we can alter table PT by adding an attribute parent
STwx1 , which is exactly hny1 ; edny1 ; wx1 i.
SUBSTRPT :prefix; 1; LENGTHPT :prefix 1, and create a
Case 2: Substituting the character d of ny1 for the last
table PT hprefix; lkid; ukid; parenti. Using this table, we
character cx1 of wx1 . Since edny1 ; wx1 edny ; wx
propose an alternative method and use the following the
1 , we have edny ; wx 1 < . Thus, ny is a similar SQL for the substitution case:
prefix of wx . Based on the induction assumption,
hny ; edny ; wx i must be in STwx . From node ny , our method SELECT PT .prefix, ed+1 AS ed FROM STw ; PT
considers the substitution case when it considers this child WHERE PT :parent STw .prefix AND ed < AND
node (ny1 ) of the node ny , and adds hny1 ; edny ; wx 1i PT .prefix! CONCAT STw :prefix; cx1 .
to STwx1 , which is exactly hny1 ; edny1 ; wx1 i. We can create an index on attribute parent of prefix table
Case 3: The last character cx1 of wx1 matching the character PT to increase search performance.
d of ny1 . Since edny1 ; wx1 edny ; wx , then ny is a Similarly, it is inefficient to execute the SQL for the
similar prefix of nx . Based on the induction assumption, insertion case as it contains a complicated statement
hny ; edny ; wx i must be in STwx . From node ny , our method
considers the match case when it considers this child node SUBSTR PT :prefix; 1; LENGTHSTw :prefix 1
(ny1 ) of the node ny , and adds hny1 ; edny ; wx i to STwx1 , CONCATSTw :prefix; cx1 :
which is exactly hny1 ; edny1 ; wx1 i.
Case 4: Transforming ny to wx1 and inserting character d Next we discuss how to improve the SQL using indexes. Let
of ny1 . For each transformation from ny to wx1 , we Y0 be the similar-prefix table of the results of the SQL for the
consider the last character cx1 of wx1 . First, we can show match case. For insertions, we need to find the similar
that this transformation cannot delete the character cx1 , strings with prefixes in Y0 . Let Yi1 be the similar-prefix
since otherwise we can combine this deletion of cx1 and table composed of prefixes by appending one more
the insertion of d into one substitution, yielding another character to those prefixes in Yi for 0 i 1. Obviously
transformation with a smaller number of edit operations, [i1 Yi is exactly the result of the SQL for the insertion case.
contradicting to the minimality of edit distance. Thus, Note that Y0 can be efficiently computed as we can use
we can just consider two possible operations on the the SQL for the match case to generate it. Iteratively, we
character cx1 in this transformation. 1) Matching cx1 for compute Yi1 based on Yi using the following SQL:
the character of an ancestor na of ny1 : in this case, since
edny1 ; wx1 edna1 ; wx y a 1 , we have SELECT PT .prefix, ed+1 AS ed FROM Yi ; PT
edna1 ; wx , and na1 is a similar prefix of wx . Based WHERE PT :parent Yi .prefix AND ed < .
on the induction assumption, hna1 ; edna1 ; wx i must be We can create indexes on the parent attribute of the prefix
in STwx . From node na1 , the algorithm considers the table PT and the prefix attribute of Yi to improve the
matching case, and adds hny1 ; edna1 ; wx y a 1i performance. In our running example, Y0 fhv; 0i; hpv; 1ig.
to STwx1 , which is hny1 ; edny1 ; wx1 i. 2) Substituting As only hv; 0i 2 Y0 satisfies the SQL condition, this SQL
cx1 for the character of an ancestor na of ny1 : in this case, returns hvl; 1i. Thus, we can use several SQL statements that
instead of substituting c for the character of na and can be efficiently executed to replace the original complicated
inserting the character d, we can insert the character of na SQL statement for the insertion case.
LI ET AL.: SUPPORTING SEARCH-AS-YOU-TYPE USING SQL IN DATABASES 469
from the 0th row to the 1st row for answering query Q. TABLE 6
Then for query Q0 , we continue to access the records of w1 Data Sets and Index Costs
from the th row. To get enough answers, we access records
for each SQL query, where is a parameter depending on the
keyword distribution (usually set to m N).
Next we discuss how to assign iteratively. Initially, for
m 1, that is, the query Q has only one keyword and Q0 has
two keywords. When answering Q, we have visited N
records for w1 , and we need to continue to access records
starting from the Nth record. Thus, we set N. If we
cannot get N answers, we set until we get N results
(or we have accessed all of the records).
For example, assume a user types in a query privacyic
character by character, and N 2. When the user types in
keyword privacy, we issue the following SQL:
SELECT dblp: FROM Pdblp ; Idblp ; dblp
which returns records r4 and r8 . As we only get two results,
WHERE Pdblp .prefix privacy AND we increase the edit-distance threshold to 1, and issue a new
Pdblp :ukid Idblp :kid AND Pdblp :lkid Idblp :kid AND SQL, which returns record r1 . Thus, we get first-3 results
Idblp :rid dblp:rid and terminate the execution. We do not need to consider the
LIMIT 0; 2, case that 2 as we have gotten the first-3 results.
which returns records r1 and r2 . We compute and cache
CQ fr1 ; r2 g. When the user types in another keyword ic 7 SUPPORTING UPDATES EFFICIENTLY
and submits a query privacyic, we first use CQ to
answer the query and get record r2 . As we want to compute We can use a trigger to support data updates. We consider
insertions and deletions of records.
first-2 results, we need to issue the following SQL:
Insertion. Assume a record is inserted. We first assign it
SELECT dblp: FROM Pdblp ; Idblp ; dblp a new record ID. For each keyword in the record, we insert
WHERE Pdblp .prefix privacy AND the keyword into the inverted-index table. For each prefix of
Pdblp :ukid Idblp :kid AND Pdblp :lkid Idblp :kid AND the keyword, if the prefix is not in the prefix table, we add
Idblp :rid dblp:rid an entry for the prefix. For the keyword-range encoding of
LIMIT 2; 2 2 each prefix, we can reserve extra space for prefix ids to
accommodate future insertions. We only need to do global
INTERSECT
reordering if a reserved space of the insertion is consumed.
SELECT dblp: FROM Pdblp ; Idblp ; dblp
Deletion. Assume a record is deleted. For each keyword
WHERE Pdblp .prefix ic AND
in the record, in the inverted-index table we use a bit
Pdblp :ukid Idblp :kid AND Pdblp :lkid Idblp :kid AND to denote whether a record is deleted. Here we use the bit to
Idblp :rid dblp:rid, mark the record to be deleted. We do not update the table
which returns records r5 . Thus, we get first-2 results until we need to rebuild the index. For the range encoding of
(records r2 and r5 ) and terminate the execution. each prefix, we can use the deleted prefix ids for future
Fuzzy first-N queries. The above methods cannot be easily insertions.
extended to support fuzzy search, as they cannot distinguish The ranges of ids are assigned-based inverse document
the results of exact search and fuzzy search. Generally, we frequency (idf) of keywords. We use a larger range for a
need to first return the best results with smaller edit keyword with a smaller idf. In most cases, we can use the
distances. To address this issue, we propose to progressively kept extra space for update. But in the worst case, we need
compute the results. As an example, we consider the to rebuild the index. The problem of the range selection and
character-level incremental method (Section 4.3). analysis is beyond the scope of this paper.
For a single-keyword query w, we first get the results with
edit distance 0. If we have gotten N answers, we terminate 8 EXPERIMENTAL STUDY
the execution; otherwise, we progressively increase the edit-
distance threshold and select the records with edit-distance We implemented the proposed methods on two real data
thresholds 1; 2; . . . ; , until we get N answers. sets. 1) DBLP: It included 1.2 million computer science
publications.7 2) MEDLINE: It included 5 million biome-
For example, suppose 2. Considering a keyword
dical articles.8 Table 6 summarizes the data sets and index
vld, to get the first-3 answers, we first issue the following
sizes. We see that the size of inverted-index table and prefix
SQL: table is acceptable, compared with the data set size. As a
SELECT dblp: FROM Sdblp vld
; Pdblp ; Idblp ; dblp keyword may have many deletion-based neighbors, the size
vld
WHERE Sdblp :ed 0 AND Sdblp vld
.prefix Pdblp .prefix of prefix-deletion table is rather large. The size of q-gram
table is also larger than that of our method, since a substring
AND Pdblp :ukid Idblp :kid AND Pdblp :lkid Idblp :kid
AND Idblp :rid dblp:rid 7. http://dblp.uni-trier.de/xml/.
LIMIT 0; 3, 8. http://www.ncbi.nlm.nih.gov/pubmed/.
LI ET AL.: SUPPORTING SEARCH-AS-YOU-TYPE USING SQL IN DATABASES 471
TABLE 7
Ten Example Keyword Queries
an interactive speed. To support prefix matching, we [12] S. Chaudhuri, V. Ganti, and R. Kaushik, A Primitive Operator for
Similarity Joins in Data Cleaning, Proc. 22nd Intl Conf. Data Eng.
proposed solutions that use auxiliary tables as index (ICDE 06), pp. 5-16, 2006.
structures and SQL queries to support search-as-you-type. [13] S. Chaudhuri, V. Ganti, and R. Motwani, Robust Identification of
We extended the techniques to the case of fuzzy queries, Fuzzy Duplicates, Proc. 21st Intl Conf. Data Eng. (ICDE), pp. 865-
876, 2005.
and proposed various techniques to improve query perfor-
[14] S. Chaudhuri and R. Kaushik, Extending Autocompletion to
mance. We proposed incremental-computation techniques Tolerate Errors, Proc. 35th ACM SIGMOD Intl Conf. Management
to answer multikeyword queries, and studied how to of Data (SIGMOD 09), pp. 433-439, 2009.
support first-N queries and incremental updates. Our [15] B.B. Dalvi, M. Kshirsagar, and S. Sudarshan, Keyword Search on
External Memory Data Graphs, Proc. VLDB Endowment, vol. 1,
experimental results on large, real data sets showed that no. 1, pp. 1189-1204, 2008.
the proposed techniques can enable DBMS systems to [16] B. Ding, J.X. Yu, S. Wang, L. Qin, X. Zhang, and X. Lin, Finding
support search-as-you-type on large tables. Top-K Min-Cost Connected Trees in Data Bases, Proc. IEEE 23rd
There are several open problems to support search-as- Intl Conf. Data Eng. (ICDE 07), pp. 836-845, 2007.
[17] L. Gravano, P.G. Ipeirotis, H.V. Jagadish, N. Koudas, S.
you-type using SQL. One is how to support ranking queries Muthukrishnan, and D. Srivastava, Approximate String Joins in
efficiently. Another one is how to support multiple tables. a Data Base (Almost) for Free, Proc. 27th Intl Conf. Very Large
Data Bases (VLDB 01), pp. 491-500, 2001.
[18] M. Hadjieleftheriou, A. Chandel, N. Koudas, and D. Srivastava,
ACKNOWLEDGMENTS Fast Indexes and Algorithms for Set Similarity Selection
Queries, Proc. IEEE 24th Intl Conf. Data Eng. (ICDE 08),
Guoliang Li and Jianhua Feng are partially supported by pp. 267-276, 2008.
the National Natural Science Foundation of China under [19] M. Hadjieleftheriou, N. Koudas, and D. Srivastava, Incremental
Grant No. 61003004 and 61272090, the National Grand Maintenance of Length Normalized Indexes for Approximate
String Matching, Proc. 35th ACM SIGMOD Intl Conf. Management
Fundamental Research 973 Program of China under Grant of Data (SIGMOD 09), pp. 429-440, 2009.
No. 2011CB302206, and National S&T Major Project of [20] M. Hadjieleftheriou, X. Yu, N. Koudas, and D. Srivastava,
China under Grant No. 2011ZX01042-001-002. Chen Li is Hashed Samples: Selectivity Estimators for Set Similarity Selec-
tion Queries, Proc. VLDB Endowment, vol. 1, no. 1, pp. 201-212,
partially supported by the NIH grant 1R21LM010143-01A1 2008.
and the NSF grant IIS-1030002. Chen Li declares financial [21] H. He, H. Wang, J. Yang, and P.S. Yu, Blinks: Ranked Keyword
interest in Bimaple Technology Inc., which is commercializ- Searches on Graphs, Proc. ACM SIGMOD Intl Conf. Management
of Data (SIGMOD 07), pp. 305-316, 2007.
ing some of the techniques used in this publication.
[22] V. Hristidis and Y. Papakonstantinou, Discover: Keyword Search
in Relational Data Bases, Proc. 28th Intl Conf. Very Large Data
Bases (VLDB 02), pp. 670-681, 2002.
REFERENCES [23] J. Jestes, F. Li, Z. Yan, and K. Yi, Probabilistic String Similarity
[1] S. Agrawal, K. Chakrabarti, S. Chaudhuri, and V. Ganti, Scalable Joins, Proc. Intl Conf. Management of Data (SIGMOD 10), pp. 327-
Ad-Hoc Entity Extraction from Text Collections, Proc. VLDB 338, 2010.
Endowment, vol. 1, no. 1, pp. 945-957, 2008. [24] S. Ji, G. Li, C. Li, and J. Feng, Efficient Interactive Fuzzy Keyword
[2] S. Agrawal, S. Chaudhuri, and G. Das, DBXplorer: A System for Search, Proc. 18th ACM SIGMOD Intl Conf. World Wide Web
Keyword-Based Search over Relational Data Bases, Proc. 18th (WWW), pp. 371-380, 2009.
Intl Conf. Data Eng. (ICDE 02), pp. 5-16, 2002. [25] V. Kacholia, S. Pandit, S. Chakrabarti, S. Sudarshan, R. Desai, and
[3] A. Arasu, V. Ganti, and R. Kaushik, Efficient Exact Set-Similarity H. Karambelkar, Bidirectional Expansion for Keyword Search on
Joins, Proc. 32nd Intl Conf. Very Large Data Bases (VLDB 06), Graph Data Bases, Proc. 31st Intl Conf. Very Large Data Bases
pp. 918-929, 2006. (VLDB 05), pp. 505-516, 2005.
[4] H. Bast, A. Chitea, F.M. Suchanek, and I. Weber, ESTER: Efficient [26] M.-S. Kim, K.-Y. Whang, J.-G. Lee, and M.-J. Lee, N-Gram/2l: A
Search on Text, Entities, and Relations, Proc. 30th Ann. Intl ACM Space and Time Efficient Two-Level N-Gram Inverted Index
SIGIR Conf. Research and Development in Information Retrieval Structure, Proc. 31st Intl Conf. Very Large Data Bases (VLDB 05),
(SIGIR 07), pp. 671-678, 2007. pp. 325-336, 2005.
[5] H. Bast and I. Weber, Type Less, Find More: Fast Autocomple- [27] N. Koudas, C. Li, A.K.H. Tung, and R. Vernica, Relaxing Join and
tion Search with a Succinct Index, Proc. 29th Ann. Intl ACM Selection Queries, Proc. 32nd Intl Conf. Very Large Data Bases
SIGIR Conf. Research and Development in Information Retrieval (VLDB 06), pp. 199-210, 2006.
(SIGIR 06), pp. 364-371, 2006.
[28] H. Lee, R.T. Ng, and K. Shim, Extending Q-Grams to Estimate
[6] H. Bast and I. Weber, The Complete Search Engine: Interactive,
Selectivity of String Matching with Low Edit Distance, Proc. 33rd
Efficient, and Towards IR & DB Integration, Proc. Conf. Innovative
Intl Conf. Very Large Data Bases (VLDB 07), pp. 195-206, 2007.
Data Systems Research (CIDR), pp. 88-95, 2007.
[7] R.J. Bayardo, Y. Ma, and R. Srikant, Scaling up all Pairs Similarity [29] H. Lee, R.T. Ng, and K. Shim, Power-Law Based Estimation of
Search, Proc. 16th Intl Conf. World Wide Web (WWW 07), pp. 131- Set Similarity Join Size, Proc. VLDB Endowment, vol. 2, no. 1,
140, 2007. pp. 658-669, 2009.
[8] G. Bhalotia, A. Hulgeri, C. Nakhe, S. Chakrabarti, and S. [30] C. Li, J. Lu, and Y. Lu, Efficient Merging and Filtering
Sudarshan, Keyword Searching and Browsing in Data Bases Algorithms for Approximate String Searches, Proc. IEEE 24th
Using Banks, Proc. 18th Intl Conf. Data Eng. (ICDE 02), pp. 431- Intl Conf. Data Eng. (ICDE 08), pp. 257-266, 2008.
440, 2002. [31] C. Li, B. Wang, and X. Yang, VGRAM: Improving Performance of
[9] K. Chakrabarti, S. Chaudhuri, V. Ganti, and D. Xin, An Efficient Approximate Queries on String Collections Using Variable-
Filter for Approximate Membership Checking, Proc. ACM Length Grams, Proc. 33rd Intl Conf. Very Large Data Bases (VLDB
SIGMOD Intl Conf. Management of Data (SIGMOD 08), pp. 805- 07), pp. 303-314, 2007.
818, 2008. [32] G. Li, J. Fan, H. Wu, J. Wang, and J. Feng, Dbease: Making Data
[10] S. Chaudhuri, K. Ganjam, V. Ganti, R. Kapoor, V. Narasayya, and Bases User-Friendly and Easily Accessible, Proc. Conf. Innovative
T. Vassilakis, Data Cleaning in Microsoft SQL Server 2005, Proc. Data Systems Research (CIDR), pp. 45-56, 2011.
ACM SIGMOD Intl Conf. Management of Data (SIGMOD 05), [33] G. Li, J. Feng, X. Zhou, and J. Wang, Providing Built-in Keyword
pp. 918-920, 2005. Search Capabilities in Rdbms, VLDB J., vol. 20, no. 1, pp. 1-19,
[11] S. Chaudhuri, K. Ganjam, V. Ganti, and R. Motwani, Robust and 2011.
Efficient Fuzzy Match for Online Data Cleaning, Proc. ACM [34] G. Li, S. Ji, C. Li, and J. Feng, Efficient Type-Ahead Search on
SIGMOD Intl Conf. Management of Data (SIGMOD 03), pp. 313- Relational Data: A Tastier Approach, Proc. 35th ACM SIGMOD
324, 2003. Intl Conf. Management of Data (SIGMOD 09), pp. 695-706, 2009.
LI ET AL.: SUPPORTING SEARCH-AS-YOU-TYPE USING SQL IN DATABASES 475
[35] G. Li, B.C. Ooi, J. Feng, J. Wang, and L. Zhou, EASE: An Effective Guoliang Li received the PhD degree in
3-in-1 Keyword Search Method for Unstructured, Semi-Structured computer science from Tsinghua University in
and Structured Data, Proc. ACM SIGMOD Intl Conf. Management 2009, and the bachelors degree in computer
of Data (SIGMOD 08), pp. 903-914, 2008. science from Harbin Institute of Technology in
[36] F. Liu, C.T. Yu, W. Meng, and A. Chowdhury, Effective Keyword 2004. He is an assistant professor in the
Search in Relational Data Bases, Proc. ACM SIGMOD Intl Conf. Department of Computer Science, Tsinghua
Management of Data (SIGMOD 06), pp. 563-574, 2006. University, Beijing, China. His research interests
[37] Y. Luo, X. Lin, W. Wang, and X. Zhou, Spark: Top-K Keyword include integrating data bases and information
Query in Relational Data Bases, Proc. ACM SIGMOD Intl Conf. retrieval, data cleaning, data base usability, and
Management of Data (SIGMOD 07), pp. 115-126, 2007. data integration.
[38] R.B. Miller, Response Time in Man-Computer Conversational
Transactions, Proc. AFIPS 68: Fall Joint Computer Conf., Part I,
pp. 267-277, 1968. Jianhua Feng received the BS, MS, and PhD
[39] S. Mitra, M. Winslett, W.W. Hsu, and K.C.-C. Chang, Trust- degrees in computer science and technology
worthy Keyword Search for Compliance Storage, VLDB J.Intl from Tsinghua University. He is currently work-
J. Very Large Data Bases, vol. 17, no. 2, pp. 225-242, 2008. ing as a professor in the Department of
[40] A. Nandi and H.V. Jagadish, Effective Phrase Prediction, Proc. Computer Science and Technology at Tsinghua
33rd Intl Conf. Very Large Data Bases (VLDB 07), pp. 219-230, 2007. University. His main research interests include
[41] L. Qin, J. Yu, and L. Chang, Ten Thousand Sqls: Parallel data bases, native XML data bases, and key-
Keyword Queries Computing, Proc. VLDB Endowment, vol. 3, word search over structured data. He is a
no. 1, pp. 58-69, 2010. member of the ACM and the IEEE, and a senior
[42] L. Qin, J.X. Yu, and L. Chang, Keyword Search in Data Bases: The member of China Computer Federation (CCF).
Power of Rdbms, Proc. 35th ACM SIGMOD Intl Conf. Manage-
ment of Data (SIGMOD 09), pp. 681-694, 2009.
[43] S. Sarawagi and A. Kirpal, Efficient Set Joins on Similarity
Chen Li received the BS and MS degrees in
Predicates, Proc. ACM SIGMOD Intl Conf. Management of Data
(SIGMOD 04), pp. 743-754, 2004. computer science from Tsinghua University,
[44] T. Tran, H. Wang, S. Rudolph, and P. Cimiano, Top-K China, in 1994 and 1996, respectively, and the
Exploration of Query Candidates for Efficient Keyword Search PhD degree in computer science from Stanford
on Graph-Shaped (RDF) Data, Proc. IEEE Intl Conf. Data Eng. University in 2001. He is an associate professor
(ICDE 09), pp. 405-416, 2009. in the Department of Computer Science at the
University of California, Irvine. He received a US
[45] E. Ukkonen, Finding Approximate Patterns in Strings,
J. Algorithms, vol. 6, no. 1, pp. 132-137, 1985. National Science Foundation (NSF) CAREER
[46] J. Wang, G. Li, and J. Feng, Trie-Join: Efficient Trie-Based String Award in 2003 and a few other NSF grants and
Similarity Joins with Edit-Distance Constraints, Proc. VLDB industry gifts. He was once a part-time visiting
Endowment, vol. 3, no. 1, pp. 1219-1230, 2010. research scientist at Google. His research interests include the fields of
[47] W. Wang, C. Xiao, X. Lin, and C. Zhang, Efficient Approximate data management and information search. He is the founder of
Bimaple.com. He is a member of the IEEE.
Entity Extraction with Edit Distance Constraints, Proc. 35th ACM
SIGMOD Intl Conf. Management of Data (SIGMOD 09), pp. 759-
770, 2009.
[48] C. Xiao, W. Wang, and X. Lin, Ed-Join: An Efficient Algorithm for . For more information on this or any other computing topic,
Similarity Joins with Edit Distance Constraints, Proc. VLDB please visit our Digital Library at www.computer.org/publications/dlib.
Endowment, vol. 1, no. 1, pp. 933-944, 2008.
[49] C. Xiao, W. Wang, X. Lin, and H. Shang, Top-K Set Similarity
Joins, Proc. IEEE Intl Conf. Data Eng. (ICDE 09), pp. 916-927,
2009.
[50] C. Xiao, W. Wang, X. Lin, and J.X. Yu, Efficient Similarity Joins
for Near Duplicate Detection, Proc. 17th Intl Conf. World Wide
Web (WWW 08), 2008.