0% found this document useful (0 votes)
24 views13 pages

Similarity Metrics For SQL Query Clustering

Uploaded by

karen
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views13 pages

Similarity Metrics For SQL Query Clustering

Uploaded by

karen
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

2408 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 30, NO.

12, DECEMBER 2018

Similarity Metrics for SQL Query Clustering


Gokhan Kul , Student Member, IEEE, Duc Thanh Anh Luong , Ting Xie, Varun Chandola,
Oliver Kennedy , Member, IEEE, and Shambhu Upadhyaya, Senior Member, IEEE

Abstract—Database access logs are the starting point for many forms of database administration, from database performance tuning,
to security auditing, to benchmark design, and many more. Unfortunately, query logs are also large and unwieldy, and it can be difficult
for an analyst to extract broad patterns from the set of queries found therein. Clustering is a natural first step towards understanding the
massive query logs. However, many clustering methods rely on the notion of pairwise similarity, which is challenging to compute for
SQL queries, especially when the underlying data and database schema is unavailable. We investigate the problem of computing
similarity between queries, relying only on the query structure. We conduct a rigorous evaluation of three query similarity heuristics
proposed in the literature applied to query clustering on multiple query log datasets, representing different types of query workloads.
To improve the accuracy of the three heuristics, we propose a generic feature engineering strategy, using classical query rewrites to
standardize query structure. The proposed strategy results in a significant improvement in the performance of all three similarity
heuristics.

Index Terms—Clustering, query logs, similarity metric, summarization

1 INTRODUCTION

D ATABASE access logs are used in a wide variety of set-


tings, including evaluating database performance tun-
ing [1], benchmark development [2], database auditing [3],
which different classes of join predicates occur. Unfortu-
nately, such fine-grained properties lack the context to
clearly communicate how the data is being used, combined,
and compliance validation [4]. Also, many user-centric sys- and/or manipulated. To see the complete context, Jane
tems utilize query logs to help users by providing recom- must look at entire queries. Naively, she might look at all
mendations and personalizing the user experience [5], [6], distinct query strings in the log. Even comparatively small
[7], [8], [9], [10]. As the basic unit of interaction between a production databases typically log hundreds or thousands
database and its users, the sequence of SQL queries that a of distinct query strings, making direct inspection impracti-
user issues effectively models the user’s behavior. Queries cal. Furthermore, it is unclear that distinct query strings are
that are similar in structure imply that they might be issued the right level of granularity in the first place. Consider the
to perform similar duties. Examining a history of the queries following example queries:
serviced by a database can help database administrators
with tuning, or help security analysts to assess the possi- 1) SELECT name FROM user
bility and/or extent of a security breach. However, logs WHERE rank IN (‘adm‘,‘sup‘)
from enterprise database systems are far too large to exam- 2) SELECT SUM(balance) FROM accounts
ine manually. As one example, a recent study of queries 3) SELECT name FROM user WHERE rank = ‘adm‘
at a major US bank for a period of 19 hours found nearly UNION SELECT name FROM user
17 million SQL queries and over 60 million stored proce- WHERE rank = ‘sup‘
dure execution events [3]. Even excluding stored proce- 4) SELECT SUM(accounts.balance) FROM
dures, it is unrealistic to expect any human to manually accounts
inspect all 17 million queries per day. NATURAL JOIN user WHERE user.rank =
Let us consider an analyst (call her Jane) faced with the ‘adm‘
task of analyzing such a query log. Jane might first attempt Queries 1 and 2 are clearly distinct: Their structures dif-
to identify some interesting query fragments and their fer, they reference different datasets, and perform different
aggregate properties. For example, she might count how computations. The remaining queries however are less so.
many times each table is accessed or the frequency with Query 3 is logically equivalent to Query 1: Both compute
identical results. Conversely, although Query 4 is distinct
from Queries 1 and 2, it is conceptually similar to both and
 The authors are with the Department of Computer Science and Engineering,
University at Buffalo, Buffalo, NY 14260. E-mail: {gokhanku, ducthanh,
shares many structural features with each.
tingxie, chandola, okennedy, shambhu}@buffalo.edu. The exact definition of similarity may depend on Jane’s
Manuscript received 25 Sept. 2017; revised 15 Mar. 2018; accepted 19 Apr. exact task, the content of the log, the database schema, data-
2018. Date of publication 30 Apr. 2018; date of current version 5 Nov. 2018. base records, and numerous other details, some of which
(Corresponding author: Gokhan Kul.) may not be available to Jane immediately when she first
Recommended for acceptance by J. Levandoski. begins analyzing the log. It is also likely that some of this
For information on obtaining reprints of this article, please send e-mail to:
[email protected], and reference the Digital Object Identifier below. information, like the precise contents of the database or
Digital Object Identifier no. 10.1109/TKDE.2018.2831214 even the database schema may not even be available to Jane
1041-4347 ß 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See ht_tp://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: University of Michigan Library. Downloaded on January 14,2025 at 23:45:10 UTC from IEEE Xplore. Restrictions apply.
KUL ET AL.: SIMILARITY METRICS FOR SQL QUERY CLUSTERING 2409

TABLE 1
SQL Query Similarity Literature Review

Feature Distance Similarity


Paper title Motivation Features
Structure Function Ratio
Agrawal et al. (2006) [21] Q. reply importance Schema, rules Vector Cosine similarity No
Giacometti et al. (2009) [6] Q. recommendation Difference pairs Set Difference query No
Yang et al. (2009) [7] Q. recommendation Selection/join, projection Graph Jaccard coefficient No
on the graph edges
Stefanidis et al. (2009) [8] Data Recommendation Inner product of two queries Vector - No
Khoussainova et al. (2010) [9] Q. recommendation Popularity of each query object Graph - No
Chatzopoulou et al. (2011) [10] Q. recommendation Syntactic element frequency Vector Jaccard coefficient No
and cosine similarity
Aouiche et al. (2006) [13] View selection Selection/join, group-by Vector Hamming distance Yes
Aligon et al. (2014) [14] Session similarity Selection/join, projection, 3 Sets Jaccard coefficient Yes
group-by
Makiyama et al. (2016) [15] Workload analysis Term frequency of projection, Vector Cosine similarity Yes
selection/join, from,
group-by and order-by

for reasons of privacy or security. As a result, this type of more regular, uniform query representations by leveraging
log analysis can quickly become a tedious, time-consuming query equivalence rules and data partitioning operations.
process [11]. An earlier work of Aligon et al. [12] attempted These rules are commonly utilized by database manage-
to address this problem for OLAP operations by performing ment systems when parsing and evaluating SQL queries.
query log analysis and exploration. Within the scope of this This process significantly improves the quality of all
article, we focus on analysis of SQL queries instead of three distance metrics. We also investigate and identify
OLAP queries. In particular, we lay the groundwork for a sources of errors in the clustering process. Experimental
more automated approach to SQL query log exploration results show that our regularization pre-processing tech-
based on hierarchical clustering. Given a hierarchical clus- nique consistently improves clustering for different query
tering of the SQL query log, Jane can manually adjust how comparison schemes from the literature.
aggressively the log is summarized. She can select an appro- Concretely, the specific contributions of this article are:
priate level of granularity without a priori needing to specify
exactly what constitutes a similar query. (1) A survey of existing SQL query similarity metrics,
The primary focus of this article is to study the suitability (2) An evaluation of these metrics on multiple query
of three existing query distance metrics [13], [14], [15] to be logs, and
used with hierarchical algorithms for clustering query logs. (3) Applying query standardization techniques to imp-
All of these metrics operate on the query structure and do rove query clustering accuracy.
not rely on the availability of underlying data or schema, This article is organized as follows. We start by perform-
thus making them applicable in a wide variety of practical ing a literature survey on log clustering and SQL query sim-
settings. We evaluate the three metrics on two types of data: ilarity in Section 2. We describe a feature engineering
Human-authored and Machine-generated. Thus, using an technique called regularization in Section 3. In Section 4, we
explain our query workloads and propose a strategy for
appropriate similarity metric, one can cluster the queries to
evaluating the quality of query similarity metrics. The eval-
obtain a meaningful clustering of the query log.
uation is presented in Section 5. We discuss our experiment
For our evaluation, we use three evaluation data sets:
results, findings and ideas to further build upon the sur-
i) a large set of student authored queries released by veyed techniques in Section 6, and in Section 7, we explain
IIT Bombay [16], how this work can be beneficial by giving real life examples.
ii) a smaller set of student queries gathered at the Uni- Finally, we conclude by identifying the steps needed to
versity at Buffalo, and released as part of this publi- deploy query log clustering into practice using the techni-
cation, and ques evaluated in this article in Section 8.
iiii) SQL logs that capture all activities on 11 Android
phones for a period of one month [2]. 2 BACKGROUND
Student-written queries are appealing, as queries are Analyzing query logs mostly relies on the structure of
already labeled by their ground-truth clusterings—For each queries [18], although their motivations are different; some
question, the student is attempting to accomplish one spe- methods prefer using the log as a resource to collect informa-
cific stated task. Conversely, machine-generated queries on tion to build user profiles, and the others utilize structural
smartphones present a conceptually easier challenge, as similarity to perform tasks like query recommendation [6],
they produce more rigid, structured queries. The three simi- [7], [10], performance optimization [13], session identifica-
larity metrics are evaluated on these data sets using three tion [14] and workload analysis [15]. A summary of these
standard clustering evaluation statistics: Silhouette Coeffi- methods is given in Table 1.
cient, Beta CV, and Dunn Index [17]. There are also other possible approaches; like data-
None of the similarity metrics perform as well as desired, centric query comparison [19], and utilizing the access areas
so we propose and evaluate a pre-processing step to create of user queries by inspecting the data partition the query is
Authorized licensed use limited to: University of Michigan Library. Downloaded on January 14,2025 at 23:45:10 UTC from IEEE Xplore. Restrictions apply.
2410 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 30, NO. 12, DECEMBER 2018

interested in [20] from the WHERE condition. However, these queries although it is not the aim of their work. They aim to
approaches are out of our scope since we are interested in optimize view selection in warehouses by the queries posed
comparing and improving methods based on structural to the system. They consider the selection, joins and group-by
similarity; we assume that we do not have access to the data items in the query to create vectors and use Hamming Dis-
or the statistical information about the database. tance to measure how similar two queries are. While creat-
Agrawal et al. [21] aim to rank the tuples returned by the ing the vector, it doesn’t matter if an item appears more
SQL query based on the context. They create a ruleset for than once or where the item is. They cluster similar queries
contexts and evaluate the result of queries that belongs to that creates a workload on the system and base their view
the context according to the ruleset. They capture context creation strategy in the system on the clustering result.
and query as feature vectors and capture similarity through Aligon et al. [14] study various approaches to defining a
cosine distance between the vectors. similarity function to compare OLAP sessions. They focus
Chatzopoulou et al. [10] aim to assist non-expert users of on comparing session similarity while also performing a
scientific databases by tracking their querying behavior survey on query similarity metrics. They identify selection
and generating personalized query recommendations. They and join items as the most relevant components in a query
deconstruct an SQL query into a bag of fragments. Each followed by the group by set. Inspired by the findings, they
distinct fragment is a feature, with a weight assigned to it propose their own query similarity metric which considers
indicating its importance. Each feature has two types of projection, group-by, selection-join items for queries issued on
importance: (1) within the query and (2) for the overall OLAP datacubes. OLAP datacubes are multidimensional
workload. Similarity is defined upon common vector-based models, and they have hierarchy levels for the same attrib-
measures such as cosine similarity. A summarization/user utes. Aligon et al. [14] measure the distance between the
profile for this approach is just a sum over all single query attributes on different hierarchy levels, and compute the set
feature vectors that belong to their workload. similarity for projection, group-by, and selection-join sets indi-
Yang et al. [7], on the other hand, build a graph following vidually when comparing two queries. In our experiments,
the query log by connecting associations of table attributes since we do not consider the hierarchy levels in an OLAP
from the input and output of queries which are then used to system but focus on databases, we consider all queries are
compute the likelihood of an attribute appearing in a query on the same level in the schema to adjust the formulas pre-
with a similarity function like Jaccard coefficient. Their aim sented in the paper. Namely, we compute set similarity of
is again to assist users in writing SQL queries by analyzing projection, group-by, selection-join sets of two queries with Jac-
query logs. Giacometti et al. [6], similarly, aim to make rec- card coefficient. Also, Aligon et al. [14] provide the flexibil-
ommendations on the discoveries made in the previous ses- ity to adjust weights of the three feature sets based on the
sions for users to spend less time on investigating similar domain needs. We explore how the clustering quality is
information. They introduce difference pairs in order to mea- affected with various weightings in Appendix A, which can
sure the relevance of the previous discoveries. Difference be found on the Computer Society Digital Library at http://
pairs are essentially the result columns that is not included doi.ieeecomputersociety.org/10.1109/TKDE.2018.2831214.
in the other return results; hence the method depends on Makiyama et al. [15] approach query log analysis with
having access to the data. Stefanidis et al. [8] takes a different the goal of analyzing a system’s workload, and they provide
approach, and instead of recommending candidate queries, a set of experiments on Sloan Digital Sky Survey (SDSS)
they recommend tuples that may be of interest to the user. dataset. They extract the terms in selection, joins, projection,
By doing so, the users may decide to change the selection cri- from, group-by and order-by items separately and record their
teria of their queries in order to include these results. appearance frequency. They create a feature vector using
Sapia [5] creates a model that learns query templates to the frequency of these terms which they use to calculate the
prefetch data in OLAP systems based on the user’s past pairwise similarity of queries with cosine similarity. Instead
activity. SnipSuggest [9], on the other hand, is a context- of clustering, they perform the workload analysis with Self-
aware SQL-autocomplete system that helps database users Organizing Maps (SOM).
to write SQL queries by suggesting SQL snippets. In partic- To further illustrate how the three structural metrics [13],
ular, it assigns a probability score to each subtree of a query [14], [15] work, we show the feature representations for the
based on the subtree’s frequency in a query log. These prob- following query for each method in Table 2.
abilities are used to discover the most likely subtree that a
SELECT u.username, u.yearenrolled
user is attempting to construct, at interactive speeds.
FROM user u, accounts a
Although these methods [6], [7], [8], [9], [10], [21] utilize
WHERE u.id = a.userid
query similarity one way or other to achieve their purpose,
AND a.balance > 1000
they don’t directly offer a way to compare query similarity.
AND u.id > 20050001
We aim to summarize the log and the most practical way to
GROUP BY u.yearenrolled
describe a query log is to group similar queries together so
ORDER BY u.yearenrolled
that we can provide summaries of these groups to the users.
For this purpose, we need to be able to measure pairwise In the next section, we propose a generalized feature
similarity between each query, hence we need a metric that engineering scheme for query comparison methods to
can do so. As shown in Table 1, this condition is only satis- improve the clustering quality. Our work evaluates the per-
fied by [13], [14], [15]. formance of the three methods [13], [14], [15] that directly
Aouiche et al. [13] is the first work we encountered that describe a pairwise similarity metric in Section 4 due to the
proposes a pairwise similarity metric between two SQL lack of performance evaluation for the query similarity
Authorized licensed use limited to: University of Michigan Library. Downloaded on January 14,2025 at 23:45:10 UTC from IEEE Xplore. Restrictions apply.
KUL ET AL.: SIMILARITY METRICS FOR SQL QUERY CLUSTERING 2411

TABLE 2 TABLE 3
Representation of Three Similarity Metrics Syntactic Desugaring

Paper title Extracted Feature Vector Before After


Aouiche et al. {‘u.id’, ‘a.userid’, ‘a.balance’, ‘u.yearenrolled’} b f > ; g a a f < ; g b
(2006) [13] x BETWEEN (a,b) a  x AND x  b
{‘u.username‘, ‘u.yearenrolled’} x IN (a; b,. . .) x=a OR x=b OR . . .
Aligon et al.
(2014) [14]
{‘u.id’, ‘a.userid’, ‘a.balance’} isnull(x,y) CASE WHEN x is null THEN y END
{‘u.yearenrolled’}
{‘SELECT_u.username’ ! 1,
‘SELECT_u.yearenrolled’ ! 1, 3.1 Regularization Rules
‘FROM_user ! 1’, ‘FROM_accounts’ ! 1,
Makiyama et al. Canonicalize Names and Aliases. As we will show in our
‘WHERE_u.id’ ! 2, ‘WHERE_a.userid’ ! 1,
(2016) [15] experiments in Section 5, table and attribute aliases are a
‘WHERE_a.balance’ ! 1,
‘GROUPBY_u.yearenrolled’ ! 1, significant source of error in matching. Consider the follow-
‘ORDERBY_u.yearenrolled’ ! 1} ing two queries:

5) SELECT name FROM user


metrics in the given studies. We also show that our feature 6) SELECT id
engineering scheme improves the clustering quality with FROM (SELECT name AS id FROM user) AS t
both statistical and empirical methods.
Although these queries are functionally identical, vari-
3 FEATURE ENGINEERING able names are aliased in different ways. This is especially
damaging for the three structural heuristics that we evalu-
The grammar of SQL is declarative. By design, users can ate, each of which assumes that variable names follow a
write queries in the way they feel most comfortable, letting globally consistent pattern. Our first regularization step
well-established equivalence rules dictate a final evaluation attempts to create a canonical naming scheme for both
strategy. As a result, many syntactically distinct queries attributes and tables which is similar to one used in [23].
may still be semantically equivalent. Recall example queries Syntax Desugaring. We remove SQL’s redundant syntac-
1 and 3, paraphrased here: tic sugar following basic pattern-replacements as shown in
1) SELECT name FROM user Table 3.
WHERE rank = ‘a‘ OR rank=‘s‘ EXISTS Standardization. Although SQL admits four clas-
3) SELECT name FROM user WHERE rank = ‘a‘ ses of nested query predicates: (EXISTS, IN, ANY, and ALL),
UNION SELECT name FROM user WHERE rank = ‘s‘ the EXISTS predicate is general enough to capture the
semantics of the remaining operators [23]. Queries using the
Though semantically distinct, these queries produce others are rewritten:
identical results for any input. Unfortunately similarity of
results is not practical to implement: General query equiva- x IN (SELECT y . . .) becomes
EXISTS (SELECT * . . .WHERE x = y)
lence is NP-complete [22] for SQL92 and earlier, while
x < ANY (SELECT y . . .) becomes
SQL99 and later versions of SQL are turing-complete, due
EXISTS (SELECT * . . .WHERE x < y)
to the introduction of recursive queries.
x < ALL (SELECT y . . .) becomes
However, we can still significantly improve clustering
NOT EXISTS (SELECT * . . .WHERE x  y)
quality by standardizing certain SQL features into a more
regular form with techniques such as canonicalizing names DNF Normalization. We normalize all boolean-valued
and aliases, removing syntactic sugaring, and standardizing expressions by converting them to disjunctive normal form
nested query predicates. This process of regularization aims (DNF). The choice of DNF is motivated by the ubiquity of
to produce a new query that is more likely to be structurally conjunctive queries in most database applications, as well
similar to other semantically similar queries. Because the out- as by the natural correspondence between disjunctions and
put is an ordinary SQL query, regularization may be used unions that we exploit below.
with any similarity metric. These process is similarly used Commutative Operator Ordering. We standardize the order
in [23], [24], where Chandra et al. [23] generate mutations of of expressions involving commutative and associative oper-
SQL queries to catch diversions from a baseline query, and ators (e.g., ^, _, þ, and ) by defining a canonical order of
Sapia [24] creates OLAP query prototypes based on selected all operands and traversing the expression tree bottom-up
features and models user profiles. to ensure consistent order of all operands.
Although the techniques we utilize for regularization are Flatten FROM-Nesting. We merge nested sub-queries in a
widely used in other settings, to the best of our knowledge, FROM clause with its parent query as described in [23].
we introduce their usage to improve clustering quality. We Nested Query De-correlation. A common database optimiza-
also test all the techniques we use individually to find their tion called nested-query de-correlation [25] converts some
impact on the regularization’s overall effect. Our experiments EXISTS predicates into joins for more efficient evaluation.
in Section 5.2 show consistent improvements for all metrics Note that this rewrite does not guarantee query result equiva-
evaluated in practical real world settings. In this section, we lence under bag semantics due to duplicated rows in the result.
describe the transformations that we apply to regularize Hence we require that the parent query is either a SELECT
queries and the conditions under which they may be applied. DISTINCT or a duplicate-insensitive aggregate [26] (e.g.,
Authorized licensed use limited to: University of Michigan Library. Downloaded on January 14,2025 at 23:45:10 UTC from IEEE Xplore. Restrictions apply.
2412 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 30, NO. 12, DECEMBER 2018

maxf1; 1g ¼ maxf1g, but sumf1; 1g 6¼ sumf1g). If the EXISTS Many database courses include homework or exam questions
predicate is in a purely conjunctive WHERE clause, the de-cor- where students are asked to translate prose into a precise SQL
relation process simply moves the query nested in the query. This provides us with a ground-truth source of queries
EXISTS into the FROM clause of its parent query. The (for- with different structures that should be similar. As machine-
merly) nested query’s WHERE clause can be then merged into generated queries, we use PocketData [2] a log of 33 million
the parent’s WHERE clause. Specifically, if the input query is of queries issued by smartphone apps running on 11 phones in
the form: the wild over the course of a month.
SELECT . . . FROM R WHERE In subsection 4.1, we outline the datasets used. Then, in
EXISTS (SELECT . . . FROM S WHERE q) Section 4.2, we outline the experimental methodology used
to evaluate distance metrics, and propose a set of measures
then the output query will have the form: for quantitatively assessing how effective a query similarity
SELECT . . . FROM R, (SELECT . . . FROM S) WHERE q metric is at clustering queries with similar tasks.

To de-correlate a NOT EXISTS predicate, we use the set- 4.1 Workloads


difference operator EXCEPT. If the input is of the form: We use three specific query sets: Student assignments gath-
SELECT DISTINCT. . . FROM R WHERE ered by IIT Bombay [16], student exams gathered at our
NOT EXISTS (SELECT . . . FROM S WHERE q) department (denoted as UB dataset in the experiments) and
released as part of this article,1 and SQL query logs of the
then the output will be of the form Google+ app extracted from PocketData dataset [2].
(SELECT DISTINCT. . . FROM R) EXCEPT The first dataset [16] consists of student answers to SQL
(SELECT DISTINCT. . . FROM R, WHERE questions given in IIT Bombay’s undergraduate databases
EXISTS (SELECT . . . FROM S WHERE q)) course. The dataset consists of student answers to 14 sepa-
rate query-writing tasks, given as part of 3 separate home-
OR-UNION Transform. We use a regularization transfor- work assignments. The query writing tasks have varying
mation that exploits the relationship between OR and degrees of difficulty. Answers are not linked to anonymous
UNION. This rewrite does not guarantee query result student identifiers and there is no grade information. The
equivalence, also due to potentially duplicated rows in IIT Bombay dataset is exclusively answers to homework
query result. Recall the equivalence between logical OR and assignments, so we expect generally high-quality answers
UNION mentioned in our first example. Naively, we might due to the lack of time pressure and availability of resources
convert the DNF-form predicates into UNION queries: for validating query correctness.
SELECT. . .WHERE q OR p OR. . . becomes The second dataset consists of student answers to SQL
SELECT. . .WHERE q UNION SELECT. . .WHERE p UNION. . . questions given as part of our department’s graduate data-
base course. The dataset consists of student answers to
However, duplicates caused by the possible correlation 2 separate query-writing tasks, each given as part of midterm
between clauses in DNF will break the equivalence of this exams in 2014 and 2015 respectively. SQL queries were tran-
rewrite. Consider the following query: scribed from hand-written exam answers, anonymized for
IRB compliance and labeled with the grade the answer was
SELECT Score FROM Exam WHERE Score>60 OR Pass=1 given. We expect quality to vary, as exams are closed-book
and students have limited time. Since 50 percent of the grade
Students who pass the exam overlap with those whose
is the failing criterion, we assume that answers conform with
score greater than 60. Thus the rewritten query would not
the task of the question if the grade is over 50 percent. We
be exactly equivalent, as it may include duplicate rows. As
a result, we require the query to satisfy the same condition also explore 20 and 80 percent thresholds in Appendix B,
mentioned in previous rule nested query de-correlation. available in the online supplemental material.
Union Pull-Out. Since the prior transformation may The third dataset consists of SQL logs that capture all data-
introduce UNION operator in nested subqueries, we push base activities of 11 Android phones for a period of one month.
selection predicates down into the union as well. We selected Google+ application for our study since it is one of
the few applications where all users created a workload. SQL
4 QUALITY METRICS queries collected were anonymized and some of the identified
query constraints were deleted for IRB compliance [2].
In this section, we introduce the quality measures and A summary of all datasets is given in Tables 4, 5, and 6.
workloads to evaluate three query similarity metrics and The prose questions asked for IIT Bombay and UB Exam
the feature engineering scheme. Our goal is to evaluate how datatsets can be found in Table 8 and 7. Not all student
well a query similarity metric captures the task behind a
responses are legitimate SQL, and so we ignore queries that
query with and without regularization. We use two types
cannot be successfully parsed by our open-source SQL
of real-world query workloads: human—and machine-
parser.2 We also released the source code we used in the
generated. We expect the problem of query similarity to be
experiments.3
harder on human-generated workloads, as queries gener-
ated by machines are more likely to follow a strict, rigid
structural pattern. 1. http://odin.cse.buffalo.edu/public_data/2016-UB-Exam-
Queries.zip
As a source of human-generated queries, we use two dif- 2. https://github.com/UBOdin/jsqlparser
ferent sets of student answers to database course assignments. 3. https://github.com/UBOdin/EttuBench
Authorized licensed use limited to: University of Michigan Library. Downloaded on January 14,2025 at 23:45:10 UTC from IEEE Xplore. Restrictions apply.
KUL ET AL.: SIMILARITY METRICS FOR SQL QUERY CLUSTERING 2413

TABLE 4 TABLE 7
Summary of IIT Bombay Dataset UB Exam Dataset Questions

Total number Number of Number of distinct ID Question


Question
of queries parsable queries query strings 1 Find course_id and title of all the courses
1 55 54 4
2 57 57 10 Find course_id and title of all the courses offered by
2
3 71 71 66 “Comp. Sci.” department.
4 78 78 51 Find course_id, title and instructor ID for all the
3
5 72 72 67 courses offered in Spring 2010
6 61 61 11 Find id and name of all the students who have taken
7 77 66 61 4
the course “CS-101”
8 79 73 64 Find which all departments are offering courses in
9 80 77 70 5
Spring 2010
10 74 74 52
11 69 69 31 Find the course ID and titles of all courses that have
6
12 70 60 22 more than 3 credits
13 72 70 68 Find, for each course, the number of distinct stu-
14 67 52 52 dents who have taken the course; in case the course
7
has not been taken by any student, the value should
be 0
TABLE 5 Find id and title of all the courses offered in Spring
Summary of UB Exam Dataset 8
2010, which have no pre-requisite
Year 2014 2015 Find the ID and names of all students who have (in
9
any year/semester) taken two courses
Total number of queries 117 60 Find the departments (without duplicates) of
Number of syntactically correct queries 110 51 10
courses that have the maximum credits
Number of distinct query strings 110 51
Number of queries with score > 50% 62 40 Show a list of all instructors (ID and name) along
with the course_id of courses they have taught. If
11
they have not taught any course show the ID and
TABLE 6 name with null value for course_id
Summary of PocketData Dataset and Google+ Find IDs and names all students whose name
12 contains the substring “sr” ignoring case. (Hint
Pocket Dataset Google+ Oracle supports the functions lower and upper)
All queries 45,090,798 2,340,625 Using a combination of outer join and the is null
SELECT queries 33,470,310 1,352,202 predicate but WITHOUT USING ”except/minus”
Distinct query strings 34,977 135 13
and ”not in” find IDs and names of all students who
have not enrolled in any course in Spring 2010
A course is included in your CPI calculation if you
In the first two datasets, the query-writing task is specific.
passed it, or you have failed it, and have not
We can expect that student answers to a single question are subsequently passed it (or in other words, a failed
written with the same task. Thus, we would expect a good course is removed from CPI calculation if you have
distance metric to rate answers to the same question as close 14
subsequently passed it). Write an SQL query that
and answers to different questions as distant. Similarly, shows all tuples of the relation other than those
using the distance metric for clustering, we would expect to eliminated by the above rule, and also eliminating
see each query cluster to uniformly include answers to the tuples with a null value for grade
same question.
In the third dataset, PocketData-Google+, the queries are TABLE 8
generated by the Google+ application. Since some of the Questions given IIT Bombay Dataset [16]
constants are replaced with standard placeholders for IRB
Year Question
compliance, the number of distinct queries drops signifi-
cantly. Since there is no information about what kind of a How many distinct species of bird have ever been
task a query is trying to perform, we inspected and manu- 2014 seen by the observer who saw the most birds on
December 15, 2013?
ally labeled each distinct query string. Queries were labeled
with one of 8 different categories: Account, Activity, Analyt- You are hired by a local birdwatching organization,
who’s database uses the Birdwatcher Schema on page
ics, Contacts, Feed, Housekeeping, Media and Photo.
2. You are asked to design a leader board for each
species of Bird. The leader board ranks Observers by
4.2 Clustering Validation Measures 2015
the number of Sightings for Birds of the given species.
In addition to workload datasets, we define a set of meas- Write a query that computes the set of names of all
ures to be used for evaluating queries. Given a set of queries Observers who are highest ranked on at least one
labeled with tasks and an inter-query similarity metric, we leader board. Assume that there is no tied rankings.
want to understand how well the metric can (1) put queries
that perform the same task close together even if they are We evaluate each metric according to how well it aligns
written differently, and (2) differentiate queries that are with the ground-truth cluster labels. Rather than evaluating
labeled with different tasks. the clustering output itself, we evaluate an intermediate
Authorized licensed use limited to: University of Michigan Library. Downloaded on January 14,2025 at 23:45:10 UTC from IEEE Xplore. Restrictions apply.
2414 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 30, NO. 12, DECEMBER 2018

step: the pairwise distance matrix for the set of queries in a similarity can be improved by applying this step on the
given workload. With this matrix and a labeled dataset, SQL query. We also look closer at feature engineering by
we can use various clustering validation measures to breaking it down to different modules and analyze the
understand how effectively a similarity metric character- effect of each module on capturing the tasks performed
izes the partition of a set of queries. Specifically, clustering by queries.
validation measures are used to validate the quality of a
labeled dataset by estimating two quantities: (1) the degree 5.1 Evaluation on SQL Similarity Metrics
of tightness of observations in the same label group and (2) In the first experiment, we evaluate three similarity metrics
the degree of separations between observations in differ- mentioned in Section 2. The aim of the experiment is to eval-
ent label groups. As a result, we will use three clustering uate which similarity metric can best capture the task per-
validation measures [17, Chapter 17] including Average formed by each query.
Silhouette Coefficient, BetaCV and Dunn Index as they The black columns in Fig. 1 show a comparison of three
all quantify the two qualities mentioned above in their similarity metrics using each of the three quality measures
formulations. (Average Silhouette Coefficient, BetaCV and Dunn Index).
Silhouette Coefficient. For every data point in the dataset, As can be seen in Fig. 1, Aligon seems to work the best
its silhouette coefficent is a measure of how similar it is to for both IIT Bombay and UB Exam dataset while achiev-
its own cluster in comparison to other clusters. In particular, ing second-best for PocketData-Google+ dataset under the
the silhouette coefficient for a data point i is measured as Average Silhouette Coefficient measure. When consider-
bðiÞaðiÞ
maxðaðiÞ;bðiÞÞ where aðiÞ is the average distance from i to all ing BetaCV measure, Aligon also attains the best result for
other data points in the same cluster and bðiÞ is the average both IIT Bombay and UB Exam dataset while having com-
distance from i to all other data points in the closest neigh- parable result for PocketData-Google+ dataset. Aligon also
boring cluster. The range of silhouette coefficient is from performs well on the Dunn Index, coming in first on UB
1 to 1. We denote sðiÞ to represent silhouette coefficient Exam dataset, and second-best for IIT Bombay and Pocket-
of data point i. sðiÞ is close to 1 when sðiÞ is close to other Data-Google+ dataset. Especially given that the Dunn
data points from the same cluster more than data points Index measures only worst-case performance, Aligon’s
from different clusters, which represents a good match. metric seems to be ideal for our workloads. This shows
On the other hand, sðiÞ which is close to 1 represents that that even a fairly simple approach can capture task simi-
the data point i stayed in the wrong cluster, as it is closer larity well.
to data points in different clusters than its own. Since the For a closer look of Aligon’s similarity metric, Fig. 2a, 2c,
silhouette coefficient represents a measure of degree of 2e shows the distribution of Silhouette coefficients for each
goodness for each data point, to validate the effectiveness query and their respective tasks. Recall that the silhouette
of the distance metric given a query partition, we use the coefficient below 0 effectively indicates a query closer to
average silhouette coefficient of all data points (all queries) another cluster than its own, or a query that would be mis-
in the dataset. classified. The further below zero, the greater the error. For
BetaCV Measure. The BetaCV measure is the ratio of the
the UB Exam dataset (Fig. 2c), the majority of queries would
total mean of intra-cluster distance to the total mean of
have been successfully classified, and only a small fraction
inter-cluster distance. The smaller the value of BetaCV, the
exhibit minor errors. For the PocketData-Google+ dataset
better the similarity metric characterizes the cluster parti-
(Fig. 2e), there are some erroneous queries in cluster 4, 5
tion of queries on average.
and 6 while cluster 1, 2, 3, 7 and 8 have very few errors. For
Dunn Index. The Dunn Index is defined as the ratio
the Bombay dataset (Fig. 2a), the distribution of errors
between minimum distance between query pairs from dif-
varies. Cluster 1, 2, 4, 6, 12 and 14 exhibit virtually no error,
ferent clusters and the maximum distance between query
while cluster 7, 8, and 9 exhibit particularly egregious
pairs from the same cluster. In other words, this is the ratio
errors.
between closest pairs of points from different clusters over
the largest diameter among all clusters. Higher values of the
Dunn Index indicate better the worst-case performance of 5.2 Evaluation of Feature Engineering
the clustering metric. We next evaluate the effectiveness of regularization by
applying it to each of the three metrics described in
Section 2. We use our quality evaluation scheme to compare
5 EXPERIMENTS the quality of each measure both with and without feature
In this section, we perform experiments to evaluate the engineering.
performance of three similarity metrics previously dis- Fig. 1 shows the values of three validation measures for
cussed in Section 2: Makiyama’s similarity [15], Aligon’s each of the three similarity metrics, both with and without
similarity [14] and Aouiche’s similarity [13]. We imple- regularization. As shown in Fig. 1, regularization significantly
mented each of these similarity metrics in Java and improves the Average Silhouette Coefficient and BetaCV
evaluated them using the three clustering validation measures for all similarity metrics except for the case of
measures discussed in Section 4.2. In particular, we eval- Makiyama similarity metric with PocketData-Google+ data-
uate these three similarity metrics on their ability to cap- set. The Dunn index is relatively unchanged or little improved
ture the tasks performed by SQL queries. In addition, we for the IIT Bombay and PocketData-Google+ dataset and
also evaluate the effectiveness of the feature engineering shows slight signs of worsening with regularization on the
step introduced in Section 3 and understand how query UB Exam dataset. To understand the reason of worse
Authorized licensed use limited to: University of Michigan Library. Downloaded on January 14,2025 at 23:45:10 UTC from IEEE Xplore. Restrictions apply.
KUL ET AL.: SIMILARITY METRICS FOR SQL QUERY CLUSTERING 2415

Fig. 1. Clustering validation measures for each metric with and without regularization step.

Dunn Index, we compare Fig. 2c (original) with Fig. 2d regularization can be seen for queries with both positive
(with regularization). The Silhouette Coefficient for answers and with negative values of sðiÞ.
that are originally positive in each question are consider-
ably increased, and for answers that are originally negative 5.3 Case Study
(regarded erroneous) are even more decreased as a result of As part of our analysis, we attempted to provide empirical
regularization, since it reduces the query structure diversity explanations for query errors, in particular for queries where
which leads to separating queries better. In other words, for sðiÞ < 0 for all three similarity metrics. Namely, we looked
erroneous answers with negative Silhouette Coefficients, dis- into the queries that are too far apart from the clusters they
tance metrics like Aligon distinguish them further apart from belong, and we categorized the reasons for misclassification
answers with positive Silhouette Coefficients after regulariza- based on these queries. We then investigated how the regu-
tion. Since erroneous answers are treated as the ‘worst cases’ larization process particularly affect these queries.
for each question, the Dunn Index which measures worst case Almost all of these egregiously misclassified queries
performance naturally gets worse. appear in the IIT Bombay dataset, the distribution of which
is summarized in Table 9. The PocketData-Google+ dataset
5.2.1 Per-Query Similarity includes no egregiously misclassified queries, while the UB
Fig. 2b, 2d, 2f shows the distributions of silhouette coeffi- Exam dataset includes only one such query (which we
cients for the Aligon similarity metric after regularization tagged as a case of Contextual equivalence). We tagged each
is applied. For IIT Bombay dataset, comparing against egregiously misclassified query with an explanation that
Fig. 2a there is a slight improvement at the tail end of justifies why the query has a low s(i). Tags were drawn
clusters 9, 11, 12, 13 and 14—several of the negative coef- from the following list:
ficients have been removed. Furthermore, positive Ground-Truth Error. A student’s response to the question
matches have been improved, particularly for cluster 7, may have been legitimately incorrect. This is a query that is
9, 10, 12 and 13. Finally, there has been a significant correctly classified as an outlier. For example:
reduction in the degree of error in cluster 10. Cluster 10 SELECT *
is a particularly egregious case of aliasing, as the correct FROM (SELECT id, name, time_slot_id
answer involves two self-joins in the same query. As a FROM (SELECT *
result, aliasing is a fundamental part of the correct query FROM (SELECT *
answer, and our rewrites could not reliably create a FROM student
uniform set of alias names. In the UB Exam and Pocket- NATURAL JOIN takes) b1) a, section
Data-Google+ datasets, the improvement provided by WHERE a.course_id = section.course_id) a1
Authorized licensed use limited to: University of Michigan Library. Downloaded on January 14,2025 at 23:45:10 UTC from IEEE Xplore. Restrictions apply.
2416 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 30, NO. 12, DECEMBER 2018

Fig. 2. Distribution of silhouette coefficients when using Aligon’s similarity (a), (c), and (e) without regularization, and (b), (d), and (f) when regulariza-
tion is applied.

This query was attempting to complete the task “Find the SELECT DISTINCT student.id, student.name
ID and names of all students who have (in any year/semester) FROM student, takes, section AS a, section AS b
taken two courses in the same timeslot.” WHERE student.id = takes.id
Nested Subquery. A student’s response is equivalent to a AND takes.course_id = a.course_id
legitimately correct answer but uses nested subqueries such AND takes.course_id = b.course_id
that a heuristic distance metric cannot recognize. For example: AND a.course_id <> b.course_id
AND a.time_slot_id = b.time_slot_id
SELECT id, name FROM student
WHERE id IN (SELECT DISTINCT s.id
The student’s use of a and b make this query hard to dis-
FROM (SELECT * FROM takes NATURAL JOIN section) s,
tinguish from other queries that may use other names for
(SELECT * FROM takes NATURAL JOIN section) t
WHERE s.id = t.id the attributes.
AND s.time_slot_id = t.time_slot_id Insufficient Features. Relevant query components are not
AND s.course_id <> t.course_id) sufficiently captured as features for a heuristic distance met-
ric to distinguish between answers from sufficiently similar
Here, the subquery nesting structure is significantly dif- questions.
ferent from other queries for of the same question. Too Many Features. Irrelevant query components create
Aliasing. Aliasing (e.g., AS in SQL) breaks a distance met- redundant features that artificially increase the distance
ric that relies on attribute and relation names. For example: between the query and cluster center. For example:
Authorized licensed use limited to: University of Michigan Library. Downloaded on January 14,2025 at 23:45:10 UTC from IEEE Xplore. Restrictions apply.
KUL ET AL.: SIMILARITY METRICS FOR SQL QUERY CLUSTERING 2417

Fig. 3. Effect of each module in regularization.

SELECT DISTINCT student.name, takes.id, Many of the queries with low silhouette coefficients are
s1.course_id, s2.course_id identified as incorrect answers for the task given. These
FROM section AS s1, section AS s2, takes, student answers directly affect the ground-truth quality, therefore
WHERE takes.course_id = s1.course_id reduce the average silhouette coefficient. Another reason for
AND s1.course_id <> s2.course_id erroneous queries with low silhouette coefficients is because
AND s1.time_slot_id = s2.time_slot_id of aliasing. Although it is convenient for user to use aliases in
AND s1.semester = s2.semester the query to refer to a particular item, it is difficult for a
AND s1.year = s2.year machine to approximate the tasks the query authors are trying
AND takes.sec_id = s1.sec_id to accomplish since different query authors have different
AND s1.semester = takes.semester
ways to name particular items in the query. This problem is
AND s1.year = takes.year
particularly prevalent in question 9 of the IIT Bombay dataset.
AND student.id = takes.id
Although the distribution of the error reasons are expected
AND s2.time_slot_id = s2.time_slot_id
AND takes.sec_id = s2.sec_id to change, all the tags provided in this section can generically
AND s2.semester = takes.semester be applied to other query logs given a ground-truth. The regu-
AND s2.year = takes.year larization method cannot be expected to fix errors originating
from misclassifications in ground-truth since they do not
Contextual Equivalence. Establishing query equivalence to actually share any similarities with the cluster.
properly clustered queries requires domain-specific knowl- After the regularization process, the silhouette coefficient
edge not available to the distance metric (e.g., attribute under all three similarity metrics for each query is com-
uniqueness). For example: puted again and the result yields an 18 percent overall
reduction in number of erroneous queries (sðiÞ < 0) in the
SELECT student.id, student.name IIT Bombay dataset.
FROM student
WHERE student.id 5.4 Analysis of Regularization by Module
IN (SELECT takes.id
In section 5.2, we analyzed the overall effect of regulariza-
FROM takes, section
tion on query similarity. However, as described in Section 3,
WHERE takes.course_id = section.course_id
AND takes.sec_id = section.sec_id
TABLE 9
AND takes.semester = section.semester
Empirical Error Reasons for IITBombay Dataset
AND takes.year = section.year
GROUP BY takes.id, Erroneous Erroneous
takes.semester, queries queries
Cause
takes.year, Without With
section.time_slot_id Regularization Regularization
HAVING count(*) > 1) All queries 33 (100%) 27 (100%)
Ground-truth quality 14 (42.4%) 14 (51.8%)
Table 9 shows the primary reasons why these queries Nested subquery 7 (21.2%) 5 (18.5%)
could not be classified correctly. Note that there may be Aliasing 8 (24.2%) 5 (18.5%)
more than one reason for a query to be placed in a different Insufficient features 2 (6.0%) 1 (3.7%)
Too many features 1 (3.0%) 1 (3.7%)
cluster, but in Table 9, we only give the empirically deter- Contextual equivalence 1 (3.0%) 1 (3.7%)
mined primary reason.
Authorized licensed use limited to: University of Michigan Library. Downloaded on January 14,2025 at 23:45:10 UTC from IEEE Xplore. Restrictions apply.
2418 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 30, NO. 12, DECEMBER 2018

regularization is composed of many different transforma- For expressions of the form: X IN fx1 ; x2 ; . . . ; xn g, feature
tion rules. In this experiment, we group these rules into four duplication becomes dominant when n grows large. In
separate modules and inspect their impact on the clustering Fig. 3, Aligon and Makiyama suffer from feature duplication
quality. One may observe that Commutative Operator brought by Expression Standardization in some cases while
Ordering is guaranteed to provide benefit in structure simi- Aouiche does not. Because Aouiche records feature existence
larity comparison, hence we include it in all four modules. instead of occurrence in its vector. Although in some cases
In addition, there are dependencies between rules that such as this, simply replacing feature occurence with exis-
require them to operate one before another. For example, tence solves the problem of feature duplication, feature
we should better apply Syntax Desugaring and then DNF occurence can also be a good indicator for the interests of the
Normalization to simplify the boolean expression in query. We believe this problem can be addressed with explo-
WHERE clause before OR-Union Transformation. As ration of feature weighting strategies. Therefore, the problem
another example, Exists Standardization should better be of feature duplication will be further explored as a part of
applied on nested sub-queries before we de-correlate them feature weighting strategies in our future work.
using Nested Query De-correlation. As a result, we group
the rules from Section 3 into four modules: 6 DISCUSSION
1) Naming: Canonicalize Names and Aliases We have reviewed several similarity metrics for clustering
2) Expression Standardization: Syntax Desugaring, queries and focused on three syntax-based methods that
Exists Standardization, DNF Normalization, offer an end-to-end similarity metric. The advantage of this
Nested Query Decorrelation, OR-Union Transform preference is that, syntax-based methods do not require
3) FROM-Nesting: Flatten FROM-Nesting access to the data in the database or database properties.
4) Union Pullout: OR-UNION Pullout Considering that only logs are usually transferred between
organizations, and requiring access to the data for investiga-
Commutative Operator Ordering is included in all modules. tions can cause privacy violations, we preferred focusing on
Fig. 3 provides a comparison of each module in regulari- the syntax-based approach.
zation. From this figure, one can observe that, since students The survey we performed shows that most of the metrics
use different names/aliases for their convenience when con- make use of selection and join operations in the queries and
structing queries, the Naming module is the most effective consider them as the most important items for similarity cal-
one in terms of improving clustering quality for IIT Bombay culation. Group-by aggregate follows them closely while
and UB Exam datasets. On the other hand, for PocketData- projection items take the third most important item set.
Google+ dataset, names are already canonicalized as they There are other possible feature sets that can be used, such
are machine-generated. In this case, Expression Standardiza- as tables accessed or the abstract syntax tree (AST) of a
tion seems to be the most effective module, especially when query, but these feature sets are generally overlooked.
using Aligon or Aouiche as similarity metric. In Pocket- Although Aouiche et al. [13] make use of the most impor-
Data-Google+ dataset, referred tables and boolean expres- tant features selection, joins, and group-by items, they don’t
sions in the queries are both informative in distinguishing utilize the number of times an item appears, or after the
between different query categories or clusters. For this rea- parsing, they don’t consider what kind of feature an item is.
son, Makiyama similarity metric which considers both This means, it does not matter if a query has rank column
works well even without regularization while Aligon and in group-by, and the other one has rank column in selec-
Aouiche can get commensurate performance only after tion; they are considered the same. Makiyama et al. [15], on
applying Expression Standardization module. the other hand, follow Aligon et al. [14] in separating the
Note that in Fig. 3, Expression Standardization makes different features, and improves on it by making use of
Average Silhouette Coefficient worse in some cases for IIT appearance count of items. However, while trying to make
Bombay and UB Exam data sets. The performance degrada- use of every item like FROM and Order-By predicates, they
tion is majorly due to feature duplication. More specifically, consider these low priority predicates with same impor-
consider the example query with Expression Standardization. tance as the selection and join predicates.
Makiyama et al. [15] use a more complete structure of the
Example 1. Syntax Desugaring with OR-UNION transform query AST, hence when the query is simple like in the Pock-
1) SELECT name FROM usr WHERE etData-Google+ dataset, this technique can be slightly bet-
rank IN {‘admin‘,‘normal‘} ter. However, for a complex query with redundant features,
2) SELECT name FROM usr WHERE mixing features captured from various components of a
rank = ‘admin‘ OR rank = ‘normal‘ query without proper feature re-weighting will essentially
3) SELECT name FROM usr decrease the weight of features that are more informative.
WHERE rank = ‘admin‘ Hence, in student exam datasets, we can observe that Ali-
UNION gon et al. [14] is better than the others while in PocketData-
Google+ dataset, Makiyama et al. [15] is better.
SELECT name FROM usr
We could further improve these methods by making use
WHERE rank = ‘normal‘
of the abstract syntax tree of a SQL statement. As a declara-
Query (1) is transformed into (2) by syntax desugaring tive language, the AST of a SQL statement acts as a proxy
and then into (3) by OR-UNION Transform. From (1) to (2), for the task of the query author. This suggests that a com-
feature WHERE rank has been replicated; From (2) to (3), parison of ASTs can be a meaningful metric for query simi-
features SELECT name and FROM usr have been duplicated. larity. For instance, we can group a query Q with other
Authorized licensed use limited to: University of Michigan Library. Downloaded on January 14,2025 at 23:45:10 UTC from IEEE Xplore. Restrictions apply.
KUL ET AL.: SIMILARITY METRICS FOR SQL QUERY CLUSTERING 2419

queries that have nearly (or completely) the same AST as Q. 8 CONCLUSION AND FUTURE WORK
This structural definition of task has seen substantial use
The focus of this work is to understand and improve simi-
already, particularly in the translation of natural language
larity metrics for SQL queries relying on query structure to
queries into SQL [27]. For two SQL queries Q1 and Q2 , one
be used to cluster queries. We described a quality evalua-
reasonable measure might be to count the number of con-
tion scheme that captures the notion of query task using stu-
nected subgraphs of Q1 that are isomorphic to a subgraph
dent answers to query-construction problems and a real-
of Q2 . Subgraph isomorphism is NP-complete, but a compu-
world smartphone query load. We used this scheme to eval-
tationally tractable simplification of this metric can be found
uate three existing query similarity metrics. We also pro-
in the Weisfeiler-Lehman (WL) Algorithm [3], [28].
posed a feature engineering technique for standardizing
As can be seen in Tables 4 and 5, as the complexity or diffi- query representations. Through further experiments, we
culty of the question increases, the number of distinct queries showed that different workloads have different characteris-
also increases, i.e., students find different ways to solve the tics and no one similarity metric surveyed was always
same problem. Especially, in Table 5, no two students answer good. The feature engineering steps provided an improve-
a question using the same structure. This phenomenon moti- ment across the board because they addressed the error rea-
vates the need for regularization in comparing SQL queries. sons we identified.
As the complexity of the query increases, the possible ways to The approaches described in this article only represent
create the query to achieve the same task increase. Fig. 1 the first steps towards tools for summarizing logs by tasks.
shows that our assumption that regularizing queries will Concretely, we plan to extend our work in several direc-
improve overall clustering quality is correct. Our proposed tions: First, we will explore new feature extracting mecha-
feature engineering scheme improves the overall clustering nisms like the Weisfeiler-Lehman framework [3], feature
quality of all three metrics on all three datasets, including weighting strategies and new labeling rules in order to cap-
both human- and machine-generated queries. ture the task behind logged queries better. Second, we will
introduce the temporal order of the log to increase the query
7 APPLICATION SCENARIOS clustering quality. In this article, we focused on query struc-
tures to improve clustering quality. Exploring the inter-
In this section, we provide three scenarios where the clustering
query feature correlation based on query order can be used
scheme coupled with the proposed regularization is applicable:
to summarize query logs in addition to clustering. Third,
The first one is, Jane the DBA where she takes on the task
we will examine user interfaces that better present clusters
of improving database performance. After performing the
of queries—Different feature sorting strategies in Frequent
straightforward database indexing tasks, she would need to
Pattern Trees (FP Trees) [30] in order to help the user distin-
select candidate views, which are virtual tables defined by a
guish important and irrelevant features, for example. Lastly,
query. They allow querying just like tables by pre-fetching
we will investigate the temporal effects on query clustering.
records from existing tables. Constructing a view for a fre-
quent complex join operation can increase querying perfor-
mance of the database substantially. To find the ideal views,
ACKNOWLEDGMENTS
Jane first clusters similar queries together to see what kinds This material is based in part upon work supported by
of queries are more frequent. Making the most frequent the US National Science Foundation under award number
complex query types faster by creating views of them could CNS - 1409551. Usual disclaimers apply.
improve database performance substantially [13], [14].
The second one is, Jane the security auditor where she sus- REFERENCES
pects that there is a person who leaks classified information [1] N. Bruno and S. Chaudhuri, “Automatic physical database tun-
from her organization. She can choose to investigate data- ing: A relaxation-based approach,” in Proc. ACM Int. Conf. Man-
base access patterns along with other strategies which age. Data, 2005, pp. 227–238.
[2] O. Kennedy, J. Ajay, G. Challen, and L. Ziarek, “Pocket data: The
would involve query clustering [29]. After identifying the
need for TPC-MOBILE,” in Proc. Technol. Conf. Perform. Eval.
query clusters, she can partition the queries by the depart- Benchmarking, 2015, pp. 8–25.
ment or role to get the intuition about which departments [3] G. Kul, D. Luong, T. Xie, P. Coonan, V. Chandola, O. Kennedy,
and roles normally utilize what part of the database. She can and S. Upadhyaya, “Ettu: Analyzing query intents in corporate
databases,” in Proc. 25th Int. Conf. Companion World Wide Web,
detect the outliers from that behavior in order to determine 2016, pp. 463–466.
the suspects for further investigation. [4] C. Dwork, “Differential privacy,” in Automata, Languages and Pro-
Lastly, Jane the researcher where needs to investigate the gramming, M. Bugliesi, B. Preneel, V. Sassone, and I. Wegener,
properties of the SQL query dataset that she is going to use Eds. New York, NY, USA: Springer, 2006.
[5] C. Sapia, “Promise: Predicting query behavior to enable predictive
for her research. One of the new graduate students in her caching strategies for OLAP systems,” in Proc. Int. Conf. Data
team clusters the queries, and provides her with the cluster- Warehousing Knowl. Discovery, 2000, pp. 224–233.
ing assignments of each query. She doubts the quality of the [6] A. Giacometti, P. Marcel, E. Negre, and A. Soulet, “Query recom-
mendations for OLAP discovery driven analysis,” in Proc. ACM
clustering performed, and wonders if the clustering opera- 12th Int. Workshop Data Warehousing OLAP, 2009, pp. 81–88.
tion could be performed better. [7] X. Yang, C. M. Procopiuc, and D. Srivastava, “Recommending join
Having a better clustering of queries would potentially queries via query log analysis,” in Proc. IEEE 25th Int. Conf. Data
enhance the quality of her work in all of the examples given Eng., 2009, pp. 964–975.
[8] K. Stefanidis, M. Drosou, and E. Pitoura, “”You May Also Like”
above. Also, works cited in this section [13], [14], [29], along results in relational databases,” in Proc. 3rd. Int. Workshop Personal-
with many others can benefit from the framework described ized Access, Profile Management, Context Awareness Databases, 2009,
in this article. pp. 37–42.
Authorized licensed use limited to: University of Michigan Library. Downloaded on January 14,2025 at 23:45:10 UTC from IEEE Xplore. Restrictions apply.
2420 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 30, NO. 12, DECEMBER 2018

[9] N. Khoussainova, Y. Kwon, M. Balazinska, and D. Suciu, Duc Thanh Anh Luong received the BS degree in
“SnipSuggest: Context-aware autocompletion for SQL,” Proc. computer science from the University of Science at
VLDB Endowment, vol. 4, no. 1, pp. 22–33, Oct. 2010. Ho Chi Minh City, Vietnam, in April 2012. He is
[10] G. Chatzopoulou, M. Eirinaki, S. Koshy, S. Mittal, N. Polyzotis, and working toward the PhD degree under the supervi-
J. S. V. Varman, “The QueRIE system for personalized query rec- sion of Dr. Varun Chandola in the Department of
ommendations,” IEEE Data Eng. Bull., vol. 34, no. 2, pp. 55–60, 2011. Computer Science and Engineering, University at
[11] W. Gatterbauer, “Databases will visualize queries too,” Proc. Buffalo. His research is broadly in the field of
VLDB Endowmen, vol. 4, no. 12, pp. 1498–1501, 2011. machine learning and data mining. In particular, he
[12] J. Aligon, K. Boulil, P. Marcel, and V. Peralta, “A holistic approach developed methods for probabilistic modeling and
to OLAP sessions composition: The falseto experience,” in Proc. clustering in complex data. The applications of his
17th Int. Workshop Data Warehousing OLAP, 2014, pp. 37–46. research works include anomaly detection, cluster-
[13] K. Aouiche, P.-E. Jouve, and J. Darmont, “Clustering-based mate- ing patient health profiles, and forecasting retail
rialized view selection in data warehouses,” in Proc. East Eur. demands.
Conf. Adv. Databases Inf. Syst., 2006, pp. 81–95.
[14] J. Aligon, M. Golfarelli, P. Marcel, S. Rizzi, and E. Turricchia, Ting Xie received the bachelor’s degree from the
“Similarity measures for OLAP sessions,” Knowl. Inf. Syst., vol. 39, Beijing University of Posts and Telecommunica-
pp. 463–489, 2014. tion, and the master’s degrees from the University
[15] V. H. Makiyama, M. J. Raddick, and R. D. Santos, “Text mining at Pennsylvania and the University of Illinois
applied to SQL queries: A case study for the SDSS SkyServer,” in at Urbana-Champaign. She is currently working
SIMBig, pp. 66–72, 2015. toward the PhD degree in the CSE Department,
[16] B. Chandra, B. Chawda, B. Kar, K. V. Reddy, S. Shah, and S. University at Buffalo. She is supervised under Dr.
Sudarshan, “Data generation for testing and grading SQL quer- Oliver Kennedy, and working as a research assis-
ies,” Int. J. Very Large Data Bases, vol. 24, pp. 731–755, 2015. tant with Odin Lab, University at Buffalo.
[17] M. J. Zaki and W. Meira Jr, Data Mining and Analysis: Fundamental
Concepts and Algorithms. Cambridge, MA, USA: Cambridge Univ.
Press, 2014.
[18] A. Kamra, E. Terzi, and E. Bertino, “Detecting anomalous access Varun Chandola received the PhD degree in
patterns in relational databases,” Int. J. Very Large Data Bases, computer science and engineering from the
vol. 17, pp. 1063–1077, 2007. University of Minnesota. He is a tenure-track
[19] S. Mathew, M. Petropoulos, H. Q. Ngo, and S. Upadhyaya, “A assistant professor with the Computer Science
data-centric approach to insider attack detection in database sys- Department, University at Buffalo (UB), and the
tems,” in Proc. Int. Workshop Recent Adv. Intrusion Detection, 2010, Center for Computational and Data-Enabled Sci-
pp. 382–401. ence and Engineering (CDSE). His research cov-
[20] H. V. Nguyen, K. B€ ohm, F. Becker, B. Goldman, G. Hinkel, and E. ers the application of data mining and machine
M€ uller, “Identifying user interests within the data space-a case learning to problems involving big and complex
study with skyserver,” in Proc. 18th Int. Conf. Extending Database data, focusing on anomaly detection from big and
Technol., 2015. complex data. Before joining UB, he was a scien-
[21] R. Agrawal, R. Rantzau, and E. Terzi, “Context-sensitive ranking,” tist with the Computational Sciences and Engineering Division, Oak
in Proc. ACM SIGMOD Int. Conf. Manage. Data, 2006, pp. 383–394. Ridge National Laboratory.
[22] A. K. Chandra and P. M. Merlin, “Optimal implementation of con-
junctive queries in relational data bases,” in Proc. 9th Annu. ACM
Symp. Theory Comput., 1977, pp. 77–90. Oliver Kennedy is an assistant professor with the
[23] B. Chandra, M. Joseph, B. Radhakrishnan, S. Acharya, and University at Buffalo. His primary area of research
S. Sudarshan, “Partial marking for automated grading of SQL quer- is databases, although his research interests fre-
ies,” Proc. VLDB Endowment, vol. 9, no. 13, pp. 1541–1544, Sep. 2016. quently cross over into programming languages
[24] C. Sapia, “On modeling and predicting query behavior in OLAP and datastructures. His work focuses on self-
systems,” in Proc. Int. Workshop Des. Manage. Data Warehouses, service analytics, making messy data, schema
1999, pp. 1–10. design, and physical layout decisions more approa-
[25] P. Seshadri, H. Pirahesh, and T. Y. C. Leung, “Complex query chable. Through real-world usage metrics gathered
decorrelation,” in Proc. 12th Int. Conf. Data Eng., 1996, pp. 450–458. from industry collaborations and the use of real-
[26] A. Gupta, V. Harinarayan, and D. Quass, “Aggregate-query proc- world testbeds, his work aims to address the practi-
essing in data warehousing environments,” in Proc. 21th Int. Conf. cal problems faced by data consumers every-
Very Large Data Bases, 1995, pp. 358–369. where. He is a member of the IEEE.
[27] F. Li and H. V. Jagadish, “Constructing an interactive natural lan-
guage interface for relational databases,” Proc. VLDB Endowment,
vol. 8, no. 1, pp. 73–84, 2014. Shambhu Upadhyaya is a professor of com-
[28] N. Shervashidze, P. Schweitzer, E. J. V. Leeuwen, K. Mehlhorn, puter science and engineering with the State Uni-
and K. M. Borgwardt, “Weisfeiler-lehman graph kernels,” J. Mach. versity of New York at Buffalo where he also
Learn. Res., vol. 12, pp. 2539–2561, 2011. directs the Center of Excellence in Information
[29] Y. Sun, H. Xu, E. Bertino, and C. Sun, “A data-driven evaluation Systems Assurance Research and Education
for insider threats,” Data Sci. Eng., vol. 1, no. 2, pp. 73–85, 2016. (CEISARE), designated by the National Security
[30] J. Han, J. Pei, Y. Yin, and R. Mao, “Mining frequent patterns with- Agency. Prior to July 1998, he was a faculty
out candidate generation: A frequent-pattern tree approach,” Data member with the Electrical and Computer Engi-
Mining Knowl. Discovery, vol. 8, pp. 53–87, 2004. neering Department. His research interests inclu-
de information assurance, computer security,
Gokhan Kul received the MS degree from METU, and fault tolerant computing. He has authored or
in Turkey, and worked as a software engineer coauthored more than 260 articles in refereed journals and conferences
there. He is working toward the PhD degree in in these areas. His research has been supported by the U.S. National
computer science at the University at Buffalo. His Science Foundation, U.S. Air Force Research Laboratory, the U.S. Air
research interests include database systems and Force Office of Scientific Research, DARPA, and National Security
cyber-threat detection. His current work focuses Agency. He is a senior member of the IEEE.
on threat modeling and insider threat detection
under the supervision of Dr. Shambhu Upadhyaya
and Dr. Oliver Kennedy. He is a graduate student " For more information on this or any other computing topic,
member of the IEEE. please visit our Digital Library at www.computer.org/publications/dlib.

Authorized licensed use limited to: University of Michigan Library. Downloaded on January 14,2025 at 23:45:10 UTC from IEEE Xplore. Restrictions apply.

You might also like