Onrfile Sample
Onrfile Sample
by
Yabin Meng
School of Computing
Queen’s University
September, 2007
processing (OLTP) queries and large complex queries such as those typical of online
analytical processing (OLAP). OLAP queries usually involve multiple joins, arithmetic
operations, nested sub-queries, and other system or user-defined functions and they
typically operate on large data sets. These resource intensive queries can monopolize the
database system resources and negatively impact the performance of smaller, possibly
that involves the decomposition of large queries into an equivalent set of smaller queries
and then scheduling the smaller queries so that the work is accomplished with less impact
DB2™ and present a set of experiments to evaluate the effectiveness of the approach.
ii
Acknowledgments
for his great guidance and help over the years that I have pursued my education and
I would also like to thank Wendy Powley for her support. She has always been a
wonderful source of advice and suggestions. Without her great work, I could not finish
my thesis so smoothly.
for their support. I would also like to acknowledge IBM Canada Ltd., NSERC, and
I would like to thank my lab mates and fellow students for their encouragement,
Finally I would like to thank my family for their love and understanding. Their
iii
Table of Contents
Acknowledgments.............................................................................................................. iii
5.2.3 Scenario 3: One Database, Shared Buffer Pool, Different Table Sets ............ 55
5.2.4 Scenario 4: One Database, Shared Buffer Pool, Same Table Set ................... 57
v
References ......................................................................................................................... 68
Appendix F: QEPs of TPC-H Q21 and Q22 from DB2’s Explain Utility ........................ 85
vi
List of Tables
Table 10: E-Value for collected throughput data (Q22) ................................................... 81
vii
List of Figures
Figure 4: Segments and virtual nodes for QEP in Figure 3 .............................................. 19
Figure 16: DB2 optimized SQL statement for TPC-H Q21 ............................................. 43
Figure 17: Example of matching a query’s QEP with its Optimized SQL Statement ...... 44
Figure 22: Scenario 2 – one database, separate buffer pools (Q21) ................................. 54
viii
Figure 23: Scenario 2 – one database, separate buffer pools (Q22) ................................. 55
Figure 24: Scenario 3 – one database, shared buffer pools, different table sets (Q21) .... 56
Figure 25: Scenario 3 – one database, shared buffer pools, different table sets (Q22) .... 57
Figure 26: Scenario 4 – one database, shared buffer pools, same table set (Q21)............ 58
Figure 27: Scenario 4 – one database, shared buffer pools, same table set (Q22)............ 59
ix
List of Equations
x
Chapter 1
Introduction
1.1 Motivation
The database management system (DBMS) has been very successful over the last half-
century history. According to an IDC report made by C. Olofson [1] in 2006, the
worldwide market for DBMS software was about $15 billion in 2005 alone with an
estimated 10% growth rate per year. DBMSs and database applications have become a
core component in most organizations’ computing systems. These systems are becoming
increasingly complex and the task of management to ensure acceptable performance for
all applications is very difficult. In recent years, this complexity has approached a point
where even database administrators (DBAs) and other highly skilled IT professionals are
unable to comprehend all aspects of a DBMS’s day-to-day performance [29] and manual
initiative [29] [31]. An autonomic computing system is one that is self-managed in a way
the efforts towards autonomic DBMS involves workload control, that is, controlling the
type of queries and the intensity of different workloads presented to the DBMS to ensure
the most efficient use of the system resources. One challenge involved in the
implementation of workload control is the handling of very large queries that are common
in data warehousing and online analytical processing (OLAP) systems. These queries are
1
crucial in answering critical business questions. They usually boast very complicated
SQL and access a huge amount of data in a database. When executed in a DBMS, they
tend to consume a large portion of the database resources, often for long periods of time.
The existence of these queries can dramatically affect overall database performance and
restrict other workloads requiring access to the DBMS. Our goal is to design a
1.2 Problem
a study conducted by Lyman and Varian [2], there were 5 exabytes (1018 bytes) of “static”
information (in the form of paper, film, magnetic and optical storage medias) and another
internet, etc) in the year 2002, with the growth factor estimated to be about 30% per year.
Ninety-two percent of the static information is stored on magnetic media, mostly on hard
disks. In order to effectively manage such large volumes of information, DBMSs have
been widely used, thus leading to an astonishing boost in the volume of data that a single
database must manage. According to the 2005 report of the “TopTen Program” by the
Winter Corporation [3], the world’s largest data warehouse in 2005 contained 100,386
Due to the high degree of competition within the business environment, more and
more companies are employing data warehousing and OLAP technologies to help the
knowledge worker” (executive, manager, analyst, etc.) [4] make better and faster
2
decisions. Decision-support queries usually boast very complex forms, including multiple
system- or user- defined functions. Moreover, they also operate over huge amounts of
data.
1 select s_name,
2 count(*) as numwait
3 from supplier,
4 lineitem l1,
5 orders,
6 nation
7 where s_suppkey = l1.l_suppkey
8 and o_orderkey = l1.l_orderkey
9 and o_orderstatus = ‘F’
10 and l1.l_receiptdate > l1.l_commitdate
11 and exists (
12 select *
13 from lineitem l2
14 where l2.l_orderkey = l1.l_orderkey
15 and l2.l_suppkey <> l1.l_suppkey
16 )
17 and not exists (
18 select *
20 from lineitem l3
21 where l3.l_orderkey = l1.l_orderkey
22 and l3.l_suppkey <> l1.l_suppkey
23 and l3.l_receiptdate > l3.l_commitdate
24 )
25 and s_nationkey = n_nationkey
26 and n_name = ‘[NATION]’
27 group by
28 s_name
29 order by
30 numwait desc,
31 s_name;
Figure 1: TPC-H Q21, an example of decision-support queries
Figure 1 shows one query, Query 21, of the TPC-H benchmark [5] which is a
Query 21 is one of the suite of business oriented ad-hoc queries specified in the
benchmark and is used to “identify suppliers, for a given nation, whose product was part
of a multi-supplier order (with current status of 'F') where they were the only supplier
3
who failed to meet the committed delivery date” [5]. As we can see from Figure 1, this
query has a complex SQL statement including multiple joins among four different tables,
three of which are relatively large (lineitem, orders, and suppliers). It also includes
When a query like TPC-H Q21 is submitted to a high volume database for
execution, it tends to consume many of the physical database resources such as CPU,
buffer pool or disk I/O and/or the logical resources such as system catalogs, locks, etc.
The query may consume the resources for long periods of time, thus, impacting other,
possibly more important, queries which may require these resources to complete their
The situation is made worse by the emerging trend of server consolidation and
shifting the functionalities of several, under-utilized servers onto one powerful server.
This trend on database servers means that one single database server must support very
servers. One direct consequence of this trend is that the DBMS must now be able to
handle multiple workloads with diverse characteristics, dynamic resource demands, and
delivery within a SOA [6]. The main purpose of SOM is to guarantee a differentiated
service delivery based on Service Level Objectives (SLOs) and Service Level
4
In order too realize thiis goal, a DBMS
D must have the ability to dynamically
d y control itss
resource alllocations foor the queriees that are suubmitted too it.
Figurre 2 shows an
a examplee of the perfformance deegradation eexperienced
d by a seriess
of small reead-only OL
LTP-like quueries due to
t the interrference of a large queery (TPC-H
H
Q21) in a DBMS.
D The throughpu
ut of the sm
mall queriess in the sysstem are mo
onitored forr
600 secondds and samppled every 20 secondss, with andd without thhe large query runningg
can be seeen from Figgure 2 thatt when the large querry is running (startingg at the 3rdd
sampling pooint and endding at the 13th samplinng point), thhe average tthroughput of the small
queries drops dramaticcally to less than 50% oof the originnal throughpput.
This approoach has 2 disadvantaages. First, the large query is siimply delay
yed and noo
5
progress on that work is achieved. Second, in businesses with 24/7 availability there may
exist no time at which the large query will not interfere with other work. A more flexible
approach such as dynamically adjusting the DBMS resources of a running query, which
service environment.
query) is, however, not a trivial task. Ideally, low-level approaches, such as directly
assigning CPU cycles or disk I/O bandwidths to a query based on its complexity and/or
importance, are desirable. In practice, however, these approaches are problematic for two
reasons. First, running a query against a DBMS involves many different and interrelated
DBMS components. It is impossible to ensure that a query is treated equally (from the
determine the appropriate settings for the resource allocations for all the components.
The goal of this research is to investigate a high-level approach to controlling the impact
that the execution of large queries has on the performance of other workload classes in a
DBMS. Our approach divides a large query into an equivalent set of smaller queries and
Our work makes two main contributions. The first contribution is an original
method of breaking up a large query into smaller queries based on its access plan
structure and the estimated query cost information. The second contribution is a prototype
6
algorithm to break up queries, if necessary, and manages the execution of the queries
submitted to a DBMS.
background and related work. The core part of our work, the decomposition algorithm, is
7
Chapter 2
Very complex queries have gained plenty of research attention in online analytical
processing (OLAP) and data warehousing systems due to the emphasis on increasing
query throughput and decreasing response time in these systems [4]. On one hand, much
of the research focuses on minimizing the query completion time, or providing feedback
more quickly for the large query itself. In Section 2.1 we present some research efforts in
this area. On the other hand, how to reallocate DBMS resources to meet different quality
based on some pre-defined business objectives and policies, is attracting more and more
attention. Section 2.2 describes research efforts in this area. In both sections, we outline
how our work relates to these previous research efforts. In Section 2.3, we will briefly
present the general query decomposition technique that is commonly used in the
distributed database systems and show how our decomposition algorithm is different from
that technique.
In recent years, the control of running large queries such as those typical of OLAP and
data warehousing in a DBMS has become more interactive. The traditional optimization
techniques which are common in current database systems often fail to meet this new
requirement because of their inherent “batch mode” characteristics. This means that once
a large query is submitted to a DBMS, users have no control over its execution and they
8
often wait for a long period of time without any feedback until a precise answer is
returned.
feedback are proposed. Luo et al. [7] and Chaudhuri et al. [8] investigate the possibility of
providing an online progress indicator (percentage of the task that has completed) for
long-running large queries. In both approaches, the progress estimator works on the query
execution plan (QEP) that is chosen by the query optimizer for a given query. They differ
in their choice of the basic unit of the query execution work. Luo et al. use one page of
bytes that has been processed along the QEP as one basic unit. Chaudhuri et al. choose
one “GetNext ()” call by the operators in the QEP as one basic unit. These techniques do
not shorten the execution time of the large queries themselves, but they can provide users
Haas et al. [9] propose a join algorithm, called Ripple Joins, for online multi-table
aggregation queries and Hellerstein et al. [10] investigate how to apply this new algorithm
in a DBMS to generate results more quickly. The underlying reasoning of their work
comes from the cognition that since large aggregation queries tend to give a general
picture of the data set, it is more appealing to provide users estimated online aggregation
results with a proximity confidence interval to the final result. Their algorithm adopts the
statistical method of sampling from base relations in order to generate answers more
quickly. A major advantage to this approach is that it allows users the ability to make a
tradeoff between the estimation precision and the updating rate. Their approaches do not
necessarily speed up the query execution itself. There may be some improvement,
however, by the replacement of a blocking join algorithm like hash join with the non-
blocking ripple join algorithm. Nevertheless, this is not the main objective of their work.
9
If appropriately used, materialized views (MVs) can provide performance
improvement in query processing time since a (large) portion of the final result is pre-
computed. The difficulty of using this approach, however, lies in how and when to exploit
the MVs. Goldstein et al. [11] present a fast and scalable view-matching algorithm for
determining whether part or all of a query can be computed from materialized views.
They also demonstrate an index structure, called a filter tree, to help speed up the search
for an appropriate view among the views maintained by a DBMS. This approach is very
attractive in a situation where system workloads are stable because in these systems we
are able to create useful MVs, that is MVs with repeatable usage among different queries
the system’s workloads are diverse and ad hoc, it is impossible to do so, and therefore this
Kabra et al. [12] examine the possibility of dynamic memory reallocation for
physical operators within a QEP based on improved estimates of statistics. Most modern
algorithms for basic relational operators use DBMS statistics to estimate their memory
requirement which, in turn, determines the algorithms’ performance. In their work, Kabar
et al. propose a run-time statistics collection technique which can be used to help improve
the estimation of the database statistics. Their work involves the modification of a QEP
by inserting “Statistics Collector” operators at several points in the QEP. The collected
statistics can be used to obtain more accurate estimates for the remainder of the query or,
All the research efforts presented above mainly focus on increasing the performance
(or perceived performance) of a large query itself. They do not directly address the
problem of controlling the execution of large queries. However, the ideas presented
10
provide useful insights into our own research. First, our approach involves the
decomposition algorithm works on the QEP of a query and tries to identify pipelined parts
within a QEP, just as the techniques used by Luo et al. [7] and Chaudhuri et al. [8].
Second, the ultimate goal of our work is not only to improve the performance of other
queries in the presence of large queries, but we would like to minimize the impact of our
approach on the large queries as well. The techniques of providing answers more quickly
or speeding up the large query’s execution as presented by Haas et al. [9], Hellerstein et al.
[10], Goldstein et al. [11], and Kabra et al. [12] could therefore be helpful in satisfying
this purpose.
The problem of resource allocation within a DBMS is very complicated. The reason is
rooted in the inherent heterogeneity and multiplicity of the DBMS resources. A DBMS
contains not only the common physical resources, like CPU, memory, and disk I/O, but it
also contains many logical resources such as system catalogs, locks, etc. These resources,
either physical or logical, are often inter-related and interact with each other, thus further
Traditionally, much of the work that has been done with regards to DBMS resource
allocation has been implemented through static tuning of database parameters in order to
optimize system wide performance. In recent years, with the emerging trend of server
consolidation, the increased complexity of a DBMS and the ongoing emphasis on service-
11
oriented management, a more dynamic and goal-oriented approach is attracting more
research interest.
DBMS. They develop a specific priority-based algorithm for managing the key physical
DBMS resources, especially the disk(s) and the buffer pool(s). Their simulation results
showed that “the objective of priority scheduling cannot be met by a single priority-based
scheduler”, which means that no matter whether the bottle neck of a DBMS is the CPU or
the disk, it is always essential to also use a priority-based replacement algorithm on the
buffer pool.
their work, they propose a feed-back based algorithm, called M&M, which adjusts DBMS
automatic way, to achieve a set of per-class response time goals for a multi-class complex
workload while leaving the largest possible left-over resources for the non-goal, or best-
effort classes. In their work, they adopt a per-class solution strategy, which means that, in
a given timeframe, the algorithm is only activated for one class and takes action for that
specific class in isolation. They use additional heuristics to compensate for the
Niu et al. [15] aim to optimize overall database resource usage by controlling the
workloads presented to it. In their work, a workload detection process is used to monitor
the characterization of the current workloads and to predict the future trends of the
workloads. Based on the classification of the workloads made by the workload detection
process, the workload control process is invoked to automatically adjust the MPLs
assigned for each class such that the SLO for each class is satisfied. Unlike the average
12
response time goal used by Brown et al. [14], Niu et al. use Query Velocity, as the goal,
goals and the tuning policies in IT friendly ways such as response time or throughput.
Although this allows the computer system to understand and control the workloads’
behavior easily, it makes it more difficult for the decision makers. Boughton et al. [16]
low-level system tuning polices using an economic model. The effectiveness of their
economic model is tested in the context of the buffer pool sizing problem in a DBMS.
Currently, commercial database systems also provide a certain level support for
dynamic DBMS resource management. IBM DB2 Query Patroller [17] is a query
management system that aims to boost overall database system resource utilization. Using
Query Patroller, queries submitted to a DB2 database are grouped into different
categories based on their size and the submitters’ identities. Each query class can have its
own class-level policy (e.g. maximum number of queries allowed for each class). The
system in general can have a high-level system policy affecting all query classes (e.g.
Teradata’s Priority Scheduler (Teradata PS) [18] introduces the concepts of “user
group”, “performance group” and “allocation group”. A Teradata DBMS uses “user
groups” to classify the queries that are submitted by database users. It then establishes a
record. The performance group is a priority scheduler component that associates users to
“allocation groups” which, as well as their predefined relative weights, determine the real
13
physical database resource usage such as the frequency of accessing CPU and the relative
DBMS resource management, most research work and current commercial products treat
extremely large queries in a static and somewhat “crude” way. A popular approach is to
adopt some kind of admission control mechanism to preclude large queries out of the
system in advance and delay their execution until a system off-peak time. Our research
investigates an approach such that not only do other queries in the system have more
reasonable resource allocation, but the large queries themselves can be controlled in a
more flexible and manageable way. The “utility throttling” technique used by Parekh et al.
[19] for controlling the performance impact that a database administration utility has on
the system has similar goals to our work but adopts a different approach. Unlike our
approach which is implemented outside of the database engine and achieves the dynamic
control over a large query by breaking it into pieces, their approach is implemented
sleep for a while if the predefined workload objectives are not satisfied.
In a typical distributed database system, the data that needs to be accessed by a SQL
query usually resides on several inter-connected remote sites. In order to process these
remotely distributed data effectively and efficiently, new query processing techniques are
required. D. Kossmann [31] presents a high level overview of the state of the art of query
14
most of the distributed DBMSs support a basic processing model of “moving query to
data”, which means that an “administrative” site (the site that receives the query) has to
break down the query somehow such that each sub-query, after being sent to a remote site,
only accesses the data that resides on that site. The purpose of decomposition here is
mainly to reduce the communication cost that is usually the dominant factor of the query
distributed DBMS environment highly depends on how the underlying data is partitioned
across the different sites. Our current approach of query decomposition, on the other hand,
only focuses on a centralized environment right now and the decomposition method
depends solely on the structure and the operator cost distribution of a query’s execution
plan as suggested by the query compiler. The purpose of our method is also different from
that used in a distributed system. The intention of our approach is to control the resource
consumption by a large query so other queries, possibly more important, can get more
DBMS resources for their own execution. The goal of decomposition in a distributed
database, however, is to reduce the query processing cost and/or response time for the
query in consideration.
15
Chapter 3
Decomposition Algorithm
The goal of our work is to control the impact that the execution of large queries has on the
performance of other workload classes. Our approach to decomposing a large query into a
set of smaller queries is based on two observations. First, at any given time, a smaller
query will likely hold fewer resources than a large query and so, interferes less with other
parts of the workload. Second, running a large query as a series of smaller queries means
that all resources are released between queries in the series and so are available to other
Unlike distributed database systems where queries are re-written to access data from
multiple sources, our approach focuses on breaking up a large query into an equivalent
The output of a query optimizer for a declarative query statement is called a Query
Execution Plan (QEP). The structure of a QEP determines the order of operations for
query execution. The QEP is typically represented using a tree structure where each node
represents a physical database operator (e.g. nested loop join, table scan etc). Multiple
plans may exist for the same query and it is a query optimizer’s top priority to choose an
optimal plan. To supplement the QEP, most query optimizers produce performance-
16
related information such as cost information, predicates, selectivity estimates for each
predicate and statistics for all objects referenced in the query statement.
Figure 3 shows an example QEP that we use for illustrative purposes throughout
this chapter. In this QEP, data from four different database tables (Tables A, B, C, and D)
are retrieved, filtered, joined, and then aggregated to create the desired final results (See
note that the plan structure shown in Figure 3 is only a conceptual structure and not an
actual plan from a query optimizer. It is used for illustrative purposes only.
17
3.2 Virtual Node, Segment, and CB-Segment
operators. An operator is blocking if it does not produce any output until it has consumed
at least one of its inputs completely. Pipelining operators produce outputs immediately
and continuously until all inputs have been processed. The hash join operator (Node 8) in
Figure 3, for example, is a blocking operator and the filter operators (Nodes 5 and 9) are
pipelining operators.
¾ Table Scan, Index Scan, Filter, Column Selection and Nested Loop Join are
pipelining operators
(see Section 3.3) between two segments and is implemented as a Table Scan node that
A Segment is a sub-tree of a QEP such that: (1) the root node of a segment must be
a blocking node or the return node of the original QEP, (2) a segment can have at most
18
one blocking node, and (3) all non-root nodes within a segment are pipelining nodes,
including virtual nodes. The definition of segment guarantees that any identified segment
6
Sort
8
Hash Join
9
Filter
10 11
Table Scan C Table Scan D
12 Segment II
Table Scan A
Segment I
3 1
Hash Join Aggregate
4 5 Column 2
Nested Loop Join Filter
Selection
7
Virtual Node I Index Scan B Virtual Node II Virtual Node III
Figure 4 shows the segments and virtual nodes for the QEP in Figure 3. In this
Figure, Virtual Nodes I, II, and III represent Segments I, II, and III respectively. Virtual
Node I creates a dependency relationship between Segment I and Segment III. Similarly,
dependency relationships are also established between Segments II and III by Virtual
Node II and between Segments III and IV by Virtual Node III (see Section 3.3 and
Section 3.4 for more details on segment dependency as well as the detailed segment
identification process).
19
A Cost-Based Segment (CB-Segment) is any valid sub-tree of a QEP. Unlike a
with cost information such as the total cost of the CB-Segment, or the cost percentage of
the CB-Segment over the total QEP cost. Cost is expressed in units adopted by a
particular DBMS. In DB2, for example, a unit called timeron is used (Appendix C). In
CB-Segments). In the rest of the thesis, unless explicitly stated, we use the term segment
this Figure, segment I and segment II can be merged together to create segment III or
20
segment III can be decomposed to create segment I and segment II. The merging process
requires that segment I depends on segment II, meaning that segment I has a virtual leaf
node that represents segment II (see Section 3.3 for segment dependency). When segment
I and segment II are merged together, the virtual node in segment I is removed and
segment II is added as a child sub-tree of the virtual node’s parent node (Node 1 in this
case) at the virtual node’s original position in segment I. The newly created tree becomes
the merged segment III. If multiple virtual nodes exist in a segment that needs to be
merged, then each virtual node is replaced by the segment represented by the virtual node.
This process is explained in Figure 5 in the direction from top to bottom (marked by the
solid line on the left). Similarly but for the reverse direction (from bottom to top and
marked by the dotted line on the right), when segment III needs to be decomposed, the
virtual node and thus creating segment I. During this procedure, a segment dependency
relationship between segment I and segment II is established (see Section 3.3 for detail).
According to the definitions in Section 3.2, a segment may contain virtual nodes as well
as other regular operation nodes. In our work, each virtual node within a segment is used
to represent another segment, which means that the outside segment depends on the
segment represented by the virtual node. If a segment does not include a virtual node,
segments because they contain no virtual nodes. Segment III includes two virtual nodes,
virtual nodes I and II, which represent segments I and II, respectively. Therefore, segment
21
III depends on both segments I and II. Similarly, segment IV depends on segment III
because it contains virtual node III, which represents Segment III. The dependency
In a QEP, if segment A depends on segment B, then the subpart of the QEP that is
because it needs the output of segment B in order to produce its own results. If segments
are independent of each other in a QEP, then they can be executed in any order or in
parallel.
In our work, we define the execution order of all the segments in a QEP as the
Segment Schedule for the QEP. Based on the segment dependency relationships in
Figure 6, the Segment Schedule for the QEP in Figure 3 can be one of the three cases
22
Figure 7: Segment schedule for QEP in Figure 3
The algorithm takes two passes. The first pass identifies all possible segments in a QEP
by exploring its tree structure. During this pass, a bottom-up scan of the QEP is used to
search for blocking nodes. Each blocking node forms the root of a segment and its lower-
level descendents form the subtree. Once a segment is discovered, the sub-tree that it
represents in the original QEP is replaced by a virtual node, thus creating a new tree with
a virtual node as one of its leaves. This search-and-replace procedure continues until all
nodes in the original QEP are processed and all segments are identified. At the same time,
by means of the virtual nodes that are created during this pass, the dependency
relationships among the segments are also defined. A segment having a virtual node as a
A pseudo code description of the first pass of the decomposition algorithm is shown
23
query’s QEP. When the iterative process of this procedure is done, all segments in the
QEP as well as their dependency relationships are identified and stored in two global sets,
1 FindSegments (QEP) {
2 subTreeSet ← create a empty tree set;
9 REPEAT
10 curSubTree ← get next sub-tree from subTreeSet;
11 newSeg ← NULL;
12 IF the root node of curSubTree has only one input in the original QEP THEN
13 newSeg ← create a new segment that has the same structure as curSubTree;
14 ELSE
15 matchedSubTrees ←find other sub-trees that have the same root as curSubTree;
16 IF matchedSubTrees is not empty THEN
17 newSeg ← create a new segment by merging matchedSubTrees with curSubTree
such that shared nodes only appear once in the segment;
18 remove matchedSubTrees from subTreeSet;
19 ENDIF;
20 ENDIF;
27 REPEAT
28 call procedure “FindSegments(newQEP)”;
29 segRels ← create segment dependency relationships between any two segments that are
found in two consecutive iterations if they are connected by a virtual node;
30 add segRels into GSegRelSet;
31 UNTIL all nodes in the original QEP are processed
32 }
Figure 8: Decomposition algorithm – the first pass
24
The first pass of the algorithm creates a set of smaller queries such that pipelined
operations are never interrupted., It does not, however, take cost information into
resulting segments (sub-queries) may be more costly than others (see Section 3.5 for a
A “skewed” solution has two major drawbacks which makes it impractical. The first
drawback is that some of the generated segments may themselves be large, costly queries.
Case A in Figure 9 shows this situation. In Figure 9, the percentage number beside each
segment represents the segment’s cost as a percentage of the total QEP. As we can see in
Case A, segment III covers 97% of the total cost while the other three segments together
cover the remaining 3%. In this situation, breaking a large query this way will not solve
our original problem. It is more reasonable to decompose segment III further, if possible.
depends on depends on
Case A Case B
25
The second drawback of a skewed solution lies in the possibility of unnecessary
execution overhead such as that shown in Case B in Figure 9 . As will be seen in Section
3.6, our approach of decomposing a large query into multiple smaller queries incurs
and segment II each cover a very small portion of the total cost. If we implement this
solution, the intermediate results from these two segments are stored in temporary tables,
In order to overcome the drawbacks, the algorithm needs to be extended such that a
more cost balanced solution is reached. The second pass of the algorithm aims to
implement this goal. Figure 10 shows the pseudo code for this pass.
1 ReOrganizeSegments (QEP) {
2 call procedure “FindSegments(QEP)” to generate the global segment set “GSegSet” and the
global segment relationship set “GSegRelSet”;
8 REPEAT
9 minSeg ← find the smallest segment (having minimum cost) in GSegSet.;
10 conSeg ← find the smallest segment that is connected to minSeg in GSegRelSet (either
depends on or is depended on minSeg);
11 newCBSeg ← merge minSeg and conSeg to create a new larger CB-segment;
12 update GSegSet such that minSeg and conSeg are removed and newCBSeg is added;
13 update GSegRelSet such that all segment dependency relationships that involve minSeg
and conSeg are modified correctly to involve newCBSeg instead;
14 curSKF ← calculate the “skew factor” for GSegSet;
15 UNTIL curSKF is within validSKFRange OR there is only one segment left in GSegSet;
26
The pseudo code in Figure 10 defines how a cost-balanced solution is reached by
merging the segments that are found by the first pass of the algorithm. It does not
always interrupts a pipelined operation. This is usually much less efficient in practice and
can incur excessive overhead. For this reason, our algorithm always tries to reach a cost-
generated to bring a DBA’s attention (or someone else who is running the query
decomposition). When received the message, this person may ignore it and just think of
the large query as un-decomposable, or he/she may choose to manually inspect the nodes
as well as their cost information within each segment to determine whether or not to
break a segment further to reach a more cost balanced solution by interrupting a pipelined
operation. The rule of thumb for this manual process is that the smaller segments
identified should equally share the cost of the original large segment. In our algorithm, an
We note that it is not always possible to decompose a large query into a cost-
balanced solution. One common example of this situation is when the cost of a single
node (not a segment) covers the majority of the total cost because our algorithm does not
such a query. Such a case is shown in Figure 11. When applying the decomposition
algorithm to this sample QEP, only one segment is generated because all the operators in
the QEP are pipelining operators. Among all these operators, the Table Scan A node
(node 6) alone covers almost all the total QEP cost (95%). In this situation, even with
27
human intervention we cannot find a cost-balanced solution. Moreover, applying the
algorithm detects the existence of such a case and provides appropriate feedback to a
DBA and/or the query submitter to indicate that the query cannot be decomposed.
The second pass of the decomposition procedure uses the “skew factor” (SKF) of a
Suppose that for a set of segments SEGS = {seg1, seg2… segn}, its related cost set is
COSTS = {cost1, cost2… costn}, in which cost1 is the cost value for seg1, cost2 is the cost
value for seg2, and so on. The SKF for SEGS measures how skewed SEGS is in terms of
COSTS. To put in another way, the SKF value for SEGS measures the variance of
COSTS. The higher the variance is, the higher the SKF value should be.
28
Equation 1 defines how the SKF value for SEGS can be calculated. In this equation,
VAR represents sample variance and cost’ is the average value of COSTS that is
calculated by equation 2.
∑ – ′
= 1
Equation 1: Skew factor
cost’ = MEAN(COSTS)
∑
=
In the equations above, the segment costs can be specified as either absolute values
(in whatever appropriate unit) or relative values. The relative cost value of a segment in
SEGS is defined as the percentage of the segment’s absolute cost value over the total
absolute cost value of all the segments in SEGS. The advantage of using relative cost
values in Equations 1 and 2 is that it normalizes the calculated SKF value to the range of
[0, 1]. Without the normalization, there is no way to easily specify a general threshold
SKF value that can be employed by the second pass of the decomposition algorithm to
find a cost-balanced solution. The SKF value calculated using Equations 3.1 and 3.2
would fluctuate widely depending on the query and how the query is decomposed.
29
In our approach, the relative costs of segments are employed in calculating the SKF
value. The default administrative threshold value is set as 0.07, which corresponds to a
“30% vs. 70%” cost distribution in a 2-segment solution, meaning that a large query can
be decomposed into 2 smaller segments with one covering 30% of the total cost and
another covering 70% of the total cost. Any solution whose SKF value is greater than the
threshold value is considered as a skewed solution by our algorithm and therefore needs
The decomposition algorithm breaks a large QEP into a set of inter-dependent smaller
segments and can form a segment schedule for the QEP. Following the schedule, the
execution of the set of generated segments will generate the same result as the original
There are two main problems that need to be solved. The first problem is how to
store the intermediate results of a segment so that dependent segments can make use of
the results. In our approach, we solve this problem by creating temporary database tables
to hold the intermediate results. The overhead resulting from this solution includes: 1) the
cost of creating empty temporary tables, 2) the cost of inserting intermediate results into
the temporary tables, and 3) the cost of retrieving the stored intermediate results from the
temporary tables. The overhead could be large, especially in cases where segments are
unavoidable due to the fact that our approach is implemented outside the database engine.
30
However, techniques that are able to exploit advanced database optimizer information
The second problem relates to how to execute a segment in practice. So far in this
virtual nodes are consider as other regular physical nodes that could appear in a real QEP.
our work, we use an approach similar to the one used by Venkataraman et al. [27] to
The basic step of the transformation is to traverse a QEP and translate each operator
encountered into a part (or several parts) of the resulting SQL statement. For example, a
filter node in a QEP can be translated into a “where” condition in a SQL statement. When
all the operators in the QEP are translated, we then assemble all the translated parts
together in a proper way such that a syntax-correct SQL statement is generated. During
this process, we need to acquire additional information from the optimizer to complete the
transformation, e.g. the type of condition used in a filter. In our approach, virtual nodes
are treated as table scan nodes. The input table for the scan is an intermediate temporary
table which is used to hold the data produced by the segment (or sub-query) referred to by
detailed discussion of the segment-to-statement procedure until Section 4.5 where the
31
3.7 Decomposition Argument
Proposition: Given a query Q, the decomposition algorithm produces a set of queries {Q1,
Q2, …, Qn} and a dependency graph G such that if the queries Q1, Q2, …, Qn are executed
Assumption:
We assume that during the decomposition procedure, all other workloads that could be
accessing the same tables used by the large query are read only queries. This means that
the data processed by both the large query (before the decomposition) and its equivalent
segment schedule (after the decomposition) are the same. Without this assumption, the
result equivalency of our approach can not be guaranteed. Our proof below is based on
this assumption.
Argument:
Given a QEP for a query Q, we know each edge of the QEP corresponds to a relation that
is the result of execution of the source nodes of the edge. The decomposition algorithm
identifies segments that can be executed as the sub-queries Qis and replaces each segment
with a virtual node by placing the result of its Qi on the edge leaving that node. During
this process, the algorithm maintains the same operator sequence within each segment as
that in the original QEP for the query Q. The result of the set of replacements is a
dependency graph G.
The original QEP is a tree, so G is a tree with each node of G representing a sub-
query Qi. The execution of the Qis is determined by moving up G from the leaves such
that a node in G for a query Qj is only executed after its children, if any, have executed. G
maintains all dependencies in the original QEP so: (1) each Qi will receive the same input
32
as its corresponding segment in the QEP; (2) when all Qis are executed according to G,
the ordering of the operators encountered is the same as that in Q except some virtual
nodes along the execution paths which, however, do not change the result because they
just simply store the intermediate results for previous segments. {Q1, Q2, …, Qn} will
33
Chapter 4
Query Disassembler
decomposition using IBM DB2. It implements the decomposition algorithm and provides
a framework for managing the decomposition process and scheduling the execution of the
Figure 12 shows the Query Disassembler. Each large query is submitted to Query
Disassembler before it is executed by the DBMS (step 1). Query Disassembler calls
DB2’s Explain utility to obtain a (cost-augmented) QEP for the submitted query (steps 2
and 3). The decomposition algorithm then divides the QEP into multiple segments, if
possible, while keeping track of dependency relationships among the segments (steps 4
and 4’). The Segment Translation procedure transforms the resulting segments into
executable SQL statements (step 5), which are then scheduled for execution by the
Schedule Generation procedure (step 5’). The generated SQL statements are submitted to
the DBMS for execution as per the schedule that is obtained in step 5’ (step 6).
34
Figure 12: Query Disassembler framework
35
If the decomposition algorithm determines that it is impossible to break up the
submitted large query, for example a single operator within the QEP for the large query
covers most of the total cost, Query Disassembler notifies an Exception Management
Module to handle this situation (step 7). The Exception Management Module is not
using an appropriate mechanism such as delaying the execution of the large query to an
Figure 13 shows the main Graphical User Interface (GUI) for Query Disassembler. The
left part of the GUI (Part I) lists the explained query instances and query statements that
are returned by the DB2 Explain utility [22]. Details of DB2 Explain are found in
Appendix C. Two SQL statements are shown for each explained SQL query. One is the
original SQL statement that is submitted by the user, and the other is the optimized SQL
statement that is suggested by the DB2 compiler as a result of applying the compiler’s
internal rewriting rules on the original statement. The optimized SQL statement is
executed more efficiently than the original query. In our work, the optimized SQL
statement is mainly used for translating segments into their equivalent SQL statement
The right part of the GUI (Part II) shows the QEP for the explained query. This
QEP shows the estimated operational tree structure and also includes other useful
performance-related data, such as cost, node predicates, and so on. The QEP and its
36
related data are either directly provided by the DB2 Explain utility or calculated from that
information.
Part I Part II
execution-cost value for the entire QEP and each of its internal nodes. The cost is
measured in a DB2-specific unit, called timeron, [24] and is further divided into sub-costs
that are directly linked with IO and CPU. The total cost for the entire QEP is shown at
The cost for each internal node of the QEP is expressed in the following ways. The
latter three costs are useful in the second pass of our decomposition algorithm to
37
¾ The node’s cost as a percentage of the cost of the entire QEP.
The accumulated cost value up to a node, say Node A, in a QEP refers to the total
cost of all the nodes in a sub-tree of the QEP rooted by Node A. The accumulated cost
percentage up to a node in a QEP is the accumulated cost value up to the node in the QEP
expressed as a percentage of the total cost of the QEP. The accumulated cost value as
well as the percentage up to a node is shown directly in the GUI on each node beneath the
node name and the node ID. Each node within the QEP as shown in Figure 13 is given a
When we choose to disassemble a QEP from the popup menu, the user is given a
choice of using the decomposition algorithm to break up the tree automatically (the “By
Cost” option) or to disassemble the tree manually (the “Manual” option). The “cost”
used to decompose the QEP can be the IO-related cost, the CPU-related cost, or the
combined total cost depending on whether the large query to be decomposed is IO-
Section 3.4 to break up the query automatically by analyzing its QEP structure as well as
the related cost information. If a cost-balanced solution can be reached, then the GUI
pops up a window to illustrate how the QEP is decomposed by displaying the breakpoints,
that is the node numbers above which the QEP is decomposed. A segment schedule
object (a Java object) is also created. The segment schedule object contains information
about what segments are decomposed from the large query and their execution order
38
(schedule). Section 4.3 gives a detailed explanation of this object. If the tree cannot be
The “Manual” option of the Query Disassembler allows a DBA to specify a list of
breaking points that he/she thinks is appropriate for decomposition. The manual
Chapter 3, but it directly utilizes the specified node numbers to form segments and the
corresponding execution schedule. Similar to the “By Cost” option, a segment schedule
Figure 14 illustrates how the manual disassembly procedure works. The left part of
Figure 14 shows an example QEP and we suppose that the specified breakpoints are 3 and
4 (as shown by “X” marks in Figure 14). As shown in the right part of Figure 14, the first
step of the procedure creates two segments (segment I and segment II) such that each
segment is equivalent to a sub-tree of the QEP and has one of the breakpoints as its root
node. In the second step, two virtual nodes are created to replace the two segments in the
original QEP and form the third segment (segment III) which depends on both segment I
39
Figure 14: An example of manual disassembly procedure
Figure 15 shows the class diagram for the segment schedule object provided by our
program. In this diagram, Schedule is the core component for segment scheduling. It
includes one Query object that represents the original large query and a set of
ScheduleUnit objects, each of which stands for a single scheduling unit (for example, a
single segment) that is managed by the Schedule object. Each ScheduleUnit object has its
own Query object which represents one small query that is decomposed from the original
large query. A ScheduleUnit object is also used to create and populate the temporary
40
Figure 15: Segment schedule class diagram
Other than the common operators listed in Section 3.2, there are some IBM DB2 specific
operators, like CMPEXP and EISCAN, that can appear in an IBM DB2 QEP. Our
algorithm supports some of these operators. The supported DB2 specific operators are
The following heuristics define how the DB2 specific operators are handled by our
algorithm. This is supplementary to the rules defined in Section 3.2. If the QEP for a large
query contains unsupported DB2 operators, the query is not considered by our algorithm
CMPEXP and PIPE (for debug usage) are not supported by our algorithm.
41
¾ TEMP (storing data in a temporary table), GENROW (generates a table of rows,
using no input from tables, indexes, or operators) are supported and are treated
as blocking operators.
¾ IXAND (index and), RIDSCN (row ID scan), EISCAN (scans a user defined
index to produce a reduced stream of rows) are supported and are treated as
pipelining operators.
following the general translation procedure described in Section 3.6. A common situation
is how to handle the DB2 specific ROWID predicates. A ROWID predicate is a predicate
that includes ROWID as an operand (ROWID is used by the DB2 compiler to directly
pinpoint a row rather than to go through the regular search procedure and therefore is
much more efficient). DB2, however, does not provide facilities to get ROWID in a
In our approach, when we encounter such a situation, we utilize the optimized SQL
statement that is provided by DB2 Explain as the source to the translation process. This
optimized statement is equivalent to the original query statement and consists of multi-
level nested sub-queries. After studying this version of a query we found that it is most
often amenable to our translation process. Figure 16 shows the optimized form for TPC-H
Q21.
42
1 select q10.$c0 as "s_name",
2 q10.$c1 as "numwait"
3 from (
4 select q9.$c0,
5 count(*)
6 from (
7 select distinct q8.$c0
8 from (
9 select q7.$c6
10 from lineitem as q1 right outer join (
11 select distinct q3.l_orderkey,
12 q3.l_suppkey,
13 q2.s_name
14 from supplier as q2, lineitem as q3,
15 orders as q4, nation as q5,
16 lineitem as q6
17 where (q2.s_suppkey = q3.l_suppkey)
18 and (q4.o_orderkey = q3.l_orderkey)
20 and (q4.o_orderstatus = 'F')
21 and (q3.l_commitdate < q3.l_receiptdate)
22 and (q2.s_nationkey = q5.n_nationkey)
23 and (q5.n_name = 'SAUDI ARABIA')
24 and (q6.l_suppkey <> q3.l_suppkey)
25 and (q6.l_orderkey = q4.o_orderkey)
26 ) as q7 on
27 (q1.l_orderkey = q7.$c3)
28 and (q1.l_suppkey <> q7.$c4)
29 and (q1.l_commitdate < q1.l_receiptdate)
30 ) as q8
31 ) as q9
32 group by q9.$c0
33 ) as q10
34 order by
35 q10.$c1 desc,
36 q10.$c0;
Figure 16: DB2 optimized SQL statement for TPC-H Q21
In Section 3.4 we point out that any sub-tree within a query’s QEP can be viewed as
another QEP that corresponds to a smaller query contained in the original large query.
Therefore, from the QEP point of view, a query’s structure is also nested in multi-levels.
The similarity between a query’s QEP and its optimized SQL statement makes the
43
optimized SQL statement an excellent resource for the task of translating segments into
Figure 17: Example of matching a query’s QEP with its Optimized SQL Statement
Figure 17 shows a simple example of how the matching process works. A rule of
thumb for this process is to match level by level. We start by matching the highest node in
a QEP to the outermost sub-query in the optimized SQL statement and continue until the
lowest possible node in QEP is matched to the innermost sub-query in the optimized SQL
statement. Although such a matching process works in DB2, we can not guarantee that it
44
Chapter 5
Experiments
approach for controlling the execution of a large query. The computer system used is an
IBM xSeries® 240 machine with dual 1 GHZ CPUs, four PCI/ISA controllers, and 17
Seagate ST 318436LC SCSI disks. We use IBM DB2 Version 8.2 as the database server.
5.1 Workload
The workload consists of a set of small read-only queries and one large query, which is
either the TPC-H Q21 or the TPC-H Q22 query. The small query set consists of eight
parameterized OLTP-like read-only queries (see Appendix E for detail). Each client
submits a random stream of these queries. The average response time for these queries is
typically less than half second. We control the intensity of the workload by varying the
Q21 is an IO-intensive query that accesses five different tables, four of which are
relatively large in size. Its SQL statement is complex and includes aggregation and sub-
queries. Q22 is a CPU-intensive query that accesses two different tables, including one
large table. Its SQL statement is less complicated than that of Q21, but, in addition to
We examined the QEPs of TPC-H queries 1 through 20 (Q1 – Q20) and found that
all are highly skewed under the current experimental database configuration so there is no
way to find a cost-balanced solution. Within each of the QEPs for this set of queries,
45
there is always a single Table Scan node that covers most of the total QEP cost (at least
90%). Q21 and Q22, however, are two queries that can be decomposed by the
running alone in our test-bed environment (no interference from any other query), Q21
takes about 60 seconds to run and Q22 takes about 30 seconds to run.
Using our algorithm, Q21 is broken into two smaller queries. The first query
accounts for approximately 70% of the total cost and the second covers the remaining
30%. Similarly, Q22 is also decomposed into two smaller queries that account for 60%
and 40% of the total cost, respectively. Unlike Q21, Q22 is decomposed such that a
pipelined operation is interrupted. Figure 18 shows the QEP for Q22 and illustrates how it
is decomposed. We do not show the QEP for Q21 here because it is too large to see
clearly (Appendix F shows the QEP anyway), but the process of decomposing it and how
46
Figure 18: QEP of Q22 and its decomposition
In Figure 18, the QEP for Q22 is divided just above node 7 (NLJOIN) marked by an
X, thus creating two segments – one is the sub-tree rooted at node 7 (segment I) and the
other is the QEP for Q22 with a virtual node (segment II) replacing the sub-tree rooted at
node 7. The cost estimates in Figure 18 show that segment I covers almost 60% of the
total QEP cost for Q22 and segment II takes the remaining 40%.
47
Figure 19: Workload generation class diagram
Figure 19 shows the class diagram of the workload generation in our experiments.
which, in turn, contains a set of Stream objects and a QuerySubmitter object. Each Stream
object has a unique stream ID and represents a single client that submits the OLTP-like
small queries. The number of Stream objects managed by the StreamManager object
controls the workload intensity of the small queries. The LargeQuerySubmitter object, a
sub-object of the QuerySubmitter object, is used to submit a large OLAP query to the
database without decomposing it. The ScheduleSubmitter object, another sub-object of the
48
QuerySubmitter object, is used to submit a Segment Schedule consisting of the sub-
queries that have been decomposed from the large query using the decomposition
algorithm. The integer interval parameter of this object controls the length of the pause
We conducted experiments to test the effectiveness of our approach under four scenarios
in which the large OLAP query causes different degrees of contention for resources.
system and just compete for system resources like CPU and IO.
¾ In scenario 2, the workloads run in the same database instance but use separate
buffer pools, which add contention for general DBMS resources such as system
¾ In scenario 3, the workloads run in the same database instance and use the same
¾ In scenario 4, the workloads access the same tables. This adds contention for
Within each case, four types of throughput data for the small query set are
collected:
¾ Type2 – the small query set and a large query (before decomposition) run
simultaneously.
49
¾ Type3 – the small query set and a segment schedule (composed of the small
queries that were decomposed from the large query) run simultaneously.
¾ Type4 – the small query set and a segment schedule (composed of the small
queries that were decomposed from the large query) run simultaneously with a
one minute pause between executing the small queries contained in the
schedule. This data is used to confirm our observation that running a large
query as a series of smaller queries will release all resources between queries in
the series and so they are available to other parts of the workload.
used to calculate the confidence intervals for all experimental cases. Each run lasts for
600 seconds and is sampled every 20 seconds. Within each run, the small query starts its
run 60 seconds earlier than that of the large query (or its corresponding query schedule)
and uses the time as a warm-up period. It runs continuously within each run and its
throughput is monitored. On the other hand, due to the interference of the small query set,
the execution time for the large query (or its corresponding query schedule) is
substantially prolonged and therefore it runs only once per run. The first run is considered
as a general database warm-up period, especially for the large query, and the results
collected during this run are therefore excluded for the final analysis. Figures 20 to 27
show the throughput data for the small query set for the four cases. The analysis of the
To accommodate the different purposes of the four scenarios, two databases, Db_1
and Db_2, are used for our experiments. Db_1 has one user table space (Db1Ts_1) and
Db_2 has two user table spaces (Db2Ts_1 and Db2Ts_2). Within these table spaces, there
50
are four different sets of the standard TPC-H tables (TblSet1 to TblSet4) used in the
experiments. Table 1 shows the size and the location of these table sets.
The buffer pool size for each table space is scaled to 2% of the table space size. A
more detailed description of the buffer pool configuration is provided in Section 5.3.
Other key database parameters are configured as database default. No indices, other than
the primary key index, are created on each of the database tables.
In our experiments, the large query (Q21 or Q22) always accesses the 2GB table set
(TblSet4) and the small query set may access any table set (TblSet1 – TblSet4) depending
on the experimental case. Table 2 shows which table sets are used in the various
experimental scenarios.
51
5.2.1 Scenario 1: Separate Databases
This scenario tests the effectiveness of our approach in a situation where a large query
competes with other queries for system resources such as CPU and disk I/O but does not
share database-specific resources such as locks and buffer pool memory. The small query
set and the large query run in two separate databases (Db_1 and Db_2 respectively). The
large query accesses large tables in TblSet4 whereas the small query set accesses small
tables in TblSet1. The size of the corresponding buffer pools is configured such that it is
proportional to the table space size, which means that the buffer pool size for the table
space Db2Ts_2 is 20 times as big as that of table space Db1Ts_1. The results of this
scenario are shown in Figures 20 and 21 when the large query is Q21 and Q22,
respectively.
Scenario 1 ‐ Q21
30
Type1: Small Query Set Only
25
Throughput (q/s)
20
Type2: With Large Query(No
15 Decomposition)
10 Type3: With Large
Query(Decomposed, No Pause)
5
Type4: With Large
0 Query(Decomposed, 1‐minute
Pause)
0 5 10 15 20 25 30
Sample # (20s interval)
52
Scenario 1 ‐ Q22
30
Type1: Small Query Set Only
25
Throughput (q/s)
20
Type2: With Large Query(No
15 Decomposition)
10 Type3: With Large
Query(Decomposed, No Pause)
5
Type4: With Large
0 Query(Decomposed, 1‐minute
Pause)
0 5 10 15 20 25 30
Sample # (20s interval)
Figures 20 and 21, as well as Figures 22 to 27 that we will see in later sections, all
confirm that the running a large query in a database has a significant impact on the
performance of other workloads in the database. It also can be seen that in this scenario,
our decomposition approach is unsuccessful. The throughput of the small query set is
even worse when the large is decomposed, whether or not a 1-minute pause is applied.
This is understandable, however, because in this scenario, the large query and the small
query set compete only for operating system managed resources like CPU and disk I/O.
The decomposition of a large query brings extra overhead of CPU and disk I/O usage and
there is little that can be done by the DBMS to alleviate the performance degradation
53
5.2.2 Scenario 2: One Database, Separate Buffer Pools
Scenario 2 tests the effectiveness of our approach in a situation where a large query
competes with other queries for both CPU and I/O resources and general DBMS
resources such as catalogs and queues, but not for buffer pool memory. In this case, both
the large query and the small query set run in Db_2. The large query accesses large tables
in TblSet4 whereas the small query set accesses small tables in TblSet2. The buffer pool
size for table space Db2Ts_2 is the same as in Case 1 and the buffer pool size for table
space Db2Ts_1 is the same as that for table space Db1Ts_1 in Case 1. The results of this
case using Q21 as the large query are shown in Figure 22 and the results using Q22 are
Scenario 2 ‐ Q21
14
12
Type1: Small Query Set Only
10
Throughput (q/s)
8 Type2: With Large Query(No
Decomposition)
6
4 Type3: With Large
Query(Decomposed, No Pause)
2
Type4: With Large
0 Query(Decomposed, 1‐minute
Pause)
0 5 10 15 20 25 30
Sample # (20s interval)
54
Scenario 2 ‐ Q22
14
12
Type1: Small Query Set Only
10
Throughput (q/s)
8 Type2: With Large Query(No
Decomposition)
6
4 Type3: With Large
Query(Decomposed, No Pause)
2
Type4: With Large
0 Query(Decomposed, 1‐minute
0 5 10 15 20 25 30 Pause)
Sample # (20s interval)
Compared with scenario 1, the four types of throughput data for the small query set
in this scenario are all worse due to the added competition. Other than this point, however,
the trends of how the throughput data change before and after decomposing the large
query are similar to those in scenario 1. The decomposition approach still does not work
5.2.3 Scenario 3: One Database, Shared Buffer Pool, Different Table Sets
Scenario 3 reflects the situation where a large query competes with other queries for all
physical database resources, namely CPU, disk I/O and memory. In this case, however,
there is no lock contention as the queries are accessing different sets of tables. Both the
large query and the small query set run in Db_2. The large query accesses large tables in
55
TblSet4 whereas the small query set accesses small tables in TblSet3. There is a single,
shared buffer pool and its size is configured to be the total of the buffer pool sizes in Case
2. The results of this case using Q21 are shown in Figure 24 and Figure 25 shows the
Scenario 3 ‐ Q21
8
7
Type1: Small Query Set Only
6
Throughput (q/s)
5
Type2: With Large Query(No
4 Decomposition)
3
Type3: With Large
2 Query(Decomposed, No Pause)
1
Type4: With Large
0 Query(Decomposed, 1‐minute
0 5 10 15 20 25 30 Pause)
Sample # (20s interval)
Figure 24: Scenario 3 – one database, shared buffer pools, different table sets (Q21)
56
Scenario 3 ‐ Q22
8
7
Type1: Small Query Set Only
6
Throughput (q/s)
5
Type2: With Large Query(No
4 Decomposition)
3
Type3: With Large
2 Query(Decomposed, No Pause)
1
Type4: With Large
0 Query(Decomposed, 1‐minute
Pause)
0 5 10 15 20 25 30
Sample # (20s interval)
Figure 25: Scenario 3 – one database, shared buffer pools, different table sets (Q22)
Compared with scenarios 5.2.2 and 5.2.3, this scenario sees some improvement as a
result of the decomposition of the large query, especially for the period between the 3rd
sampling point to the 7th sampling point with TPC-H 22 as the large query. In Figure 25,
both Type3 and Type4 data have two obvious performance drops along the curves. This
introduced by the decomposition approach. The similar trends also exist in other
scenarios for both Q21 and Q22, although those for Q21 are not obvious.
5.2.4 Scenario 4: One Database, Shared Buffer Pool, Same Table Set
Scenario 4 reflects the situation where a large query competes with other queries for all
physical database resources including locks. Both the large query and the small query set
57
run in Db_2 and access the same large tables in TblSet4. There is only one buffer pool
involved in this case and its size is configured to be the same as that for Db2Ts_2 in Case
2. The results of this case using Q21 are shown in Figure 26 and in Figure 27 using Q22.
Compared with scenario 3, this scenario sees more improvement that is brought by the
Scenario 4 ‐ Q21
7
6
Type1: Small Query Set Only
5
Throughput (q/s)
4 Type2: With Large Query(No
Decomposition)
3
Type3: With Large
2
Query(Decomposed, No Pause)
1
Type4: With Large
0 Query(Decomposed, 1‐minute
Pause)
0 5 10 15 20 25 30
Sample # (20s interval)
Figure 26: Scenario 4 – one database, shared buffer pools, same table set (Q21)
58
Scenario 4 ‐ Q22
7
6
Type1: Small Query Set Only
5
Throughput (q/s)
4 Type2: With Large Query(No
Decomposition)
3
2 Type3: With Large
Query(Decomposed, No Pause)
1
Type4: With Large
0 Query(Decomposed, 1‐minute
0 5 10 15 20 25 30 Pause)
Sample # (20s interval)
Figure 27: Scenario 4 – one database, shared buffer pools, same table set (Q22)
We first observe that, as expected, decomposing the large queries causes significant
increases in their response time. Table 3 shows how the response time changes for Q21
59
As we can see from Table 3, the response time for query Q21, which is an IO-
intensive query, increases an average of 20% over the four cases (162s normal execution
versus 195s for the decomposed query). The response time for query Q22, which is a
CPU-intensive query, increases an average of 87% over the four cases (67s normal
execution, 125s for the decomposed parts). The increased response time of the
decomposed queries is mainly due to the IO associated with the introduction of temporary
tables. This additional IO is more significant for the CPU-intensive queries (Q22) than for
We also observe that increased contention for resources has a negative impact on
the throughput of the OLTP-like workload. Tables 4 and 5 show the average throughput
of the small query set over the “busy” period (different among experiment cases, but all
between sampling points 3 and 18) within each experimental scenario when the large
query is Q21 and Q22, respectively. Looking at the Type 1 column (small queries
running alone) in both tables, we see that throughput of the small query set decreases 61%
when the workloads are placed in the same database instance; decreases another 45%
when the workloads are placed in the same buffer pool, and decreases another 7% when
60
Experimental Type1 (q/s) Type2 (q/s) Type3 (q/s) Type4 (q/s)
Scenario
Tables 4 and 5 also show that the throughput for the small query set is worse when
the large query is decomposed (Type3 and Type4) than when the large query is not
I/O overhead incurred by our approach to write temporary intermediate database tables.
The overhead itself depends on how many intermediate results are written and it is
unavoidable due to the fact that Query Disassembler is implemented outside the DBMS
engine.
We see that, in the cases where the large query and the other queries are in one
database and also share the same buffer pool, our approach works fine. From tables 4 and
5, we can see that the overall average throughput for the OLTP-like workload is only
slightly lower when the large query is decomposed than when the large query is not
decomposed. However, such a minor decease is implemented under a situation that our
current approach brings an unavoidable big overhead (more on the overhead and the
possible ways to reduce it are discussed in Section 6.1). Besides, if we only focus on the
sampling interval between the 4th and the 7th sampling points in Figures 25 and 27
(assuming that the overhead brought by our approach could be controlled to a minimum
level), it actually shows that the Type 3 and Type 4 throughputs are higher than the Type
61
2 throughput. This observation verifies that at any given time, a smaller query will likely
We also note that the throughput for Type 4, which includes the 1 minute delay
between the executions of the decomposed query parts, is better than the case where the
delay is not introduced. This delay means that by running a large query as a series of
smaller queries, all resources that are occupied by the large query can be released
between queries in the series and so are available for other queries and can be used to
Our approach does not help in situations like scenario 1 (separate databases) or
scenario 2 (one database, separate buffer pools). This, however, is expected. Carey et al.
[13] point out that “whether the system bottleneck is the CPU or the disk, it is essential
that priority scheduling on the critical resource be used in conjunction with a priority-
based buffer management algorithm”. In scenario 1 and scenario 2, the large query does
not compete for memory (buffer pool) with other queries. Therefore, our approach will
not make much difference in these two cases, even when there is no overhead involved.
The most surprising observation in our experiments comes from decomposing Q21
(Figures 20, 22, 24, and 26). In these cases, since by our decomposition algorithm, both
Q21 and Q22 are broken into two smaller units, we expect that when a one-minute pause
is applied between executing the two smaller units for Q21, the Type 4 curve shape in
Figures 22, 22, 24, and 26 would be similar to that in Figure 21, 23, 25, and 27
correspondingly. To put in another way, we expect a throughput increase during this one-
minute period before it decreases again. In Figures 22, 22, 24, and 26, however, the trend
is not obvious.
62
The reason for this is subtle. In Section 5.1, we mentioned that our decomposition
algorithm breaks Q21 into two “70% and 30%” smaller parts, and breaks Q22 into two
“60% and 40%” smaller parts. In our experiments the actual execution for Q22 reflects
this 60-40 division, but the execution of Q21 does not. In reality, the first smaller part of
Q21 consumes most of the total execution time. How to take advantage of extra database
compiler information to detect this type of circumstance in advance is slated for future
work.
63
Chapter 6
In this thesis we present an approach to managing the execution of large complex queries
in a database and therefore controlling its impact on other smaller, possibly more
important, queries. A decomposition algorithm that breaks up a large query into a set of
6.1 Conclusions
Our experiments show that concurrent execution of large resource-intensive queries can
have significant impact on the performance of other workloads, especially as the points of
contention between the workloads increase. We conclude that there is a need to be able to
manage the execution of these large queries in order to control their impact.
The experiments show that our approach is viable, especially in cases when
contention among the workloads is high, for example when a large query and other
workloads run in the same database and share buffer pools. In other cases when the
competition is low (by “low”, we mean that the workloads do not share buffer pools), our
approach does not work well. In these cases, the performance degradation that is caused
by the overhead of our approach dominates and therefore makes our approach
impracticable.
In our approach, the major overhead is primarily due to the costs involved in saving
the intermediate results to connect the decomposed queries. Specifically, these costs
64
include those related with creating, populating, accessing, and destroying the temporary
tables that are necessary for accommodating the intermediate results. The overhead could
Currently, due to the fact that our approach is implemented outside of a database
engine, we have no choice but to use an expensive way to store the intermediate results,
statement and a “DROP TABLE” statement. If we had the ability to save the
intermediate results from inside a database engine, we could probably design a cheaper
and faster mechanism to save the intermediate results. A possible solution would be to
save the ROWID and COLUMNID information of a table instead of storing its real record
values. There are two main advantages of doing so. First, it can create a much smaller
intermediate table because the ROWID and COLUMNID information of a table record is
usually much smaller in size than the real record value. Second, it can also create a much
faster intermediate table because the DBMS can utilize the ROWID and COLUMNID
Another big improvement of saving the intermediate results from inside a database
engine is that it would avoid the overhead that is caused by the DBMS following the
The experiments also show that our approach always causes performance
degradation for the large query itself and sometimes the reduction can be significant,
especially when the large query is decomposed in a way that a pipelined operation is
65
interrupted. One reason for the degradation comes from the decomposition processes
itself and another comes from creating, accessing, and deleting the intermediate tables.
The first type of degradation is unavoidable in our approach. We could, however, shorten
the overall delay by utilizing more advanced techniques of saving intermediate tables as
Our work shows the feasibility and potential of the management of the execution of large
¾ Currently in our work, the small query set in our experiment workload contains
UPDATE, and DELETE) in the workload. These queries tend to create more
into an equivalent SQL statement. This step is highly vendor-specific and has
some limitations that are inherent in our current approach due to the fact that
66
¾ The approach of controlling the execution of a large query in our work is to
decompose the large query based on this QEP. This approach is static and can
not handle all types of large queries. It is very attractive to investigate a more
execution, so that the large queries that cannot be handled by our current
¾ Our current approach relies solely on the DB2 compiler to provide the
the DBMS screens the system configuration change on the approach’s behalf
However, it is very interesting to investigate how our approach can react to the
change of the system configuration in a more active and reasonable way. For
example, if more CPUs are added in the system, our decomposition algorithm
could utilize that information to generate a more parallel segment schedule and
67
References
http://www.oracle.com/corporate/analyst/reports/infrastructure/dbms/idc-201692.pdf.
http://www2.sims.berkeley.edu/research/projects/how-much-info-2003/.
http://www.wintercorp.com/VLDB/2005_TopTen_Survey/TopTenWinners_2005.asp.
Technology, ACM SIGMOD Record 26(1), March 1997, pp. 65- 74.
http://searchwebservices.techtarget.com/sDefinition/0,,sid26_gci929186,00.html.
Indicator for Database Queries, Proc. of the 2004 ACM SIGMOD Int. Conf. on
for SQL Queries, Proc. of the 1996 ACM SIGMOD Int. Conf. on Management of Data,
[9] P. J. Haas, J. M. Hellerstein. Ripple Joins for Online Aggregation, Proc. of the
1999 ACM SIGMOD Int. Conf. on Management of Data, Philadelphia, U.S.A, June 1999,
68
[10] J. M. Hellerstein, P. J. Hass, H. J. Wang. Online Aggregation, Proc. of the 1997
ACM SIGMOD Int. Conf. on Management of Data, Tucson, U.S.A, June 1997, pp. 171 –
182.
Practical, Scalable Solution, Proc. of the 2001 ACM SIGMOD Int. Conf. on Management
Query Execution Plans, Proc. of the 1998 ACM SIGMOD Int. Conf. on Management of
the 15th Int. Conf. on Very Large Data Bases, Amsterdam, The Netherlands, August 1989,
Tuning for Complex Workloads, Proc. of the 20th Int. Conf. on Very Large Data Bases,
Autonomic DBMSs, Proc. of the 2006 Conf. of the Centre for Advanced Studies on
Policies for Distributed Systems and Networks, London, Canada, June 2006, pp. 13-22.
[17] IBM DB2 Query Patroller Guide: Installation, Administration, and Usage, IBM
online documentation,
ftp://ftp.software.ibm.com/ps/products/db2/info/vr82/pdf/en_US/db2dwe81.pdf
69
[18] C. Ballinger, Introduction to Teradata Priority Scheduler, July 2006,
http://www.teradata.com/library/pdf/eb3092.pdf.
August 2003.
Complex SQL Queries Using Automatic Summary Tables, Proc. of the 2000 ACM
SIGMOD Int. Conf. on Management of Data, Dallas, USA, June 2000, pp. 105 -116.
ftp://ftp.software.ibm.com/ps/products/db2/info/vr8/pdf/letter/nlv/db2tvb80.pdf
ftp://ftp.software.ibm.com/ps/products/db2/info/vr82/pdf/ko_KR/db2s1k81.pdf.
ftp://ftp.software.ibm.com/ps/products/db2/info/vr82/pdf/ja_JP/db2d3j81.pdf.
Universal DataJoiner, Proc. of the 24th Int. Conf. on Very Large Data Bases, New York
70
[27] H. Pirahesh, J. Hellerstein, W. Hasan. Extensible Rule-based Query Rewrite
Optimization in Starburst, Proc. of the 1992 ACM SIGMOD Int. Conf. on Management of
Starburst, Proc. of the 1989 ACM SIGMOD Int. Conf. on Management of Data, Portland,
[29] J.O. Kephart, D.M. Chess. The Vision of Autonomic Computing, IEEE Computers,
http://www.research.ibm.com/autonomic/overview/problem.html.
http://www-03.ibm.com/autonomic/pdfs/AC_BrochureFinal.pdf.
[32] D. Kossamann. The State of the Art in Distributed Query Processing, ACM
71
Glossary of Acronyms
MV Materialized View
72
Appendix A
TPC-H Benchmark
(containing large volume of data) under controlled conditions in order to give answers to
oriented ad-hoc queries. It is designed such that both the queries and the data reflect broad
A TPC-H database contains eight base tables. The relationships between these
tables are illustrated in Figure 28. In Figure 28, the arrows point in the direction of one-
to-many relationships between tables. The parentheses following each table name defines
the prefix of the column names for that table. For example, the real column name for the
name of a nation should be “N_NAME”. The number below each table name represents
the cardinality (number of rows) of the table. The SF in front of the number represents the
scale factor used to obtain a chosen database size. Take SUPPLIERS table as an example,
a SF value of 5 means that the actual SUPPLIERS table has 50,000 (5 * 10,000) rows
inside. A TPC-H database with a SF value 1 (the TPC-H tables in the database all have a
73
Figure 28: TPC-H schema [5]
TPC-H defines twenty-two decision support queries (Q1 to Q22). Tables 6 list the
business questions for which Q21 and Q22 provide answers. For the business questions
that Q1 to Q20 aim for, please see the official TPC-H specification [5].
74
Query # Business Question Description
substitution by some randomly selected real value is required. The TPC-H specification
defines the value range for each of the parameters involved and suggests a default value
for it for the purpose of query validation. Figure 29 shows the SQL statement template
for TPC-H Q22. The template for Q21 can be found in Figure 1 in Section 1.2:
1 select cntrycode,
2 count(*) as numcust,
3 sum(c_acctbal) as totacctbal
4 from (
5 select substring(c_phone from 1 for 2) as cntrycode,
6 c_acctbal
7 from customer
8 where substring(c_phone from 1 for 2) in ('[I1]','[I2]',’[I3]','[I4]','[I5]','[I6]','[I7]')
9 and c_acctbal > (
10 select avg(c_acctbal)
11 from customer
12 where c_acctbal > 0.00
13 and substr (c_phone from 1 for 2) in
14 ('[I1]','[I2]','[I3]','[I4]','[I5]','[I6]','[I7]') )
15 and not exists (
16 select *
17 from orders
18 where o_custkey = c_custkey )
19 ) as custsale
20 group by cntrycode
21 order by cntrycode;
Figure 29: TPC-H Q22 statement (template)
75
Appendix B
Table 7 gives the list of the common physical operators that can appear in a QEP. Simple
Operator Description
Table Scan A TBSCAN operator retrieves the data of a database table by reading
(TBSCAN) all the required data directly from the data pages.
Index Scan An IXSCAN operator scans an index to produce a reduced stream of
(IXSCAN) data.
Filter A FILTER operator filters a stream of data based on the criteria
(FILTER) supplied by the filter predicates.
Column Selection A COLSEL operator selects the data for designated columns from a
(COLSEL) stream of data.
Nested Loop Join A NLJOIN operator joins two streams of data using the standard nested
(NLJOIN) loop join algorithm.
Distinct/Unique A UNIQUE operator eliminates duplicates from a stream of data.
(UNIQUE)
Sort A SORT operator sorts a data stream in the order of one or more of its
(SORT) columns, optionally eliminating duplicate entries.
Hash Join A HSJOIN operator joins two streams of data using the standard hash
(HSJOIN) join algorithm.
Merge-Sort Join A MSJOIN operator joins two streams of data using the standard
(MSJOIN) merge-sort join algorithm. A merge-sort join is also called a merge
scan join or a sorted merge join.
Union A UNION operator concatenates two data streams (having same data
(UNION) structure) and retrieves all data from both streams.
Intersect A INTERSECT operator concatenates two data streams (having same
(INTERSECT) data structure) and retrieves the data that are shared by both streams.
Except A EXCEPT operator concatenates two data streams (having same data
(EXCEPT) structure) and retrieves the data from the first data stream that is not
contained in the second stream.
Aggregation/Group A GRPBY operator groups data by common values of designated
By columns or functions. It is required to produce a group of values, or to
evaluate set functions.
(GRPBY)
Table 7: Common QEP Operators
76
Appendix C
IBM DB2 provides a facility called SQL Explain to allow a DBA to capture information
about the access plan that is chosen by the DB2 optimizer [24]. The information captured
selectivity estimates for each predicate; 4) statistics for all objects referenced in the SQL
statement; and 5) values for the host variables, parameter markers, or special registers.
The information can help a DBA understand how database tables and indexes are
accessed for a submitted query and to evaluate the performance tuning strategies.
DB2 uses a suite of explain tables to store the captured explain data that can be
¾ Use the db2expln and dynexpln tools to see the access plan information for
markers, respectively.
Table 8 lists the relational tables that are provided by DB2 to store the explain
information, which is used by our program to build up the QEP as well as its related cost
77
Table Name Description
All explain information as stored in the explain tables is organized around the
concept of an explain instance, which represents one invocation of the explain facility.
Each explain instance can contain the explain information for multiple SQL statements,
either static or dynamic. The information stored in the explain tables reflects the
Other than the operation sequence in an access plan, the explain facility also
captures cost information for each operator. The cost captured for an operator is an
estimated cumulative cost, from the start of access plan execution up to and including the
78
¾ The total cost (in timerons).
¾ The cost (in timerons) of fetching the first row, including any initial overhead
required.
The unit of cost is timeron which is a DB2-specific relative cost unit. It does not
directly link to any actual unit of measure, like response time or throughput, but gives a
79
Appendix D
In Section 5.3, we describe the experiment cases and the type of data to collect. There are
4 different experiment scenarios and within each case, there are 4 different data types. In
this appendix, for simplicity we use S1T1, S1T2, S1T3, S1T4, S2T1, S2T2, S2T3, S2T4,
S3T1, S3T2, S3T3, S3T4, S4T1, S4T2, S4T3, and S4T4 to name the 16 different types of
throughput data to be collected, in which S means the experiment scenario and T means
the data type. The number following S and T means the experiment scenario number and
meaning the maximum possible error between the sample mean and the population mean.
In this equation, n is the sample number, σ is the standard deviation of the sample and z
(α/2) is the z-value for a confidence level (1-α) 100%. In our experiment, n is equal to 10
/2 /√
Tables 9 and 10 list the calculated E-values for the 16 different types of data when
the large query is TPC-H Q21, Q22 respectively. It can be seen from these tables that,
with a confidence level of 95%, the maximum errors for the 16 types of throughput data
that is collected by our experimental method are all less than 0.4 queries per second,
80
Data Sample Sample E-Value Conf. Interval
MEAN (q/s) STDEV (95% Conf.) MEAN
(q/s) (q/s) (95% Conf.)
(q/s)
S1T1 30.54 0.30 0.18 (30.36, 30.72)
S1T2 27.58 0.56 0.34 (27.24, 27.92)
S1T3 27.05 0.47 0.29 (26.76, 27.34)
S1T4 27.14 0.40 0.25 (26.89, 27.39)
S2T1 11.81 0.26 0.16 (11.65, 11.97)
S2T2 10.27 0.30 0.19 (10.08, 10.46)
S2T3 9.55 0.28 0.17 (9.38, 9.72)
S2T4 9.82 0.29 0.18 (9.64, 10)
S3T1 6.48 0.41 0.25 (6.23, 6.73)
S3T2 6.22 0.36 0.22 (6, 6.44)
S3T3 6.14 0.33 0.21 (5.93, 6.35)
S3T4 6.12 0.38 0.24 (5.88, 6.36)
S4T1 6.00 0.18 0.11 (5.89, 6.11)
S4T2 5.06 0.13 0.08 (4.98, 5.14)
S4T3 4.90 0.13 0.08 (4.82, 4.98)
S4T4 4.92 0.15 0.09 (4.83, 5.01)
Table 9: E-Value for collected throughput data (Q21)
81
Appendix E
As part of our experimental workload, we run a small query set consisting of eight queries
which access the TPC-H tables. The templates for these queries are shown below. Within
Query 1:
Query 2:
group by r.r_name
Query 3:
group by n.n_name
Query 4:
82
Template select c.c_mktsegment, count(c.c_custkey)
group by c.c_mktsegment
Query 5:
from part as p
group by p.p_container
Query 6:
from partsupp as ps
Query 7:
from orders as o
where [?] < o.o_orderkey and o.o_orderkey < [?] + 100000 and
o.o_orderpriority = '1-URGENT'
group by o.o_orderstatus
83
Query 8:
o.o_orderpriority = '1-URGENT'
group by l.l_orderkey
84
Appendix F
Figure 30 shows the QEP structure of TPC-H Q22 that is returned by DB2’s Explain
Utility. This structure is very similar to that as shown in Figure 18 in Section 5.1, which is
the QEP structure of Q22 that is returned by Query Disaasembler, except that the cost
presentation in the Query Disassembler’s structure is more versatile (e.g. relative cost,
accumulative cost, and etc.), which makes it a better candidate for the decomposition
algorithm.
Similarly, Figure 31 shows the QEP structure of TPC-H Q21 from DB2’s point of
view. But unlike the case of Q22, we do not show its Query Disassembler structure in
85
Figure 30 : QEP of TPC-H Q22 by DB2 Explain Utility
86
Figure 31 : QEP of TPC-H Q21 by DB2 Explain Utility
87