0% found this document useful (0 votes)
5 views154 pages

DBT 2

The document discusses query processing and optimization in database technologies, focusing on query execution, compilation, and physical query plans. It details the steps involved in query compilation, including parsing, query rewriting, and physical plan generation, as well as the implementation of physical operators using iterators. Additionally, it covers one-pass and two-pass algorithms for various database operations, emphasizing their efficiency and disk I/O requirements.

Uploaded by

Shreeya Rao
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views154 pages

DBT 2

The document discusses query processing and optimization in database technologies, focusing on query execution, compilation, and physical query plans. It details the steps involved in query compilation, including parsing, query rewriting, and physical plan generation, as well as the implementation of physical operators using iterators. Additionally, it covers one-pass and two-pass algorithms for various database operations, emphasizing their efficiency and disk I/O requirements.

Uploaded by

Shreeya Rao
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 154

Database Technologies

Query Processing and Optimization

Query Execution
Department of Computer Science and Engineering

Acknowledgement: Authors of the prescribed textbooks and materials sourced online.


1
Query Execution

List of Contents
- Query Execution
- Query Compilation
- Physical Query Plan - Operators
- The Computation Model for Physical Operators

Department of Computer Science and Engineering


2
DATABASE TECHNOLOGIES
Query Execution

• The query processor is a very important component of the DBMS that


turns user queries and data-modification commands into a sequence of
database operations and executes those operations.

• The query processor determines the execution strategy

3
DATABASE TECHNOLOGIES
Query Compilation

1. Parsing - A parse tree for the query is constructed


2. Query Rewrite - The parse tree is converted to an initial query plan,
which is an algebraic representation of the query.
3. Physical Plan Generation - The abstract query plan is then turned into
a physical query plan by selecting algorithms to implement each of the
operators of the logical plan and by selecting an order of execution for
these operators.

4
DATABASE TECHNOLOGIES
Query Compilation - Example

1. Select Name, Salary


From Employee
Where ID = 123;

1. Logical query plan: π NAME, SALARY (σ ID = 123 (EMPLOYEE))

1. Physical query plan:


Operators : Selection, Projection.

5
DATABASE TECHNOLOGIES
Query Compilation - Example

1. Select E.Name, E.Salary, D.Dept_Name


From Employee E, Department D
Where E.ID between 100 and 200 and
D.Dept_Name in (‘CSE’, ‘ME’);

1. Logical query plan:


π NAME, SALARY, DEPT_NAME ((σ ID >= 100 AND ID <= 200 (EMPLOYEE)
(σ DEPT_NAME = ‘CSE’ OR DEPT_NAME = ‘ME’ (DEPARTMENT))

2. Physical query plan:


Operators : Selection, Projection, Join
6
DATABASE TECHNOLOGIES
Physical Query Plan - Operators

Physical query plans are built from operators, each of which implements one step of the plan

Scanning - read the contents of a relation R. There are two approaches to locating the tuples of a
relation R.
1. Table scan – Read the blocks containing the tuples of R one by one from secondary storage
2. Index scan – If there is an index on any attribute of R, use this index to get all the tuples of R.

Sorting
The physical-query-plan operator sort-scan takes a relation R and a specification of the attributes
on which the sort is to be made and produces R in that sorted order
7
DATABASE TECHNOLOGIES
The Computation Model for Physical Operators

1. A physical query plan is composed of several physical operators


2. Need to estimate the “cost” of each operator used in a physical query
3. The number of disk I/O’s is used as a measure of cost for an operation
4. When comparing algorithms for the same operations, assume that the arguments for the
operation is in secondary storage and the result will be in main memory.
5. Parameters considered for computing the cost.
1. The number of blocks B that are needed to hold all the tuples of R = B(R)
2. The number of tuples T in the relation R = T(R)
3. The number of distinct values that appear in a column a of a relation R = V (R,a)

8
DATABASE TECHNOLOGIES
The Computation Model for Physical Operators

I/O Cost for Scan Operators


• If relation R is clustered, then the number of disk I/O’s for the table-scan operator is
approximately B.
• If relation R is clustered, then the number of disk I/O’s for the sort-scan operator is
approximately B if the relation R can be housed in memory for sorting.
• If relation R is not clustered, then the number of disk I/O’s for the table-scan operator is
much higher than B. A table-scan for R may require reading as many blocks as there are
tuples of R. ie, the I/O cost could be T
• Similarly, if the relation R is not clustered, then the number of disk I/O’s for the sort-scan
operator could be T
9
DATABASE TECHNOLOGIES
The Computation Model for Physical Operators

Iterators for Implementation of Physical Operators


• Iterators support efficient execution when they are composed within query plans.
• When iterators are used, many operations are active at once.
• Tuples pass between operators as needed, thus reducing the need for storage.
• Not all physical operators support the iteration approach
• Many physical operators can be implemented as an iterator, which is a group of three methods that
allows a consumer of the result of the physical operator to get the result one tuple at a time
• The three methods forming the iterator for an operation are:
1. Open()
2. GetNext()
3. Close()
10
DATABASE TECHNOLOGIES
The Computation Model for Physical Operators

Iterators for Implementation of Physical Operators


Open()
{
b := the first block of R;
t := the first tuple of block b;
}

• This method starts the process of getting tuples, but does not get a tuple.
• It initializes any data structures needed to perform the operation and calls Open() for
any arguments of the operation
11
DATABASE TECHNOLOGIES
The Computation Model for Physical Operators

Iterators for Implementation of Physical Operators


GetNext()
{
IF (t is past the last tuple on block b) • This method returns the next tuple in the result and
{ increment b to the next block;
adjusts data structures as necessary to allow
IF (there is no next block)
subsequent tuples to be obtained.
RETURN NotFound;
• If there are no more tuples to return, GetNext()
ELSE /* b is a new block */
t := first tuple on block b; returns a special value NotFound
} /* now we are ready to return t and increment */
oldt := t;
increment t to the next tuple of b;
RETURN oldt;
12
}
DATABASE TECHNOLOGIES
The Computation Model for Physical Operators

Iterators for Implementation of Physical Operators


Close()
{
}

• This method ends the iteration after all tuples, or after all tuples that
the consumer wanted have been obtained.

13
DATABASE TECHNOLOGIES
The Computation Model for Physical Operators

Iterators for Implementation of Physical Operators


Example R U S

Open()
{
R.Open();
CurRel := R;
}

14
DATABASE TECHNOLOGIES
The Computation Model for Physical Operators

Iterators for Implementation of Physical Operators


Example R U S
GetNext()
{
IF (CurRel = R)
{
t := R.GetNext();
IF (t <> NotFound) /* R is not exhausted */
RETURN t;
ELSE /* R is exhausted */
{
S.Open();
CurRel := S;
}
} /* here, we must read from S */
RETURN S.GetNext(); /* notice that if S is exhausted, S.GetNext() will return NotFound, which is the correct action for our GetNext as well */ 15
DATABASE TECHNOLOGIES
The Computation Model for Physical Operators

Iterators for Implementation of Physical Operators


Example R U S

Close()
{
R.Close();
S.Close();
}

16
THANK YOU
Query Execution
Department of Computer Science and Engineering

17
Database Technologies
Query Processing and Optimization

One Pass Algorithms


Department of Computer Science and Engineering

Acknowledgement: Authors of the prescribed textbooks and materials sourced online. 1


One Pass Algorithms

List of Contents
- One-Pass Algorithms for Tuple-at-a-Time Operations
- One-Pass Algorithms for Unary, Full-Relation
Operations (Duplicate Elimination, Grouping)
- One-Pass Algorithms for Binary Operations (Set Union,
Set intersection, Set Difference, Product, Natural Join)

Department of Computer Science and Engineering


2
DATABASE TECHNOLOGIES
Query Execution

• How should each of the individual steps of a logical query plan be


executed? — for example, a projection, join or selection
• The choice of algorithm for each operator is an essential part of the
process of transforming a logical query plan into a physical query plan
• Algorithms for operators are divided into three “degrees” of difficulty
and cost
i. One-Pass Algorithms - read data only once from disk
ii. Two-Pass Algorithms - read data first time from disk, process it, write
data to disk and again read it for further processing
iii. Multi-Pass Algorithms - read data from disk, process it, write data to disk
multiple times 3
DATABASE TECHNOLOGIES
Query Execution

One-Pass Algorithms for Tuple-at-a-Time Operations


• Tuple-at-a-time operations are σ(R) and π(R)
• Read the blocks of R one at a time into an input buffer
• Perform the operation on each tuple
• Move the selected tuples or the projected tuples to the output buffer
• The output buffer may be an input buffer of some other operator
• The cost for this process depends only on whatever it takes to perform a
table-scan or index-scan of R.
• The number of disk I/O’s is B if R is clustered and T if it is not clustered.
• We require only that M ≥ 1 for the input buffer, regardless of B.
4
DATABASE TECHNOLOGIES
Query Execution

One-Pass Algorithms for Unary, Full-Relation Operations


• Full-Relation operations are duplicate elimination (δ) and grouping (γ)
• Duplicate Elimination
• Read each block of R one at a time
• Consider each tuple from the block read, compare it with all tuples read already and
if it is not equal to any of those tuples, copy it to the output and add it to the in-
memory list of tuples
• Keep in memory one copy of every tuple read
• One memory buffer holds one block of R’s tuples
• The remaining M − 1 buffers are used to hold a single copy of every tuple read so far
• B(δ(R)) ≤ M – 1
• The number of disk I/O’s is B(R)
5
DATABASE TECHNOLOGIES
Query Execution

One-Pass Algorithms for Unary, Full-Relation Operations


• Grouping
• Create in main memory one entry for each group (for each value of the grouping attributes)
• Read each block of R one at a time
• For MIN(a) or MAX(a), record the minimum or maximum value respectively, change the recorded value if
necessary
• For COUNT(a), increment the value for the corresponding group
• For SUM(a) add the value of attribute a to the accumulated sum for its group
• For AVG(a) maintain two accumulations: the count of the number of tuples in the group and the sum of the values
of attribute a. After all tuples of R are read, compute the quotient of the sum and count to obtain the average.
• Write the tuple for each group
• The number of disk I/O’s needed is B
6
• The number of memory buffers required will usually be less than B.
DATABASE TECHNOLOGIES
Query Execution

One-Pass Algorithms for Binary Operations


• Binary operations: union, intersection, difference, product and join.
• Set Union – R U S
• Read S into M − 1 buffers of main memory and build a search structure whose search key is
the entire tuple
• All these tuples are also copied to the output
• Read each block of R into the Mth buffer, one at a time
• For each tuple t of R, check if t is in S, and if not, copy t to the output. If t is also in S, ignore t.
• min (B(R), B(S)) ≤ M - 1
• The number of disk I/O’s is B(R) + B(S)
7
DATABASE TECHNOLOGIES
Query Execution

One-Pass Algorithms for Binary Operations


• Set Intersection – R ∩ S
• Read S into M − 1 buffers of main memory and build a search structure whose search key is
the entire tuple
• Read each block of R into the Mth buffer, one at a time
• For each tuple t of R, check if t is in S. If yes, copy t to the output. If not, ignore t.
• min (B(R), B(S)) ≤ M - 1
• The number of disk I/O’s is B(R) + B(S)

8
DATABASE TECHNOLOGIES
Query Execution

One-Pass Algorithms for Binary Operations


• Set Difference R – S
• Read S into M − 1 buffers of main memory and build a search structure whose search key
is the entire tuple
• Read each block of R into the Mth buffer, one at a time
• For each tuple t of R, check if t is in S. If yes, ignore t. If not, copy t to the output buffer.
• min (B(R), B(S)) ≤ M - 1
• The number of disk I/O’s is B(R) + B(S)

9
DATABASE TECHNOLOGIES
Query Execution

One-Pass Algorithms for Binary Operations


• Set Difference S – R
• Read S into M − 1 buffers of main memory and build a search structure whose search key
is the entire tuple
• Read each block of R into the Mth buffer, one at a time
• For each tuple t of R, check if t is in S. If yes, delete the copy of t in main memory of S. If
not, ignore t.
• Copy the remaining tuples of S to the output buffer
• min (B(R), B(S)) ≤ M - 1
• The number of disk I/O’s is B(R) + B(S)

10
DATABASE TECHNOLOGIES
Query Execution

One-Pass Algorithms for Binary Operations


• Product - R x S
• Read S into M − 1 buffers of main memory
• Read each block of R into the Mth buffer, one at a time
• For each tuple t of R concatenate t with each tuple of S in main memory.
• Output each concatenated tuple as it is formed
• min (B(R), B(S)) ≤ M - 1
• The number of disk I/O’s is B(R) + B(S)

11
DATABASE TECHNOLOGIES
Query Execution

One-Pass Algorithms for Binary Operations


• Natural Join – R * S
• Consider R(X,Y) is being joined with S(Y,Z) where Y represents all the attributes that R and S have in common,
X is all attributes of R that are not in the schema of S and Z is all attributes of S that are not in the schema of R.
• Read all the tuples of S into M − 1 buffers of main memory and form a main-memory search structure with
the attributes of Y as the search key
• Read each block of R into the Mth buffer, one at a time
• For each tuple t of R, find the tuples of S that match with t on all attributes of Y using the search structure
• For each matching tuple of S form a tuple by joining it with t, and move the resulting tuple to the output
buffer
• min (B(R), B(S)) ≤ M - 1
• The number of disk I/O’s is B(R) + B(S) 12
DATABASE TECHNOLOGIES
Query Execution

Main Memory and Disk I/O Requirements for One-Pass Algorithms for Different Operations

13
THANK YOU
One Pass Algorithms
Department of Computer Science and Engineering

14
Database Technologies
Query Processing and Optimization

Two Pass Algorithms


Department of Computer Science and Engineering

Acknowledgement: Authors of the prescribed textbooks and materials sourced online.


1
Two Pass Algorithms

List of Contents
- Two Pass Algorithms based on Sorting (Two-Phase Multiway
Merge-Sort, Duplicate Elimination, Grouping & Aggregation,
Union, Intersection, Difference, Join, Merge-Join)
- Two Pass Algorithms based on Hashing (Partitioning Relations,
Duplicate Elimination, Grouping & Aggregation, Union,
Intersection and Difference)
- Saving some disk I/Os
Department of Computer Science and Engineering
2
DATABASE TECHNOLOGIES
Query Execution – Two pass algorithms

• Two pass algorithms based on sorting


• Consider a relation R for which B(R) > M. Divide relation R into chunks of size M, sort them
and then process the sorted sublists in some fashion that requires only one block of each
sorted sublist in main memory at any one time
• Two-Phase Multiway Merge-Sort (TPMMS)
• Consider M main memory buffers to use for the sort.
• Phase 1 : Repeatedly fill the M buffers with new tuples from R and sort them using any
main-memory sorting algorithm. Write out each sorted sublist to secondary storage.
• Phase 2 : Merge the sorted sublists. For this phase to work, there can be at most M − 1
sorted sublists, which limits the size of R. Allocate one input block to each sorted sublist
and one block to the output. 3
DATABASE TECHNOLOGIES
Query Execution – Two pass algorithms

• Phase 2 :
• Merge the sorted sublists into one sorted list with all the records as
follows.
• Find the smallest key among the first remaining elements of all the lists.
• Move the smallest element to the first available position of the output block
• If the output block is full, write it to disk and reinitialize the same buffer in main
memory to hold the next output block
• If the block from which the smallest element was just taken is now exhausted of
records, read the next block from the same sorted sublist into the same buffer
that was used for the block just exhausted. If no blocks remain, then leave its
buffer empty and do not consider elements from that list in any further
competition for smallest remaining elements.

4
DATABASE TECHNOLOGIES
Query Execution – Two pass algorithms

• Two-Phase Multiway Merge-Sort Example :

F F F F F F
T T G G G G
H H H H H H
G G T T A T A T A H U
K B B J Z
K A A A A
U
U K K K K … F K
A G T
A U U U U
Z
Z Z Z Z Z
J
B J B B B B
B J J J J

5
DATABASE TECHNOLOGIES
Query Execution – Two pass algorithms

• Analysis of Two-Phase, Multiway Merge-Sort Algorithm


• For TPMMS to work, there must be no more than M − 1 sublists
• Suppose R fits in B blocks. Since each sublist consists of M blocks, the number of sublists is B/M
• Therefore, B/M ≤ M − 1, or B ≤ M(M − 1) (or about B ≤ M2).
• The algorithm reads B blocks in the first pass and another B disk I/O’s to write the sorted
sublists.
• The sorted sublists are each read again in the second pass, resulting in a total of 3B disk I/O’s
• If the sorted sublists are written to disk in the second pass, a total of 4B disk I/O’s are
performed

6
DATABASE TECHNOLOGIES
Query Execution – Two pass algorithms

• Two-Phase, Multiway Merge-Sort Algorithm


• Consider Block Size = 64 KB = 216
• Main Memory available = 1 GB
• Therefore M = 1 GB / 64 KB = 16,384 = 16K = 214
• A table fitting in B blocks can be sorted as long as B is no more than (214)2 = 228
• A table can be sorted as long as its size is not greater than 228 * 216 = 244 bytes

7
DATABASE TECHNOLOGIES
Query Execution – Two pass algorithms

Two Pass Duplicate Elimination Using Sorting


• Repeatedly fill the M buffers with new tuples from R and sort them using any main-memory
sorting algorithm. Write out each sorted sublist to secondary storage.
• In the second pass, use the available main memory to hold one block from each sorted sublist
and one output block
• Repeatedly select the first (in sorted order) unconsidered tuple t among all the sorted
sublists.
• Write one copy of t to the output and eliminate from the input blocks all occurrences of t.
• Thus, the output will consist of exactly one copy of any tuple in R; they will in fact be
produced in sorted order.
• The number of disk I/O’s performed will be 3 * B(R) 8
DATABASE TECHNOLOGIES
Query Execution – Two pass algorithms

• Two Pass Duplicate Elimination Example :

F F A A A A
A A F F F F
H H G G G G
G G H H A H A H A A H
K B B B K
K A A A A
U
U K K K K F … F U
A G Z
A U U U U
Z
Z Z Z Z Z
F
B F B B B B
B F F F F

9
DATABASE TECHNOLOGIES
Query Execution – Two pass algorithms

• Analysis of Two Pass Duplicate Elimination Algorithm


• For Two Pass Duplicate Elimination algorithm to work, there must be no more than M − 1
sublists
• Suppose R fits in B blocks. Since each sublist consists of M blocks, the number of sublists is B/M
• Therefore, B/M ≤ M − 1, or B ≤ M(M − 1) (or about B ≤ M2).
• To eliminate duplicates with the two-pass algorithm requires only √B(R) blocks of main
memory, rather than the B(R) blocks required for a one-pass algorithm.

10
DATABASE TECHNOLOGIES
Query Execution – Two pass algorithms

Grouping and Aggregation Using Sorting


• Repeatedly fill the M buffers with new tuples from R . Sort the tuples in each of the M blocks,
using the grouping attributes as the sort key. Write each sorted sublist to disk.
• Use one main-memory buffer for each sublist and initially load the first block of each sublist
into its buffer.
• Repeatedly find the least value of the sort key (grouping attributes) present among the first
available tuples in the buffers.
• Examine each of the tuples with sort key v, and accumulate the needed aggregates
• When there are no more tuples with sort key v available, output a tuple consisting of the
grouping attributes of L and the associated values of the aggregations computed for the
group 11
DATABASE TECHNOLOGIES
Query Execution – Two pass algorithms

• Two Pass Aggregation Example :

F F A A A A
A A F F F F
H H G G G G
G G H H A=2 H A=2 H A=2 A=2 H=1
K B=1 B=1 B=1 K=1
K A A A A
U
U K K K K F=2 … F=2 U=1
A G=1 Z=1
A U U U U
Z
Z Z Z Z Z
F
B F B B B B
B F F F F

12
DATABASE TECHNOLOGIES
Query Execution – Two pass algorithms

A Sort-Based Union Algorithm – R U S


• Create sorted sublists from both R and S
• Use one main-memory buffer for each sublist of R and S. Initialize each with the first block
from the corresponding sublist
• Repeatedly find the first remaining tuple t among all the buffers.
• Copy t to the output and remove from the buffers all copies of t
• Each tuple of R and S is read twice into main memory, once when the sublists are being
created, and the second time as part of one of the sublists. The tuple is also written to disk
once, as part of a newly formed sublist.
• The cost of disk I/O’s is 3 * (B(R) + B(S)) .
• The total size of the two relations must not exceed M2. That is, B(R) + B(S) ≤ M2. 13
DATABASE TECHNOLOGIES
Query Execution – Two pass algorithms

• Two Pass R U S example :

F Z F A A A A
A F A F F F F
H B H G G G G
G G H H A H A H A A H
S
K B B B K
K A A A A
U
U K K K K F … F U
A G Z
A U U U U
R
Z B B B B
F F F F F
B Z Z Z Z

14
DATABASE TECHNOLOGIES
Query Execution – Two pass algorithms

Sort-Based Intersection
• Create sorted sublists from both R and S
• Use one main-memory buffer for each sublist of R and S. Initialize each with the first block
from the corresponding sublist
• Repeatedly find the first remaining tuple t among all the buffers.
• Copy t to the output if it exists in R and S and remove from the buffers all copies of t
• The cost of disk I/O’s is 3 * (B(R) + B(S)) .
• The total size of the two relations must not exceed M2. That is, B(R) + B(S) ≤ M2.

15
DATABASE TECHNOLOGIES
Query Execution – Two pass algorithms

• Two Pass R ∩ S example :

F Z F A A A A
A F A F F F F
H B H G G G G
G G H H H H F F
S
K
K A A A A
U
U K K K K …
A
A U U U U
R
Z B B B B
F F F F F
B Z Z Z Z

16
DATABASE TECHNOLOGIES
Query Execution – Two pass algorithms

Sort-Based Difference
• Create sorted sublists from both R and S
• Use one main-memory buffer for each sublist of R and S. Initialize each with the first block
from the corresponding sublist
• Repeatedly find the first remaining tuple t among all the buffers.
• Copy t to the output if it exists in one and not in the other and remove from the buffers all
copies of t
• The cost of disk I/O’s is 3 * (B(R) + B(S)) .
• The total size of the two relations must not exceed M2. That is, B(R) + B(S) ≤ M2.

17
DATABASE TECHNOLOGIES
Query Execution – Two pass algorithms

• Two Pass R - S example :

F Z F A A A A
A F A F F F F
H B H G G G G
G G H H A H A H A A U
S
K G
K A A A A
U
U K K K K … H
A K
A U U U U
R
Z B B B B
F F F F F
B Z Z Z Z

18
DATABASE TECHNOLOGIES
Query Execution – Two pass algorithms

Sort-Based Join
• Consider relations R(X,Y ) and S(Y,Z) to join and M blocks of main memory for buffers
• Sort R and S using 2PMMS with Y as the sort key
• Merge the sorted R and S using only two buffers: one for the current block of R and the other for the current block of S
by repeating the following
• Find the least value y of the join attributes Y that is currently at the front of the blocks for R and S.
• If y does not appear at the front of the other relation, then remove the tuple(s) with sort key y.
• Otherwise, identify all the tuples from both relations having sort key y. If necessary, read blocks from the sorted R
and/or S, until we are sure there are no more y’s in either relation. As many as M buffers are available for this
purpose.
• Output all the tuples that can be formed by joining tuples from R and S that have a common Y -value y.
• If either relation has no more unconsidered tuples in main memory, reload the buffer for that relation 19
DATABASE TECHNOLOGIES
Query Execution – Two pass algorithms

• Two Pass R S example :

F Z F A A A A A
A F A F F A A A
H B H G G F F F
G G H H G G G F
S
K
K A A H
U
U K K K
A
A U U U
R
Z B B B B B
F F F F F F
B Z Z Z Z Z

20
DATABASE TECHNOLOGIES
Query Execution – Two pass algorithms

Analysis of Sort-Based Join

• The number of disk I/O’s = 5 * (B(R) + B(S))


• Also, max(B(R), B(S)) ≤ M2
• In addition, the result set (matching tuples) should fit in M blocks

21
DATABASE TECHNOLOGIES
Query Execution – Two pass algorithms

A more efficient Sort-Based Join – “sort-merge-join”


• Consider relations R(X,Y ) and S(Y,Z) to join and M blocks of main memory for buffers
• Create sorted sublists of size M using Y as the sort key for both R and S.
• Bring the first block of each sublist into a buffer assuming there are no more than M sublists
in all.
• Repeatedly find the least Y value y among the first available tuples of all the sublists.
• Identify all the tuples of both relations that have Y -value y, using some of the M available
buffers to hold them
• Output the join of all tuples from R with all tuples from S that share this common Y -value.
• If the buffer for one of the sublists is exhausted, then replenish it from disk.

22
DATABASE TECHNOLOGIES
Query Execution – Two pass algorithms

Analysis of “sort-merge-join”
• The number of disk I/O’s is 3 * (B(R) + B(S))
• The sizes of the sorted sublists are M blocks and there can be at most M of them among the two lists

• B(R) + B(S) ≤ M2

Example of “sort-merge-join”
• Consider joining relations R and S of sizes 1000 and 500 blocks respectively using 101 buffers
• Divide R into 10 sublists and S into 5 sublists each of length 100 and sort them
• Use 15 buffers to hold the current blocks of each of the sublists. Use the remaining 86 buffers to store tuples in case
there are many tuples with a fixed Y value
• We need to do three disk I/O’s per block of data. Two to create the sorted sublists and one for the block of every
sorted sublist that is read into main memory one more time in the multiway merging process.
• The total number of disk I/O’s = 3 * (1000 + 500) = 4500
23
DATABASE TECHNOLOGIES
Query Execution – Two pass algorithms

Summary of sort based algorithms

24
DATABASE TECHNOLOGIES
Query Execution – Two pass algorithms

Two-Pass Algorithms Based on


Hashing

25
DATABASE TECHNOLOGIES
Query Execution – Two pass algorithms

Two-Pass Algorithms Based on Hashing


• If the data is too big to store in main-memory buffers, hash all the tuples of the argument
or arguments using an appropriate hash key.
• Select the hash key so all the tuples that need to be considered together when we
perform the operation fall into the same bucket
• Perform the operation by working on one bucket at a time in case of unary operations
and on a pair of buckets with the same hash value in case of a binary operations.

26
DATABASE TECHNOLOGIES
Query Execution – Two pass algorithms

Partitioning Relations by Hashing:


• Goal: Using M buffers, partition a relation R into (M − 1) buckets of roughly
equal size
• h() is the hash function that takes complete tuples of R as its argument (all
attributes of R are part of the hash key)
• Associate one buffer with each bucket. The last buffer holds blocks of R, one
at a time.
• Each tuple t in the block is hashed to bucket h(t) and copied to the
appropriate buffer.
• If that buffer is full, write it out to disk and initialize another block for the
same bucket.
• At the end, write out the last block of each bucket if it is not empty.

27
DATABASE TECHNOLOGIES
Query Execution – Two pass algorithms

Two Pass Hashing Example : M = 4

M-1 M-1 M-1


F F F F
A G G G
M M M
H F
F K F F A H
G
A U B G K U
K
A A A …
H A F A Z
U
G Z K K B
A
A A
Z
B
F
B H H H
U U
Z Z

28
DATABASE TECHNOLOGIES
Query Execution – Two pass algorithms

Hash-Based Algorithm for Duplicate Elimination


• Hash R to (M − 1) buckets.
• Copies of the same tuple t will hash to the same bucket. Why?
• Examine one bucket at a time, perform duplicate elimination on that bucket in isolation.
• Use one-pass algorithm to eliminate duplicates from each Ri in turn (Ri is the portion of R that
hashes to the ith bucket) and write out the resulting unique tuples.
• This algorithm will work as long as the individual Ri’s are sufficiently small to fit in main
memory and thus allow a one-pass algorithm.
• Since the hash function h partitions R into equal-sized buckets, each Ri will be approximately
B(R)/(M − 1) blocks in size.
• The total number of Disk I/O’s = 3 * B(R). 29
DATABASE TECHNOLOGIES
Query Execution – Two pass algorithms

Hash-Based Algorithm for Duplicate Elimination Example : M = 4

M-1
F F
A G
M
H
F F A H F F F A H
G
A G K U G G G K U
K
A … …
H F A Z F B Z
U
G B
A
Z
F
B H

30
DATABASE TECHNOLOGIES
Query Execution – Two pass algorithms

Hash-Based Algorithm for Grouping and Aggregation


• Hash R to (M − 1) buckets.
• Choose a hash function that depends only on the grouping attributes.
• Use one-pass algorithm for Aggregation to process each bucket in turn, provided B(R) ≤ M2.
• In the second pass, need only one record per group as we process each bucket.
• The total number of Disk I/O’s = 3 * B(R).

31
DATABASE TECHNOLOGIES
Query Execution – Two pass algorithms

Hash-Based Algorithm for Union, Intersection and Difference


• Hash R and S to M − 1 buckets each.
• Since the operations are binary, use the same hash function to hash tuples of both R and S.
• Union: If a tuple t appears in both R and S, then for some i we shall find t in both Ri and Si. Take the
set-union of Ri with Si for all i and output the result.
• Intersection: Use one-pass algorithm to each pair of corresponding buckets and output the result.
• Difference: Use one-pass algorithm to each pair of corresponding buckets and output the result.
• The total number of Disk I/O’s = 3 * (B(R) + B(S)).
• Also, min (B(R), B(S)) ≤ M2

32
DATABASE TECHNOLOGIES
Query Execution – Two pass algorithms

Hash-Based Algorithm for Join


• Consider relations R(X, Y) and S(Y, Z) to join.
• Hash R and S to (M − 1) buckets each. Use the same hash function to hash tuples of both and
R and S considering only the join attributes.
• If tuples of R and S join, they will wind up in corresponding buckets Ri and Si for some i.
• Use one-pass join of all pairs of corresponding buckets.
• The total number of Disk I/O’s = 3 * (B(R) + B(S)).
• Also, min (B(R), B(S)) ≤ M2

33
DATABASE TECHNOLOGIES
Query Execution – Two pass algorithms

Saving Some Disk I/O’s


• There are opportunities to avoid writing some of the buckets to disk and then reading them again.
• Consider relations R(X,Y) and S(Y,Z) to join, S being the smaller one.
• Create k buckets, where k is much less than M, the available memory.
• When we hash S, we can choose to keep m of the k buckets entirely in main memory, while keeping
only one block for each of the other k − m buckets.
• We can manage to do so provided the expected size of the buckets in memory, plus one block for each
of the other buckets, does not exceed M. That is, m*B(S)/k + (k − m) ≤ M
• When the tuples of the other relation R are read, hash that relation into buckets, we keep in memory:
• The m buckets of S that were never written to disk and one block for each of the k−m buckets of R
whose corresponding buckets of S were written to disk.

34
DATABASE TECHNOLOGIES
Query Execution – Two pass algorithms

Saving Some Disk I/O’s


• If a tuple t of R hashes to one of the first m buckets, then immediately join it with all the tuples of the
corresponding S bucket, as if this were a one pass, hash join.
• If t hashes to one of the buckets whose corresponding S bucket is on disk, then t is sent to the main-
memory block for that bucket, and eventually migrates to disk, as for a two-pass, hash-based join.
• On the second pass, join the corresponding buckets of R and S.
• The savings in disk I/O’s is equal to two for every block of the buckets of S that remain in memory, and
their corresponding R-buckets.
• Since m/k of the buckets are in memory, the savings is 2(m/k) * ( B(R) + B(S) )
• Assuming, that k is about B(S)/M and m = 1, the savings in disk I/O’s is 2M * ( B(R) + B(S) ) / B(S) and the
total cost is (3 − 2M/B(S)) * (B(R) + B(S)).

35
DATABASE TECHNOLOGIES
Query Execution – Two pass algorithms

Saving Some Disk I/O’s Example

• Consider relations R and S of 1000 and 500 blocks respectively using M = 101
• For hybrid hash-join, k = 500 / 101 = 5
• On average, each bucket will have 100 tuples of S
• To fit one of these buckets and four extra blocks for the other four buckets, we
need 104 blocks of main memory. Therefore, there is a chance that the in-memory
bucket may overflow
• So choose k = 6

36
DATABASE TECHNOLOGIES
Query Execution – Two pass algorithms

Saving Some Disk I/O’s Example


• For hashing S on the first pass, we have five buffers for five of the buckets and we have up to 96
buffers for the in-memory buckets, whose expected size is 500/6 = 83
• The number of disk I/O’s we use for S on the first pass is thus 500 to read all of S and 500 − 83 = 417
to write five buckets to disk.
• For hashing R on the first pass, we need to read all of R (1000 disk I/O’s) and write 5 of its 6 buckets.
• The number of disk I/O’s for R = (1000 * 5/6) = 833.
• On the second pass, we need to read all the buckets written to disk, or 417 + 833 = 1250 disk I/O’s.
• The total number of disk I/O’s = 1500 to read R and S + 1250 to write 5/6 of these relations + 1250
to read those tuples again = 4,000 disk I/O’s.
• Number of disk I/O’s in case of regular hash-join or sort join = 4,500.
37
DATABASE TECHNOLOGIES
Query Execution – Two pass algorithms

Summary of sort based algorithms Summary of Hash-Based Algorithms

38
THANK YOU
Two Pass Algorithms
Department of Computer Science and Engineering

39
Database Technologies
Query Processing and Optimization

Buffer Management, Clustered Indexes


Department of Computer Science and Engineering

Acknowledgement: Authors of the prescribed textbooks and materials sourced online.


1
Database Technologies

Buffer Management, Clustered Indexes

List of Contents
- Buffer Management
- Buffer Management Strategies (Least Recently Used, First In
First Out, Clock Algorithm)
- Relationship Between Physical Operator Selection and Buffer
Management
- Index Scan
- Clustering and Non Clustering Indexes

Department of Computer Science and Engineering


2
DATABASE TECHNOLOGIES
Query Execution – Buffer Management

• The Buffer Manager manages the central task of making main-memory


buffers available to processes, such as queries
• It is the responsibility of the buffer manager to allow processes to get
the memory they need, while minimizing the delay
• When the buffer manager controls main memory directly and requests
exceed available space, it has to select a buffer to empty, by writing its
contents to disk.
• If the buffered block has not been changed, then it may simply be erased
from main memory
• Normally, the number of buffers is a parameter set when the DBMS is
3
initialized
DATABASE TECHNOLOGIES
Query Execution – Buffer Management

Buffer Management Strategies


The critical choice/decision that the buffer manager has to make is which block to
throw out of the buffer pool when a buffer is needed for a newly requested block.
• Least-Recently Used (LRU)
• Throw out the block that has not been read or written for the longest
time.
• The buffer manager maintains a table indicating the last time the block
in each buffer was accessed
• It also requires that each database access make an entry in this table, so
there is significant effort in maintaining this information 4
DATABASE TECHNOLOGIES
Query Execution – Buffer Management

• First-In-First-Out (FIFO)
• When a buffer is needed, the buffer that has been occupied the longest by the
same block is emptied and used for the new block
• The buffer manager needs to know only the time at which the block currently
occupying a buffer was loaded into that buffer. An entry into a table is made
when the block is read from disk, and there is no need to modify the table when
the block is accessed
• It requires less maintenance than LRU, but it can make more mistakes. A block
that is used repeatedly, say the root block of a B-tree index will eventually
become the oldest block in a buffer. It will be written back to disk, only to be
reread shortly thereafter into another buffer.
5
DATABASE TECHNOLOGIES
Query Execution – Buffer Management

The “Clock” Algorithm


• This algorithm is a commonly implemented, efficient approximation to
LRU
• Think of the buffers as arranged in a circle
• A “hand” points to one of the buffers, and will rotate clockwise if it
needs to find a buffer in which to place a disk block
• Each buffer has an associated “flag,” which is either 0 or 1.
• Buffers with a 0 flag are vulnerable to having their contents sent back to
disk; buffers with a 1 are not
• When a block is read into a buffer, it’s flag is set to 1.
• Likewise, when the contents of a buffer is accessed, it’s flag is set to 1. 6
DATABASE TECHNOLOGIES
Query Execution – Buffer Management

The “Clock” Algorithm (Cont’d)

• When the buffer manager needs a buffer for a new


block, it looks for the first 0 it can find, rotating
clockwise
• If it passes a 1, it sets it to 0.
• Thus, a block is only thrown out of its buffer if it
remains unaccessed for the time it takes the hand to
make a complete rotation to set its flag to 0 and then
make another complete rotation to find the buffer
7
with its 0 unchanged.
DATABASE TECHNOLOGIES
Query Execution – Buffer Management

The Relationship Between Physical Operator Selection and Buffer Management


• The query optimizer will eventually select a set of physical operators that
will be used to execute a given query.
• This selection of operators may assume that a certain number of buffers M
is available for execution of each of these operators
• The buffer manager may not be able to guarantee the availability of these
M buffers when the query is executed.

8
DATABASE TECHNOLOGIES
Query Execution – Buffer Management

The Relationship Between Physical Operator Selection and Buffer Management


• Example: Nested-Loop Join
• The basic algorithm does not really depend on the value of M, although its performance depends on M.
• It is sufficient to find out what M is just before execution begins. It is even possible that M will change at
different iterations of the outer loop.
• Each time we load main memory with a portion of the relation S (the relation of the outer loop), we can
use all but one of the buffers available at that time;
• The remaining buffer is reserved for a block of R, the relation of the inner loop.
• The number of times we go around the outer loop depends on the average number of buffers available
at each iteration
• In case, in the first iteration, enough buffers are available to hold all of S, in which case nested-loop join
gracefully becomes a one-pass join
9
DATABASE TECHNOLOGIES
Query Execution – Buffer Management

The Relationship Between Physical Operator Selection and Buffer Management


• Other algorithms also are impacted by the fact that M can vary and by the buffer-
replacement strategy used by the buffer manager
• Sort-Based algorithms
• If M shrinks, we can change the size of a sublist, since the sort-based algorithms do not depend on the
sublists being the same size. The major limitation is that as M shrinks, we could be forced to create so
many sublists that we cannot then allocate a buffer for each sublist in the merging process.

• Hash-Based algorithms
• We can reduce the number of buckets if M shrinks, as long as the buckets do not then become so large
that they do not fit in allotted main memory. However, these algorithms cannot respond to changes in
M while the algorithm executes.

10
DATABASE TECHNOLOGIES
Query Execution – Indexes

Index Scan
If there is an index on any attribute of R, we may be able to use this index to get
all the tuples of R.
For example, a sparse index on R can be used to lead us to all the blocks holding R,
even if we don’t know otherwise which blocks these are.
This operation is called Index-scan.

The important observation is that the index is used to not only to get all the tuples
of the relation it indexes, but to get only those tuples that have a particular value
(or sometimes a particular range of values) in the attribute or attributes that form
the search key for the index.

11
DATABASE TECHNOLOGIES
Query Execution – Indexes

Clustering and Non Clustering Indexes


• A relation is “clustered” if its tuples are packed into roughly as few blocks as can
possibly hold those tuples.
• All the analyses we have done so far assume that relations are clustered.
• Clustering indexes are indexes on an attribute or attributes such that all the
tuples
with a fixed value for the search key of this index appear on roughly as few
blocks
as can hold them.
• A relation that isn’t clustered cannot have a clustering index, but even a clustered
relation can have nonclustering indexes. 12
DATABASE TECHNOLOGIES
Query Execution – Indexes

Clustering and Non Clustering Indexes


• A relation that isn’t clustered cannot have a clustering index, but even a clustered
• relation can have nonclustering indexes.
0 0 :: Non Clustered Relation => Non Clustered Index ✔OK

0 1 :: Non Clustered Relation => Clustered Index 🗶 Not OK

1 0 :: Clustered Relation => Non Clustered Index ✔OK

1 1 :: Clustered Relation => Clustered Index ✔OK


13
DATABASE TECHNOLOGIES
Query Execution – Indexes

Clustering and Non Clustering Indexes


Example:
• A relation R(a, b) that is sorted on attribute a and stored in that order, packed into blocks,
• is surely clustered.
• An index on a is a clustering index, since for a given a-value a1, all the tuples with that value
• for a are consecutive. They thus appear packed into blocks, except possibly for the first
• and last blocks that contain a-value a1, as suggested in the Fig.
• However, an index on b is unlikely to be clustering, since the tuples with a fixed b-value will
• be spread all over the file unless the values of a and b are very closely correlated.

Figure: A clustering index has all tuples with a fixed value packed into (close to) the minimum
possible number of blocks

14
THANK YOU
Buffer Management, Clustered Indexes
Department of Computer Science and Engineering

15
Database Technologies
Query Processing and Optimization

Query Compiler
Department of Computer Science and Engineering

Acknowledgement: Authors of the prescribed textbooks and materials sourced online. 1


Query Compiler

List of Contents
- Query Parsing and Preprocessing
- Syntax Analysis and Parse Trees
- Grammar for a Simple Subset of SQL
- Preprocessor
- Preprocessing Queries Involving Views

Department of Computer Science and Engineering


2
DATABASE TECHNOLOGIES
The Query Compiler

Query Parsing and Preprocessing


1. Parser - Parser takes text written in a language such as SQL and converts it to
a parse tree
2. Preprocessor - Responsible for semantic checking. The preprocessor
replaces views in a parse tree by the objects which represent how the view is
constructed from the base tables.
3. Logical query plan generator – Converts parse trees into logical query plans
4. Query rewriter – Rewrites logical query plans and selects the preferred
logical query plan.

3
DATABASE TECHNOLOGIES
The Query Compiler – Parser

Syntax Analysis and Parse Trees


• Parser - Parser takes text written in a language such as SQL and converts it to a Parse
Tree
• The nodes of a parse tree corresponds to
1. Atoms - lexical elements such as keywords (e.g., SELECT), names of attributes or
relations, constants, parentheses, operators and other schema elements
2. Syntactic categories - which are names for families of query subparts that all play a
similar role in a query such as group by, order by, having
• If a node is an atom, then it has no children
• If a node represents a syntactic category, then its children are described by one of the
rules of the grammar for the language 4
DATABASE TECHNOLOGIES
The Query Compiler – Parser

Example:
• Consider the following relations
StarsIn (movieTitle, movieYear, starName)

MovieStar (name, address, gender, birthdate)

• Find the movies with stars born in 1960

5
DATABASE TECHNOLOGIES
The Query Compiler – Parser

Syntax Analysis and Parse Trees

Syntactic
SELECT movieTitle Category
(Intermediate
FROM StarsIn
Nodes)
WHERE starName IN
( SELECT name FROM MovieStar
WHERE birthdate LIKE ’%1960’ );

Atoms
(Leaf Nodes)

6
DATABASE TECHNOLOGIES
The Query Compiler – Parser

Grammar for a Simple Subset of SQL


• The syntactic category is intended to represent (some of the) queries of SQL
• <Query> ::= SELECT <SelList> FROM <FromList> WHERE <Condition>
• The syntactic categories <SelList> and <FromList> represent lists that can follow SELECT and
FROM respectively
• The syntactic category <Condition> represents SQL conditions

7
DATABASE TECHNOLOGIES
The Query Compiler – Parser

Example:
• Consider the following relations
StarsIn (movieTitle, movieYear, starName)

MovieStar (name, address, gender, birthdate)

• Find the movies with stars born in 1960


SELECT …

8
DATABASE TECHNOLOGIES
The Query Compiler – Parser

Example:
• Consider the following relations
StarsIn (movieTitle, movieYear, starName)

MovieStar (name, address, gender, birthdate)

• Find the movies with stars born in 1960


SELECT movieTitle
FROM StarsIn
WHERE starName IN
( SELECT name
FROM MovieStar
WHERE birthdate LIKE ’%1960’ );

9
DATABASE TECHNOLOGIES
The Query Compiler – Parser

Example:
• Consider the following relations
StarsIn (movieTitle, movieYear, starName)

MovieStar (name, address, gender, birthdate)

• Find the movies with stars born in 1960


SELECT movieTitle
FROM StarsIn, MovieStar
WHERE starName = name AND
birthdate LIKE ’%1960’;

10
DATABASE TECHNOLOGIES
The Query Compiler – Preprocessor

Preprocessor
• The preprocessor is also responsible for semantic checking
• Semantic rules
1. Check relation uses - Every relation mentioned in a FROM-clause must be a relation or
view in the current schema
2. Check and resolve attribute uses - Every attribute that is mentioned in the SELECT or
WHERE clause must be an attribute of some relation in the current scope.
3. Check types - All attributes must be of a type appropriate to their use. Operators are
checked to see that they apply to values of appropriate and compatible types. Attribute
birthdate can be treated as a string hence it is valid.
11
DATABASE TECHNOLOGIES
The Query Compiler – Preprocessor

Preprocessing Queries Involving Views


• When an operand in a query is a view, the preprocessor needs to replace the operand by a
piece of parse tree that represents how the view is constructed from the base tables
• A query Q is represented by its expression tree in relational algebra by views V and W
• To form the query over base tables, substitute for each leaf in the tree for Q that is a view,
the root of a copy of the tree that defines that view
• The resulting tree is a query over base tables that is equivalent to the original query using
views

12
DATABASE TECHNOLOGIES
The Query Compiler – Preprocessor

Preprocessing Queries Involving Views Example


View:
CREATE VIEW ParamountMovies AS
SELECT title, year
FROM Movies
WHERE studioName = ’Paramount’;
Query:
SELECT title
FROM ParamountMovies
WHERE year = 1979;
13
DATABASE TECHNOLOGIES
The Query Compiler – Preprocessor

Preprocessing Queries Involving Views Example


CREATE VIEW ParamountMovies AS
SELECT title, year
FROM Movies
WHERE studioName = ’Paramount’;

Query:
SELECT title
FROM ParamountMovies
WHERE year = 1979;
14
DATABASE TECHNOLOGIES
The Query Compiler – Preprocessor

Preprocessing Queries Involving Views Example


CREATE VIEW ParamountMovies AS
SELECT title, year
FROM Movies
WHERE studioName = ’Paramount’;

Query:
SELECT title
FROM ParamountMovies
WHERE year = 1979;
15
THANK YOU
Query Compiler
Department of Computer Science and Engineering

16
Database Technologies
Query Processing and Optimization

Algebraic Laws for Improving Query


Plans
Department of Computer Science and Engineering

Acknowledgement: Authors of the prescribed textbooks and materials sourced online.


1
Algebraic Laws for Improving Query Plans

List of Contents
- Laws for Selection
- Laws for Projection
- Laws for Joins and Product
- Laws for Duplicate Elimination
- Laws for Grouping and Aggregation

Department of Computer Science and Engineering


2
DATABASE TECHNOLOGIES
Query Execution – Query Compiler

Algebraic Laws for Improving Query Plans


1. Commutative Law:
● x + y = y + x and
● x×y=y×x
2. Associative Law:
● (x + y) + z = x + (y + z);
● (x × y) × z = x × (y × z)
• Several of the operators of relational algebra are
both associative and commutative.

3
DATABASE TECHNOLOGIES
Query Execution – Query Compiler

Laws for Selection

• One of the most important rules of efficient query processing is to move/pushing the
selections down the tree as far as they will go without changing what the expression does
1) σC1 AND C2 (R) = σC1 (σC2 (R)) 7) σC (R ⋈ S) = σC (R) ⋈ S
2) σC1 OR C2 (R) = σC1 (R) ∪ σC2 (R) 8) σC (R ⋈ D S) = σC (R) ⋈ D S
3) σC1 (σC2 (R)) = σC2 (σC1 (R)) 9) σC (R ∩ S) = σC (R) ∩ S.
4) σC (R ∪ S) = σC (R) ∪ σC (S) 10)σC (R × S) = R × σC (S)
5) σC (R − S) = σC (R) − S 11)σC (R × S) = σC (R) × S
6) σC (R − S) = σC (R) − σC (S) 12)σC (R ⋈ S) = σC (R) ⋈ σC (S) 4
DATABASE TECHNOLOGIES
Query Execution – Query Compiler

Pushing Selection Example


• Consider the relations
StarsIn (title, year, starName)
Movies (title, year, length, genre, studioName, producerC#)
• CREATE VIEW MoviesOf1996 AS SELECT * FROM Movies WHERE year = 1996;
• SQL Query: SELECT starName, studioName FROM MoviesOf1996 NATURAL JOIN StarsIn;

5
DATABASE TECHNOLOGIES
Query Execution – Query Compiler

Laws for Projection


• πL(R ⋈ S) = πL (πM (R) ⋈ πN (S))
• πL(R ⋈ C S) = πL (πM(R) ⋈ C πN (S))
• πL(R × S) = πL (πM(R) × πN (S))
• πL(R ∪ S) = πL(R) ∪ πL(S)

6
DATABASE TECHNOLOGIES
Query Execution – Query Compiler

Laws for Joins and Products

• R ⋈ C S = σC (R × S)
• R ⋈ S = πL (σC (R × S))
Laws for Duplicate Elimination

• δ(R × S) = δ(R) × δ(S)


• δ(R ⋈ S) = δ(R) ⋈ δ(S)
• δ(R ⋈ C S) = δ(R) ⋈ C δ(S)
• δ(σC (R)) = σC (δ(R))
• Note:- δ(R) = R, if R has no duplicates 7
DATABASE TECHNOLOGIES
Query Execution – Query Compiler

Laws for Grouping and Aggregation


• δ (γL(R)) = γL(R)
• γL(R) = γL (πM(R))
• γL(R) = γL (δ(R)) for MIN(), MAX()

8
DATABASE TECHNOLOGIES
Query Execution – Query Compiler

Laws for Grouping and Aggregation Example

• Consider the relations


MovieStar (name, addr, gender, birthdate)
StarsIn (movieTitle, movieYear, starName)

Query:
SELECT movieYear, MAX(birthdate)
FROM MovieStar, StarsIn
WHERE name = starName
GROUP BY movieYear;
9
DATABASE TECHNOLOGIES
Query Execution – Query Compiler

Laws for Grouping and Aggregation Example

• Consider the relations


MovieStar (name, addr, gender, birthdate)
StarsIn (movieTitle, movieYear, starName)

Query:
SELECT movieYear, MAX(birthdate)
FROM MovieStar, StarsIn
WHERE name = starName
GROUP BY movieYear;
10
DATABASE TECHNOLOGIES
Query Execution – Query Compiler

Laws for Grouping and Aggregation


Example
• Consider the relations
MovieStar (name, addr, gender, birthdate)
StarsIn (movieTitle, movieYear, starName)

Query:
SELECT movieYear, MAX(birthdate)
FROM MovieStar, StarsIn
WHERE name = starName
GROUP BY movieYear; 11
DATABASE TECHNOLOGIES
Query Execution – Query Compiler

Laws for Grouping and Aggregation Example


• Consider the relations
MovieStar (name, addr, gender, birthdate)
StarsIn (movieTitle, movieYear, starName)

Query:
SELECT movieYear, MAX(birthdate)
FROM MovieStar, StarsIn
WHERE name = starName
GROUP BY movieYear;

12
THANK YOU
Algebraic Laws for Improving Query
Plans
Department of Computer Science and Engineering

13
Database Technologies
Query Processing and Optimization

From Parse Trees to Logical Query Plan


Department of Computer Science and Engineering

Acknowledgement: Authors of the prescribed textbooks and materials sourced online.


1
From Parse Trees to Logical Query Plan

List of Contents
- Conversion to Relational Algebra
- Removing Subqueries From Conditions
- Improving the Logical Query Plan
- Improving the Logical Query Plan - most commonly used
optimization techniques
- Grouping Associative/Commutative Operators

Department of Computer Science and Engineering


2
DATABASE TECHNOLOGIES
Query Execution – Query Compiler

From Parse Trees to Logical Query Plan

1. Conversion to Relational Algebra - Transform SQL parse trees to algebraic logical query plans.

2. Improving the Logical Query Plan – Rewrite the logical plan using algebraic laws

3
DATABASE TECHNOLOGIES
Query Execution – Query Compiler

Conversion to Relational Algebra Example

• Consider a <query> with a <condition> that has no subqueries

• Construct a relational-algebra expression consisting of:

i. The product of all the relations mentioned in the <FromList>, which is the argument of:

ii. A selection σC , where C is the expression in the construct being replaced, which in turn

is the argument of:

iii. A projection πL, where L is the list of attributes in the <SelList>.

4
DATABASE TECHNOLOGIES
Query Execution – Query Compiler

Conversion to Relational Algebra Example


SELECT movieTitle

FROM StarsIn, MovieStar

WHERE starName = name AND

birthdate LIKE ’%1960’;

5
DATABASE TECHNOLOGIES
Query Execution – Query Compiler

Removing Subqueries From Conditions

SELECT movieTitle
FROM StarsIn
WHERE starName IN
( SELECT name
FROM MovieStar
WHERE birthdate LIKE ’%1960’);

π movieTitle (StarsIn starName = name (π name (σ birthdate LIKE ’%1960’ (MovieStar))))

6
DATABASE TECHNOLOGIES
Query Execution – Query Compiler

Improving the Logical Query Plan


• When we convert a query to relational algebra we obtain one possible logical
query plan.
• Rewrite the plan using the algebraic laws
• We could generate more than one logical plan, representing different orders or
combinations of operators
• The query rewriter chooses a single logical query plan that it believes is “best,”
meaning that it is likely to result ultimately in the cheapest physical plan.

7
DATABASE TECHNOLOGIES
Query Execution – Query Compiler

Improving the Logical Query Plan - most commonly used optimization techniques
• Selections can be pushed down the expression tree as far as they can go.
• If a selection condition is the AND of several conditions, then we can split the condition and push
each piece down the tree separately. This strategy is probably the most effective improvement
technique.
• Projections can be pushed down the tree, or new projections can be added.
• Duplicate eliminations can sometimes be removed, or moved to a more convenient position in
the tree
• Certain selections can be combined with a product below to turn the pair of operations into an
equijoin, which is generally much more efficient to evaluate than are the two operations
separately. 8
DATABASE TECHNOLOGIES
Query Execution – Query Compiler

Example:

SELECT movieTitle
FROM StarsIn
WHERE starName IN
( SELECT name
FROM MovieStar
WHERE birthdate LIKE ’%1960’);

9
DATABASE TECHNOLOGIES
Query Execution – Query Compiler

Grouping Associative/Commutative Operators:


• An operator that is associative and commutative may be thought of as
having any number of operands.
• An operator like a join can have any number of operands. We can
reorder those operands so that when the multi-way join is executed as
a sequence of binary joins, they take less time than if we had executed
the joins in the order implied by the parse tree.
• For each portion of the subtree that consists of nodes with the same
associative and commutative operator, group the nodes with these
operators into a single node with many children.
• Products can be considered as a special case of natural join and
combined with joins if they are adjacent in the tree. 10
THANK YOU
From Parse Trees to Logical Query Plan
Department of Computer Science and Engineering

11
Database Technologies
Query Processing and Optimization

Cost Based Plan Selection


Department of Computer Science and Engineering

Acknowledgement: Authors of the prescribed textbooks and materials sourced online.


1
Cost Based Plan Selection

List of Contents
- Introduction to Cost-Based Plan Selection
- Obtaining Estimates for Size Parameters
- Computation of Statistics
- Heuristics for Reducing the Cost of Logical Query Plans
- Approaches to Enumerating Physical Plans

Department of Computer Science and Engineering


2
DATABASE TECHNOLOGIES
Query Compiler

Cost-Based Plan Selection


The “cost” of evaluating an expression is approximated well by the number of disk I/O’s
performed.
• The number of disk I/O’s in turn is influenced by:
i. Logical query plan - particular logical operators chosen to implement the query
ii. The size of intermediate results
iii. The physical operators used to implement the logical operators
iv. The ordering of similar operations, especially joins
v. The method of passing arguments from one physical operator to the next

3
DATABASE TECHNOLOGIES
Query Compiler

Obtaining Estimates for Size Parameters


• The size is estimated knowing certain important parameters
○ T(R), the number of tuples in a relation R and
○ V(R, a), the number of different values in the column of relation R for attribute a.
• Modern DBMS generally allows the user or administrator explicitly request the gathering of
statistics such as T(R) and V(R, a). These statistics are then used in query optimization,
unchanged until the next command to gather statistics.
• By scanning an entire relation R, it is straightforward to count the number of tuples T(R) and to
discover the number of different values V(R, a) for each attribute a.
• The number of blocks in which R can fit, B(R) can be estimated either
○ by counting the actual number of blocks used (if R is clustered), or
○ by dividing T(R) by the number of R’s tuples that can fit in one block. 4
DATABASE TECHNOLOGIES
Query Compiler

Obtaining Estimates for Size Parameters using Histograms


• A DBMS may compute a histogram of the values for a given attribute.
• If V(R, a) is not too large, then the histogram may consist of the number of the tuples having
each of the values of attribute a.
• The most common types of histograms are:
i. Equal-width: A width w is chosen along with a constant v0. Counts are provided of the
number of tuples with values v in the ranges v0 ≤ v < (v0 + w), (v0 + w) ≤ v < (v0 + 2w).
ii. Equal-height: These are the common “percentiles.” Pick some fraction p and list the
lowest value, the value that is fraction p from the lowest, the fraction 2p from the lowest
and so on up to the highest value.
iii. Most-frequent-values: List the most common values and their number of occurrences. 5
DATABASE TECHNOLOGIES
Query Compiler
Example of Obtaining Estimates for Size Parameters using the Most-Frequent Values Histogram

• Suppose we want to compute the join R(a, b) ⋈ S(b, c)


• Let the histogram for R.b be: 0: 150, 1: 200, 5: 100, others: 550. Therefore T(R) = 1000
• Let the histogram for S.b be: 0: 100, 1: 80, 2: 70, others: 250. Therefore T(S) = 500
• Suppose V (R,b) = 14 and V (S,b) = 13
• 550 tuples of R with unknown/other b-values are divided among eleven values, for an average of 50 tuples each;
i.e., 14 – 3(0,1,5) = 11;
• 250 tuples of S with unknown/other b-values are divided among ten values, each with an average of 25 tuples
each; i.e., 13 – 3(0,1,2) = 10;
○ Values 0 and 1 appear explicitly in both histograms
○ (150 x 100)0 + (200 x 80)1 + (50 x 70)2 + (100 x 25)5 + (9(13-4) x (50 x 25)) = 48,250 tuples in the result set
• Estimated value using a simpler method discussed earlier :
6
○ T(R) x T(S) / V(R, b) = 1000 × 500 / 14 = 35,714 tuples
DATABASE TECHNOLOGIES
Query Compiler

Example of Obtaining Estimates for Size Parameters using the Equal-Width Histogram
• SELECT Jan.day, July.day FROM Jan, July WHERE Jan.temp = July.temp;
• If two corresponding bands have T1 and T2 tuples respectively and the number of values
in a band is V, then the estimate for the number of tuples in the join of those bands is
T1*T2/V
• Many of these products are 0, because one or the other of T1 and T2 is 0. The only bands
for which neither is 0 are 40–49 and 50–59.
• Estimate for 40-49 = 10 * 5 / 10 = 5
• Estimate for 50-59 = 5 * 20 / 10 = 10
• Estimate for the size of this join is 5 + 10 = 15 tuples
• Estimated value using a simpler method discussed earlier = 245 * 245 / 100 = 600 tuples
8
DATABASE TECHNOLOGIES
Query Compiler

Computation of Statistics
• Statistics normally are computed periodically.
• The re-computation of statistics might be triggered automatically after some period of time or after
some number of updates
• Computing statistics for an entire relation R can be very expensive, particularly if we compute V (R,a)
for each attribute a in the relation
• One common approach is to compute approximate statistics by sampling only a fraction of the data
• In a small sample of R, say 1% of its tuples, if we find that most of the a-values we see are different,
then it is likely that V (R,a) is close to T(R).
• If we find that the sample has very few different values of a, then it is likely that we have seen most of
the a-values

9
DATABASE TECHNOLOGIES
Query Compiler

Heuristics for Reducing the Cost of Logical Query Plans


• We already have observed that pushing selections down the tree certainly improves the
cost of a logical query plan regardless of relation sizes
• In the query optimization process, estimating the cost both before and after a
transformation will allow us to apply a transformation where it appears to reduce cost or
avoid the transformation.
• While estimating the cost of a logical query plan, since we have not yet made decisions
about the physical operators that will be used to implement the operators of relational
algebra, our cost estimate cannot be based on disk I/O’s
• We estimate the sizes of all intermediate results using the techniques learnt and their sum
is our heuristic estimate for the cost of the entire logical plan. 10
DATABASE TECHNOLOGIES
Query Compiler

Heuristics for Reducing the Cost of Logical Query Plans Example


• Consider the initial logical query plan
• Let the selection be pushed down as far as possible
• Fig a - Estimating the size of selection σa = 10 = 5000/50 = 100
• Fig a – Estimating the size of δ (σa = 10 (R) ) = 100/2 = 50
• Fig a – Estimating the size of δ (R(S)) = 2000/2 = 1000
• Fig a – Estimating the size of R S = 50 * 1000/max(V(R,b), V(S,b))
= 50 * 1000 / 200 = 250
• Fig b - Estimating the size of selection σa = 10 = 5000/50 = 100
• Fig b – Estimating the size of R S = 100 * 2000/max(V(R,b), V(S,b))
= 100 * 2000 / 200 = 1000
• Fig b – Estimating the size of δ = 1000 / 2 = 500
• Cost estimate of Plan a = 100 + 50 + 1000 + 250 = 1400
• Cost estimate of Plan b = 100 + 1000 + 500 = 1600 11
DATABASE TECHNOLOGIES
Query Compiler

Approaches to Enumerating Physical Plans

• Top-down: Work down the tree of the logical query plan from the root. For each possible
implementation of the operation at the root, consider each possible way to evaluate its
argument(s) and compute the cost of each combination, taking the best

• Bottom-up: For each subexpression of the logical-query-plan tree, compute the costs of
all possible ways to compute that subexpression. The possibilities and costs for a
subexpression E are computed by considering the options for the subexpressions of E and
combining them in all possible ways with implementations for the root operator of E

12
DATABASE TECHNOLOGIES
Query Compiler- Approaches to Enumerating Physical Plans

Heuristic Enumeration
• Use the same approach to selecting a physical plan that is generally used for selecting a logical plan. That is, make a
sequence of choices based on heuristics.
• Most commonly used heuristic approaches:
i. If the logical plan calls for a selection σ A = c(R) and relation R has an index on attribute A, then perform an
index-scan to obtain only the tuples of R with A-value equal to c.
ii. If the selection involves one condition like A = c and other conditions as well, implement the selection by an
index scan followed by a further selection on the tuples, which shall be represented by the physical operator
filter.
iii. If an argument of a join has an index on the join attribute(s), then use an index-join with that relation in the
inner loop
iv. If one argument of a join is sorted on the join attribute(s), then prefer a sort-join to a hash-join, although not
necessarily to an index-join if one is possible.
v. When computing union or intersection of three or more relations, group the smallest relations first
13
DATABASE TECHNOLOGIES
Query Compiler- Approaches to Enumerating Physical Plans

Branch-and-Bound Plan Enumeration


1. This approach is often used in practice.
2. Using heuristics find a good physical plan for the entire logical query plan. Let the cost of this plan be C.
3. Consider other plans for subqueries. Eliminate any plan for a subquery that has a cost greater than C.
4. If we construct a plan for the complete query that has a cost less than C, replace C by the cost of this
better plan in subsequent exploration of the physical query plans.
5. If the cost C is small, then even if there are much better plans to be found, the time spent finding them
may exceed C, so it does not make sense to continue the search.
6. If C is large, then investing time in the hope of finding a faster plan makes sense.

14
DATABASE TECHNOLOGIES
Query Compiler- Approaches to Enumerating Physical Plans

Hill Climbing
• Start with a heuristically selected physical plan.
• Make small changes to the plan, e.g., replacing one method for executing an operator by
another, or reordering joins by using the associative and/or commutative laws, to find
“nearby” plans that have lower cost.
• If you find a plan such that no small modification yields a plan of lower cost, choose the
physical query plan.

15
DATABASE TECHNOLOGIES
Query Compiler- Approaches to Enumerating Physical Plans

Dynamic Programming
• This is a variation of the general bottom-up strategy
• For each subexpression only the plan with least cost is considered
• As we work up the tree, consider possible implementations of each node, assuming the
best plan for each subexpression

16
DATABASE TECHNOLOGIES
Query Compiler- Approaches to Enumerating Physical Plans

Selinger-style Optimization
• This approach improves upon the dynamic-programming approach by keeping for each
subexpression not only the plan of least cost, but certain other plans that have higher
cost, yet produce a result that is sorted in an order that may be useful higher up in the
expression tree
• If we take the cost of a plan to be the sum of the sizes of the intermediate relations, then
there appears to be no advantage to having an argument sorted
• If we use the more accurate measure disk I/O’s as the cost, then the advantage of having
an argument sorted becomes clear if we can use one of the sort-based algorithms and
save the work of the first pass for the argument that is sorted already

17
THANK YOU
Cost Based Plan Selection
Department of Computer Science and Engineering

18
Database Technologies
Query Processing and Optimization

Choosing An Order For Joins


Department of Computer Science and Engineering

Acknowledgement: Authors of the prescribed textbooks and materials sourced online.


1
Choosing An Order For Joins

List of Contents
- Significance of Left and Right Join Arguments
- Join Trees
- Left-Deep, Right-Deep & Bushy Join Trees
- Dynamic Programming to Select a Join Order and Grouping
- A Greedy Algorithm for Selecting a Join Order

Department of Computer Science and Engineering


2
DATABASE TECHNOLOGIES
Query Compiler – Choosing an Order for Joins

• A critical problem in cost-based optimization is selecting an order for the (natural) join of three or more
relations;
• Join methods are asymmetric. The roles played by the two argument relations are different and the cost of
the join depends on which relation plays which role;
• In one-pass join, read one relation preferably the smaller into main memory, creating a structure such as a
hash table to facilitate matching of tuples from the other relation. It then reads the other relation one block
at a time to join its tuples with the tuples stored in memory;
• Hash join: Assume the left argument of the join is the smaller relation and store it in a main-memory data
structure. This relation is called the build relation. The right argument of the join, called the probe relation,
is read a block at a time and its tuples are matched in main memory with those of the build relation;
• Nested-loop join: Assume the left argument is the relation of the outer loop.
• Index-join: Assume the right argument has the index.
3
DATABASE TECHNOLOGIES
Query Compiler- Choosing an Order for Joins

Join Trees
• When two relations are to be joined, there are only two choices for a join tree— take either of the
two relations to be the left argument
• When the join involves more than two relations, the number of possible join trees grows rapidly.
There are n! ways to order n relations
• Consider joining four relations R, S, T and U
Left Deep Join Tree
• A binary tree is Left-deep if all the right children are leaves
• Left-deep trees for joins interact well with common join algorithms, nested-loop joins and one-
pass joins in particular.
• Query plans based on left-deep trees plus these join implementations will tend to be more efficient
than the same algorithms used with non-left-deep trees.
4
DATABASE TECHNOLOGIES
Query Compiler- Choosing an Order for Joins

Right Deep Join Tree


• A binary tree is Right-deep if all the left children are leaves
Bushy Tree
• A tree that is neither left-deep nor right-deep is called a bushy tree

• The “leaves” in a left or right-deep join tree can actually be interior nodes with operators other than a join
• The total number of tree shapes
• If one-pass joins are used and the build relation is on the left, then the amount of memory needed at any
one time tends to be smaller than if we used a right-deep tree or a bushy tree for the same relations.
• If we use nested-loop joins with the relation of the outer loop on the left, then we avoid constructing any
intermediate relation more than once

5
DATABASE TECHNOLOGIES
Query Compiler- Choosing an Order for Joins

Dynamic Programming to select a Join Order and Grouping


• Consider R1 R2 ··· Rn
• Construct a table with an entry for each subset of one or more of the n relations. In that
table, store the following:
1. The estimated size of the join of these relations
2. The least cost of computing the join of these relations
3. The expression that yields the least cost
• Build the table computing entries for all subsets until we get an entry for the one subset of
size n. That entry tells us the best way to compute the join of all the relations; it also gives
us the estimated cost of that method.

6
DATABASE TECHNOLOGIES
Query Compiler- Choosing an Order for Joins

Example of Dynamic Programming to select a Join Order and Grouping

1. Consider R ⋈ S ⋈ T ⋈ U. Let T(R) = T(S) = T(T) = T(U) = 1000

2. For the singleton sets, the cost is 0


since there are no intermediate relations

1. Consider the pairs of relations. The cost for each is 0, since there are still no intermediate relations
in a join of two relations

7
DATABASE TECHNOLOGIES
Query Compiler- Choosing an Order for Joins

Example of Dynamic Programming to select a Join Order and Grouping


4. Consider the table for joins of three out of the four relations. The only way to compute a join of
three relations is to pick two to join first

Consider {R,S,T}. We must consider each of the three pairs {R, S}, {R, T}, {S, T} in turn.
Cost of {R, S} = 5,000, {R, T} = 1,000,000 and {S, T} = 2,000. Pick {S, T} since t has the lowest cost
5. Consider joining all four relations

8
DATABASE TECHNOLOGIES
Query Compiler- Choosing an Order for Joins

A Greedy Algorithm for selecting a Join Order


• Dynamic programming approach needs to compute a number of calculations.
• When the number of joins grow, we choose not to invest the time necessary for an
exhaustive search and choose to use a heuristic algorithm.
• The most common choice of heuristic is a greedy algorithm, where we make one decision at a
time about the order of joins and never backtrack or reconsider decisions once made.
• Consider a greedy algorithm that only selects a left-deep tree.
• The “greediness” is based on the idea that we want to keep the intermediate relations as
small as possible at each level of the tree.

9
DATABASE TECHNOLOGIES
Query Compiler- Choosing an Order for Joins

Example of Greedy Algorithm for Selecting a Join Order

1. Consider R ⋈ S ⋈ T ⋈ U. Let T(R) = T(S) = T(T) = T(U) = 1000

1. For the singleton sets, the size is the same

1. Consider the pairs of relations. The pair {T, U} has the least size

10
DATABASE TECHNOLOGIES
Query Compiler- Choosing an Order for Joins

Example of Greedy Algorithm for Selecting a Join Order


4. Consider the table for joins of three out of the four relations. The only way to compute a
join of three relations is to pick two to join first.

Pick {S, T, U} since it has the lowest size


5. Consider joining all four relations

T ⋈ U ⋈ S⋈ R size = 2000 + 1000 = 3000

6. NOTE: In the two examples solved above, the tree resulting from the greedy
algorithm is
the same as that selected by the dynamic programming algorithm.
11
7. However, there are examples where the greedy algorithm fails to find the best
THANK YOU
Choosing An Order For Joins
Department of Computer Science and Engineering

12

You might also like