Query Processing and Optimization
Query Processing and Optimization
Optimization
Prof. Smita Joshi
Query Processing and Optimization
● Overview
● Measures of Query cost
● Selection operation, Sorting, Join Operations, and other Operations
● Evaluation of Expression
● Query Optimization :Translations of SQL Queries into relational
algebra
● Heuristic approach
● Cost based optimization
Overview
● Techniques used internally by a DBMS to process high-level queries. A query expressed in a
high-level query language such as SQL must first be scanned, parsed, and validated
● The scanner identifies the query tokens—such as SQL keywords, attribute names, and relation
names—that appear in the text of the query
● The parser checks the query syntax to determine whether it is formulated according to the syntax rules
(rules of grammar) of the query language
● The query must also be validated by checking that all attribute and relation names are valid and
semantically meaningful names in the schema of the particular database being queried
● An internal representation of the query is then created, usually as a tree data structure called a query
tree
● It is also possible to represent the query using a graph data structure called a query graph, which is
generally a directed acyclic graph (DAG)
● The DBMS must then devise an execution strategy or query plan for retrieving the results of the query
from the database files
● A query has many possible execution strategies, and the process of choosing a suitable one for
processing a query is known as query optimization
● First we focus on how queries are processed and what algorithms are used to perform individual
operations within the query
Steps of processing a query
The basic steps involved in processing a
query are
1. Parsing and translation.
2. Optimization.
3. Evaluation.
Parsing and translation
● The fired queries undergo lexical, syntactic, and semantic analysis.
● Essentially, the query gets broken down into different tokens and white spaces are
removed along with the comments (Lexical Analysis).
● In the next step, the query gets checked for the correctness, both syntax and semantic
wise.
● The query processor first checks the query if the rules of SQL have been correctly
followed or not (Syntactic Analysis).
● Finally, the query processor checks if the meaning of the query is right or not. Things
like if the table(s) mentioned in the query are present in the DB or not? if the
column(s) referred from all the table(s) are actually present in them or not? (Semantic
Analysis)
● Once the above mentioned checks pass, the flow moves to convert all the tokens into
relational expressions, graphs, and trees.
Parsing and translation
As an example, consider the query: The query would be divided into the following tokens:
SELECT SELECT, salary, FROM, instructor, WHERE, salary, <,
salary 75000.
FROM The tokens (and hence the query) get validated for
instructor
WHERE ● The name of the queried table is looked into the data
dictionary table.
salary < 75000; ● The name of the columns mentioned (salary) in the
tokens are validated for existence.
● The type of column(s) being compared have to be of
the same type (salary and the value 75000 should have
the same data type).
The next step is to translate the generated set of tokens into a relational algebra query.
These are easy to handle for the optimizer in further processes.
Parsing and translation
This query can be translated into either of the following
relational-algebra expressions:
○ t – Average time to transfer one block of data (Assuming for simplicity that write cost is
T
same as read cost. Ideally read cost and write cost is different)
○ t – Average block access time (time for one seek = disk seek time + rotational latency)
S
The formula to calculate the cost or measure the cost is
Cost = 1 * tS + br * t
T
Selection operation: A1(Linear search, Equality on key)
● If selection is on a key attribute, search stop on finding record
Select *
from employee
where emp_id = 1004;
Cost = tS+ br * tT + tT + tS
No of seeks = hi + 1
No of block transfer = hi + b Assuming b blocks contain search key
Cost = (hi + 1) * tS + (hi + b) * tT
Cost = hi (tT + tS) + tS + b * tT
● One seek for each level of the tree, one seek for the first block.
● Here b is the number of blocks containing records with the specified
search key, all of which are read.
● These blocks are leaf blocks assumed to be stored sequentially (since it
is a primary index) and don’t require additional seeks.
Hence Cost = hi ∗ (tT + tS) + b ∗ tT
Selection operation: (primary Index (dense/sparse), equality on non key)
Dense Index, Linear Search, equality on non key
Total Cost: cost of selection in index table + cost of selection in data file
Cost of selection in index table : tS+ br * tT
Cost of selection in data file : (n * tT + tS)
Cost = tS+ br * tT + n * tT + tS
Sparse Index, Binary Search, equality on non key
Total Cost: cost of selection in index table + cost of selection in data file
Cost of selection in index table : ⌈log2(br)⌉ * (tT + tS)
Cost of selection in data file : (n * tT + tS)
Cost = ⌈log2(br)⌉ * (tT + tS) + n * tT + tS
Selection operation: (Secondary Index, equality, only one record)
Dense Index, Linear Search, equality, more than only one record
Cost = tS+ br * tT + n * (tT + tS)
Sparse Index, Binary Search, equality, more than one record
Cost = ⌈log2(br)⌉ * (tT + tS) + n * (tT + tS)
Cost = (⌈log2(br)⌉ + n) * (tT + tS)
Sorting in Query Processing
Sorting Operation is required when
● SQL Queries can specify that the output be sorted
● For efficient query processing
○ E.g. for join operation we have to sort the relation first
● Logical sorting
○ Sort a relation by building an index on the sort key, and use that index to read the
relation in sorted order
○ Reading of tuples in the sorted order may lead to a disk access for each record,
which may be very expensive, since the number of record can be much larger
than the number of blocks
● Hence it is desirable to order the record, physically.
Sorting in Query Processing
There are two different types of sorting
1. Internal sorting.
2. External sorting.
Internal sorting:
When all the tuples fit into the memory then we can use a standard in-memory sorting
algorithm. Some of the algorithms are Quick sort and Bubble sort.
External sorting:
Refers to sorting algorithms that are suitable for large files of records stored on disk that
do not fit entirely in main memory, such as most database files. If data does not fit in
memory, then we need to use a technique that is aware of the cost of writing data out to
disk.
External Merge Sort Algorithm
The algorithm works in 2 stages
● In first stage, relation will be divided into as per the main memory space
available and the individual part of the relation is sorted and saved back to the
disk
● In second stage, all the sorted runs that has been generated will be merged
○ Read one block of each of the N run files into a buffer block in memory
repeat
● Choose the first tuple in sort order among all the buffer blocks
● Write the tuple to the output, delete it from buffer block
● If the buffer block of any run Ri and not end of the file Ri
○ Then read the next block of Ri into the buffer block
until all input buffer blocks are empty
External Merge Sort Algorithm
Let M denotes the number of blocks can fit into memory available for
sorting.
● Sorted runs are produced.
○ M blocks are read at a time.
○ Sorted in memory.
○ M blocks are written back.
● Merge M-1 runs at a time (M-1 may merge) (read M-1 buffers and 1
output buffer to write sorted output)
○ Read 1st blocks of M-1 runs
○ Outputs of 1st records to buffered blocks.
○ Once the buffered block is done it continue till buffer block is full.
○ Output buffer block to disk.
○ When a block of run is exhausted next block of that run is read.
Assumption:
● Number of buffer fit in
memory, M = 3
● Number of tuples fit in one
buffer = 1
● Read M-1 buffers and 1 output
buffer to write sorted output)
Example: 3 Buffer pages to sort 12 page file
Assumption:
● Number of buffer fit in
memory, M = 3
● Number of tuples fit in one
buffer = 1
● Read M-1 buffers and 1 output
buffer to write sorted output)
Join Operation
There are several different algorithms that can be used to implement
joins (natural-join, equi-join, condition-join)
○ Nested-Loop Join
○ Block Nested-Loop Join
○ Index Nested-Loop Join
○ Sort-Merge Join
○ Hash-Join
● Choice of a particular algorithm is based on cost estimate
Nested-Loop Join Algorithm
● A nested loop join is a join that contains a pair of nested for
loops.
● To perform the nested loop join i.e., θ on two relations r and s,
we use an algorithm known as the Nested loop join algorithm.
● The computation takes place as:
r⋈θs
where r is known as the outer relation and s is the inner
relation of the join. It is because the for loop of r encloses the for
loop of s.
Nested-Loop Join Algorithm (r ⋈ θ s)
for each tuple tr in r do begin
for each tuple ts in s do begin
test pair (tr, ts) to test if they satisfy the given join condition ?
if test satisfied Here,
r - Outer Relation
add tr . ts to the result; s - Inner Relation
tr and ts are the tuples of relations r and
end inner loop s, respectively.
The notation tr. ts is a tuple constructed
end outer loop
by concatenating the attribute values of
tuples tr and ts.
Nested-Loop Join Example
s⋈θr
Here,
● s is outer relation
● r is inner relation
Nested-Loop Join Algorithm Example
Here,
Employee is a outer relation
Dept is a inner relation
Nested-Loop Join Algorithm
○ The nested-loop join does not need any indexing similar to a linear file scan for
accessing the data.
○ Nested-loop join does not care about the given join condition. It is suitable for each
given join condition.
○ The nested-loop join algorithm is expensive in nature. It is because it computes and
examines each pair of tuples in the given two relations.
Cost of Nested-Loop Join Algorithm
For analyzing the cost of the nested-loop join algorithm,
● Consider a number of pairs of tuples as nr * ns.
Here, nr specifies the number of tuples in relation r and ns specifies the number of
tuples in relation s.
● For each record in outer relation r, complete scan on inner relation s is performed
● For computing the cost, perform a complete scan on relation s. Thus,
Worst Case
● The buffer can hold only one block of each relation
Total number of block transfers in worst case = nr * bs + br
Total number of seeks required in worst case = nr + br
where,
br - number of blocks containing tuples of relation r
b - number of blocks containing tuples of relation s
Cost of Nested-Loop Join Algorithm
Best Case
● Enough space for both relations to fit simultaneously in memory, so each block would
have to be read only once
● If one of the relations fits entirely in memory, it is beneficial to use that relation as a
inner relation , since the inner relation would then be read once
Note that the projection of course on (course id,title) is required since course shares an
attribute dept name with instructor; if we did not remove this attribute using the projection, the
above expression using natural joins would return only courses from the Music department,
even if some Music department instructors taught courses in other departments.
Query Optimization
The above expression constructs a large intermediate relation
Query Optimization
Transformed expression tree will take less time to find or project the output, because it
directly select dept_name =”Music” from instructor.
An evaluation plan defines exactly what algorithm should be used for each operation, and how
the execution of the operation should be coordinated.
● Different Operation such as, Hash Join, Merge Join/Sorted Join etc..
● ID attributes helps to merge / sort the relation where edges are marked as pipelined,
the output of the producer is pipelined directly to the consumer, without being
written out to disk.
● Given a relational-algebra expression, it is the job of the query optimizer to come up
with a query-evaluation plan that computes the same result as the given expression,
and is the least-costly way of generating the result.
Query Optimization
●
To find the least-costly query-evaluation plan, the optimizer needs to generate
alternative plans that produce the same result as the given expression, and to choose
the least-costly one.
Generation of query-evaluation plans involves three
steps:
Materialization: Materialize (i.e., store into temporary relations in the disk) intermediate
results from lower-level operations, and use them as inputs to upper-level operations.
Much cheaper than materialization: no need to store a temporary relation to disk • For
pipelining to be effective, use evaluation algorithms that generate output tuples even as
tuples are received for inputs to the operation
Query Optimization
❏ Transformation of Relational Expression
tuples.
The two expressions may generate the tuples in different orders, but would be
considered equivalent as long as the set of tuples is the same.
Query Optimization
Equivalence Rules:
first form. The two expressions generate the same result on any valid database.
Example of Transformation: