Adb ch2
Adb ch2
12/10/24
Query processing & optimization 1
Query Processing…
Refers to the range of activities involved in
extracting data from a database.
This includes translation of high –level queries into
low level expressions that can be used:
at physical level of the file system,
Execution
Strategies
DBMS
Answer Data
Optimizer
12/10/24
Query processing & optimization 3
Query Processing…
The scanner identifies the query tokens such as
SQL keywords,
attribute names, and
relation names—that appear in the text of the query.
12/10/24
Query processing & optimization 5
Query Processing…
The query optimizer module has the task of producing
a good execution plan.
The code generator generates the code to execute that
plan.
The runtime database processor has the task of
running (executing) the query code, whether in
compiled or interpreted mode, to produce the query
result.
If a runtime error results, an error message is generated
by the runtime database processor.
12/10/24
Query processing & optimization 6
Query Processing…Query tree
sql query
Select balance
From account
bala>2500(Account))
Bala(
Bala (Account))
bala>2500(
Evaluation
12/10/24
Query processing & optimization 3-8
Steps in Query Processing
1. Parsing and translation
translate the query into its internal form.
This is then translated into relational algebra.
2. Optimization
The query optimizer translates a relational algebra expression
into an execution plan.
A relational algebra expression may have many equivalent
expressions, each of which gives rise to a different evaluation
plan.
Amongst all equivalent evaluation plans choose the one with
lowest cost.
12/10/24
Query processing & optimization 3-9
2.Optimization….
Annotated expression specifying detailed evaluation
strategy is called an evaluation-plan.
Query Optimization: Amongst all equivalent evaluation
plans choose the one with lowest cost.
balance2500(balance(account))
balance(balance2500(account))
12/10/24
Query processing & optimization 3-10
Operations and Costs
Operations: σ, π, , , -, x,
Costs:
NR: number of records in R
LR: size of record in R
FR: blocking factor
• number of records in page
BR: number of pages to store relation R
V(A,R): number of distinct values of attribute A in R
SC(A,R): selection cardinality of A in R
• A key: S(A,R)=1
• A nonkey: S(A,R)= NR / V(A,R)
HTi: number of levels in index I
rounding up fractions and logarithms
12/10/24
Query processing & optimization 3-11
Steps in Query Processing
3) Evaluation
The query-execution engine takes a query-
evaluation plan,
• executes that plan, and
12/10/24
Query processing & optimization 3-13
Query Processing vs.
Optimization
Query Processing
Query Optimization
i j
...
...
n k, n k
PROJECT can produce many tuples with same value
SQL does not -- one difference between formal and actual query languages
12/10/24
Query processing & optimization 16
Relational Algebra: Select
<predicate> (R)
<predicate> is a conditional expression of the type
that we are familiar with from conventional
programming languages
<attribute> <op> <attribute>
<attribute> <op> <constant>
attribute in R
op {=,,<,>,, …, AND, OR}
12/10/24
Query processing & optimization 17
Pictorially
Movie
title year length filmType
Star Wars 1977 124 color
Mighty result set
1991 104 color
Ducks
Wayne’s
1992 95 color
World
A1 A 2 A 3 … A n A1 A 2 A 3 … A n
i j, i j
...
...
# of selected tuples is referred to as the selectivity of the condition
12/10/24
Query processing & optimization 18
Cartesian Product
RxS
Sets of all pairs that can be formed by choosing the first
element of the pair to be any element of R, the second
any element of S.
Resulting schema may be ambiguous
Use R.A or S.A to disambiguate an attribute that occurs in both
schemas
A R.B S.B C D
R S 1 2 2 5 6
A B B C D 1 2 4 7 8
1 2 2 5 6 1 2 9 10 11
x
3 4 4 7 8 3 4 2 5 6
9 10 11 3 4 4 7 8
12/10/24 3processing
Query 4 9& optimization
10 11 19
Join Operations
R join S
Match only those tuples from R and S that agree in
whatever attributes are common to the schemas of R
and S
If r and s from r(R) and s(S) are successfully paired, result is
called a joined tuple
This join operation is the same we used in earlier
section to recombine relations that had been projected
onto two subsets of their attributes (e.g., as a result of a
BCNF decomposition)
12/10/24
Query processing & optimization 20
Example
R S
A B B C D A B C D
1 2 2 5 6 1 2 5 6
join
3 4 4 7 8 3 4 7 8
9 10 11
Select B
From R
Where
12/10/24
R.A = “c” R.C > 10
Query processing & optimization 3-22
SQL Primer (contd.)
We will focus on SPJ..
12/10/24
Query processing & optimization 3-23
Relational Algebra - can be used to
describe plans
[s
B,D R.A=“c” S.E=2 R.C = S.C (RXS)]
B,D
X
R S
12/10/24
Query processing & optimization 3-24
General syntax query parser
Translating SQL into Relational Algebra
FROM R1..,…Rk
WHERE P
A1,,,,Anp( R1 x,….Rk))
12/10/24
Query processing & optimization 3-25
Tree Representation of
Relational Algebra
A1,,,,Anp( R1 x,….Rk))
A1,,,An
P
x Rx
x
R1 R3
R2
12/10/24
Query processing & optimization 3-26
query parser Example
SELECT balance
FROM account
WHERE balance<2500
balancebalance<2500(account)) balance
account
12/10/24
Query processing & optimization 3-27
Making An Evaluation Plan
Query Evaluation Plan (or simply Plan): A Tree of
Relational Algebra operators (essentially σ-π-join
[ basic block ], while rest operators are carried out on
the result) with choice of algorithm for each operator .
An evaluation plan defines exactly what
algorithm is used for each operation, and how
the execution of the operations is coordinated
Query Plan presents a specific order of operations
for executing a query.
12/10/24
Query processing & optimization 3-28
Query Evaluation Plan
Used to fully specify how to evaluate a query, each
operation in the query tree is annotated with instructions
which specify the algorithm or the index to be used to
evaluate that operation.
Query Optimization: Amongst all equivalent evaluation
plans choose the one with lowest cost.
Cost is estimated using statistical information from the
database catalog
e.g. number of tuples in each relation, size of tuples, etc.
12/10/24
Query processing & optimization 3-29
Query Evaluation
How to evaluate individual relational operation?
12/10/24
Query processing & optimization 30
Relational Algebra - can be used to
describe plans
B,D
X
R S
R S
12/10/24
Query processing & optimization 32
methods of query optimization
There are two methods of query optimization.
Cost based Optimization (Physical)
This is based on the cost of the query.
The query can use different paths based on indexes, constraints,
sorting methods etc.
This method mainly uses the statistics like :
• record size,
• number of records,
• number of records per block,
• number of blocks,
• table size,
• whether whole table fits in a block,
• organization of tables,
• uniqueness of column values,
12/10/24
• size of columns etc
Query processing & optimization 3-33
methods of query optimization
Heuristic Optimization (Logical)
This method is also known as rule based optimization.
This is based on the equivalence rule on relational
expressions;
• hence the number of combination of queries get reduces here.
• Hence the cost of the query too reduces.
This method creates relational tree for the given query based
on the equivalence rules.
These equivalence rules by providing an alternative way of
writing and evaluating the query, gives the better path to
evaluate the query.
This rule need not be true in all cases.
It needs to be examined after applying those rules.
12/10/24
Query processing & optimization 3-34
Example
SELECT Lname
FROM EMPLOYEE, WORKS_ON, PROJECT
WHERE Pname=‘Aquarius’AND Pnumber=Pno AND Essn=Ssn AND Bdate> ‘1957-12-31’;
12/10/24
Query processing & optimization 3-36
Cont…
12/10/24
Query processing & optimization 3-37
Cont..
12/10/24
Query processing & optimization 3-39
Cont…
1. Conjunctive selection operations can be deconstructed into a sequence of
individual selections.
L1 ( L2 ( ( Ln ( E )) )) L1 ( E )
4. Selections can be combined with Cartesian products and theta joins.
a. (E1 X E2) = E1 E2
b. 1(E1 2 E 2) = E 1 1 2 E2
12/10/24
Query processing & optimization 3-40
Cont…
5. Theta-join operations (and natural joins) are commutative.
E1 E2 = E2 E1
6. (a) Natural join operations are associative:
(E1 E2) E3 = E1 (E2 E3)
(b) Theta joins are associative in the following manner:
12/10/24
Query processing & optimization 3-41
Cont…
7. The selection operation distributes over the theta join operation under
the following two conditions:
(a) When all the attributes in 0 involve only the attributes of one
12/10/24
Query processing & optimization 3-42
Cont…
8. The projection operation distributes over the theta join operation as
follows:
(a) if involves only attributes from L1 L2:
L1 L2 ( E1 E2 ) ( L1 ( E1 )) ( L2 ( E2 ))
L L ( E1
1 2 E2 ) L L (( L L ( E1 ))
1 2 1 3 ( L L ( E2 )))
2 4
12/10/24
Query processing & optimization 3-43
Cont…
9. The set operations union and intersection are commutative
E1 E2 = E2 E1
E1 E2 = E2 E1
(set difference is not commutative).
10. Set union and intersection are associative.
(E1 E2) E3 = E1 (E2 E3)
(E1 E2) E3 = E1 (E2 E3)
11. The selection operation distributes over , and –.
(E1 – E2) = (E1) – (E2)
and similarly for and in place of –
Also: (E 1 – E2) = (E1) – E2
and similarly for in place of –, but not for
12. The projection operation distributes over union
L(E1 E2) = (L(E1)) (L(E2))
12/10/24
Query processing & optimization 3-44
Algebraic Laws
Commutative and Associative Laws
RUS=SUR R U (S U T) = (R U S) U T
R∩S=S∩R R ∩ (S ∩ T) = (R ∩ S) ∩ T
Laws involving selection:
s C AND C’(R) = s C(s C’(R)) = s C(R) ∩ s C’(R)
s C OR C’(R) = s C(R) U s C’(R)
S) = s (R)
s C (R S
C
• When C involves only attributes of R
s C (R – S) = s C (R) – S
s C (R U S) = s C (R) U s C (S)
s C (R ∩ S) = s C (R) ∩ S
12/10/24
Query processing & optimization 3-45
Initial Logical Plan
B,D
Select B,D
From R,S R.A = “c” Λ R.C = S.C
Where R.A = “c”
R.C=S.C
X
R S
12/10/24
Query processing & optimization 3-46
Apply Rewrite Rule (1)
B,D B,D
R.C = S.C
R.A = “c” Λ R.C = S.C
R.A = “c”
X
X
R S
R S
12/10/24
Query processing & optimization 3-47
Apply Rewrite Rule (2)
B,D B,D
R.C = S.C R.C = S.C
R.A = “c” X
R.A = “c” S
X
R S R
B,D [ R.C=S.C [R.A=“c”(R)] X S]
12/10/24
Query processing & optimization 3-48
Apply Rewrite Rule (3)
B,D
B,D
R.C = S.C
Natural join
X R.A = “c” S
R.A = “c” S
R
R
B,D [[R.A=“c”(R)] S]
12/10/24
Query processing & optimization 3-49
• How do we execute this
query?
Select B,D
From R,S
Where R.A = “c” S.E = 2
R.C=S.C
- Do Cartesian product
- Select tuples
One idea - Do projection
12/10/24
Query processing & optimization 3-50
R A B C S C D E
a 1 10 10 x 2
b 1 20 20 y 2
c 2 10 30 z 2
d 2 35 40 x 1
e 3 45 50 y 3
Select B,D
From R,S Answer B D
Where R.A = “c” S.E = 2
R.C=S.C 2 x
12/10/24
Query processing & optimization 3-51
An Example (cont.)
Plan 1
Cross product of R & S
Select tuples using WHERE conditions
Project on B & D
Algebra expression
B,D
R S
12/10/24
Query processing & optimization 52
RXS R.A R.B R.C S.C S.D S.E
Select B,D
a 1 10 10 x 2
From R,S
S.E
Where R.A = “c”
= 2 R.C=S.C
a 1 10 20 y 2
.
.
Found!
c 2 10 10 x 2
Got one... .
.
12/10/24
Query processing & optimization 3-53
An Example (cont.)
Plan 2
Select R tuples with R.A=“c”
Select S tuples with S.E=2
Natural join
Project B & D
R S
12/10/24
Query processing & optimization 54
Relational Algebra Primer
12/10/24
Query processing & optimization 3-55
Another idea:
Plan II B,D
natural join
R.A = “c” S.E = 2
R(A,B,C) S(C,D,E)
Select B,D
From R,S
Where R.A = “c” S.E = 2
R.C=S.C
12/10/24
Query processing & optimization 3-56
Measures of Query Cost
Cost is generally measured as total elapsed time for
answering query
Many factors contribute to time cost
• disk accesses, CPU, or even network communication
Typically disk access is the predominant cost, and is
also relatively easy to estimate. Measured by taking
into account
Number of seeks * average-seek-cost
Number of blocks read * average-block-read-cost
Number of blocks written* average-block-write-cost
• The cost to write a block is greater than the cost to read a
block
12/10/24
Query processing & optimization 3-57
Measures of Query Cost
(Cont.)
For simplicity we just use the number of block transfers from
disk and the number of seeks as the cost measures
tT – time to transfer one block
tS – time for one seek
Cost for b block transfers plus S seeks
b * tT + S * t S
We do not include cost to writing output to disk in the cost
formulae
We ignore CPU costs for simplicity, as they tend to be much
lower
Real systems do take CPU cost into account, but they are clearly less
significant
12/10/24
Query processing & optimization 3-58
Algorithms for Selection
Operation
File scan – search algorithms that locate and retrieve
records that fulfill a selection condition.
Algorithm A1 (linear search). Scan each file block and
test all records to see whether they satisfy the selection
condition.
Cost estimate = b block transfers + 1 seek
r
12/10/24
Query processing & optimization 3-59
Algorithms for Selection (Cont.)
A2 (binary search). Applicable only if the selection is an
12/10/24
Query processing & optimization 3-60
Selections Using Indices
Index scan – search algorithms that use an index
12/10/24
Query processing & optimization 3-61
Database Index
Data is stored in the form of records. Every records has a key
field, which helps it to be recognized uniquely
Search Key - attribute to set of attributes used to look up records
in a file.
An index file consists of records (called index entries) of the
form search-key pointer
Index files are typically much smaller than the original file
Two basic kinds of indices:
Ordered indices: search keys are stored
Hash indices: search keys are distributed uniformly across
“buckets” using a “hash function”.
12/10/24
Query processing & optimization 3-62
Index Evaluation Metrics
Access types supported efficiently.
Equality searches – records with a specified value in
an attribute.
Range searches – records with an attribute value falling
within a specified range
Access time-time to find and use a files
Insertion time- time to push new record
Deletion time-time to delete from record
Space overhead- how much extra byte need for the
index itself.
12/10/24
Query processing & optimization 3-63
Ordered Indices
In an ordered index, index entries are stored sorted on
the search key value
Eg.Author catalog in library
Primary index: in a sequentially ordered file, the
index whose search key specifies the sequential order
of the actual file.
Also called clustering index
Secondary index: an index whose search key specifies
an order different from the sequential order of the file.
Also called
non-clustering index.
12/10/24
Query processing & optimization 3-64
Dense Index Files
Dense index — Index record appears for every search-key value in the file. 0r
every entry for possible search key values. Faster but it requires more space to
store index itself.
E.g. index on ID
12/10/24
Query processing & optimization 3-65
Dense Index Files (Cont.)
Dense index on dept_name, with instructor file sorted on dept_name
Don’t have a pointer to every records but one which has for search key
12/10/24
Query processing & optimization 3-66
Sparse Index Files
Sparse Index: contains index records for only some search-
key values
To locate a record with search-key value K :
Find index record with largest search-key value < K
Search file sequentially starting at the record to which the
index record points
You reach to the nearest record the follow the pointer.
12/10/24
Query processing & optimization 3-67
Sparse Index Files (Cont.)
12/10/24
Query processing & optimization 3-69
Multilevel Index
If primary index does not fit in memory, access becomes
expensive.
Solution: treat primary index kept on disk as a sequential file
and construct a sparse index on it.
outer index – a sparse index of primary index
inner index – the primary index file
If even outer index is too large to fit in main memory, yet
another level of index can be created, and so on.
Indices at all levels must be updated on insertion or deletion
from the file.
12/10/24
Query processing & optimization 3-70
Multilevel Index (Cont.)
12/10/24
Query processing & optimization 3-71
B+-Tree Index Files
All the data is stored in leaf node.
at a node.