ch2 PDF
ch2 PDF
12/10/2024
Query processing & optimization 1
Query Processing…
Refers to the range of activities involved in
extracting data from a database.
This includes translation of high –level queries into
low level expressions that can be used:
at physical level of the file system,
query optimization and
actual execution of the query to get the result.
12/10/2024
Query processing & 2
optimization
Query Processing & Optimization
Process
Query Scanner Parser Internal
representation
Execution
Strategies
DBMS
Answer Data
Optimizer
12/10/2024
Query processing & 3
optimization
Query Processing…
The scanner identifies the query tokens such as
SQL keywords,
attribute names, and
relation names—that appear in the text of the query.
12/10/2024
Query processing & 4
optimization
Query Processing…
An internal representation of the query is then created,
12/10/2024
Query processing & 6
optimization
Query Processing…Query tree
sql query
Select balance
From account
Where balance > 2500
Bala( bala>2500(Account))
bala>2500(Bala (Account))
12/10/2024
both are equivalent query i.e. display the same results.
Query processing & 7
optimization
Generally, Basic Steps in Query Processing
Evaluation
12/10/2024
Query processing & 3-8
optimization
Steps in Query Processing
1. Parsing and translation
translate the query into its internal form.
This is then translated into relational algebra.
2. Optimization
The query optimizer translates a relational algebra expression
into an execution plan.
A relational algebra expression may have many equivalent
expressions, each of which gives rise to a different evaluation
plan.
Amongst all equivalent evaluation plans choose the one with
lowest cost.
12/10/2024
Query processing & 3-9
optimization
2.Optimization….
Annotated expression specifying detailed evaluation
strategy is called an evaluation-plan.
Query Optimization: Amongst all equivalent evaluation
plans choose the one with lowest cost.
balance2500(balance(account))
balance(balance2500(account))
12/10/2024
Query processing & 3-10
optimization
Operations and Costs
Operations: σ, π, , , -, x,
Costs:
N : number of records in R
R
L : size of record in R
R
F : blocking factor
R
• number of records in page
B : number of pages to store relation R
R
V(A,R):number of distinct values of attribute A in R
SC(A,R): selection cardinality of A in R
• A key: S(A,R)=1
• A nonkey: S(A,R)= NR / V(A,R)
HT : number of levels in index I
i
rounding up fractions and logarithms
12/10/2024
Query processing & 3-11
optimization
Steps in Query Processing
3) Evaluation
The query-execution engine takes a query-
evaluation plan,
• executes that plan, and
• returns the answers to the query.
12/10/2024
Query processing & 3-13
optimization
Query Processing vs.
Optimization
Query Processing
Query Optimization
i j
...
...
n k, n k
PROJECT can produce many tuples with same value
SQL does not -- one difference between formal and actual query languages
12/10/2024
Query processing & 16
optimization
Relational Algebra: Select
<predicate> (R)
<predicate> is a conditional expression of the type
that we are familiar with from conventional
programming languages
<attribute> <op> <attribute>
<attribute> <op> <constant>
attribute in R
op {=,,<,>,, …, AND, OR}
12/10/2024
Query processing & 17
optimization
Pictorially
Movie
title year length filmType
Star Wars 1977 124 color
Mighty result set
1991 104 color
Ducks
Wayne’s
1992 95 color
World
A1 A 2 A 3 … A n A1 A 2 A 3 … A n
i j, i j
...
...
# of selected tuples is referred to as the selectivity of the condition
12/10/2024
Query processing & 18
optimization
Cartesian Product
RxS
Sets of all pairs that can be formed by choosing the first
element of the pair to be any element of R, the second
any element of S.
Resulting schema may be ambiguous
Use R.A or S.A to disambiguate an attribute that occurs in both
schemas
A R.B S.B C D
R S 1 2 2 5 6
A B B C D 1 2 4 7 8
1 2 2 5 6 1 2 9 10 11
x
3 4 4 7 8 3 4 2 5 6
9 10 11 3 4 4 7 8
12/10/2024 3 4Query9processing
10 11& 19
optimization
Join Operations
R join S
Match only those tuples from R and S that agree in
whatever attributes are common to the schemas of R
and S
If r and s from r(R) and s(S) are successfully paired, result is
called a joined tuple
This join operation is the same we used in earlier
section to recombine relations that had been projected
onto two subsets of their attributes (e.g., as a result of a
BCNF decomposition)
12/10/2024
Query processing & 20
optimization
Example
R S
A B B C D A B C D
1 2 2 5 6 1 2 5 6
join
3 4 4 7 8 3 4 7 8
9 10 11
12/10/2024
Query processing & 3-23
optimization
Relational Algebra - can be used to
describe plans
[s
B,D R.A=“c” S.E=2 R.C = S.C (RXS)]
B,D
X
R S
12/10/2024
Query processing & 3-24
optimization
General syntax query parser
Translating SQL into Relational Algebra
A1,,,,Anp( R1 x,….Rk))
12/10/2024
Query processing & 3-25
optimization
Tree Representation of
Relational Algebra
A1,,,,Anp( R1 x,….Rk))
A1,,,An
P
x Rx
x
R1 R3
R2
12/10/2024
Query processing & 3-26
optimization
query parser Example
SELECT balance
FROM account
WHERE balance<2500
balancebalance<2500(account)) balance
account
12/10/2024
Query processing & 3-27
optimization
Making An Evaluation Plan
Query Evaluation Plan (or simply Plan): A Tree of
Relational Algebra operators (essentially σ-π-join
[ basic block ], while rest operators are carried out on
the result) with choice of algorithm for each operator .
An evaluation plan defines exactly what
algorithm is used for each operation, and how
the execution of the operations is coordinated
Query Plan presents a specific order of operations
for executing a query.
12/10/2024
Query processing & 3-28
optimization
Query Evaluation Plan
Used to fully specify how to evaluate a query, each
operation in the query tree is annotated with instructions
which specify the algorithm or the index to be used to
evaluate that operation.
Query Optimization: Amongst all equivalent evaluation
plans choose the one with lowest cost.
Cost is estimated using statistical information from the
database catalog
e.g. number of tuples in each relation, size of tuples, etc.
12/10/2024
Query processing & 3-29
optimization
Query Evaluation
How to evaluate individual relational operation?
12/10/2024
Query processing & 30
optimization
Relational Algebra - can be used to
describe plans
B,D
X
R S
R S
12/10/2024
Query processing & 32
optimization
methods of query optimization
There are two methods of query optimization.
Cost based Optimization (Physical)
This is based on the cost of the query.
The query can use different paths based on indexes,
constraints, sorting methods etc.
This method mainly uses the statistics like :
• record size,
• number of records,
• number of records per block,
• number of blocks,
• table size,
• whether whole table fits in a block,
• organization of tables,
• uniqueness of column values,
•
12/10/2024 size of columns etc
Query processing & 3-33
optimization
methods of query optimization
Heuristic Optimization (Logical)
This method is also known as rule based optimization.
This is based on the equivalence rule on relational
expressions;
• hence the number of combination of queries get reduces here.
• Hence the cost of the query too reduces.
This method creates relational tree for the given query based
on the equivalence rules.
These equivalence rules by providing an alternative way of
writing and evaluating the query, gives the better path to
evaluate the query.
This rule need not be true in all cases.
It needs to be examined after applying those rules.
12/10/2024
Query processing & 3-34
optimization
Example
SELECT Lname
FROM EMPLOYEE, WORKS_ON, PROJECT
WHERE Pname=‘Aquarius’AND Pnumber=Pno AND Essn=Ssn AND Bdate> ‘1957-12-31’;
12/10/2024
Query processing & 3-36
optimization
Cont…
12/10/2024
Query processing & 3-37
optimization
Cont..
12/10/2024
Query processing & 3-39
optimization
Cont…
1. Conjunctive selection operations can be deconstructed into a sequence of
individual selections.
L1 ( L2 ( ( Ln ( E )) )) L1 ( E )
4. Selections can be combined with Cartesian products and theta joins.
a. (E1 X E2) = E1 E2
b. 1(E1 2 E 2) = E 1 1 2 E2
12/10/2024
Query processing & 3-40
optimization
Cont…
5. Theta-join operations (and natural joins) are commutative.
E1 E2 = E2 E1
6. (a) Natural join operations are associative:
(E1 E2) E3 = E1 (E2 E3)
(b) Theta joins are associative in the following manner:
12/10/2024
Query processing & 3-41
optimization
Cont…
7. The selection operation distributes over the theta join operation under
the following two conditions:
(a) When all the attributes in 0 involve only the attributes of one
12/10/2024
Query processing & 3-42
optimization
Cont…
8. The projection operation distributes over the theta join operation as
follows:
(a) if involves only attributes from L1 L2:
ÕL1 ÈL2 ( E1 q E2 ) = (Õ L1 ( E1 )) q (Õ L2 ( E2 ))
Õ L ÈL ( E1
1 2 q E2 ) = Õ L ÈL ((Õ L ÈL ( E1 ))
1 2 1 3 q (Õ L ÈL ( E2 )))
2 4
12/10/2024
Query processing & 3-43
optimization
Cont…
9. The set operations union and intersection are commutative
E1 E2 = E2 E1
E1 E2 = E2 E1
(set difference is not commutative).
10. Set union and intersection are associative.
(E1 E2) E3 = E1 (E2 E3)
(E1 E2) E3 = E1 (E2 E3)
11. The selection operation distributes over , and –.
(E1 – E2) = (E1) – (E2)
and similarly for and in place of –
Also: (E 1 – E2) = (E1) – E2
and similarly for in place of –, but not for
12. The projection operation distributes over union
L(E1 E2) = (L(E1)) (L(E2))
12/10/2024
Query processing & 3-44
optimization
Algebraic Laws
Commutative and Associative Laws
RUS=SUR R U (S U T) = (R U S) U T
R∩S=S∩R R ∩ (S ∩ T) = (R ∩ S) ∩ T
Laws involving selection:
s C AND C’(R) = s C(s C’(R)) = s C(R) ∩ s C’(R)
s C OR C’(R) = s C(R) U s C’(R)
S) = s C (R)
s C (R S
• When C involves only attributes of R
s C (R – S) = s C (R) – S
s C (R U S) = s C (R) U s C (S)
s C (R ∩ S) = s C (R) ∩ S
12/10/2024
Query processing & 3-45
optimization
Initial Logical Plan
B,D
Select B,D
From R,S R.A = “c” Λ R.C = S.C
Where R.A = “c”
R.C=S.C
X
R S
12/10/2024
Query processing & 3-46
optimization
Apply Rewrite Rule (1)
B,D B,D
R.C = S.C
R.A = “c” Λ R.C = S.C
R.A = “c”
X
X
R S
R S
12/10/2024
Query processing & 3-47
optimization
Apply Rewrite Rule (2)
B,D B,D
R.C = S.C R.C = S.C
R.A = “c” X
R.A = “c” S
X
R S R
B,D [ sR.C=S.C [R.A=“c”(R)] X S]
12/10/2024
Query processing & 3-48
optimization
Apply Rewrite Rule (3)
B,D
B,D
R.C = S.C
Natural join
X R.A = “c” S
R.A = “c” S
R
R
B,D [[R.A=“c”(R)] S]
12/10/2024
Query processing & 3-49
optimization
• How do we execute this query?
Select B,D
From R,S
Where R.A = “c” S.E = 2
R.C=S.C
- Do Cartesian product
- Select tuples
One idea - Do projection
12/10/2024
Query processing & 3-50
optimization
R A B C S C D E
a 1 10 10 x 2
b 1 20 20 y 2
c 2 10 30 z 2
d 2 35 40 x 1
e 3 45 50 y 3
Select B,D
From R,S Answer B D
Where R.A = “c” 2 x
S.E = 2 R.C=S.C
12/10/2024
Query processing & 3-51
optimization
An Example (cont.)
Plan 1
Cross product of R & S
Select tuples using WHERE conditions
Project on B & D
Algebra expression
B,D
R S
12/10/2024
Query processing & 52
optimization
RXS R.A R.B R.C S.C S.D S.E
Select B,D
a 1 10 10 x 2
From R,S
Where R.A = “c” a 1 10 20 y 2
S.E = 2 .
R.C=S.C
.
Found! c 2 10 10 x 2
Got one... .
.
12/10/2024
Query processing & 3-53
optimization
An Example (cont.)
Plan 2
Select R tuples with R.A=“c”
Select S tuples with S.E=2
Natural join
Project B & D
R S
12/10/2024
Query processing & 54
optimization
Relational Algebra Primer
12/10/2024
Query processing & 3-55
optimization
Another idea:
Plan II B,D
natural join
sR.A = “c” sS.E = 2
R(A,B,C) S(C,D,E)
Select B,D
From R,S
Where R.A = “c”
S.E = 2 R.C=S.C
12/10/2024
Query processing & 3-56
optimization
Measures of Query Cost
Cost is generally measured as total elapsed time for
answering query
Many factors contribute to time cost
• disk accesses, CPU, or even network communication
Typically disk access is the predominant cost, and is
also relatively easy to estimate. Measured by taking
into account
Number of seeks * average-seek-cost
Number of blocks read * average-block-read-cost
Number of blocks written* average-block-write-cost
• The cost to write a block is greater than the cost to read a
block
12/10/2024
Query processing & 3-57
optimization
Measures of Query Cost
(Cont.)
For simplicity we just use the number of block transfers from
disk and the number of seeks as the cost measures
tT – time to transfer one block
tS – time for one seek
Cost for b block transfers plus S seeks
b * tT + S * t S
We do not include cost to writing output to disk in the cost
formulae
We ignore CPU costs for simplicity, as they tend to be much
lower
Real systems do take CPU cost into account, but they are clearly less
significant
12/10/2024
Query processing & 3-58
optimization
Algorithms for Selection
Operation
File scan – search algorithms that locate and retrieve
records that fulfill a selection condition.
Algorithm A1 (linear search). Scan each file block and
test all records to see whether they satisfy the selection
condition.
Cost estimate = b block transfers + 1 seek
r
12/10/2024
Query processing & 3-59
optimization
Algorithms for Selection (Cont.)
A2 (binary search). Applicable only if the selection is an
12/10/2024
Query processing & 3-60
optimization
Selections Using Indices
Index scan – search algorithms that use an index
12/10/2024
Query processing & 3-61
optimization
Database Index
Data is stored in the form of records. Every records has a key
field, which helps it to be recognized uniquely
Search Key - attribute to set of attributes used to look up records
in a file.
An index file consists of records (called index entries) of the
form search-key pointer
Index files are typically much smaller than the original file
Two basic kinds of indices:
Ordered indices: search keys are stored
Hash indices: search keys are distributed uniformly across
“buckets” using a “hash function”.
12/10/2024
Query processing & 3-62
optimization
Index Evaluation Metrics
Access types supported efficiently.
Equality searches – records with a specified value in
an attribute.
Range searches – records with an attribute value falling
within a specified range
Access time-time to find and use a files
Insertion time- time to push new record
Deletion time-time to delete from record
Space overhead- how much extra byte need for the
index itself.
12/10/2024
Query processing & 3-63
optimization
Ordered Indices
In an ordered index, index entries are stored sorted on
the search key value
Eg.Author catalog in library
Primary index: in a sequentially ordered file, the
index whose search key specifies the sequential order
of the actual file.
Also called clustering index
Secondary index: an index whose search key specifies
an order different from the sequential order of the file.
Also called
non-clustering
12/10/2024
index.
Query processing & 3-64
optimization
Dense Index Files
Dense index — Index record appears for every search-key value in
the file. 0r every entry for possible search key values. Faster but it
requires more space to store index itself.
E.g. index on ID
12/10/2024
Query processing & 3-65
optimization
Dense Index Files (Cont.)
Dense index on dept_name, with instructor file sorted on
dept_name
Don’t have a pointer to every records but one which has for
search key
12/10/2024
Query processing & 3-66
optimization
Sparse Index Files
Sparse Index: contains index records for only some search-
key values
To locate a record with search-key value K :
Find index record with largest search-key value < K
Search file sequentially starting at the record to which the
index record points
You reach to the nearest record the follow the pointer.
12/10/2024
Query processing & 3-67
optimization
Sparse Index Files (Cont.)
12/10/2024
Query processing & 3-69
optimization
Multilevel Index
If primary index does not fit in memory, access becomes
expensive.
Solution: treat primary index kept on disk as a sequential file
and construct a sparse index on it.
outer index – a sparse index of primary index
inner index – the primary index file
If even outer index is too large to fit in main memory, yet
another level of index can be created, and so on.
Indices at all levels must be updated on insertion or deletion
from the file.
12/10/2024
Query processing & 3-70
optimization
Multilevel Index (Cont.)
12/10/2024
Query processing & 3-71
optimization
B+-Tree Index Files
All the data is stored in leaf node.
at a node.
12/10/2024
Query processing & 3-73
optimization