0% found this document useful (0 votes)
3 views64 pages

Chapter - 2 Query Processing

This document covers query processing and optimization in database systems, detailing the steps involved in transforming high-level queries into efficient execution strategies. It discusses various optimization techniques, including heuristic and cost-based methods, as well as the importance of query decomposition, semantic analysis, and restructuring. Additionally, it emphasizes the significance of minimizing evaluation time and provides examples of query optimization strategies.

Uploaded by

tedatujube
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views64 pages

Chapter - 2 Query Processing

This document covers query processing and optimization in database systems, detailing the steps involved in transforming high-level queries into efficient execution strategies. It discusses various optimization techniques, including heuristic and cost-based methods, as well as the importance of query decomposition, semantic analysis, and restructuring. Additionally, it emphasizes the significance of minimizing evaluation time and provides examples of query optimization strategies.

Uploaded by

tedatujube
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 64

Advanced Database

Systems(CoSc2042)

Chapter Two

QUERY PROCESSING & OPTIMIZATION


Query Processing and Optimization: Outline
 Query processing
 Operator Evaluation Strategies
 Selection
 Join
 Query Optimization
 Heuristic query optimization
 Cost-based query optimization
 Query Tuning

2
Overview of Query Processing
 Query processing: The activities involved in parsing,

validating, optimizing, and executing a query.


 Aims

 To transform a query written in a high-level language,


typically SQL, into a correct and efficient execution
strategy expressed in a low-level language
(implementing the relational algebra), and
 To execute the strategy to retrieve the required data.

3
Steps of Query Processing
1. Parsing and
translation
2. Optimization
3. Evaluation

4
 DBMS has algorithms to implement relational algebra
expressions
 SQL is a kind of high level language; specify what is wanted,
not how it is obtained

5
6
Query optimization:
The activity of choosing an efficient execution
strategy for processing a query.
 Task: Find an efficient physical query plan (aka
execution plan) for an SQL query
Goal: Minimize the evaluation time for the query, i.e.,
compute query result as fast as possible
Cost Factors: Disk accesses, read/write operations,
[I/O, page transfer] (CPU time is typically ignored)
Optimization: find the most efficient evaluation plan for
a query because there can be more than one way.
7
Examples:

8
 Find all Managers who work at a London branch.
Example - 2
SELECT * FROM Staff s, Branch b WHERE s.branchNo = b.branchNo
AND (s.position = ‘Manager’ AND b.city = ‘London’);

9
The equivalent relational algebra queries corresponding
to this SQL statement are:

The
Different
Strategi
es

10
Cost Comparison
 Cost (in disk accesses) are:

(1) (1000 + 50) + 2*(1000 * 50) = 101 050

(2) 2*1000 + (1000 + 50) = 3 050

(3) 1000 + 2*50 + 5 + (50 + 5) = 1 160


 The third option significantly reduces size of relations being joined together.

 Cartesian product and join operations are much more expensive than
selection.
We will see shortly that one of the fundamental strategies in query
processing is to perform the unary operations, Selection and Projection,
as early as possible, thereby reducing the operands of any subsequent
binary operations.
11
Phases of query processing

12
 Query Decomposition
 Transform high-level query into RA query.

 Check that query is syntactically and semantically


correct.
 Typical stages are:
 Analysis,

 Normalization,

 Semantic analysis,

 Simplification,

 Query restructuring.
13
 Analysis
 Analyze query lexically and syntactically using compiler techniques.
 Verify relations and attributes exist.
 Verify operations are appropriate for object type.

14
Analysis
 Finally, query transformed into a query tree constructed as follows:

Leaf node for each base relation.

Non-leaf node for each intermediate relation produced by RA operation.

Root of tree represents query result.


 Sequence is directed from leaves to root.

15
Normalization
 Converts query into a normalized form for easier
manipulation.
 Predicate can be converted into one of two forms:

 Conjunctive normal form:

(position = 'Manager'  salary > 20000)  (branchNo =


'B003')
 Disjunctive normal form:

(position='Manager'branchNo='B003')(salary>20000branc
hNo ='B003')
16
Semantic Analysis
 Rejects normalized queries that are incorrectly
formulated or contradictory.
 Query is incorrectly formulated if components do
not contribute to generation of result.
 Query is contradictory if its predicate cannot be
satisfied by any tuple.
 Algorithms to determine correctness exist only for
queries that do not contain disjunction and
negation.

17
Semantic Analysis
 To detect
➠ connection graph (query graph)
➠ join graph

18
Relation connection graph

 Relation connection graph not


fully connected, so query is not
correctly formulated.
 Have omitted the join condition
(v.propertyNo = p.propertyNo) .

19
Example 2
SELECT Ename,Resp FROM Emp, Works, Project
WHERE Emp.Eno = Works.Eno AND Works.Pno =
Project.Pno AND Pname = ‘CAD/CAM’ AND Dur > 36
AND Title = ‘Programmer’

If the query graph is connected, the query is


semantically correct.
20
Simplification
• Detects redundant qualifications,
• Eliminates common sub-expressions,
• Transforms query to semantically
equivalent but more easily and efficiently
computed form.
 Apply well-known transformation rules of Boolean
algebra.

21
Example
 SELECT TITLE FROM Emp E WHERE(NOT (TITLE=
“Programmer”) AND TITLE=“Programmer” ) OR
(TITLE=”Electrical Eng.” AND NOT (TITLE=“Electrical
Eng.”))OR ENAME=“J.Doe”; is

equivalent to
 SELECT TITLE FROM Emp E WHERE ENAME=
“J.Doe”;

22
Restructuring
 Convert
. SQL to relational
algebra
 Make use of query trees
 Example: SELECT Ename FROM
Emp, Works, Project WHERE
Emp.Eno = Works.Eno AND
Works.Pno = Project.Pno AND
Ename <> ‘J. Doe’ AND Pname =
‘CAD/CAM’ AND (Dur = 12 OR
Dur = 24)
23
 Query tree:
 A tree data structure that corresponds to a relational algebra
expression.
 It represents the input relations of the query as leaf nodes
of the tree, and represents the relational algebra operations
as internal nodes.
 Query graph:
 A graph data structure that corresponds to a relational
calculus expression.
 It does not indicate an order on which operations to perform
first.
 There is only a single graph corresponding to each query.
24
Transformation Rules for RA
Operations
1. Conjunctive Selection operations can cascade into
individual Selection operations (and vice versa).

 Sometimes referred to as cascade of Selection.

2. Commutativity of Selection.

25
3. In a sequence of Projection operations, only the last in the
sequence is required.

∏Col_list1 (∏Col_list2 (… (∏Col_listN (T))….)) = ∏Col_list1 (T)


∏ STD_ID, STD_NAME (∏STD_ID, STD_NAME, AGE, ADDRESS
(∏STD_ID, STD_NAME, AGE, ADDRESS, CLASS_ID, SKILLS
(STUDENT))) = ∏STD_ID, STD_NAME (STUDENT)

26
Con…
4. Commutativity of Selection and Projection.
If predicate p involves only attributes in projection
list, Selection and Projection operations commute:

27
Con…
5. Commutativity of Theta join (and Cartesian
product).

Rule also applies to Equijoin and Natural join.


Example:

28
6. Commutativity of Selection and Theta join (or
Cartesian product)
 If selection predicate involves only attributes of one
of join relations, Selection and Join (or Cartesian
product) operations commute:

 If selection predicate is conjunctive predicate having


form (p  q), where p only involves attributes of R,
and q only attributes of S, Selection and Theta join
operations commute as:

29
7. Commutativity of Projection &Theta join (or
Cartesian product)

30
8. Commutativity of Union & Intersection (but not set
difference)
RS=SR
RS=SR
9.Commutativity of Selection and set operations
(Union, Intersection, and Set difference).
p(R  S) = p(S)  p(R)
p(R  S) = p(S)  p(R)
p(R - S) = p(S) - p(R)

10.Commutativity of Projection and Union.


L(R  S) = L(S)  L(R)

11. Associativity of Union & Intersection (but not Set


difference).
31 (R  S)  T = S  (R  T), (R  S)  T = S  (R
 T)
12 . Associativity of Theta join (and Cartesian product).
 Cartesian product and Natural join are always
associative.

32
2. Query Optimization
Optimization – not necessarily “optimal”, but
reasonably efficient
Techniques:

Heuristic rules

 Query tree (relational algebra) optimization

 Query graph optimization

Cost-based (physical) optimization

 Cost estimation(Comparing costs of different


plans)
33
a. Heuristic based Processing
Strategies
► Perform Selection operations as early as possible.
►Keep predicates on same relation together.
►Combine Cartesian product with subsequent Selection
whose predicate represents join condition into a Join
operation.
►Use associativity of binary operations to rearrange leaf
nodes so leaf nodes with most restrictive Selection
operations executed first.
►Perform Projection as early as possible.
►Keep projection attributes on same relation together.
►Compute common expressions once.
►If common expression appears more than once, and
34 result not too large, store result and reuse it when
Examples
 What are the names of customers living on Elm
Street who have checked out “Terminator”?
 SQL query:
SELECT Name FROM Customer CU, CheckedOut CH, Film F
WHERE Title = ’Terminator’ AND F.FilmId = CH.FilmID AND
CU.CustomerID = CH.CustomerID AND CU.Street = ‘Elm’

35
Apply Selections Early

36
Apply More Restrictive Selections Early

37
Form Joins

38
Apply Projections Early

39
Cost- Based Optimization
 Statistics on the inputs to each operator are needed.
 Statistics on leaf relations are stored in the system
catalog.
 Statistics on intermediate relations must be estimated;
most
important is the relations' cardinalities.
 Cost can be CPU time, I/O time, communication time, main
memory usage, or a combination.
 The candidate query tree with the least total cost is selected
for execution.

40
Measures of Query Cost
 There are many possible ways to estimate cost, e.g.,
based on
disk accesses, CPU time, or communication overhead.
 Disk access is the cost of block transfers from/to
disks.
 Simplifying assumption: each block transfer has
the same cost
 Cost of algorithm (e.g., for join or selection) depends
on database buffer size;
 More memory for DB buffer reduces disk accesses.
Selectivity and Cost Estimates in Query
Optimization
 Catalog Information Used in Cost Functions
 Information about the size of a file
 number of records (tuples) (r),
 record size (R),
 number of blocks (b)
 blocking factor (bfr)
 Information about indexes and indexing attributes of
a file
 Number of levels (x) of each multilevel index
 Number of first-level index blocks (bI1)
 Number of distinct values (d) of an attribute
 Selectivity (sl) of an attribute
 Selection cardinality (s) of an attribute. (s = sl * r)
Database Statistics
 For each base relation R
 nTuples(R) – the number of tuples (records) in relation R (that is, its
cardinality).
 bFactor(R) – the blocking factor of R (that is, the number of tuples of R that
fit into one block).
 nBlocks(R) – the number of blocks required to store R. If the tuples of R
are stored physically together, then:
 nBlocks(R) = [nTuples(R)/bFactor(R)]
 We use [x] to indicate that the result of the calculation is rounded to the
smallest integer that is greater than or equal to x.
For each attribute A of base relation R
 nDistinctA(R) – the number of distinct values that appear for
attribute A in relation R.
 minA(R),maxA(R) – the minimum and maximum possible
values for the attribute A in relation R.
 SCA(R) – the selection cardinality of attribute A in relation R.
 This is the average number of tuples that satisfy an equality
condition on attribute A.

44
Con…

For each multilevel index I on attribute set A


•nLevelsA(I) – the number of levels in I.
•nLfBlocksA(I) – the number of leaf blocks in I.
45
Con…
 The cost of Selection Operation (S = sσ(R)) is
also calculated as;

46
Example
 For the purposes of this example, we make the following assumptions about
the Staff relation:
 There is a hash index with no overflow on the primary key attribute staffNo.
 There is a clustering index on the foreign key attribute branchNo.
 There is a B+-tree index on the salary attribute.
 The Staff relation has the following statistics stored in the system catalog:

47
Q1
 The estimated cost of a linear search on the key attribute staffNo is 50 blocks,
and the cost of a linear search on a non-key attribute is 100 blocks.
 Now we consider the following Selection operations, and use the above
strategies to improve on these two costs:
 S1:σstaffNo=‘SG5’(Staff)
 S2:σposition=‘Manager’(Staff)
 S3:σbranchNo=‘B003’(Staff)
 S4:σsalary>20000(Staff)
 Solution: S1: This Selection operation contains an equality condition on the
primary key. Therefore, as the attribute staffNo is hashed we can use strategy
3 defined above to estimate the cost as 1 block. The estimated cardinality of
48
the result relation is SCstaffNo(Staff) = 1.
S2: The attribute in the predicate is a non-key, non-indexed attribute, so we cannot
improve on the linear search method, giving an estimated cost of 100 blocks. The
estimated cardinality of the result relation is SCposition(Staff) = 300.
S3: The attribute in the predicate is a foreign key with a clustering index, so we can use
Strategy 6 to estimate the cost as 2 + [6/30] = 3 blocks. The estimated cardinality of the
result relation is SCbranchNo(Staff) = 6.

S4: The predicate here involves a range search on the salary attribute, which has a B+-
tree index, so we can use strategy 7 to estimate the cost as: 2 + [50/2] + [3000/2] = 1527
blocks. However, this is significantly worse than the linear search strategy, so in this case
we would use the linear search method. The estimated cardinality of the result relation is

SCsalary(Staff) = [3000*(50000–20000)/(50000–10000)] = 2250

49
Selection Operation

σA=a(R) where a is a constant value, A an attribute


of R

File Scan - search algorithms that locate and


retrieve records
that satisfy a selection condition

S1 - Linear search
cost(S1)= BR

S2 - Binary search, i.e., the file ordered based


on attribute A (primary index)
50
Con…

51
52
Cost of Operations
 Cost = I/O cost + CPU cost
 I/O cost: # pages (reads & writes) or # operations
(multiple pages)
 CPU cost: # comparisons or # tuples processed

 I/O cost dominates (for large databases)

 Cost depends on
 Types of query conditions

 Availability of fast access paths

 53 DBMSs keep statistics for cost estimation


Notations

 Used to describe the cost of operations.


 Relations: R, S

 nR: # tuples in R, nS: # tuples in S

 bR: # pages in R

 dist(R.A) : # distinct values in R.A

 min(R.A) : smallest value in R.A

 max(R.A) : largest value in R.A

 HI: # index pages accessed (B+ tree height?)

54
Simple Selection
 Simple selection: A op a(R)
 A is a single attribute, a is a constant, op is one of =, ,
<, , >, .
 Do not further discuss  because it requires a
sequential scan of table.
How many tuples will be selected?
 Selectivity Factor (SFA op a(R)) : Fraction of tuples of R
satisfying “A op a”
 0  SFA op a(R)  1
# tuples selected: NS = nR  SFA op a(R)

55
Options of Simple Selection
Sequential (linear) Scan
 General condition: cost = bR
 Equality on key: average cost = bR / 2
Binary Search
 Records are stored in sorted order
 Equality on key: cost = log2(bR)
 Equality on non-key (duplicates allowed)
cost = log2(bR) + NS/bfR - 1
= sorted search time + selected – first one

56
Example: Cost of Selection
Relation: R(A, B, C)
nR = 10000 tuples
bfR = 20 tuples/page
dist(A) = 50, dist(B) = 500
B+ tree clustering index on A with order 25 (p=25)
B+ tree secondary index on B w/ order 25
Query:
 select * from R where A = a1 and B = b1
Relational Algebra: A=a1  B=b1 (R)

57
Example: Cost of Selection (cont.)
Option 1: Sequential Scan
 Have to go thru the entire relation
 Cost = bR = 10000/20 = 500
Option 2: Binary Search using A = a
 It is sorted on A (why?)
 NS = 10000/50 = 200
 assuming equal distribution
 Cost = log2(bR) + NS/bfR - 1
= log2(500) + 200/20 - 1 = 18

58
Cost of Join

Cost = # I/O reading R & S + # I/O writing


result
Additional notation:
 M: # buffer pages available to join operation
 LB: # leaf blocks in B+ tree index
Limitation of cost estimation
 Ignoring CPU costs
 Ignoring timing
 Ignoring double buffering requirements

59
Estimate Size of Join Result

How many tuples in join result?


 Cross product (special case of join)
NJ = nR  nS
 R.A is a foreign key referencing S.B
NJ = nR (assume no null value)
 S.B is a foreign key referencing R.A
NJ = nS (assume no null value)
 Both R.A & S.B are non-key

nR nS nR nS
NJ = min( , )
dist(R. A) dist(S .B)
60
Estimate Size of Join Result (cont.)
How wide is a tuple in join result?
 Natural join: W = W(R) + W(S) – W(SR)
 Theta join: W = W(R) + W(S)
What is blocking factor of join result?
 bfJoin = block size / W
How many blocks does join result have?
 bJoin = NJ / bfJoin

61
Query Execution Plans
 An execution plan for a relational algebra query consists
of a combination of the relational algebra query tree
and information about the access methods to be used
for each relation as well as the methods to be used in
computing the relational operators stored in the tree.
 Materialized evaluation: the result of an operation is
stored as a temporary relation.
 Pipelined evaluation: as the result of an operator is
produced, it is forwarded to the next operator in
sequence
62
Query Tuning
 Monitoring or revising the query to increase throughput,
to lower response time for time-critical applications.
 Having to tune queries is a fact of life.
 Query tuning has a localized effect and is thus relatively
attractive.

 It is a time-consuming and specialized task.

 It makes the queries harder to understand.

 However, it is often a necessity.

 This is not likely to change any time soon.

63
Assignment one
 Using heuristic algorithm optimize the following
sql query.
SELECT LNAME FROM EMPLOYEE, WORKS_ON, PROJECT

WHERE PNAME = ‘AQUARIUS’ AND


PNMUBER=PNO AND ESSN=SSN
AND BDATE > ‘1957-12-31’;

64

You might also like