0% found this document useful (0 votes)
72 views49 pages

Ad Database All Slide

The document discusses query processing and optimization in database systems. It covers translating SQL queries to relational algebra, basic algorithms for executing query operations like selection, projection, join, and sorting. It explains sorting using an external sort-merge algorithm and discusses implementing selection using different search methods like linear search and binary search. Query optimization involves choosing the most efficient execution plan for a query.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
72 views49 pages

Ad Database All Slide

The document discusses query processing and optimization in database systems. It covers translating SQL queries to relational algebra, basic algorithms for executing query operations like selection, projection, join, and sorting. It explains sorting using an external sort-merge algorithm and discusses implementing selection using different search methods like linear search and binary search. Query optimization involves choosing the most efficient execution plan for a query.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 49

Advanced Database Management

System
Query processing and optimization
Database security and authorization
Transaction processing concepts
Concurrency control techniques
Database recovery techniques
Distributed database system
Query Processing and
Optimization

Database system features


Crash Recovery
Integrity Checking
Security
Concurrency Control
Query Processing and Optimization
File Organization and Optimization

2
Query Processing and Optimization
 Translating SQL Queries into Relational
Algebra
 Basic Algorithms for Executing Query
Operations
 Using Heuristic in Query Optimization
 Using Selectivity and Cost Estimates in
Query Optimization
 Semantic Query Optimization
•Query languages: Allow manipulation and
retrieval of data from a database.

3
Relational algebra-operations on a relation

Operations that could be applied on relations



Selection (  ) Selects a subset of rows from a relation.
Projection (  ) Deletes unwanted columns from a relation.

Renaming: assigning intermediate relation for a single operation

Cross-Product ( x ) Allows to concatenate a tuple from one


relation with all the tuples from the other relation.
Set-Difference ( - ) Tuples in relation R , but not in relation R .
1 2
Union ( ) Tuples in relation R , or in relation R .
1 2
Intersection () Tuples in relation R and in relation R2
1
Join Tuples joined from two relations based on a condition
Query processing :
The aim of query processing is to find information in one or

more databases and deliver it to the user.


4
Query Processing and Optimization
 A query expressed in a high-level query language such
as SQL must first be scanned, parsed, and validated.
 The scanner identifies the language tokens—such as
SQL keywords, attribute names, and relation names.
 The parser checks the query syntax based on the rules
of the query language.
 The query must also be validated, by checking that all
attribute and relation names are valid and semantically
meaningful names in the schema of the particular
database being queried.

5
Query Processing and Optimization
 An internal representation of the query is then
created, usually as a tree data structure called a
query tree.
 It is also possible to represent the query using a
graph data structure called a query graph.
 The DBMS must then devise an execution
strategy for retrieving the result of the query
from the database files. A query typically has
many possible execution strategies, and the
process of choosing a suitable one for
processing a query is known as query
optimization.
6
Query Processing and Optimization
 The query optimizer module has the task of
producing an execution plan.
 The code generator generates the code to
execute that plan.
 The runtime database processor has the task
of running the query code.

7
Query Processing (cont…)

8
Query Processing (cont…)
 Translating SQL queries into relational
algebra
 Query block:
 The basic unit that can be translated into the

algebraic operators
 A query block contains a single SELECT-FROM-
WHERE expression, as well as GROUP BY and
HAVING clause if these are part of the block.
 Nested queries within a query are identified as
separate query blocks.

9
Query Processing (cont…)
SELECT LNAME, FNAME
FROM EMPLOYEE
WHERE SALARY > ( SELECT MAX (SALARY)
FROM EMPLOYEE
WHERE DNO = 5);

SELECT LNAME, FNAME SELECT MAX (SALARY)


FROM EMPLOYEE FROM EMPLOYEE
WHERE SALARY > C WHERE DNO = 5

πLNAME, FNAME (σSALARY>C(EMPLOYEE)) ℱMAX SALARY (σDNO=5 (EMPLOYEE))

10
Basic algorithms for executing query
operations
 External Sorting
 Implementing the SELECT Operation
 Implementing the JOIN Operation
 Implementing PROJECT and Set Operations

11
Basic algorithms for executing query
operations
 External sorting:
 Refers to sorting algorithms that are suitable for large

files of records stored on disk that do not fit entirely in


main memory, such as most database files.
 Sorting is one of the primary algorithms used in query
processing. For example, whenever an SQL query
specifies an ORDER BY-clause, the query result must be
sorted. Sorting is also a key component in sort-merge
algorithms used for JOIN and other operations (such as
UNION and INTERSECTION), and in duplicate
elimination algorithms for the PROJECT operation (when
an SQL query specifies the DISTINCT option in the
SELECT clause).

12
External sorting:

13
External sorting:
 External sorting uses Sort-Merge strategy:
 Starts by sorting small subfiles (runs) of the main file

and then merges the sorted runs, creating larger sorted


subfiles that are merged in turn. 2 phases:
 Sorting phase:


Number of subfiles (runs) nR = (b/nB)
 Merging phase:

Degree of merging(dM) = Min (nB-1, nR); nP = (logdM(nR))

14
Sort-Merge strategy (cont…)
 nR: number of initial runs;
 b: number of file blocks;

 nB: available buffer space;

 dM: degree of merging;

 P: number of passes

 The size of a run and number of initial run depends on the


number of file blocks (b) and available buffer space (nB)
 Example: if nB=5 blocks and size of the file=1024 blocks,
 nR=(b/nB)= (1024/5) =205 runs each of size 5 blocks

except the last run which will have 4 blocks.


 Hence, after the sort phase, 205 sorted runs are stored

as temporary subfiles on disk

15
Sort-Merge strategy (cont…)
 In the merging phase, the sorted runs are merged during
one or more passes.
 The degree of merging (dM) is the number of runs that
can be merged in each pass.
 dM=min((nB-1) and nR))
 The number of passes=[(logdM (nR))]
 In each pass, one buffer block is needed to hold one
block from each of the runs being merged and one block
is needed for containing one block of the merge result

16
Sort-Merge strategy (cont…)
 In the above example, dM=4(four way merging)
 Hence, the 205 initial sorted runs would be merged into:
 52 at the end of the first pass
 13 at the end of the second pass
 4 at the end of the third pass
 1 at the end of the fourth pass
 The minimum dM of 2 gives the worst case performance
of the algorithm, which is
 2*b +2(*(b*(log2 (nR) )
 The first term represents the number of block access for
the sort phase and the second term represents the
number of block access for the merge phase

17
Exercise using the relation given below

 Assume available buffer


space is 3 blocks.
 Find DM?
 Find P?
 Find the number of
merges in each pass?

18
Merging phase
 Step 1 (only for run 1 and run 2)
 Read a (from first run) and b(from second run) and load to buffer

 a is written into an output buffer

 Read d(from first run) and load to buffer

 a is written to disk, and b is written into an output buffer

 Read c (from second run) and load to buffer

 b is written to disk, and c is written into an output buffer

 Read e(from second run) and load to buffer

 c is written to disk, d is written to an output buffer

 Read g (from first run) and load to buffer

 d is written to disk, e is written to an output buffer


 And write g to disk

Do for Step 1 (for run 3 and run 4) and the whole for step 2

19
Implementing the SELECT Operation
 There are many options for executing a SELECT operation
 Some options depend on the file having specific access

paths and may apply only to certain types of selection


conditions
Examples:
 (OP1):  SSN='123456789' (EMPLOYEE)
 (OP2):  DNUMBER>5(DEPARTMENT)
 (OP3):  DNO=5(EMPLOYEE)
 (OP4):  DNO=5 AND SALARY>30000 AND SEX=F(EMPLOYEE)

20
Implementing the SELECT Operation (cont…)
 Search Methods for implementing Simple Selection:
 S1 Linear search (brute force):
 Retrieve every record in the file, and test whether its attribute
values satisfy the selection condition.
 S2 Binary search:
 If the selection condition involves an equality comparison on a
key attribute on which the file is ordered, binary search (which
is more efficient than linear search) can be used. An example
is OP1 if SSN is the ordering attribute for EMPLOYEE file
 S3 Using a primary index or hash key to retrieve a
single record:
 If the selection condition involves an equality comparison on a
key attribute with a primary index (or a hash key), use the
primary index (or the hash key) to retrieve the record.
 For Example, OP1 use primary index to retrieve the record

21
Implementing the SELECT Operation (cont…)
 Search Methods for implementing Simple Selection
 S4 Using a primary index to retrieve multiple

records:
 If the comparison condition is >, ≥, <, or ≤ on a key

field with a primary index, use the index to find the


record satisfying the corresponding equality
condition, then retrieve all subsequent records in
the (ordered) file. (see OP2)
 S5 Using a clustering index to retrieve multiple

records:
 If the selection condition involves an equality

comparison on a non-key attribute with a clustering


index, use the clustering index to retrieve all the
records satisfying the selection condition. (See OP3)

22
Implementing the SELECT Operation (cont…)
• S6: using a secondary index on an equality comparison:
• This search method can be used to retrieve a single record if
the indexing field is a key (has unique values) or to retrieve
multiple records if the indexing field is not a key
• Search Methods for implementing complex Selection like OP4
• S7: Conjunctive selection using an individual index :
 If an attribute involved in any single simple condition in
the conjunctive condition has an access path that
permits the use of one of the methods S2 to S6, use that
condition to retrieve the records and then check whether
each retrieved record satisfies the remaining simple
conditions in the conjunctive condition

23
Implementing the SELECT Operation
(summarized)
 Whenever a single condition specifies the selection, we
can only check whether an access path exists on the
attribute involved in that condition.
 If an access path exists, the method corresponding to

that access path is used;


 Otherwise, the “brute force” linear search approach

of method S1 is used
 For conjunctive selection conditions, whenever more

than one of the attributes involved in the conditions have


an access path, query optimization should be done to
choose the access path that retrieves the fewest records
in the most efficient way

24
Implementing the SELECT Operation
(summarized)
 Disjunctive selection conditions: This is a situation where
simple conditions are connected by the OR logical
connective rather than AND.
 Compared to conjunctive selection, It is much harder to

process and optimize.


 Example: DNO=5 OR SALARY>3000 OR SEX=‘F’(EMPLOYEE)
 Little optimization can be done because the records
satisfying the disjunctive condition are the union of the
records satisfying the individual conditions.
 If any of the individual conditions does not have an
access path, we are compelled to use the brute force
approach.

25
Implementing the JOIN Operation:
 The join operation is one of the most time
consuming operation in query processing
 Join
 two–way join: a join on two files

e.g. R A=B S
 multi-way joins: joins involving more than two files

e.g. R A=B S C=D T
 Examples
 (OP5): EMPLOYEE DNO=DNUMBER DEPARTMENT
 (OP6): DEPARTMENT MGRSSN=SSN EMPLOYEE

26
Implementing the JOIN Operation(cont…)
 Methods for implementing joins:
 J1 Nested-loop join (brute force):
 For each record t in R (outer loop), retrieve every
record s from S (inner loop) and test whether the
two records satisfy the join condition t[A] = s[B]
 J2 Single-loop join (Using an access structure to
retrieve the matching records):
 If an index (or hash key) exists for one of the two
join attributes — say, B of S — retrieve each record
t in R, one at a time, and then use the access
structure to retrieve directly all matching records s
from S that satisfy s[B] = t[A].

27
Implementing the JOIN Operation (cont…)
 Methods for implementing joins:
 J3 Sort-merge join:
 If the records of R and S are physically sorted
(ordered) by value of the join attributes A and B,
respectively, we can implement the join in the most
efficient way possible.
 Both files are scanned in order of the join attributes,
matching the records that have the same values for
A and B.
 In this method, the records of each file are scanned
only once each for matching with the other file—
unless both A and B are non-key attributes

28
Implementing the JOIN Operation (cont’d)
 Other type of JOIN algorithms
 Partition hash join
 Partitioning phase:
 Each file (R and S) is first partitioned into M partitions using a
partitioning hash function on the join attributes: 
 R1 , R2 , R3 , ...... Rm and S1 , S2 , S3 , ...... Sm
 Minimum number of in-memory buffers needed for the
partitioning phase: M.
 A disk sub-file is created per partition to store the tuples for
that partition.  
 Joining or probing phase:
 Involves M iterations, one per partitioned file.
 Iteration i involves joining partitions Ri and Si.

29
Algorithms for PROJECT and SET Operations
 Algorithm for PROJECT operation
 <attribute list>(R)
1. If <attribute list> has a key of relation R, extract all tuples from
R with only the values for the attributes in <attribute list>.
2. If <attribute list> does NOT include a key of relation R,
duplicated tuples must be removed from the results.

 Methods to remove duplicate tuples


1. Sorting
2. Hashing
 Note: in SQL query, the default is not to remove duplicates.
To remove duplicates we have to use the key word
‘DISTINCT’.

30
Set operations
 UNION, INTERSECTION, SET DIFFERENCE, and CARTESIAN
PRODUCT
 CARTESIAN PRODUCT: too expensive. If R has n records and j
attributes and S has m records and k attributes, the result relation
will have n * m records and j + k attributes. Hence, it is important
to avoid the CARTESIAN PRODUCT operation and substitute
other equivalent operations during query optimization.
 UNION, INTERSECTION, and SET DIFFERENCE: apply with the
same number of attributes and the same attribute domains. The
customary way to implement these operations is to use:
 sort-merge technique: the two relations are sorted on the

same attributes, and, after sorting, a single scan through each


relation is sufficient to produce the result.

31
Query Optimization
 There are two main techniques for implementing query
optimization.
 Heuristic rules
 A heuristic is a rule that works well in most cases but is
not guaranteed to work well in every possible case. The
rules typically reorder the operations in a query tree.
 Cost estimation
 Determining the cost of different execution strategies and
choosing the execution plan with the lowest cost estimate.
 The two techniques are usually combined in a query
optimizer

32
General Transformation rules for
relational algebra

1. Cascade of : A conjunctive selection condition can be


broken up into a cascade (sequence) of individual 
operations:

 c1 AND c2 AND ... AND cn(R) = c1 (c2 (...(cn(R))...) )
2. Commutativity of :

The  operation is commutative:

c1 (c2(R)) = c2 (c1(R))
3. Cascade of : In a cascade (sequence) of  operations, all
but the last one can be ignored:

List1 (List2 (...(Listn(R))...) ) = List1(R)

33
Transformation Rules (cont…)

4. Commuting  with :
 If the selection condition c involves only the
attributes A1, ..., An in the projection list, the two
operations can be commuted:
 A1, A2, ..., An (c (R)) = c (A1, A2, ..., An (R))

5. Commuting of and X: both are commutative


R cS = S c R, RXS=SXR

34
Transformation Rules (cont…)
6. Commuting  with : If projection list is L = {A1, ..., An,
B1, ..., Bm}, where A1, ..., An are attributes of R and
B1, ..., Bm are attributes of S and the join condition c
involves only attributes in L, the two operations can be
commuted as follows:
 L ( R C S ) = (A1, ..., An (R)) C ( B1, ..., Bm (S))

7. Converting a (, x) sequence into : If the condition c of a


 that follows a x Corresponds to a join condition, convert
the (, x) sequence into a as follows:

(C (R x S)) = (R C S)

35
Transformation Rules (cont…)
8. Commutativity of the Set Operations: UNION and
INTERSECTION but not SET DIFFERENCE
RS=SR and RS=SR
9. Associativity of the THETA JOIN, CARTESIAN PRODUCT,
UNION and INTERSECTION.
(R S) T=R (S T) where is one of the operations
10. Commuting SELECTION with SET OPERATIONS
c (R S)= (c(R)  c(S)) where is one of the operations 
11. Commuting PROJECTION with UNION
L1 (SR )= L1 (S )  L1 (R )

36
Using Heuristics in Query Optimization
 The heuristic approach of optimization will make use of:
 Properties of individual operators

 Association between operators

 Query Tree: a graphical representation of the operators, relations,

attributes and predicates and processing sequence during query


processing.
 Query tree is composed of three main parts:
The Leafs: the base relations used for processing the query/ extracting
the required information
 The Root: the final result/relation as an output based on the operation

on the relations used for query processing


 Nodes: intermediate results or relations before reaching the final result.

 Sequence of execution of operation in a query tree will start from the leaves
and continues to the intermediate nodes and ends at the root

37
Using Heuristics in Query Optimization (cont…)
 Process for heuristics optimization
1. The parser of a high-level query generates an initial
internal representation;
2. Apply heuristics rules to optimize the internal
representation.
3. A query execution plan is generated to execute groups of
operations based on the access paths available on the files
involved in the query
 The main heuristic is to apply first the operations that
reduce the size of intermediate results
 E.g., Apply SELECT and PROJECT operations before
applying the JOIN or other binary operations

38
Using Heuristics in Query Optimization (cont…)
 Heuristic Optimization of Query Trees:
 The same query could correspond to many different relational

algebra expressions — and hence many different query trees.


 The task of heuristic optimization of query trees is to find a final

query tree that is efficient to execute.


 Example:
 Project (Pname, Pnumber), Employee (Bdate, Ssn, Fname,

Lname) and Works_on (Pno, Essn). Project and Employee are


related by Works_on table.
 List Last Name of employees who are working in AQUARIUS

project and born after 1957-12-31.


Q: SELECT LNAME
FROM EMPLOYEE, WORKS_ON, PROJECT
WHERE PNAME = ‘AQUARIUS’ AND
PNMUBER=PNO AND ESSN=SSN
AND BDATE > ‘1957-12-31’;

39
Steps in converting a query tree during heuristic optimization:
Initial query tree for the query Q

40
Steps in converting a query tree during heuristic optimization:

Moving the select operation down the tree

41
Steps in converting a query tree during heuristic optimization:
Applying the more restrictive select operation first

42
Steps in converting a query tree during heuristic optimization:
Replacing Cartesian product and select with join

43
Steps in converting a query tree during heuristic optimization:
Moving project operations down the query tree

44
Using Heuristics in Query Optimization
 Summary of Heuristics for Algebraic Optimization:
1. The main heuristic is to apply first the operations that
reduce the size of intermediate results
2. Perform select operations as early as possible to reduce
the number of tuples and perform project operations as
early as possible to reduce the number of attributes. (This
is done by moving select and project operations as far
down the tree as possible.)
3. The select and join operations that are most restrictive
should be executed before other similar operations. (This
is done by reordering the leaf nodes of the tree among
themselves and adjusting the rest of the tree
appropriately.)

45
Exercise
 Consider the following schemas and the query, where the
EMPLOYEE and the PROJECT relations are related by the
WORKS_ON relation.
 EMPLOYEE (EEmpID, FName, LName, Salary, Dept, Sex, DoB)

 PROJECT (PProjID, PName, PLocation, PFund, PManagerID)

 WORKS_ON (WEmpID, WProjID

 Query: The manager of the company working on road construction


would like to view employees name born before January 1, 1965 who
are working on the project named Ring Road.
 Formulate SQL query, relational algebra and finally construct the
optimum query tree (show the steps you passed to reach the final
tree). You must justify to simplify your tree.

46
Using Selectivity and Cost Estimates in Query
Optimization
 Cost-based query optimization:
 Estimate and compare the costs of executing a query

using different execution strategies and choose the


strategy with the lowest cost estimate
 Cost Components for Query Execution
1. Access cost to secondary storage

2. Computation cost

3. Memory usage cost

4. Communication cost

 Note: Different database systems may focus on different


cost components.

47
Using Selectivity and Cost Estimates in
Query Optimization
 Catalog Information Used in Cost Functions
 Information about the size of a file
 number of records (tuples) (r),
 record size (R),
 number of blocks (b)
 blocking factor (bfr) –a factor of block size and record size
 Information about indexes and indexing attributes of a file
 Number of levels (x) of each multilevel index
 Number of first-level index blocks (bI1)
 Number of distinct values (d) of an attribute
 Selectivity (sl) of an attribute-r/d

48
Semantic Query Optimization
 Semantic Query Optimization:
 Uses constraints specified on the database schema in order to

modify one query into another query that is more efficient to


execute
 Consider the following SQL query,
SELECT E.LNAME, M.LNAME
FROM EMPLOYEE E M
WHERE E.SUPERSSN=M.SSN AND E.SALARY>M.SALARY
 Explanation:
 Suppose that we had a constraint on the database schema that

stated that no employee can earn more than his or her direct
supervisor. If the semantic query optimizer checks for the
existence of this constraint, it need not execute the query at all
because it knows that the result of the query will be empty.

49

You might also like