0% found this document useful (0 votes)
7 views

Advanced Database

The document discusses query processing and optimization. It describes the steps involved in query processing like parsing, translating to relational algebra, and optimization. It also covers different operations like selection, evaluation of expressions, and transformation of relational expressions.

Uploaded by

suplexcity656
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

Advanced Database

The document discusses query processing and optimization. It describes the steps involved in query processing like parsing, translating to relational algebra, and optimization. It also covers different operations like selection, evaluation of expressions, and transformation of relational expressions.

Uploaded by

suplexcity656
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 47

Database Management Systems (DBMS)

GTU # 3130703

Chapter 2
Query Processing
and Optimization
P Outline
Looping
• Steps in query processing
• Measures of query cost
• Selection operation
• Evaluation of expressions
• Query optimization
• Transformation of relational expressions
• Sorting and join
Query Processing
} Query Processing is a procedure of converting a query written in high-level language
(Ex. SQL) into a correct and efficient execution plan expressed in low-level language,
which is used for data manipulation.
} A query expressed in a high-level query language such as SQL must first be
Ê Scanned
Ê Parsed and
Ê Validated
• Scanner identifies the query tokens such as
• SQL keywords
• Attribute names and
• Relation names that appear in the text of the query
• Parser checks the query syntax to determine whether it is formulated according to
the syntax rules (rules of grammar) of the query language
• Validated by checking that all attribute and relation names are valid and
• Semantically meaningful names in the schema of the particular database being queried
Section – 1
Steps in Query Processing
Parser checks the syntax Translator translates
of query and verifies the query into its
attribute name and internal form
relation name (relational algebra)
Parser Relational
Query and algebra
translator expression
Choose best execution
plan Optimize
Execute the query-
r
evaluation plan and returns
output
Evaluatio
Query Execution
n
output plan
engine

Database Catalog
Data Statistics about
Data
Section – 2
Measures of Query Cost
} Cost is generally measured as the total time required to execute a statement/query.
} Factors contribute to time cost
Ê Disk accesses (time to process a data request and retrieve the required data from the storage
device)
Ê CPU time to execute a query
Ê Network communication cost
} Disk access is the predominant (major) cost, since disk access is slow as
compared to in-memory operation.
} Cost to write a block is greater than cost to read a block because data is read back
after being written to ensure that the write was successful.
Section – 3
Selection Operator
} Symbol: σ (Sigma)

} Notation: σ condition (Relation)


} Operation: Selects tuples from a relation that satisfy a given condition.
} Operators: =, <>, <, >, <=, >=, Λ (AND), V (OR)
Example Display the detail of students belongs to “CE” Branch. Answer σBranch=‘CE’ (Student)

Student Output
RollNo Name Branch SPI RollNo Name Branch SPI
101 Raju CE 8 101 Raju CE 8
102 Mites ME 9 104 Meet CE 9
h
103 Nilesh CI 9
104 Meet CE 9
Search algorithm for selection operation
} Linear search (A1)
} Binary search (A2)
Linear search (A1)
} It scans each blocks and tests all records to see whether they satisfy the selection
condition.
Ê Cost of linear search (worst case) = br
br denotes number of blocks containing records from relation r
} If the selection condition is there on a (primary) key attribute, then system can stop
searching if the required record is found.
Ê cost of linear search (best case) = (br /2)
} If the selection is on non (primary) key attribute then multiple block may contains
required records, then the cost of scanning such blocks need to be added to the
cost estimate.
} Linear search can be applied regardless of
Ê selection condition or
Ê ordering of records in the file (relation)
} This algorithm is slower than binary search algorithm.
Binary search (A2)
} Generally, this algorithm is used if selection is an equality comparison on the
(primary) key attribute and file (relation) is ordered (sorted) on (primary) key
attribute.
} cost of binary search = [log2(br)]
Ê br denotes number of blocks containing records from relation r
} This algorithm is faster than linear search algorithm.
Section – 4
Evaluation of expressions
} Expression may contain more than one
operations, solving expression will be ΠCust_Nam
difficult if it contains more than one
e
operations.
ΠCust_Name ( σBalance<2500 (account) (customer) )

Bottom to top
Execution
} To evaluate such expression we need to
evaluate each operations one by one in σ Balance<2500 (customer)
appropriate order.
} Two methods for evaluating an
expression carrying multiple operations (account)
are:
Ê Materialization
Ê Pipelining
Materialization
} Materialization evaluates the expression tree of the relational algebra operation
from the bottom and performs the innermost or leaf-level operations first.
} The intermediate result of each operation is materialized (store in temporary
relation) and becomes input for subsequent (next) operations.
} The cost of materialization is the sum of the individual operations plus the cost of
writing the intermediate results to disk.
} The problem with materialization is that
Ê it creates lots of temporary relations
Ê it performs lots of I/O operations
Pipelining
} In pipelining, operations form a queue, and results are passed from one operation to
another as they are calculated.
} To reduce number of intermediate temporary relations, we pass results of one
operation to the next operation in the pipelines.
} Combining operations into a pipeline eliminates the cost of reading and writing
temporary relations.
} Pipelines can be executed in two ways:
Ê Demand driven (System makes repeated requests for tuples from the operation at the top of
pipeline)
Ê Producer driven (Operations do not wait for request to produce tuples, but generate the tuples
eagerly.)
Section – 5
Query optimization
} It is a process of selecting the most efficient query evaluation plan from the
available possible plans.

ΠCust_Name ( σBalance<2500 (account) (customer) ) Customer Account


CID ANO Name ANO Balance
Efficient C01 A01 Raj A01 3000
plan 2 4
C02 A02 Meet A02 1000
records records
C03 A03 Jay A03 2000
ΠCust_Name ( σBalance<2500 (account customer) ) C04 A04 Ram A04 4000

4 4
records records
Approaches to Query Optimization
} Exhaustive Search Optimization
Ê Generates all possible query plans and then the best plan is selected.
Ê It provides best solution.
} Heuristic Based Optimization
Ê Heuristic based optimization uses rule-based optimization approaches for query optimization.
Ê Performs select and project operations before join operations. This is done by moving the select
and project operations down the query tree. This reduces the number of tuples available for join.
Ê Avoid cross-product operation because they result in very large-sized intermediate tables.
Ê This algorithms do not necessarily produce the best query plan.
Section – 6
Transformation of relational expressions
} Two relational algebra expressions are said to be equivalent if the two expressions
generate the same set of tuples.
} Example: Customer Account
CID ANO Name ANO Balance
C01 A01 Raj A01 3000
C02 A02 Meet A02 1000
C03 A03 Jay A03 2000
C04 A04 Ram A04 4000

ΠName ( σBalance<2500 (Account) (Customer) ) ΠName ( σBalance<2500 (Account Customer) )

Customer
Name
Meet
Jay
Transformation of relational expressions
} Combined selection operation can be divided into sequence of individual selections.
This transformation is called cascade of σ.
} Example:
Customer
CID ANO Name Balance
σANO<3 Λ Balance<2000 Output
C01 1 Raj 3000 (Customer)
OUTPU CID ANO Name Balance
C02 2 Meet 1000
T C02 2 Meet 1000
C03 3 Jay 2000
C04 4 Ram 4000 σANO<3 (σBalance<2000
(Customer))

σθ1Λθ2 (E) = σθ1 (σθ2 (E))


Transformation of relational expressions
} Selection operations are commutative.
} Example:

Customer
CID ANO Name Balance
σANO<3 (σBalance<2000 (Customer)) Output
C01 1 Raj 3000
CID ANO Name Balance
C02 2 Meet 1000 OUTPUT
C02 2 Meet 1000
C03 3 Jay 2000
C04 4 Ram 4000 σBalance<2000 (σANO<3 (Customer))

σθ1 (σθ2 (E)) = σθ2 (σθ1 (E))


Transformation of relational expressions
} If more than one projection operation is used in expression then only the outer
projection operation is required. So skip all the other inner projection operation.
} Example:
Customer Customer
AN Nam Balanc Name
CID
O e e
ΠName (ΠANO, Name (Customer))
Raj
C01 1 Raj 3000
OUTPUT Meet
C02 2 Meet 1000
Jay
C03 3 Jay 2000
ΠName (Customer) Ram
C04 4 Ram 4000

ΠL1 (ΠL2 (…(Π Ln (E))…)) = ΠL1 (E)


Transformation of relational expressions
} Selection operation can be joined with Cartesian product and theta join.
} Example:

Customer Account
CID ANO Name ANO Balance
σANO<3 (Customer Account)
Output
C01 1 Raj 1 3000 CID ANO Name Balance

C02 2 Meet 2 1000 OUTPUT C01 1 Raj 3000


C03 3 Jay 3 2000 C02 2 Meet 1000
C04 4 Ram 4 4000 (Customer) σANO<3 (Account)

σθ (E1 E2)) = E1 θE2

σθ1 (E1 θ2 E2)) = E1 θ1Λ θ2 E2


Transformation of relational expressions
} Theta operations are commutative.
} Example:

Customer Account
CID ANO Name ANO Balance
(Account) σANO<3 (Customer) Output
C01 1 Raj 1 3000 CID ANO Name Balance

C02 2 Meet 2 1000 OUTPUT C01 1 Raj 3000


C03 3 Jay 3 2000 C02 2 Meet 1000
C04 4 Ram 4 4000 (Customer) σANO<3 (Account)

E1 σθ E2 = E2 σθ E1
Transformation of relational expressions
} Natural join operations are associative.

(E1 E2) E3 = E1 (E2 E3)

} Selection operation distribute over theta join operation under the following
condition
Ê When all the attributes in the selection condition θ0 involves only the attributes of the one of the
expression (says E1) being joined.

σθ0 (E1 θ E2) = (σθ0 (E1)) θ E2

Ê When the selection condition θ1 involves only the attributes of E1 and θ2 involves only the
attributes of E2.

σθ1Λθ2 (E1 θ E2) = (σθ1(E1) θ (σθ2 (E2)))


Transformation of relational expressions
} Set operations union and intersection are commutative.
Output
Customer Employee Customer U Employee Name
Cst_Name Emp_Name
OUTPU Raj
Raj Meet
T Meet
Meet Suresh
Suresh
Employee U Customer

Customer Employee Customer ∩ Employee


Output
Cst_Name Emp_Name
OUTPU Name
Raj Meet
T Meet
Meet Suresh
Employee ∩ Customer

Set difference is not


E1 U E2 = E2 U E1 & E1 ∩ E2 = E2 ∩ E1 commutative
Transformation of relational expressions
} Set operations union and intersection are associative.
Output
Customer Employee Student (Customer U Employee) U Student
Name
Cst_Name Emp_Name Stu_Name
OUTPU Raj
Raj Meet Raj
T Meet
Meet Suresh Meet
Suresh
Customer U (Employee U Student)

Customer Employee Student (Customer ∩ Employee) ∩ Student


Output
Cst_Name Emp_Name Stu_Name
OUTPU Name
Raj Meet Raj
T Meet
Meet Suresh Meet
Customer ∩ (Employee ∩ Student)

(E1 U E2) U E3 = E1 U (E2 U E3) & (E1 ∩ E2) ∩ E3 = E1 ∩ (E2 ∩ E3)


Transformation of relational expressions
} Selection operation distributes over U, ∩ and –.

σθ(E1 U E2) = σθ(E1) U σθ(E2)

σθ(E1 ∩ E2) = σθ(E1) ∩ σθ(E2)

σθ(E1 – E2) = σθ(E1) – σθ(E2)


Using Heuristics in Query Optimization
§ Process for heuristics optimization
1.The parser of a high-level query generates an initial internal
representation;
2.Apply heuristics rules to optimize the internal representation.
3.A query execution plan is generated to execute groups of
operations based on the access paths available on the files
involved in the query.
§ The main heuristic is to apply first the operations that reduce the
size of intermediate results
• E.g., Apply SELECT and PROJECT operations before applying
the JOIN or other binary operations.
• The SELECT and PROJECT operations reduce the size of a file and hence should
be applied before a join or other binary operation
Using Heuristics in Query Optimization (2)
§ Query tree and query graph can be used as the basis for the data
structures that are used for internal representation of queries
§ Query tree:
§ A tree data structure that corresponds to a relational algebra
expression
§ It represents the input relations of the query as leaf nodes of the tree,
and represents the relational algebra operations as internal nodes
§ An execution of the query tree consists of executing an internal node
operation whenever its operands are available and then replacing that
internal node by the relation that results from executing the operation.
§ The order of execution of operations starts at the leaf nodes, which
represents the input database relations for the query, and ends at
the root node, which represents the final operation of the query
§ Query graph:
§ A graph data structure that corresponds to a relational calculus
expression
§ It does not indicate an order on which operations to perform first
§ There is only a single graph corresponding to each query
Using Heuristics in Query Optimization (3)
§ Example:
§ For every project located in ‘Stafford’, retrieve the project number,
the controlling department number and the department manager’s
last name, address and birthdate.
• Relation algebra:
pPNUMBER, DNUM, LNAME, ADDRESS, BDATE (((sPLOCATION=‘STAFFORD’(PROJECT))
DNUM=DNUMBER (DEPARTMENT)) MGRSSN=SSN (EMPLOYEE))

§ SQL query:
Q2: SELECT P.NUMBER,P.DNUM,E.LNAME,
E.ADDRESS, E.BDATE
FROM PROJECT AS P,DEPARTMENT AS D,
EMPLOYEE AS E
WHERE P.DNUM=D.DNUMBER AND
D.MGRSSN=E.SSN AND
P.PLOCATION=‘STAFFORD’;
Using Heuristics in Query Optimization (4)

Ayele G. : Department of Computer Science 35


Using Heuristics in Query Optimization (5)
§ Heuristic Optimization of Query Trees:
§ The same query could correspond to many different relational algebra
expressions and hence many different query trees.
§ The task of heuristic optimization of query trees is to find a final query tree
that is efficient to execute
§ Example:
Q: SELECT LNAME
FROM EMPLOYEE, WORKS_ON, PROJECT
WHERE PNAME = ‘AQUARIUS’ AND
PNMUBER=PNO AND ESSN=SSN
AND BDATE > ‘1957-12-31’;

Ayele G. : Department of Computer Science 36


Using
Heuristics in
Query
Optimization
(6)
Using Heuristics in
Query Optimization
(7)
Using Heuristics in Query Optimization (8)
Using Selectivity and Cost Estimates in Query Optimization (1)

§ Cost-based query optimization:


§ Estimate and compare the costs of executing a query using
different execution strategies and choose the strategy with the
lowest cost estimate
§ Issues
§ Cost function
§ Number of execution strategies to be considered
Using Selectivity and Cost Estimates in Query Optimization (2)
Cost Components for Query Execution
1. Access cost to secondary storage
2. Storage cost
3. Computation cost
4. Memory usage cost
5. Communication cost
Note: Different database systems may focus on different
cost components.
Cont’d…
• Access cost to secondary storage: This is the cost of transferring (reading and
writing) data blocks between secondary disk storage and main memory buffers.
• Disk storage cost: This is the cost of storing on disk any intermediate files that
are generated by an execution strategy for the query.
• Computation cost: This is the cost of performing in-memory operations on the
records within the data buffers during query execution. Such operations include
searching for and sorting records, merging records for a join or a sort operation,
and performing computations on field values. This is also known as CPU (central
processing unit) cost.
• Memory usage cost: This is the cost pertaining to the number of main memory
buffers needed during query execution
• Communication cost: This is the cost of shipping the query and its results from
the database site to the site or terminal where the query originated.
Section – 7
Sorting
} Several of the relational operations, such as joins, can be implemented efficiently if
the input relations are first sorted.
} We can sort a relation by building an index on the relation and then using that index
to read the relation in sorted order.
} Such a process orders the relation only logically rather than physically.
} Hence reading of tuples in the sorted order may lead to disk access for each record,
which can be very expensive.
} So it is desirable to order the records physically.
} Sorting of relation that fit into main memory, standard sorting techniques such as
quick-sort can be used.
} Sorting of relations that do not fit in main memory is called external sorting.
} Most commonly used algorithm for this type of sorting is external sort merge
algorithm.
External Sort-Merge (Example)
} Blocks=3
24 19 14
24 2
19 24 16
19 3
31 31 19
31 7
33 14 24
33 14
14 16 31
14 14
16 33 33
16 16
16 16 3 16
2
21 21 16 19
3
3 3 21 21
7
2 24
2 2 14
7 create merge 31
7 7 16
14 runs pass-2 33
merge 21
14 14
pass-1
initial relation runs runs sorted output
External Sort-Merge (Algorithm)
} Let M denote memory size (in pages).
1. Create sorted runs. Let i be 0 initially.
Ê Repeatedly do the following till the end of the relation:
1) Read M blocks of relation into memory
2) Sort the in-memory blocks
3) Write sorted data to run Ri; then increment i.
Ê Let the final value of i be N
2. Merge the runs (N-way merge). We assume (for now) that N < M.
Ê Use N blocks of memory to buffer input runs, and 1 block to buffer output. Read the first block of
each run into its buffer page
Ê repeat
Select the first record (in sort order) among all buffer pages
Write the record to the output buffer. If the output buffer is full write it to disk.
Delete the record from its input buffer page.
§ If the buffer page becomes empty then read the next block (if any) of the run into the buffer.
Ê until all input buffer pages are empty:
Questions asked
1. Explain query processing steps. OR Discuss various steps of query processing with
proper diagram.
2. Explain Heuristics in optimization.
3. Explain the purpose of sorting with example with reference to query optimization.
4. Explain the measures of finding out the cost of a query in query processing.

You might also like