Advanced Database
Advanced Database
GTU # 3130703
Chapter 2
Query Processing
and Optimization
P Outline
Looping
• Steps in query processing
• Measures of query cost
• Selection operation
• Evaluation of expressions
• Query optimization
• Transformation of relational expressions
• Sorting and join
Query Processing
} Query Processing is a procedure of converting a query written in high-level language
(Ex. SQL) into a correct and efficient execution plan expressed in low-level language,
which is used for data manipulation.
} A query expressed in a high-level query language such as SQL must first be
Ê Scanned
Ê Parsed and
Ê Validated
• Scanner identifies the query tokens such as
• SQL keywords
• Attribute names and
• Relation names that appear in the text of the query
• Parser checks the query syntax to determine whether it is formulated according to
the syntax rules (rules of grammar) of the query language
• Validated by checking that all attribute and relation names are valid and
• Semantically meaningful names in the schema of the particular database being queried
Section – 1
Steps in Query Processing
Parser checks the syntax Translator translates
of query and verifies the query into its
attribute name and internal form
relation name (relational algebra)
Parser Relational
Query and algebra
translator expression
Choose best execution
plan Optimize
Execute the query-
r
evaluation plan and returns
output
Evaluatio
Query Execution
n
output plan
engine
Database Catalog
Data Statistics about
Data
Section – 2
Measures of Query Cost
} Cost is generally measured as the total time required to execute a statement/query.
} Factors contribute to time cost
Ê Disk accesses (time to process a data request and retrieve the required data from the storage
device)
Ê CPU time to execute a query
Ê Network communication cost
} Disk access is the predominant (major) cost, since disk access is slow as
compared to in-memory operation.
} Cost to write a block is greater than cost to read a block because data is read back
after being written to ensure that the write was successful.
Section – 3
Selection Operator
} Symbol: σ (Sigma)
Student Output
RollNo Name Branch SPI RollNo Name Branch SPI
101 Raju CE 8 101 Raju CE 8
102 Mites ME 9 104 Meet CE 9
h
103 Nilesh CI 9
104 Meet CE 9
Search algorithm for selection operation
} Linear search (A1)
} Binary search (A2)
Linear search (A1)
} It scans each blocks and tests all records to see whether they satisfy the selection
condition.
Ê Cost of linear search (worst case) = br
br denotes number of blocks containing records from relation r
} If the selection condition is there on a (primary) key attribute, then system can stop
searching if the required record is found.
Ê cost of linear search (best case) = (br /2)
} If the selection is on non (primary) key attribute then multiple block may contains
required records, then the cost of scanning such blocks need to be added to the
cost estimate.
} Linear search can be applied regardless of
Ê selection condition or
Ê ordering of records in the file (relation)
} This algorithm is slower than binary search algorithm.
Binary search (A2)
} Generally, this algorithm is used if selection is an equality comparison on the
(primary) key attribute and file (relation) is ordered (sorted) on (primary) key
attribute.
} cost of binary search = [log2(br)]
Ê br denotes number of blocks containing records from relation r
} This algorithm is faster than linear search algorithm.
Section – 4
Evaluation of expressions
} Expression may contain more than one
operations, solving expression will be ΠCust_Nam
difficult if it contains more than one
e
operations.
ΠCust_Name ( σBalance<2500 (account) (customer) )
Bottom to top
Execution
} To evaluate such expression we need to
evaluate each operations one by one in σ Balance<2500 (customer)
appropriate order.
} Two methods for evaluating an
expression carrying multiple operations (account)
are:
Ê Materialization
Ê Pipelining
Materialization
} Materialization evaluates the expression tree of the relational algebra operation
from the bottom and performs the innermost or leaf-level operations first.
} The intermediate result of each operation is materialized (store in temporary
relation) and becomes input for subsequent (next) operations.
} The cost of materialization is the sum of the individual operations plus the cost of
writing the intermediate results to disk.
} The problem with materialization is that
Ê it creates lots of temporary relations
Ê it performs lots of I/O operations
Pipelining
} In pipelining, operations form a queue, and results are passed from one operation to
another as they are calculated.
} To reduce number of intermediate temporary relations, we pass results of one
operation to the next operation in the pipelines.
} Combining operations into a pipeline eliminates the cost of reading and writing
temporary relations.
} Pipelines can be executed in two ways:
Ê Demand driven (System makes repeated requests for tuples from the operation at the top of
pipeline)
Ê Producer driven (Operations do not wait for request to produce tuples, but generate the tuples
eagerly.)
Section – 5
Query optimization
} It is a process of selecting the most efficient query evaluation plan from the
available possible plans.
4 4
records records
Approaches to Query Optimization
} Exhaustive Search Optimization
Ê Generates all possible query plans and then the best plan is selected.
Ê It provides best solution.
} Heuristic Based Optimization
Ê Heuristic based optimization uses rule-based optimization approaches for query optimization.
Ê Performs select and project operations before join operations. This is done by moving the select
and project operations down the query tree. This reduces the number of tuples available for join.
Ê Avoid cross-product operation because they result in very large-sized intermediate tables.
Ê This algorithms do not necessarily produce the best query plan.
Section – 6
Transformation of relational expressions
} Two relational algebra expressions are said to be equivalent if the two expressions
generate the same set of tuples.
} Example: Customer Account
CID ANO Name ANO Balance
C01 A01 Raj A01 3000
C02 A02 Meet A02 1000
C03 A03 Jay A03 2000
C04 A04 Ram A04 4000
Customer
Name
Meet
Jay
Transformation of relational expressions
} Combined selection operation can be divided into sequence of individual selections.
This transformation is called cascade of σ.
} Example:
Customer
CID ANO Name Balance
σANO<3 Λ Balance<2000 Output
C01 1 Raj 3000 (Customer)
OUTPU CID ANO Name Balance
C02 2 Meet 1000
T C02 2 Meet 1000
C03 3 Jay 2000
C04 4 Ram 4000 σANO<3 (σBalance<2000
(Customer))
Customer
CID ANO Name Balance
σANO<3 (σBalance<2000 (Customer)) Output
C01 1 Raj 3000
CID ANO Name Balance
C02 2 Meet 1000 OUTPUT
C02 2 Meet 1000
C03 3 Jay 2000
C04 4 Ram 4000 σBalance<2000 (σANO<3 (Customer))
Customer Account
CID ANO Name ANO Balance
σANO<3 (Customer Account)
Output
C01 1 Raj 1 3000 CID ANO Name Balance
Customer Account
CID ANO Name ANO Balance
(Account) σANO<3 (Customer) Output
C01 1 Raj 1 3000 CID ANO Name Balance
E1 σθ E2 = E2 σθ E1
Transformation of relational expressions
} Natural join operations are associative.
} Selection operation distribute over theta join operation under the following
condition
Ê When all the attributes in the selection condition θ0 involves only the attributes of the one of the
expression (says E1) being joined.
Ê When the selection condition θ1 involves only the attributes of E1 and θ2 involves only the
attributes of E2.
§ SQL query:
Q2: SELECT P.NUMBER,P.DNUM,E.LNAME,
E.ADDRESS, E.BDATE
FROM PROJECT AS P,DEPARTMENT AS D,
EMPLOYEE AS E
WHERE P.DNUM=D.DNUMBER AND
D.MGRSSN=E.SSN AND
P.PLOCATION=‘STAFFORD’;
Using Heuristics in Query Optimization (4)