Advanced Database Management system
CHAPTER ONE
Query processing and optimization
Outline
Translating SQL Queries into Relational Algebra
Basic Algorithms for Executing Query Operations
Using Heuristic in Query Optimization
Using Selectivity and Cost Estimates in Query
Optimization
Semantic Query Optimization
Query
A query in a database is like asking a question to your
data. It's a set of instructions that tells the database what
information you want to retrieve.
Think of it like this:
You have a library of books. The library is your database,
and each book is like a row in a table.
You want to find a specific book. You wouldn't search
through every single book. Instead, you'd ask the librarian
(the database) for a book with a specific title, author, or
subject. This show the concept f query we use for
accessing database.
Query
Query is the statement written by the user in high level language
using SQL.
Parser & Translator
Parser: Checks the syntax and verifies the relation.
Translator: Translates the query into an equivalent relational algebra.
Parser: checks syntax, verifies relations
Parser & Translator (query compiler ) includes:
• Syntax
• Schema element
• Converts the query into R.A expression.
Query Processing
Query Processing is a procedure of transforming a high-level
query (such as SQL) into a correct and efficient execution plan that
expressed in low-level language.
Query Processing is the list of activities that are perform to obtain
the required tuples that satisfy a given query.
Query Processing activities involved in retrieving data from the
database
Query Processing
Why It Matters:
Performance: Query processing directly impacts the speed and
efficiency of database queries.
Data Integrity: Accurate query processing ensures that the data
retrieved is consistent with the database state.
Scalability: Effective query processing is crucial for handling large
databases and complex queries.
Query processing goes through various phases:
The first phase is called syntax checking phase, the system parses
the query and checks that it follows the syntax rules or not.
It then matches the objects in the query syntax with the view tables
and columns listed in the system table.
In second phase the SQL query is translated in to an algebraic
expression using various rules.
So that the process of transforming a high-level SQL query into a
relational algebraic form is called Query Decomposition. The
relational algebraic expression now passes to the query optimizer.
Basic Steps in Query Processing
1. Parsing and translation
2. Optimization
3. Evaluation
Query Analyzer
The syntax analyzer takes the query from the users, parses it into
tokens and analyses the tokens and their order to make sure they
follow the rules of the language grammar.
A syntactically, legal query is then validated using the system
catalog, to ensure that all data objects (relations and attributes)
referred to by the query are defined in the database.
If an error is found in the query submitted by the user, it is rejected
and an error code together with an explanation of why the query
was rejected is return to the user.
Query Decomposition
In query decomposition the query processing aims are to transfer the
high-level query into a relational algebra query and to check whether
that query is syntactically and semantically correct.
Thus the query decomposition is start with a high-level query and
transform into query graph of low-level operations, which satisfy the
query.
The SQL query is decomposed into query blocks (low-level
operations), which form the basic unit. Hence nested queries within a
query are identified as separate query blocks.
Cont.…
Parsing: The Language of Databases
The parsing stage is where the database management system
(DBMS) takes your SQL query, a human-readable language, and
transforms it into a structured representation that the DBMS can
understand and process.
Lexical Analysis: The query is broken down into individual tokens
(keywords, identifiers, operators, values) like a sentence being
divided into words.
Syntactic Analysis: The tokens are analyzed for their grammatical
structure (like the order of words in a sentence). This ensures the
query follows the SQL grammar rules..
Cont.…
Example
SELECT name, salary
FROM Employee
WHERE department = 'Sales';
This query is broken down into tokens:
SELECT, name, salary, FROM, Employee, WHERE, department, =,
'Sales', ;
The parser then checks if this sequence of tokens adheres to the
SQL grammar rules.
Cont.…
Translation: From SQL to Execution Plan
After parsing, the query's meaning is understood. Now, it needs to be
translated into a plan that can be executed efficiently.
* Logical Query Plan: The DBMS creates a logical representation of
the query, showing the operations (like selection, projection, join) that
need to be performed. This plan focuses on the logical steps rather than
specific physical implementations.
* Physical Query Plan: The DBMS optimizes the logical plan by
selecting specific algorithms, indexes, and data access methods. This
plan details how the query will be executed on the physical database
system.
Cont.…
Example:
Logical Plan: The query might be represented as a tree structure, with:
* A SELECT node as the root.
* A FROM node representing the Employee table.
* A WHERE node with the condition department = 'Sales'.
* A PROJECT node to select the name and salary columns.
Physical Plan: The optimizer might decide:
To use an index on the department column to quickly filter the Employee table.
To choose a specific join algorithm based on the sizes and characteristics of the
involved tables.
To fetch the data from the disk in a particular order for efficient access.
Typical stages in query decomposition are
I. Analysis: lexical and syntactical analysis of the
query(correctness) based on attributes, data type and etc. Query
tree will be built for the query containing leaf node for base
relations, one or many non-leaf nodes for relations produced by
relational algebra operations and root node for the result of the
query. Sequence of operation is from the leaves to the root.
SELECT * FROM Catalog c, Author a Where a.authorid =
c.authorid AND c.price>200 AND a.country= ‘ USA’ )
ii. Normalization: convert the query into a normalized form. The
predicate WHERE will be converted to Conjunctive (^) or
Disjunctive (V ) Normal form.
Cont.…
iii. Semantic Analysis: to reject normalized queries that are not
correctly formulated or contradictory. Incorrect if components do not
contribute to generate result. Contradictory if the predicate can not be
satisfied by any tuple. Say for example,(Catalog =“BS” Catalog=
“CS”) since a given book can only be classified in either of the
category at a time
iv. Simplification: to detect redundant qualifications, eliminate
common sub-expressions, and transform the query to a semantically
equivalent but more easily and effectively computed form. For
example, if a user don’t have the necessary access to all of the objects
of the query , it should be rejected.
Translating SQL Queries into Relational Algebra
Key Relational Algebra Operators
1. Selection (σ): Filters rows based on a condition.
Example: σ[salary > 50000](Employee) (selects employees with
a salary greater than 50,000)
2. Projection (π): Extracts specific columns from a relation.
Example: π[name, dept](Employee) (selects the "name" and
"dept" columns)
3. Union (∪): Combines two relations, keeping unique rows.
Example: R ∪ S (combines the results of relations R and S,
removing duplicates)
4. Intersection (∩): Returns the rows that exist in both relations.
Example: R ∩ S (selects rows common to both R and S)
Cont..
5. Difference (-) : Returns the rows that exist in the first
relation but not in the second.
Example: R - S (selects rows in R that are not in S)
6. Cartesian Product (×): Combines all possible pairs of
rows from two relations.
Example: R × S (creates a new relation with all possible
7. Join (⋈): Combines rows from two relations based on a
combinations of rows)
Example: Employee ⋈[dept=dno] Department (joins
condition.
employees and departments based on matching
departments)
Example of Relational Algebra Translating
Example of Relational Algebra Translating
Example of Relational Algebra Translating with operator
projection example
Selectin Example
Example
Example
Example
Join
Join
Optimizer
Find all equivalent Relational Algebra(R.A) expressions
Find the R.A expression with least cost
Cost(CPU, Block access, time spent)
Will create query evaluation plan which tell what R.A and what
algorithm is used.
Query evaluation plan: Evaluate the above plan and get the result
Query Optimization
It is the process of selecting the most efficient query evaluation
plan from among the many strategies usually possible for
processing a given query, especially if the query is complex.
It is the process of selecting the most efficient query evaluation
plan from among the many strategies usually possible for
processing a given query, especially if the query is complex
The primary goal of query optimization is of choosing an
efficient execution strategy for processing a query.
Cont.…
The query optimizer attempts to minimize the use of certain resources (mainly
the number of I/O and CPU time) by selecting a best execution plan (access
plan).
A query optimization start during the validation phase by the system to validate
the user has appropriate privileges.
Query Optimization: Amongst all equivalent evaluation plans choose the one
with lowest cost.
Cost is estimated using statistical information from the database catalog
➨ e.g. number of tuples in each relation, size of tuples, etc.
Cost is generally measured as total elapsed time for answering query.
Cont.…
Storage statistics: Data about allocation of storage into table
spaces, index spaces, and buffer ports.
I/O and device performance statistics: Total read/write activity
(paging) on disk extents and disk hot spots.
Query/transaction processing statistics: Execution times of
queries and transactions, optimization times during query
optimization.
Locking/logging related statistics: Rates of issuing different
types of locks, transaction throughput rates, and log records
activity.
Cost Estimation Components
Cost of access to secondary storage
Storage cost – cost of storing intermediate results
Computation cost
Memory usage cost – usage of RAM buffers
Formulae for cost estimation of each operation
Estimation of relational algebra expression
Choosing the expression with the lowest cost
Cont..
Typically disk access is the predominant cost, and is also
relatively easy to estimate.
This is measured by taking into account
Number of seeks * average-seek-cost
Number of blocks read * average-block-read-cost
Number of blocks written * average-block-write-cost
Cost to write a block is greater than cost to read a block
– data is read back after being written to ensure that the write was
successful
Generally Measures of Query Cost:
The query costs are defined by the time to answer a query
(process the query execution plan)
Different factors contribute to the query costs like disk access
time, CPU time or even network communication time
The costs are often dominated by the disk access time
seek time (tS) (~4 ms)
transfer time (tT) (e.g. 0.1 ms per disk block)
The cost of query evaluation can be measured in
terms of different resources, including
disk accesses
CPU time to execute a query in a distributed or
parallel database system
the cost of communication.
Statistical Information for Cost Estimation
nr: number of tuples in a relation r
br: number of blocks containing tuples of r.
sr: size of a tuple of r
fr: blocking factor of r — i.e., the number of tuples of r that fit into one
block.
SC(A, r): selection cardinality of attribute A of relation r; average
number of records that satisfy equality on A.
If tuples of r are stored together physically in a file, then:
Selection Operation
The lowest-level query processing operator for accessing data is
the file scan
search and retrieve records for a given selection condition
Linear search
Binary search
Index-based Selection Operation
Conjunctive Selection Operation
Disjunctive Selection Operation
Sorting
join Operation
Linear search
given a file with n blocks, we scan each block and
check if any records satisfy the condition
a selection on a candidate key attribute (unique) can be
terminated after a record has been found
average costs: tS + n/2 * tT , worst case costs: tS + n * tT
Note: ts = seek time
tT = Transfer time
n = Number of block
Binary search
An equality selection condition on a file that is ordered
on the selection attribute (n blocks) can be realized via a
binary search
note that this only works if we assume that the blocks of
the file
are stored continously!
worst case costs: [log2(n)] * (tS + tT)
Index-based Selection Operation
• A search algorithm that makes use of an index is called an index
scan and the index structure is called access path .
• Primary index and equality on candidate key retrieve a single
record based on the index
costs for a B+-tree with height h: (h + 1) * (tS + tT)
Primary index and equality on non-candidate key . multiple
records might fulfil the condition (possibly spread over n
successive blocks)
costs for a B+-tree with height h: h * (tS + tT) + tS + n * tT
Conjunctive Selection Operation
A conjunctive selection has the form
Conjunctive selection using a single index
check if there is an access path available for an attribute in one of
the simple conditions
use one of the approaches described before (with minimal cost) to
retrieve the records and check the other conditions in memory
Conjunctive selection using a composite index
use the appropriate multi-key index if available
Sorting
Sorting in database systems is important for two reasons
1. A query may specify that the output should be sorted
2. the processing of some relational query operations can be
implemented more efficiently based on sorted relations
e.g. join operation
For relations that fit into memory, techniques like quicksort
can be used
For relations that do not fit into memory an external merge
sort algorithm can be used
Join Operation
Different algorithms for implementing join operations
nested-loop join
block nested-loop join
index nested-loop join
merge join
hash join
Techniques for Query Optimization
The first technique is based on Heuristic Rules for ordering the
operations in a query execution strategy.
A heuristic optimization transforms the query expression tree by using
a set of rules that typically improve the execution performance
The second technique involves the systematic estimation of the cost of
the different execution strategies and choosing the execution plan with
the lowest cost.
Semantic query optimization is used with the combination with the
heuristic query transformation rules.
Heuristic Rules
Cost-based optimization can be expensive. DBMS may use some heuristics to
reduce the number of cost-based choices
A heuristic optimization transforms the query expression tree by
using a set of rules that typically improve the execution performance
perform selection as early as possible
- reduces the number of tuples
perform projection as early as possible
reduces the number of attributes. perform most restrictive selection and
join operations (smallest result size) before other operations
Heuristic Rules
The SELECT and PROJECT reduced the size of the file and
hence, should be applied before the JOIN or other binary
operation.
Heuristic query optimizer transforms the initial (canonical)
query tree into final query tree using equivalence
transformation rules.
perform selection as early as possible
- reduces the number of tuples
perform projection as early as possible
- reduces the number of attributes
Using Heuristics in Query Optimization
Query tree:
• A tree data structure that corresponds to a relational algebra
expression.
• It represents the input relations of the query as leaf nodes of
the tree, and represents the relational algebra operations as
internal nodes.
• An execution of the query tree consists of executing an
internal node operation whenever its operands are available
and then replacing that internal node by the relation that
results from executing the operation.
Query graph:
• A graph data structure that corresponds to a relational
calculus expression.
• It does not indicate an order on which operations to perform
first. There is only a single graph corresponding to each
query.
Using Heuristics in Query Optimization
Fig : Two query tree of Query 1 (a) query tree of corresponding query relational
algebra expression (b) initial query tree for Q1
Using Selectivity and Cost Estimates query optimization
Selectivity refers to the fraction of rows in a table that satisfy a given predicate
(condition) in a query. For example, the selectivity of the predicate WHERE age
= 25 would be the number of rows in the table where the age column is 25,
divided by the total number of rows in the table.
Estimating Selectivity:
* Statistics: Databases maintain statistics about data distributions in tables, like
the number of distinct values in columns and their frequencies. These statistics
are used to estimate selectivity.
* Histograms: Some databases use histograms to represent the distribution of
values in a column, which can provide more accurate estimates of selectivity.
Sampling: Sampling involves randomly selecting a subset of data from a table to
estimate its statistics and selectivity.
Cost Estimation Approach to Query Optimization
• The main idea is to minimize the cost of processing a query.
• The cost function is comprised of:
• I/O cost + CPU processing cost + communication cost + Storage cost
• These components might have different weights in different
processing environments
• The DBMs will use information stored in the system catalogue for the
purpose of estimating cost.
• The main target of query optimization is to minimize the size of the
intermediate relation. The size will have effect in the cost of:
• Access Cost of Secondary Storage
• Storage Cost
• Computation Cost
• Communication Cost
• Memory usage cost
Cost Estimation Approach to Query Optimization
1. Access Cost of Secondary Storage
• Data is going to be accessed from secondary storage. The disk access cost
can again be analyzed in terms of:
• Searching
• Reading, and
• Writing, data blocks used to store some portion of a relation.
• Remark: The disk access cost will vary depending on
• The file organization used and the access method implemented for the
file organization.
• whether the data is stored contiguously or in scattered manner, will
affect the disk access cost.
2. Storage Cost
• While processing a query, as any query would be composed of many
database operations, there could be one or more intermediate results before
reaching the final output. These intermediate results should be stored in
primary memory for further processing.
• The bigger the intermediate relation, the larger the memory requirement,
which will have impact on the limited available space. This will be
considered as a cost of storage.
3. Computation Cost
• Query is composed of many operations. The operations could be
database operations like reading and writing to a disk, or
mathematical and other operations like:
• Searching
• Sorting
• Merging
• Computation on field values
4. Communication Cost
• In most database systems the database resides in one station and is
accessed by various queries originate from different terminals. This
will have impact on the performance of the system adding cost for
query processing. Thus, the cost of transporting data between the
database site and the terminal from where the query originate should
be analyzed.
5. Memory usage cost: This is the cost pertaining to the number of
memory buffers needed during query execution.
What is Semantic Query Optimization?
Semantic Query Optimization (SQO) is a technique used in database systems to
improve the efficiency of queries by leveraging the meaning (semantics) of the
data and relationships within the database.
SQO Techniques:
• Predicate Pushdown: Moving filter conditions as close to the data source
(tables) as possible. This can reduce the amount of data that needs to be
scanned.
• View Merging: Combining multiple views into a single query to simplify the
execution plan.
• Constant Folding: Evaluating constant expressions during query optimization
to avoid redundant computations.
• Redundant Condition Elimination: Identifying and removing redundant
conditions from the WHERE clause.
Benefits of SQO:
Improved Query Performance: SQO can significantly
reduce query execution time by finding more efficient
execution plans.
Reduced Data I/O: By eliminating redundant conditions
and using data dependencies, SQO can minimize the amount
of data that needs to be accessed from disk.
Increased Scalability: SQO is particularly beneficial for
large and complex databases, where efficient query
processing is essential.
End