0% found this document useful (0 votes)
42 views

Query Processing and Optimization

The document discusses query processing in database management systems (DBMS), outlining the steps involved: parsing and translation, optimization, and evaluation. It explains how user queries are translated into relational algebra for execution, the importance of constructing an efficient query evaluation plan, and the role of query optimization in minimizing costs. Additionally, it covers selection operations, sorting methods, and various join algorithms used in query processing.

Uploaded by

shikha sharma
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
42 views

Query Processing and Optimization

The document discusses query processing in database management systems (DBMS), outlining the steps involved: parsing and translation, optimization, and evaluation. It explains how user queries are translated into relational algebra for execution, the importance of constructing an efficient query evaluation plan, and the role of query optimization in minimizing costs. Additionally, it covers selection operations, sorting methods, and various join algorithms used in query processing.

Uploaded by

shikha sharma
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 33

Query Processing and

Optimization
Query Processing in DBMS
Query Processing is the activity performed in extracting data from the database. In
query processing, it takes various steps for fetching the data from the database.
The steps involved are:
• Parsing and translation
• Optimization
• Evaluation
The query processing works in the following way:
Parsing and Translation
• As query processing includes certain activities for data retrieval. Initially, the
given user queries get translated in high-level database languages such as SQL. It
gets translated into expressions that can be further used at the physical level of
the file system.
• After this, the actual evaluation of the queries and a variety of query -optimizing
transformations and takes place. Thus before processing a query, a computer
system needs to translate the query into a human-readable and understandable
language.
• Consequently, SQL or Structured Query Language is the best suitable choice for
humans. But, it is not perfectly suitable for the internal representation of the
query to the system.
• Relational algebra is well suited for the internal representation of a query.
• The translation process in query processing is similar to the parser of a query.
When a user executes any query, for generating the internal form of the query,
the parser in the system checks the syntax of the query, verifies the name of the
relation in the database, the tuple, and finally the required attribute value.
• The parser creates a tree of the query, known as 'parse-tree.' Further, translate
it into the form of relational algebra. With this, it evenly replaces all the use of
the views when used in the query.
• Thus, we can understand the working of a query processing in the below-
described diagram:
αage<=25 (Лstu_name,stu_address(student))

Select stu_name , stu_address


FROM Student where
Age <=25;
αage<=25

Лstu_name,stu_address

Student
Suppose a user executes a query. As we have learned that there are various
methods of extracting the data from the database. In SQL, a user wants to fetch
the records of the employees whose salary is greater than or equal to 10000. For
doing this, the following query is undertaken:
• select emp_name from Employee where salary>10000;
Thus, to make the system understand the user query, it needs to be translated in
the form of relational algebra. We can bring this query in the relational algebra
form as:
• σsalary>10000 (πemp_name (Employee))
• Πemp_name (σsalary>10000 (Employee))
After translating the given query, we can execute each relational algebra operation
by using different algorithms. So, in this way, a query processing begins its working.
Evaluation
For this, with addition to the relational algebra translation, it is required to
annotate the translated relational algebra expression with the instructions used for
specifying and evaluating each operation. Thus, after translating the user query,
the system executes a query evaluation plan.
Query Evaluation Plan
• In order to fully evaluate a query, the system needs to construct a query
evaluation plan.
• The annotations in the evaluation plan may refer to the algorithms to be used
for the particular index or the specific operations.
• Such relational algebra with annotations is referred to as Evaluation Primitives.
The evaluation primitives carry the instructions needed for the evaluation of the
operation.
• Thus, a query evaluation plan defines a sequence of primitive operations used
for evaluating a query. The query evaluation plan is also referred to as the query
execution plan.
• A query execution engine is responsible for generating the output of the given
query. It takes the query execution plan, executes it, and finally makes the
output for the user query.
Optimization
• The cost of the query evaluation can vary for different types of queries.
Although the system is responsible for constructing the evaluation plan, the user
does need not to write their query efficiently.
• Usually, a database system generates an efficient query evaluation plan, which
minimizes its cost. This type of task performed by the database system and is
known as Query Optimization.
• For optimizing a query, the query optimizer should have an estimated cost
analysis of each operation. It is because the overall operation cost depends on
the memory allocations to several operations, execution costs, and so on.
• Finally, after selecting an evaluation plan, the system evaluates the query and
produces the output of the query.
Estimating Query Cost
• Though a system can create multiple plans for a query, the chosen method
should be the best of all.
• It can be done by comparing each possible plan in terms of their estimated cost.
For calculating the net estimated cost of any plan, the cost of each operation
within a plan should be determined and combined to get the net estimated cost
of the query evaluation plan.
• The cost estimation of a query evaluation plan is calculated in terms of various
resources that include:
– Number of disk accesses
– Execution time taken by the CPU to execute a query
– Communication costs in distributed or parallel database systems.
• To estimate the cost of a query evaluation plan, we use the number of blocks
transferred from the disk, and the number of disks seeks.
• Suppose the disk has an average block access time of ts seconds and takes an
average of tT seconds to transfer x data blocks. The block access time is the sum
of disk seeks time and rotational latency.
• It performs S seeks than the time taken will be b*tT + S*tS seconds. If tT=0.1 ms,
tS =4 ms, the block size is 4 KB, and its transfer rate is 40 MB per second. With
this, we can easily calculate the estimated cost of the given query evaluation
plan.
• Generally, for estimating the cost, we consider the worst case that could
happen. The users assume that initially, the data is read from the disk only. But
there must be a chance that the information is already present in the main
memory. However, the users usually ignore this effect, and due to this, the
actual cost of execution comes out less than the estimated value.
The response time, i.e., the time required to execute the plan, could be used for
estimating the cost of the query evaluation plan. But due to the following reasons,
it becomes difficult to calculate the response time without actually executing the
query evaluation plan:
• When the query begins its execution, the response time becomes dependent
on the contents stored in the buffer. But this information is difficult to retrieve
when the query is in optimized mode, or it is not available also.
• When a system with multiple disks is present, the response time depends on an
interrogation that in "what way accesses are distributed among the disks?". It is
difficult to estimate without having detailed knowledge of the data layout
present over the disk.
• Consequently, instead of minimizing the response time for any query evaluation
plan, the optimizers finds it better to reduce the total resource consumption of
the query plan. Thus to estimate the cost of a query evaluation plan, it is good
to minimize the resources used for accessing the disk or use of the extra
resources.
Selection Operation
Selection Using File Scans
The selection operation is performed by the file scan. File scans are the search
algorithms that are used for locating and accessing the data. It is the lowest-level
operator used in query processing.
Let's see how selection using a file scan is performed.
Linear Search
Here each record is read from the beginning of the file till search record is reached.
It checks each record for filter condition one after the other. Records can be
fetched using linear search irrespective of filter condition, sorting and index.
Suppose tS is the seek time (number of Seek is usually one – to reach the beginning
of the file) , tT is the number of traversal time for one block, and B is the number of
blocks to be transferred, then the cost is calculated as:
tS + (B*tT)
This is the cost required to fetch the record based on non-key attribute. Suppose
the search is based on the key value. Then the average cost of query is tS +
(B*tT)/2 and at worst case it would be tS + (B*tT).
Binary Search
This method of selection is applicable only when the records are sorted based on
the search key value and we have equal condition. i.e.; this method is not suitable
for range operation or any other kind.
The filter condition should be ‘search key column = value’, like we had
‘CLASS_NAME = ‘DESIGN_01’’.
Suppose blocks of records are stored continuously in the memory and the cost of
the query to fetch first record is calculated as log (B)* (ts+ tT).
Selection Operation with Indexes
The index-based search algorithms are known as Index scans. Such index structures
are known as access paths. These paths allow locating and accessing the data in the
file. There are following algorithms that use the index in query processing:
• Primary index, equality on a key: We use the index to retrieve a single record
that satisfies the equality condition for making the selection. The equality
comparison is performed on the key attribute carrying a primary key.
• Primary index, equality on nonkey: The difference between equality on key and
nonkey is that in this, we can fetch multiple records. We can fetch multiple
records through a primary key when the selection criteria specify the equality
comparison on a nonkey.
• Secondary index, equality on key or nonkey: The selection that specifies an
equality condition can use the secondary index. Using secondary index strategy,
we can either retrieve a single record when equality is on key or multiple records
when the equality condition is on nonkey. When retrieving a single record, the
time cost is equal to the primary index. In the case of multiple records, they may
reside on different blocks. This results in one I/O operation per fetched record,
and each I/O operation requires a seek and a block transfer.
Selection Operations with Comparisons
For making any selection on the basis of a comparison in a relation, we can proceed
it either by using the linear search or via indices in the following ways:
• Primary index, comparison: When the selection condition given by the user is a
comparison, then we use a primary ordered index, such as the primary B +-tree
index. For example, when A attribute of a relation R compared with a given value
v as A>v, then we use a primary index on A to directly retrieve the tuples. The file
scan starts its search from the beginning till the end and outputs all those tuples
that satisfy the given selection condition.
• Secondary index, comparison: The secondary ordered index is used for
satisfying the selection operation that involves <, >, ≤, or ≥ In this, the files scan
searches the blocks of the lowest-level index.
(< ≤): In this case, it scans from the smallest value up to the given value v.
(>, ≥): In this case, it scans from the given value v up to the maximum value.
However, the use of the secondary index should be limited for selecting a few
records. It is because such an index provides pointers to point each record, so
users can easily fetch the record through the allocated pointers. Such retrieved
records may require an I/O operation as records may be stored on different
blocks of the file. So, if the number of fetched records is large, it becomes
expensive with the secondary index.
Implementing Complex Selection Operations
Working on more complex selection involves three selection predicates known as
Conjunction, Disjunction, and Negation.
Conjunction: A conjunctive selection is the selection having the form as:
• σ θ1ꓥθ2ꓥ…ꓥθn (r)
A conjunction is the intersection of all records that satisfies the above selection
condition.
Disjunction: A disjunctive selection is the selection having the form as:
• σ θ1ꓦθ2ꓦ…ꓦθn (r)
A disjunction is the union of all records that satisfies the given selection condition θi.
Negation: The result of a selection σ¬θ(r) is the set of tuples of given relation r where
the selection condition evaluates to false. But nulls are not present, and this set is
only the set of tuples in relation r that are not in σθ(r).
Using these discussed selection predicates, we can implement the selection
operations by using the following algorithms:
• Conjunctive selection using one index: In such type of selection operation
implementation, we initially determine if any access path is available for an
attribute. If found one, then algorithms based on the index will work better.
Further completion of the selection operation is done by testing that each
selected records satisfy the remaining simple conditions. The cost of the selected
algorithm provides the cost of this algorithm.
• Conjunctive selection via Composite index: A composite index is the one that is
provided on multiple attributes. Such an index may be present for some
conjunctive selections. If the given selection operation proves true on the
equality condition on two or more attributes and a composite index is present on
these combined attribute fields, then directly search the index. Such type of
index evaluates the suitable index algorithms.
• Conjunctive selection via the intersection of identifiers: This implementation
involves record pointers or record identifiers. It uses indices with the record
pointers on those fields which are involved in the individual selection condition.
It scans each index for pointers to tuples satisfying the individual condition.
Therefore, the intersection of all the retrieved pointers is the set of pointers to
the tuples that satisfies the conjunctive condition. The algorithm uses these
pointers to fetch the actual records. However, in absence of indices on each
individual condition, it tests the retrieved records for the other remaining
conditions.
• Disjunctive selection by the union of identifiers: This algorithm scans those
entire indexes for pointers to tuples that satisfy the individual condition. But
only if access paths are available on all disjunctive selection conditions.
Therefore, the union of all fetched records provides pointers sets to all those
tuples which satisfy or prove the disjunctive condition. Further, it makes use of
pointers for fetching the actual records. Somehow, if the access path is not
present for anyone condition, we need to use a linear search to find those
tuples that satisfy the condition. Thus, it is good to use a linear search for
determining such tests.
Sorting
Sorting in database system is important for two reasons:
1. A query may specify that the output should be sorted.
2. The processing of some relational query operations can be implemented more
efficiently based on sorted relations e.g Join Operation.
For relation that fit in the memory , techniques like quick-sort can be used and for
relations that do fit in memory an external sort-merge algorithm can be used.
External Sort – Merge Algorithm
The sorting of relations which do not fit in the memory because their size is larger
than the memory size. Such type of sorting is known as External Sorting. As a result,
the external-sort merge is the most suitable method used for external sorting.
Let M denote memory size (in pages).
1.Create sorted runs. Let i be 0 initially.
Repeatedly do the following till the end of the
relation:
(a) Read M blocks of relation into memory
(b) Sort the in-memory blocks
(c) Write sorted data to run Ri;
(d) increment i.
2.Merge the runs (N-way merge). We assume (for now) that N < M.
1.Use N blocks of memory to buffer input runs, and 1 block to
buffer output. Read the first block of each run into its buffer
page
2.repeat
1.Select the first record (in sort order) among all buffer pages
2.Write the record to the output buffer. If the output buffer is
full write it to disk.
3.Delete the record from its input buffer page.
If the buffer page becomes empty then
read the next block (if any) of the run into the buffer.
3.until all input buffer pages are empty:
• If N > M, several merge passes are required.
– In each pass, contiguous groups of M - 1 runs are merged.
– A pass reduces the number of runs by a factor of M -1, and creates runs
longer by the same factor.
– E.g. If M=11, and there are 90 runs, one pass reduces the number of runs to
9, each 10 times the size of the initial runs
• Repeated passes are performed till all runs have been merged into one.
• 18,11,16,13,12,17,21,15,19,20,14
M=3

11 , 16,18 12,13,17 15,19,21 14,20


18 16
17 17
18

11 12 13 16 17 18 14 15 19 20 21
JOIN OPERATION
Several different algorithms to implement joins
• Nested-loop join
• Block nested-loop join
• Indexed nested-loop join
• Merge-join
• Hash-join
Choice based on cost estimate
NESTED-LOOP JOIN
To compute the theta join r  s
for each tuple tr in r do begin
for each tuple ts in s do begin
test pair (tr,ts) to see if they satisfy the join condition ϴ
if they do, add tr • ts to the result.
end
end
• r is called the outer relation and s the inner relation of the join.
• Requires no indices and can be used with any kind of join condition.
• Expensive since it examines every pair of tuples in the two relations.
BLOCK NESTED-LOOP JOIN
Variant of nested-loop join in which every block of inner relation is paired
with every block of outer relation.
for each block Br of r do begin
for each block Bs of s do begin
for each tuple tr in Br do begin
for each tuple ts in Bs do begin
Check if (tr ,ts) satisfy the join condition
if they do, add tr • ts to the result.
end
end
end
end
INDEXED NESTED-LOOP JOIN
Index lookups can replace file scans if
• join is an equijoin or natural join and
• an index is available on the inner relation’s join attribute
– Can construct an index just to compute a join.
MERGE-JOIN
• Sort both relations on their join attribute (if not already sorted on the join
attributes).
• Merge the sorted relations to join them
– Join step is similar to the merge stage of the sort-merge algorithm.
– Main difference is handling of duplicate values in join attribute — every pair
with same value on join attribute must be matched
HASH-JOIN
The Hash Join algorithm is used to perform the natural join or equi join operations.
The concept behind the Hash join algorithm is to partition the tuples of each given
relation into sets. The partition is done on the basis of the same hash value on the
join attributes. The hash function provides the hash value. The main goal of using
the hash function in the algorithm is to reduce the number of comparisons and
increase the efficiency to complete the join operation on the relations.
Evaluation of Expressions
For evaluating an expression that carries multiple operations in it, we can perform
the computation of each operation one by one. However, in the query processing
system, we use two methods for evaluating an expression carrying multiple
operations. These methods are:
• Materialization
• Pipelining
Let's take a brief discussion of these methods.
Materialization
In this method, the given expression evaluates one relational operation at a time.
Also, each operation is evaluated in an appropriate sequence or order. After
evaluating all the operations, the outputs are materialized in a temporary relation
for their subsequent uses.
It leads the materialization method to a disadvantage. The disadvantage is that it
needs to construct those temporary relations for materializing the results of the
evaluated operations, respectively. These temporary relations are written on the
disks unless they are small in size.
Pipelining
Pipelining is an alternate method or approach to the materialization method. In
pipelining, it enables us to evaluate each relational operation of the expression
simultaneously in a pipeline. In this approach, after evaluating one operation, its
output is passed on to the next operation, and the chain continues till all the
relational operations are evaluated thoroughly.
Materialization
In this method, queries are broken into individual queries and then the results of
which are used to get the final result. To be more specific, suppose there is a
requirement to find the students who are studying in class ‘DESIGN_01’.
SELECT * FROM STUDENT s, CLASS c
WHERE s.CLASS_ID = c.CLASS_ID AND c.CLASS_NAME = ‘DESIGN_01’;
• Here we can observe two queries: one is to select the CLASS_ID of ‘DESIGN_01’
and another is to select the student details of the CLASS_ID retrieved in the first
query.
The DBMS also does the same.
• It breaks the query into two as mentioned above. Once it is broken, it evaluates
the first query and stores it in the temporary table in the memory. This
temporary table data will be then used to evaluate the second query.
• This is the example of two level queries in materialization method. We can have
any number of levels and so many numbers of temporary tables.
Although this method looks simple, the cost of this type of evaluation is always
more. It takes the time to evaluate and write into temporary table, then retrieve
from this temporary table and query to get the next level of result and so on. Hence
cost of evaluation in this method is:
Cost = cost of individual SELECT + cost of write into temporary table
Pipelining
• In this method, DBMS do not store the records into temporary tables. Instead, it
queries each query and result of which will be passed to next query to process
and so on. It will process the query one after the other and each will use the
result of previous query for its processing.
• In the example above, CLASS_ID of DESIGN_01 is passed to the STUDENT table
to get the student details.

• In this method no extra cost of writing into temporary tables. It has only cost of
evaluation of individual queries; hence it has better performance than
materialization.
There are two types of pipelining:
Demand Driven or Lazy evaluation
In this method, the result of lower level queries are not passed to the higher level
automatically. It will be passed to higher level only when it is requested by the
higher level. In this method, it retains the result value and state with it and it will be
transferred to the next level only when it is requested.
In our example above, CLASS_ID for DESIGN_01 will be retrieved, but it will not be
passed to STUDENT query only when it is requested. Once it gets the request, it is
passed to student query and that query will be processed.
Producer Driven or Eager Pipelining
In this method, the lower level queries eagerly pass the results to higher level
queries. It does not wait for the higher level queries to request for the results. In
this method, lower level query creates a buffer to store the results and the higher
level queries pulls the results for its use. If the buffer is full, then the lower level
query waits for the higher level query to empty it. Hence it is also called as PULL
and PUSH pipelining.
What is Query Optimization?
• Query optimization is of great importance for the performance of a relational
database, especially for the execution of complex SQL statements.
• For any given query, there may be a number of different ways to execute it. The
process of choosing a suitable one for processing a query is known as query
optimization.
• A query optimizer decides the best methods for implementing each query.
• The query optimizer selects, for instance, whether or not to use indexes for a
given query, and which join methods to use when joining multiple tables.
• These decisions have a tremendous effect on SQL performance, and query
optimization is a key technology for every application, from operational Systems
to data warehouse and analytical systems to content management systems.
Understand how your database is executing your query −
• The first phase of query optimization is understanding what the database is
performing.
• Different databases have different commands for this.
• For example, in MySQL, one can use the “EXPLAIN [SQL Query]” keyword to see
the query plan.
• In Oracle, one can use the “EXPLAIN PLAN FOR [SQL Query]” to see the query
plan.
Retrieve as little data as possible −
• The more information restored from the query, the more resources the database
is required to expand to process and save these records.
• For example, if it can only require to fetch one column from a table, do not use
‘SELECT *’.
Store intermediate results −
• Sometimes logic for a query can be quite complex.
• It is possible to produce the desired outcomes through the use of subqueries,
inline views, and UNION-type statements.
• For those methods, the transitional results are not saved in the database but are
directly used within the query.
• This can lead to achievement issues, particularly when the transitional results
have a huge number of rows.
There are various query optimization strategies are as follows −
Use Index − It can be using an index is the first strategy one should use to speed up
a query.
• Aggregate Table − It can be used to pre-populating tables at higher levels so less
amount of information is required to be parsed.
• Vertical Partitioning − It can be used to partition the table by columns. This
method reduces the amount of information a SQL query required to process.
• Horizontal Partitioning − It can be used to partition the table by data value,
most often time. This method reduces the amount of information a SQL query
required to process.
The two forms of query optimization are as follows −
Heuristic optimization − Here the query execution is refined based on heuristic
rules for reordering the individual operations.
Cost based optimization − the overall cost of executing the query is systematically
reduced by estimating the costs of executing several different execution plans.
Example
Select name from customer, account where customer.name=account.name and
account.balance>2000;
There are two evaluation plans −
Πcustomer.name(σcustomer.name=account.name ^ account.balance>2000(customerXaccount)
Πcustomer.name(σcustomer.name=account.name(customerXσ account.balance>2000(account)
• Cost evaluator evaluates the cost of different evaluation plans and chooses the
evaluation plan with lowest cost.
• Disk access time, CPU time, number of operations, number of tuples, size of
tuples are considered for cost calculations.
• Heuristic approach is also called rule-based optimization.
– There are three ways for transforming relational-algebra queries are −
o Perform the SELECTION process foremost in the query.
o This should be the first action for any SQL table.
o By doing so, we can decrease the number of records required in the query, rather
than using all the tables during the query.
o Perform all the projection as soon as achievable in the query
o Somewhat like a selection but this method helps in decreasing the number of
columns in the query
o Perform the most restrictive joins and selection operations.
o What this means is that select only those set of tables and/or views which will result
in a relatively lesser number of records and are extremely necessary in the query.
o Obviously any query will execute better when tables with few records are joined.
o Hence we see throughout these approaches, our motive is cost optimization only.
o Cost-based optimization is expensive.
o Heuristics are used to reduce the number of choices that must be made in a cost-
based approach.
Thank You !!!!!!!!

You might also like