Unit 5 Query Processing Detail
Unit 5 Query Processing Detail
INTRODUCTION TO DBMS
MITCOM
MIT ADT UNIVERSITY
Unit 5
QUERY PROCESSING
Compiled by
Dr Ashwin Tomar
[MCA, MBA, PhD(Computer Sc.)]
[email protected]
1
MITCOM MCA DS
Syllabus
Query Processing & Query Optimization 2 Hours
9.1 Overview, measures of query cost, selection operation, sorting, Join
9.2 Evaluation of expressions, transformation of relational expressions,
estimating statistics of expression results, evaluation plans, materialized views
UNIT - 4
Query Processing
Query Processing includes translations on high level Queries into low level
expressions that can be used at physical level of file system, query optimization
and actual execution of query to get the actual result.
2
MITCOM MCA DS
Method of Optimising the query by choosing the strategy that results in minimum cost is
called as cost based query optimisation. The cost of algorithm depends on cardinality of
input, output. The cost of executing query includes Cost of access to secondary storage,
Cost of storage, Cost of computation, Cost of memory usage, Cost of communication.
Minimisation of cost depends on the size and type of database application.
Parser: During parse call, the database performs the following checks-
Syntax check, Semantic check and Shared pool check, after converting the query
into relational algebra.
Parser performs the following checks as (refer detailed diagram):
1. Syntax check – concludes SQL syntactic validity.
Example:
SELECT * FORM employee
Here error of wrong spelling of FROM is given by this check.
3
MITCOM MCA DS
If there is a fresh query and its hash code does not exist in shared pool then
that query has to pass through from the additional steps known as hard
parsing otherwise if hash code exists then query does not passes through
additional steps. It just passes directly to execution engine. This is known as
Hard Parse includes following steps Optimizer and Row source generation.
Step-2:
soft parsing.
Optimizer: During optimization stage, database must perform a hard
parse at least for one unique DML statement and perform optimization during this
parse. This database never optimizes DDL unless it includes a DML component
such as sub query that require optimization.
It is a process in which multiple query execution plan for satisfying a query are
examined and most efficient query plan is satisfied for execution.
Database catalog stores the execution plans and then optimizer passes the lowest
cost plan for execution.
Row Source Generation –
The Row Source Generation is a software that receives a optimal execution
plan from the optimizer and produces an iterative execution plan that is
usable by the rest of the database. The iterative plan is the binary program
that when executes by the sql engine produces the result set.
Step-3:
Execution Engine: Finally runs the query and display the required result.
4
MITCOM MCA DS
Query Plans: A query plan (or query execution plan) is an ordered set of steps
used to access data in a SQL relational database management system.
First, it provides the user with faster results, which makes the application
seem faster to the user.
Secondly, it allows the system to service more queries in the same amount
of time, because each request takes less time than unoptimized queries.
Thirdly, query optimization ultimately reduces the amount of wear on the
hardware (e.g. disk drives), and allows the server to run more efficiently (e.g.
lower power consumption, less memory usage).
5
MITCOM MCA DS
Relational Algebra
1. Select Operation:
Notation: σ p(r)
Where:
6
MITCOM MCA DS
Input:
Output:
2. Project Operation:
o This operation shows the list of those attributes that we wish to appear in the result.
Rest of the attributes are eliminated from the table.
o It is denoted by ∏.
Where
7
MITCOM MCA DS
Input:
Output:
NAME CITY
Jones Harrison
Smith Rye
Hays Harrison
Curry Rye
Johnson Brooklyn
Brooks Brooklyn
3. Union Operation:
o Suppose there are two tuples R and S. The union operation contains all the tuples that
are either in R or S or both in R & S.
o It eliminates the duplicate tuples. It is denoted by ∪.
Notation: R ∪ S
Example:
DEPOSITOR RELATION
CUSTOMER_NAME ACCOUNT_NO
Johnson A-101
Smith A-121
Mayes A-321
Turner A-176
Johnson A-273
Jones A-472
Lindsay A-284
8
MITCOM MCA DS
BORROW RELATION
CUSTOMER_NAME LOAN_NO
Jones L-17
Smith L-23
Hayes L-15
Jackson L-14
Curry L-93
Smith L-11
Williams L-17
Input:
Output:
CUSTOMER_NAME
Johnson
Smith
Hayes
Turner
Jones
Lindsay
Jackson
Curry
Williams
Mayes
4. Set Intersection:
o Suppose there are two tuples R and S. The set intersection operation contains all tuples
that are in both R & S.
o It is denoted by intersection ∩.
Notation: R ∩ S
Input:
Output:
CUSTOMER_NAME
Smith
Jones
9
MITCOM MCA DS
5. Set Difference:
o Suppose there are two tuples R and S. The set intersection operation contains all tuples
that are in R but not in S.
o It is denoted by intersection minus (-).
Notation: R - S
Input:
Output:
CUSTOMER_NAME
Jackson
Hayes
Willians
Curry
6. Cartesian product
o The Cartesian product is used to combine each row in one table with each row in the
other table. It is also known as a cross product.
o It is denoted by X.
Notation: E X D
Example:
EMPLOYEE
DEPARTMENT
DEPT_NO DEPT_NAME
A Marketing
B Sales
C Legal
Input:
EMPLOYEE X DEPARTMENT
10
MITCOM MCA DS
Output:
7. Rename Operation:
The rename operation is used to rename the output relation. It is denoted by rho (ρ).
Example: We can use the rename operator to rename STUDENT relation to STUDENT1.
ρ(STUDENT1, STUDENT)
Relational model can represent as a table with columns and rows. Each row is known as a tuple.
Each table of the column has a name or attribute.
Attribute: It contains the name of a column in a particular table. Each attribute Ai must have
a domain, dom(Ai)
Relational instance: In the relational database system, the relational instance is represented
by a finite set of tuples. Relation instances do not have duplicate tuples.
Relational schema: A relational schema contains the name of the relation and name of all
columns or attributes.
Relational key: In the relational key, each row has one or more attributes. It can identify the
row in the relation uniquely.
11
MITCOM MCA DS
o In the given table, NAME, ROLL_NO, PHONE_NO, ADDRESS, and AGE are the
attributes.
o The instance of schema STUDENT has 5 tuples.
o t3 = <Laxman, 33289, 8583287182, Gurugram, 20>
Properties of Relations
Integrity Constraints
o Integrity constraints are a set of rules. It is used to maintain the quality of information.
o Integrity constraints ensure that the data insertion, updating, and other processes have
to be performed in such a way that data integrity is not affected.
o Thus, integrity constraint is used to guard against accidental damage to the database.
1. Domain constraints
o Domain constraints can be defined as the definition of a valid set of values for an
attribute.
12
MITCOM MCA DS
o The data type of domain includes string, character, integer, time, date, currency, etc.
The value of the attribute must be available in the corresponding domain.
Example:
2. Entity integrity constraints [primary key value can't be null – identifies rows]
o The entity integrity constraint states that primary key value can't be null.
o This is because the primary key value is used to identify individual rows in relation and
if the primary key has a null value, then we can't identify those rows.
o A table can contain a null value other than the primary key field.
Example:
Example:
13
MITCOM MCA DS
4. Key constraints
o Keys are the entity set that is used to identify an entity within its entity set uniquely.
o An entity set can have multiple keys, but out of which one key will be the primary key.
A primary key can contain a unique and null value in the relational table.
Example:
14
MITCOM MCA DS
Transformation Rules:
15
MITCOM MCA DS
16
MITCOM MCA DS
17
MITCOM MCA DS
Minimality
A set of equivalence rules is said to be minimal if no rule can be derived from any
combination of the others. A query is said to be optimal when it is minimal.
Examples
instructor(ID, name, dept_name, salary)
teaches(ID, course_id, sec_id, semester, year)
course(course_id, title, dept_name, credits)
18
MITCOM MCA DS
19
MITCOM MCA DS
There are various factors affecting the performance of the query like
When a SQL query is submitted to DB, it can be evaluated in number of ways. For
example, consider the below case:
Above query selects the EMP_ID and DEPT_NAME from EMP and DEPT table for
DEPT_ID = 10. But when it is given to the DBMS, it divides the query into tokens
and sees how it can be put together so that performance will be better. This is the
duty of query optimizer. But altering the order of tokens in the query should not
change the result. In either way it should give same result. Order of records can
change and are least important. This is called equivalent query. There is set of rules
to put tokens in the query. This is called equivalence rule.
Select the records of EMP with DEPT_ID = 10 first then join them with DEPT
table to get all matching records of EMP and DEPT. Then select only the
columns EMP_ID and DEPT_NAME to display the result.
20
MITCOM MCA DS
Select all matching records from EMP and DEPT, from which filter on
DEPT_ID = 10 and select only EMP_ID and DEPT_NAME to display.
Both the steps above are same irrespective of how it is performed. Hence both are
called equivalent query. These are not written in SQL, but using relational algebra,
graph or tree.
or
21
MITCOM MCA DS
Above relational algebra and tree shows how DBMS depicts the query inside it. But
the cost of both of them may vary. This is because the number of records in each
step changes depending on the join and filters we use, and the algorithms used to
evaluate them. For example we may have huge number of records in second case
tree above to filter. But in the first case we are filtering the record first; hence
number of records to join would have been reduced. This makes lots of difference
and query optimizer calculates this difference and selects the optimal tree for
query evaluation.
The query optimizer uses these two techniques to determine which process or
expression to consider for evaluating the query.
This is based on the cost of the query. The query can use different paths based on
indexes, constraints, sorting methods etc. This method mainly uses the statistics
like record size, number of records, number of records per block, number of
blocks, table size, whether whole table fits in a block, organization of tables,
uniqueness of column values, size of columns etc.
T1 ∞ T2 ∞ T3 ∞ T4∞ T5 ∞ T6
22
MITCOM MCA DS
For above query we can have any order of evaluation. We can start taking any two
tables in any order and start evaluating the query. Ideally, we can have join
combinations in (2(n-1))! / (n-1)! ways.
For example, suppose we have 5 tables involved in join, then we can have 8! / 4! =
1680 combinations. But when query optimizer runs, it does not evaluate in all
these ways always. It uses dynamic programming where it generates the costs for
join orders of any combination of tables. It is calculated and generated only once.
This least cost for all the table combination is then stored in the database and
is used for future use. i.e.; say we have a set of tables, T = { T1 , T2 , T3 .. Tn}, then
it generates least cost combination for all the tables and stores it.
Dynamic Programming
The least cost for the joins of any combination of table is generated here. These
values are stored in the database and when those tables are used in the query, this
combination is selected for evaluating the query.
Suppose we have set of tables, T = {T1 , T2 , T3 .. Tn}, in a DB. It picks the first table,
and computes cost for joining with rest of the tables in set T. It calculates cost for
each of the tables and then chooses the best cost. It continues doing the same with
rest of the tables in set T. It will generate 2n – 1 cases and it selects the lowest cost
and stores it. When a query uses those tables, it checks for the costs here and that
combination is used to evaluate the query. This is called dynamic programming.
In this method, time required to find optimized query is in the order of 3n, where
n is the number of tables. Suppose we have 5 tables, then time required in 35 =
243, which is lesser than finding all the combination of tables and then deciding
the best combination (1680). Also, the space required for computing and storing
the cost is also less and is in the order of 2n. In above example, it is 25 = 32.
This is another method of determining the cost of the joins. Here, the tables and
joins are represented in the form of trees. The joins always form the root of the
tree and table is kept at the right side of the root. LHS of the root always point to
the next join. Hence it gets deeper and deeper on LHS. Hence it is called as left
23
MITCOM MCA DS
deep tree.
Here instead of calculating the best join cost for set of tables, best join cost for
joining with each table is calculated. In this method, time required to find
optimized query is in the order of n2n, where n is the number of tables. Suppose
we have 5 tables, then time required in 5*25 =160, which is lesser than dynamic
programming. Also, the space required for computing storing the cost is also less
and is in the order of 2n. In above example, it is 25 = 32, same as dynamic
programming.
(T1 ∞T2) ∞ T3 ∞… ∞ Tn
This method uses hash join or merge join method to calculate the cost. Hash Join
will simply join the tables. We get sorted output in merge join method, but it is
costlier than hash join. Even though merge join is costlier at this stage, when it
moves to join with third table, the join will have less effort to sort the tables. This
is because first table is the sorted result of first two tables. Hence it will reduce the
total cost of the query.
24
MITCOM MCA DS
But the number of tables involved in the join would be relatively less and this
cost/space difference will be hardly noticeable.
All these cost based optimizations are expensive and are suitable for large number
of data. There is another method of optimization called heuristic optimization,
which is better compared to cost based optimization.
This method is also known as rule based optimization. This is based on the
equivalence rule on relational expressions; hence the number of combination of
queries get reduces here. Hence the cost of the query too reduces.
This method creates relational tree for the given query based on the equivalence
rules. These equivalence rules by providing an alternative way of writing and
evaluating the query, gives the better path to evaluate the query. This rule need
not be true in all cases. It needs to be examined after applying those rules. The
most important set of rules followed in this method is listed below:
Perform all the selection operation as early as possible in the query. This
should be first and foremost set of actions on the tables in the query. By
performing the selection operation, we can reduce the number of records
involved in the query, rather than using the whole tables throughout the
query.
Suppose we have a query to retrieve the students with age 18 and studying in class
DESIGN_01. We can get all the student details from STUDENT table, and class
details from CLASS table. We can write this query in two different ways.
The queries will return same result. But when we observe them closely we can see
that first query will join the two tables first and then applies the filters. That
means, it traverses whole table to join, hence the number of records involved is
more. But he second query, applies the filters on each table first. This reduces the
number of records on each table (in class table, the number of record reduces to
25
MITCOM MCA DS
one in this case!). Then it joins these intermediary tables. Hence the cost in this
case is comparatively less.
Instead of writing query the optimizer creates relational algebra and tree for
above case.
Perform all the projection as early as possible in the query. This is similar to
selection but will reduce the number of columns in the query.
Suppose for example, we have to select only student name, address and class name
of students with age 18 from STUDENT and CLASS tables.
Here again, both the queries look alike, results alike. But when we compare the
number of records and attributes involved at each stage, second query uses less
records and hence more efficient.
26
MITCOM MCA DS
Next step is to perform most restrictive joins and selection operations. When we
say most restrictive joins and selection means, select those set of tables and views
which will result in comparatively less number of records. Any query will have
better performance when tables with few records are joined. Hence throughout
heuristic method of optimization, the rules are formed to get less number of
records at each stage, so that query performance is better. So is the case here too.
Suppose we have STUDENT, CLASS and TEACHER tables. Any student can attend
only one class in an academic year and only one teacher takes a class. But a class
can have more than 50 students. Now we have to retrieve STUDENT_NAME,
ADDRESS, AGE, CLASS_NAME and TEACHER_NAME of each student in a school.
Not So efficient
Efficient
In the first query, it tries to select the records of students from each class. This
result in a very huge intermediary table. This table is joined with another small
table. Hence the traversing of number of records is also more.
But in the second query, CLASS and TEACHER are joined first, which has one to
one relation here. Hence the number of resulting record is STUDENT table give the
final result. Hence this second method is more efficient.
All these methods need not be always true. It also depends on the table size,
column size, type of selection, projection, join sort, constraints, indexes, statistics
etc. Above optimization describes the best way of optimizing the queries.
27
MITCOM MCA DS
Evaluation Plans
When a query is submitted to DB, it is parsed and translated to relational algebra.
It is verified for its validity and correctness. Once it passes this stage, different
ways of evaluating the query is generated. It is checked for various factors and
its execution plan is generated. It may be based on cost of the query or based on
the equivalence rules. Once cost based execution and rule based execution plans
are generated, optimizer has to decide, which plan to be selected for evaluation.
This is the most important step in processing a query.
As we have seen in other articles, the cost or the heuristic execution plan may not
be always effective in all the tables with same type of query. They are all general
guidelines to evaluate a query. There are lot many factors affecting the
performance of a query. The evaluation plans exactly defines the system which
plan / algorithm is to be used to evaluate, which index to be used etc.
Above query selects the EMP_ID and DEPT_NAME from EMP and DEPT table for
DEPT_ID = 10 with employee’s last name as ‘Joseph’. But when it is given to the
DBMS, it divides the query into tokens and sees how it can be put together so
that performance will be better. This is the duty of query optimizer. But
altering the order of tokens in the query should not change the result. In either
way it should give same result. Order of records can change and are least
important. This is called equivalent query. There is set of rules to put tokens in
the query. This is called equivalence rule.
Above query can be broken down by the DBMS in either ways below :
28
MITCOM MCA DS
Select all matching records from EMP and DEPT, from which filter on DEPT_ID =
10 and EMP_LAST_NAME = ‘Joseph’ and then select only EMP_ID and
DEPT_NAME to display.
Select all matching records from EMP and DEPT, from which select only EMP_ID
and DEPT_NAME and then filter on DEPT_ID = 10 and EMP_LAST_NAME =
‘Joseph’ and then.
Both the steps above are same irrespective of how it is performed. Hence both
are called equivalent query. These are not written in SQL, but using relational
algebra, or tree.
Or
Or
29
MITCOM MCA DS
The optimizer can produce relational expression and tree in above three
formats. According to evaluation rule
The optimizer can produce relational expression and tree in above three
formats. According to evaluation rule, the first query seems to be the best
one. But considering other factors of tables, other plans may also be efficient.
30
MITCOM MCA DS
It collects all the statistics, costs, access/ evaluation paths, relational trees etc.
It then analyses them and chooses the best evaluation path.
Or
Or
31
MITCOM MCA DS
First tree reduces the number of records for joining and seems to be efficient. But
what happens if we have index on DEPT_ID? Then the join between EMP and EMP
can also be more efficient. But we see the filter condition on EMP table, we have
DEPT_ID = 10, which is index column. Hence first applying selection condition and
then join will reduce the number of records as well as make the join more efficient
than without index. Next are the projected columns – EMP_ID and DEPT_NAME.
they are all distinct values. There cannot be duplicate values for them. But we are
selecting those values for DEPT_ID = 10, hence DEPT_NAME has only one value.
Hence their selectivity is same as number of employees working for DEPT_ID = 10.
But we are selecting only those employees whose last name is ‘Joseph’. Hence the
selectivity is min (distinct (employee (DEPT_10)), distinct (employee (DEPT_10,
JOSEPH)). Obliviously distinct (employee (DEPT_10, JOSEPH)) would have lesser
value. The optimizer decides all these factors for above 3 trees and then decides
first tree would be more efficient. Hence it evaluates the query using first tree.
32
MITCOM MCA DS
Materialization
In this method, queries are broken into individual queries and then the results of
which are used to get the final result. To be more specific, suppose there is a
requirement to find the students who are studying in class ‘DESIGN_01’.
Here we can observe two queries: one is to select the CLASS_ID of ‘DESIGN_01’ and
another is to select the student details of the CLASS_ID retrieved in the first query.
The DBMS also does the same. It breaks the query into two as mentioned above.
Once it is broken, it evaluates the first query and stores it in the temporary table
in the memory. This temporary table data will be then used to evaluate the second
query.
This is the example of two level queries in materialization method. Any number
of levels and so many numbers of temporary tables.
Although this method looks simple, the cost of this type of evaluation is always
more. It takes the time to evaluate and write into temporary table, then retrieve
33
MITCOM MCA DS
from this temporary table and query to get the next level of result and so on.
Hence cost of evaluation in this method is:
Pipelining
In this method, DBMS do not store the records into temporary tables. Instead, it
queries each query and result of which will be passed to next query to process and
so on. It will process the query one after the other and each will use the result of
previous query for its processing.
In this method no extra cost of writing into temporary tables. It has only cost of
evaluation of individual queries; hence it has better performance than
materialization.
In this method, the result of lower level queries are not passed to the higher level
automatically. It will be passed to higher level only when it is requested by the
higher level. In this method, it retains the result value and state with it and it will
be transferred to the next level only when it is requested.
34
MITCOM MCA DS
In our example above, CLASS_ID for DESIGN_01 will be retrieved, but it will not be
passed to STUDENT query only when it is requested. Once it gets the request, it is
passed to student query and that query will be processed.
In this method, the lower level queries eagerly pass the results to higher level
queries. It does not wait for the higher level queries to request for the results. In
this method, lower level query creates a buffer to store the results and the higher
level queries pulls the results for its use. If the buffer is full, then the lower level
query waits for the higher level query to empty it. Hence it is also called as PULL
and PUSH pipelining.
There are still more methods of pipelining like Linear and non-linear methods of
35
MITCOM MCA DS
36
MITCOM MCA DS
37
MITCOM MCA DS
38