unit-4-database-design-query-processing
unit-4-database-design-query-processing
com/reader/notes/savitribai-phule-pune-university-maharashtra/engineering/information-technology/seco…
DBMS
Unit - 4
For the relational table, a normalization sets out rules as to whether it follows the
normal form. A normal form is a method that evaluates each relationship against
specified criteria and eliminates from a relationship the multivalued, joined,
functional and trivial dependence. If any data is modified, removed or added,
database tables do not create any issues and help improve the consistency and
performance of the relational table.
Objectives of Normalization
● Normalization improves by analyzing new data types used in the table to reduce
consistency and complexity.
https://www.goseeko.com/reader/notes/savitribai-phule-pune-university-maharashtra/engineering/information-technology/second-year/sem-2/dat… 1/39
3/1/23, 7:31 PM https://www.goseeko.com/reader/notes/savitribai-phule-pune-university-maharashtra/engineering/information-technology/seco…
● Dividing the broad database table into smaller tables and connecting them
using a relationship is helpful.
Key takeaway:
Data redundancy
❏ Within a single database, this may mean two different areas, or two different
spots in various software environments or platforms. It effectively constitutes data
redundancy if data is replicated.
❏ Data redundancy can occur by mistake, but it is often done purposely for the
purposes of backup and recovery.
❏ There are various classifications under the general concept of data redundancy,
depending on what is considered acceptable in database management and what is
considered unnecessary or inefficient. Wasteful data duplication normally happens
https://www.goseeko.com/reader/notes/savitribai-phule-pune-university-maharashtra/engineering/information-technology/second-year/sem-2/dat… 2/39
3/1/23, 7:31 PM https://www.goseeko.com/reader/notes/savitribai-phule-pune-university-maharashtra/engineering/information-technology/seco…
when, due to inefficient coding or process sophistication, a given piece of data does
not need to be repeated but ends up being duplicated.
For instance, when inconsistent duplicates of the same entry are found on the same
database, wasteful data redundancy may occur. Due to inefficient coding or
overcomplicated data storage processes, accidental data duplication may happen
and pose a problem in terms of efficiency and cost.
Anomalies
A relation that has redundant data may have update anomalies. These anomalies are
Classified as: -
1. Insertion Anomalies.
2. Deletion Anomalies.
3. Modification Anomalies.
Insertion Anomalies
To insert new employee tuples in Emp_dept, we must enter values for dept or NULL if
the employee does not work for the department.
Case I: -
https://www.goseeko.com/reader/notes/savitribai-phule-pune-university-maharashtra/engineering/information-technology/second-year/sem-2/dat… 3/39
3/1/23, 7:31 PM https://www.goseeko.com/reader/notes/savitribai-phule-pune-university-maharashtra/engineering/information-technology/seco…
Now, Dno=21 is already existing in dept relation so, while inserting this into
Emp_dept
Case II: -
This problem will not occur if we consider employee and department relationship.
Because for employee we will enter only dno=21, Remaining information of
department will be recorded in dept relation only once. So, there will be no problem
of consistency
No employees as yet:
Case I: -
Ename=NULL
Ssn=NULL
https://www.goseeko.com/reader/notes/savitribai-phule-pune-university-maharashtra/engineering/information-technology/second-year/sem-2/dat… 4/39
3/1/23, 7:31 PM https://www.goseeko.com/reader/notes/savitribai-phule-pune-university-maharashtra/engineering/information-technology/seco…
Address=NULL
Dno=51
Dname=Civil
Dmgr_Ssn=321
Case II: -
This problem will not occur if we consider Employee and department relation.
Because dept is entered in department relation whether /not any employee works for
it.
Deletion Anomalies
Case I: -
Consider, first row of Emp_Dept, where Dno=21. Digvijay is the only employee
working for Dno=21. So, if we delete this line, the information regarding Dno=21 will
get lost.
Case II: -
This is not the case for relation employees Department. If we delete record from
Employee (Dno=21) Digvijay, the dept info will remain in table department
separately.
Modification Anomalies
https://www.goseeko.com/reader/notes/savitribai-phule-pune-university-maharashtra/engineering/information-technology/second-year/sem-2/dat… 5/39
3/1/23, 7:31 PM https://www.goseeko.com/reader/notes/savitribai-phule-pune-university-maharashtra/engineering/information-technology/seco…
Manager for dept_no.31 i.e. we change Dmgr_Ssn(456 to 765), then we must update
the tuples of all employees who work in that dept, otherwise database will become
Inconsistent.
Anomalies. A database without any anomaly will work correctly and it will be
consistent.
Key takeaway:
- Data redundancy can occur by mistake, but it is often done purposely for the
purposes of backup and recovery.
- A database without any anomaly will work correctly and it will be consistent.
The three most important rules for Database Functional Dependence are below:
https://www.goseeko.com/reader/notes/savitribai-phule-pune-university-maharashtra/engineering/information-technology/second-year/sem-2/dat… 6/39
3/1/23, 7:31 PM https://www.goseeko.com/reader/notes/savitribai-phule-pune-university-maharashtra/engineering/information-technology/seco…
● Transitivity law: If x -> y holds and y -> z holds, this rule is very similar to the
transitive rule in algebra, then x -> z also holds. X -> y is referred to as functionally
evaluating y.
https://www.goseeko.com/reader/notes/savitribai-phule-pune-university-maharashtra/engineering/information-technology/second-year/sem-2/dat… 7/39
3/1/23, 7:31 PM https://www.goseeko.com/reader/notes/savitribai-phule-pune-university-maharashtra/engineering/information-technology/seco…
Example
Example:
ID → Name,
Name → DOB
Key takeaway:
Normalization is often executed as a series of different forms. Each normal form has
its own properties. As normalization proceeds, the relations become progressively
more restricted in format, and also less vulnerable to update anomalies. For the
https://www.goseeko.com/reader/notes/savitribai-phule-pune-university-maharashtra/engineering/information-technology/second-year/sem-2/dat… 8/39
3/1/23, 7:31 PM https://www.goseeko.com/reader/notes/savitribai-phule-pune-university-maharashtra/engineering/information-technology/seco…
relational data model, it is important to bring the relation only in first normal form
(1NF) that is critical in creating relations. All the remaining forms are optional.
A relation R is said to be normalized if it does not create any anomaly for three basic
operations: insert, delete and update.
A relation r is in 1NF if and only if every tuple contains only atomic attributes means
exactly one value for each attribute. As per the rule of first normal form, an attribute
of a table cannot hold multiple values. It should hold only atomic values.
Example: suppose Company has created a employee table to store name, address
and mobile number of employees.
https://www.goseeko.com/reader/notes/savitribai-phule-pune-university-maharashtra/engineering/information-technology/second-year/sem-2/dat… 9/39
3/1/23, 7:31 PM https://www.goseeko.com/reader/notes/savitribai-phule-pune-university-maharashtra/engineering/information-technology/seco…
In above table, two employees John, Mary has two mobile numbers. So, the attribute,
So, this table is not in 1NF as the rule says, “Each attribute of a table must have atomic
Values”. But in above table the the attribute, emp_mobile, is not atomic attribute as it
● Table is in 1 NF.
https://www.goseeko.com/reader/notes/savitribai-phule-pune-university-maharashtra/engineering/information-technology/second-year/sem-2/d… 10/39
3/1/23, 7:31 PM https://www.goseeko.com/reader/notes/savitribai-phule-pune-university-maharashtra/engineering/information-technology/seco…
An attribute that is not part of any candidate key is known as non-prime attribute.
Example: Suppose a school wants to store data of teachers and the subjects they
teach. Since a teacher can teach more than one subjects, the table can have multiple
rows for the same teacher.
The table is in 1 NF because each attribute has atomic values. However, it is not in
2NF because non-prime attribute Teacher_Age is dependent on Teacher_Id alone
which is a proper subset of candidate key. This violates the rule for 2NF as the rule
says “no non-prime attribute is dependent on the proper subset of any candidate key
of the table”.
https://www.goseeko.com/reader/notes/savitribai-phule-pune-university-maharashtra/engineering/information-technology/second-year/sem-2/d… 11/39
3/1/23, 7:31 PM https://www.goseeko.com/reader/notes/savitribai-phule-pune-university-maharashtra/engineering/information-technology/seco…
To bring above table in 2NF we can break it in two tables (Teacher_Detalis and
Teacher_Details table:
Teacher_id Teacher_age
111 28
222 35
333 38
Teacher_Subject table:
Teacher_id Subject
111 DSF
111 DBMS
222 CNT
333 OOPL
333 FDS
https://www.goseeko.com/reader/notes/savitribai-phule-pune-university-maharashtra/engineering/information-technology/second-year/sem-2/d… 12/39
3/1/23, 7:31 PM https://www.goseeko.com/reader/notes/savitribai-phule-pune-university-maharashtra/engineering/information-technology/seco…
So it can be stated that, a table is in 3NF if it is in 2NF and for each functional
dependency
An attribute that is a part of one of the candidate keys is known as prime attribute.
For example:
1) P->Q and
2) Q->R
https://www.goseeko.com/reader/notes/savitribai-phule-pune-university-maharashtra/engineering/information-technology/second-year/sem-2/d… 13/39
3/1/23, 7:31 PM https://www.goseeko.com/reader/notes/savitribai-phule-pune-university-maharashtra/engineering/information-technology/seco…
Example: suppose a company wants to store information about employees. Then the
table Employee_Details looks like this:
Non-prime attributes: All attributes except emp_id are non-prime as they are not
subpart part of any candidate keys.
To bring this table in 3NF we have to break into two tables to remove transitive
dependency.
Employee_Details table:
https://www.goseeko.com/reader/notes/savitribai-phule-pune-university-maharashtra/engineering/information-technology/second-year/sem-2/d… 14/39
3/1/23, 7:31 PM https://www.goseeko.com/reader/notes/savitribai-phule-pune-university-maharashtra/engineering/information-technology/seco…
E1 Hary M1
E2 John M1
E3 Nil M2
E4 Mery M3
E5 Steve M2
Manager_Details table:
It is an advance version of 3NF. BCNF is stricter than 3NF. A table complies with BCNF
if it is in 3NF and for every functional dependency X->Y, X should be the super key of
the table.
Example: Suppose there is a company wherein employees work in more than one
https://www.goseeko.com/reader/notes/savitribai-phule-pune-university-maharashtra/engineering/information-technology/second-year/sem-2/d… 15/39
3/1/23, 7:31 PM https://www.goseeko.com/reader/notes/savitribai-phule-pune-university-maharashtra/engineering/information-technology/seco…
Emp_id ->emp_nationality
The table is not in BCNF as neither emp_id nor emp_dept alone are keys. To bring
this table in BCNF we can break this table in three tables like:
Emp_nationality table:
Emp_id Emp_nationality
101 Indian
102 Japanese
Emp_dept table:
Emp_dept_mapping table:
https://www.goseeko.com/reader/notes/savitribai-phule-pune-university-maharashtra/engineering/information-technology/second-year/sem-2/d… 16/39
3/1/23, 7:31 PM https://www.goseeko.com/reader/notes/savitribai-phule-pune-university-maharashtra/engineering/information-technology/seco…
Emp_id Emp_dept
101 Planning
101 Accounting
102 Technical support
102 Sales
Functional dependencies:
Candidate keys:
It means relation R is in 4NF if and only if whenever there exist subsets A and B of the
Attributes of R such that the Multi valued dependency AààB is satisfied then all
attributes of R are also functionally dependent on A.
https://www.goseeko.com/reader/notes/savitribai-phule-pune-university-maharashtra/engineering/information-technology/second-year/sem-2/d… 17/39
3/1/23, 7:31 PM https://www.goseeko.com/reader/notes/savitribai-phule-pune-university-maharashtra/engineering/information-technology/seco…
Multi-Valued Dependency
● Suppose a student can have more than one subject and more than one activity.
A table is in the 5NF if it is in 4NF and if for all Join dependency (JD) of (R1, R2, R3, ...,
Rm) in R, every Ri is a super key for R. It means:
Example
Ii. This table lists agents, the companies they work for and the products they sell for
those companies. ‘The agents do not necessarily sell all the products supplied by the
companies they do business with.
https://www.goseeko.com/reader/notes/savitribai-phule-pune-university-maharashtra/engineering/information-technology/second-year/sem-2/d… 18/39
3/1/23, 7:31 PM https://www.goseeko.com/reader/notes/savitribai-phule-pune-university-maharashtra/engineering/information-technology/seco…
AGENT_COMPANY_PRODUCT
Agent Company Product
Suneet ABC Nut
Raj ABC Bolt
Raj ABC Nut
Suneet CDE Bolt
Suneet ABC Bolt
2. Suppose that the table is decomposed into its three relations, P1, P2 and P3.
i. But if we perform natural join between the above three relations then no spurious
(extra) rows are added so this decomposition is called lossless decomposition.
https://www.goseeko.com/reader/notes/savitribai-phule-pune-university-maharashtra/engineering/information-technology/second-year/sem-2/d… 19/39
3/1/23, 7:31 PM https://www.goseeko.com/reader/notes/savitribai-phule-pune-university-maharashtra/engineering/information-technology/seco…
Key takeaway:
- A relation r is in 1NF if and only if every tuple contains only atomic attributes
means exactly one value for each attribute.
4.2.1 Overview
Query processing is nothing but steps involved in retrieving data from database.
These activities include translation of queries written high level database language
into expressions that can be understood at the physical level of the file system. It also
includes a variety of query optimizing transformations and actual evaluations of
queries.
https://www.goseeko.com/reader/notes/savitribai-phule-pune-university-maharashtra/engineering/information-technology/second-year/sem-2/d… 20/39
3/1/23, 7:31 PM https://www.goseeko.com/reader/notes/savitribai-phule-pune-university-maharashtra/engineering/information-technology/seco…
2. Optimization
3. Evaluation
At the beginning of query processing, the queries must be translated into a form
which could be understood at the physical level of the system. A language like SQL is
not suitable for the internal representation of a query. Relational algebra is the good
option for internal representation of query.
So, the first step in query processing is to translate a given query into its internal
https://www.goseeko.com/reader/notes/savitribai-phule-pune-university-maharashtra/engineering/information-technology/second-year/sem-2/d… 21/39
3/1/23, 7:31 PM https://www.goseeko.com/reader/notes/savitribai-phule-pune-university-maharashtra/engineering/information-technology/seco…
The system constructs a parse-tree of the query. Then this tree is translated into
relational- algebra expressions. If the query was expressed in terms of a view, the
translation phase replaces all uses of the view by the relational-algebra expression
that defines the view.
This query can be translated into one of the following relational-algebra expressions:
To specify how to evaluate a query, we need not only to provide the relational-
algebra expression, but also to explain it with instructions specifying the procedure
for evaluation of each operation.
Account Relation
https://www.goseeko.com/reader/notes/savitribai-phule-pune-university-maharashtra/engineering/information-technology/second-year/sem-2/d… 22/39
3/1/23, 7:31 PM https://www.goseeko.com/reader/notes/savitribai-phule-pune-university-maharashtra/engineering/information-technology/seco…
Cust_id Cust_Name
Customer Relation
Πacc_balance
Account
Different evaluation plans have different costs. It is the responsibility of the system to
Construct a query evaluation plan that minimizes the cost of query evaluation; this
task is called Query Optimization. Once the query plan with minimum cost is chosen,
the query is evaluated with the plan, and the result of the query is the output. In order
to optimize a query, a query optimizer must know the cost of each operation.
Although the exact cost is hard to compute, it is possible to get a rough estimate of
https://www.goseeko.com/reader/notes/savitribai-phule-pune-university-maharashtra/engineering/information-technology/second-year/sem-2/d… 23/39
3/1/23, 7:31 PM https://www.goseeko.com/reader/notes/savitribai-phule-pune-university-maharashtra/engineering/information-technology/seco…
Execution cost for each operation. Exact cost is hard for computation because it
depends on many parameters such as actual memory available to execute the
operation.
Key takeaway:
- The first step in query processing is to translate a given query into its internal
Representation.
Moreover, CPU speeds have been improving much faster than have disk speeds. The
CPU time taken for a task is harder to estimate since it depends on low-level details
of the execution.
The time taken by CPU is negligible in most systems when compared with the
number of disk accesses.
Rotational latency – time taken to bring and turn the required data under the read-
write head of the disk.
Seek time – It is the time taken to position the read-write head over the required
track or cylinder.
Sequential I/O – reading data that are stored in contiguous blocks of the disk
Random I/O – reading data that are stored in different blocks of the disk that are not
Contiguous.
https://www.goseeko.com/reader/notes/savitribai-phule-pune-university-maharashtra/engineering/information-technology/second-year/sem-2/d… 25/39
3/1/23, 7:31 PM https://www.goseeko.com/reader/notes/savitribai-phule-pune-university-maharashtra/engineering/information-technology/seco…
Key takeaway:
- We only use the number of block transfers from the disc as the cost measure for
simplicity.
File scan – search algorithms that locate and retrieve records that fulfill a selection
Condition.
Selection Operation:
It Scans each file block and test all records to see whether they satisfy the selection
Assume that the blocks of a relation are stored contiguously. Following factors are
Cost estimate (number of disk blocks to be scanned): log2(br ) — cost of locating the
first tuple by a binary search on the blocks plus number of blocks containing records
that satisfy selection condition.
Join Operations:
There are several different algorithms that can be used to implement joins (natural-,
equi-, condition-join) like:
● Nested-Loop Join
● Sort-Merge Join
● Hash-Join
Choice of a particular algorithm is based on cost estimate. For this, join size
estimates are required and in particular cost estimates for outer-level operations in a
relational algebra expression.
Example:
Assume the query CUSTOMERS ✶ ORDERS (with join attribute only being CName)
https://www.goseeko.com/reader/notes/savitribai-phule-pune-university-maharashtra/engineering/information-technology/second-year/sem-2/d… 27/39
3/1/23, 7:31 PM https://www.goseeko.com/reader/notes/savitribai-phule-pune-university-maharashtra/engineering/information-technology/seco…
FCUSTOMERS = 20
V(CName, ORDERS) = 2,500, meaning that in this relation, on average, each customer
has four orders – Also assume that CName in ORDERS is a foreign key on
CUSTOMERS
● If schema(R) ∩ schema(S) = primary key for R, then a tuple of S will match with
at most one tuple from R. Therefore, the number of tuples in R✶S is not greater than
NS If schema(R) ∩ schema(S) = foreign key in S referencing R, then the number of
tuples in R✶S is exactly NS. Other cases are symmetric.
● If schema(R) ∩ schema(S) = {A} is not a key for R or S; assume that every tuple in
R produces tuples in R ✶ S. Then the number of tuples in R ✶ S is estimated to be:
NR ∗ NS V(A, S) If the reverse is true, the estimate is NR ∗ NS V(A, R) and the lower of
the two estimates is probably the more accurate one.
Size estimates for CUSTOMERS ✶ ORDERS without using information about foreign
keys:
https://www.goseeko.com/reader/notes/savitribai-phule-pune-university-maharashtra/engineering/information-technology/second-year/sem-2/d… 28/39
3/1/23, 7:31 PM https://www.goseeko.com/reader/notes/savitribai-phule-pune-university-maharashtra/engineering/information-technology/seco…
Key takeaway:
- It Scans each file block and test all records to see whether they satisfy the
selection condition.
- Join the necessary size estimates, particularly for cost estimates in a relational-
algebra expression for outer-level operations.
For evaluation of an expression two ways can be used. One way is used is
materialization and other is pipelining.
1. Materialization:
This is one method of query evaluation. In this method queries are divided into sub
queries and then the results of sub-queries are combined to get the final result.
https://www.goseeko.com/reader/notes/savitribai-phule-pune-university-maharashtra/engineering/information-technology/second-year/sem-2/d… 29/39
3/1/23, 7:31 PM https://www.goseeko.com/reader/notes/savitribai-phule-pune-university-maharashtra/engineering/information-technology/seco…
Suppose, consider the example, we want to find the names of customers who have
balance less than 5000 in the bank account. The relational algebra for this query can
be written as:
Here we can observe that the query has two sub-queries: one is to select the number
of accounts having balance less than 5000. Another is to select the customer details
of the account whose id is retrieved in the first query. The database system also
performs the same task. It breaks the query into two sub-queries as mentioned
above. Once it is divided into two subparts, it evaluates the first sub-query and stores
result in the temporary table in the memory. This temporary table data will be then
used to evaluate the second sub-part of query.
Π cust_name
↙ ↘
σ acc_balance<5000 Customer
Account
This is the example of query which can be subdivided into two levels in
materialization method. In this method, we start from lowest level operations in the
expression. In our example, there is only one selection operation on account relation.
https://www.goseeko.com/reader/notes/savitribai-phule-pune-university-maharashtra/engineering/information-technology/second-year/sem-2/d… 30/39
3/1/23, 7:31 PM https://www.goseeko.com/reader/notes/savitribai-phule-pune-university-maharashtra/engineering/information-technology/seco…
The result of this operation is stored in a temporary table. We can use this temporary
relation to execute the operations at the next level up in the database. Inputs for the
join operation are the customer relation and the temporary relation created by the
selection on account.
Although this method looks simply, the cost of this type of evaluation is always more.
Because it involves write operation in temporary table. It takes the time to evaluate
and write into temporary table, then retrieve from this temporary table and query to
get the next level of result and so on. Hence cost of evaluation in this method is:
Cost = cost of individual SELECT operation + cost of write operation into temporary
Table
2. Pipelining algorithm
As seen in the example of materialization, the query evaluation cost is increased due
to temporary relations. This increased cost can be minimized by reducing the
number of temporary files that are produced. This reduction can be achieved by
combining several relational operations into a pipeline of operations, in which the
results of one operation are passed along to next operation in the pipeline. This kind
of evaluation is called as a pipelined evaluation.
For example, in the previous expression tree, no need to store the result of
intermediate select operation in temporary relation; instead, pass tuples directly to
the join operation.
Similarly, do not store the result of join, but pass tuples directly to project operation.
https://www.goseeko.com/reader/notes/savitribai-phule-pune-university-maharashtra/engineering/information-technology/second-year/sem-2/d… 31/39
3/1/23, 7:31 PM https://www.goseeko.com/reader/notes/savitribai-phule-pune-university-maharashtra/engineering/information-technology/seco…
For pipelining to be effective, use evaluation algorithms that generate output tuples
even as tuples are received for inputs to the operation Pipelines can be executed in
two ways:
● demand driven
● producer driven
Demand Driven:
In a demand driven pipeline, the system makes repeated requests for the tuples from
the operation at the top of the pipeline. Each time that an operation receives a
request for tuples, it computes the next tuple to be returned, and then returns that
tuple. If the inputs of the operations are not pipelined, the next tuple to be returned
can be computed from the input relations. If it has some pipelined inputs, the
operation also makes requests for tuples from its pipelined inputs. Using the tuples
received from its pipelined inputs, the operation computes tuples for its output, and
passes them up to parent.
Produced Driven:
the buffer has space for more tuples. At this point, the operation generates more
tuples, until the buffer is full again. The operation repeats this process until all the
Key takeaway:
- In a pipeline, with the result of one operation passed on to the next operation,
without the need to store intermediate values in a temporary relation.
4.3.1 Estimation
A single query can be executed using different algorithms or re-written in different forms and
structures. Hence, the question of query optimization comes into the picture. The query
optimizer helps to determine the most efficient way to execute a given query by considering
the possible query plans.
The optimizer tries to generate the most optimal execution plan for a SQL statement. It selects
the plan with lowest execution cost among all other available plans. The optimizer uses
available statistics to calculate cost. For a specific query in a given environment, the most
computation accounts for factors of query execution such as I/O, CPU, and communication.
For example, consider table storing information about employees who are managers. If the
optimizer statistics indicate that 80% of employees are managers, then the optimizer may
decide that a full table scan is most efficient. However, if statistics indicate that very few
employees are managers, then reading an index followed by a table access by row id may be
more efficient than full table scan.
https://www.goseeko.com/reader/notes/savitribai-phule-pune-university-maharashtra/engineering/information-technology/second-year/sem-2/d… 33/39
3/1/23, 7:31 PM https://www.goseeko.com/reader/notes/savitribai-phule-pune-university-maharashtra/engineering/information-technology/seco…
Because the database has many internal statistics and tools at its disposal, the optimizer is
usually in a better position than the user to determine the optimal method of statement
execution. For this reason, all SQL statements use the optimizer.
1. Analyze and transform equivalent relational expressions: Try to minimize the tuple and
column counts of the intermediate and final query processes.
2. Using different algorithms for each operation: These underlying algorithms determine how
tuples are accessed from the data structures they are stored in, indexing, hashing, data retrieval
and hence influence the number of disk and block accesses.
➔ First, it provides the user with faster results, which makes the application seem faster to the
user.
➔ Secondly, it allows the system to service more queries in the same amount of time, because
each request takes less time than un-optimized queries.
➔ Thirdly, query optimization ultimately reduces the amount of wear on the hardware (e.g.,
disk drives), and allows the server to run more efficiently (e.g. Lower power consumption, less
memory usage).
Cost estimation:
➢ The result size must also be calculated for each operation in the tree!
❏ Reduction factor
➢ The ratio of the size of the (expected) result to the size of the input, considering Just the
selection that the word reflects.
➢ Column = value
1/Nkeys(I) – I:
- index on column
1/10: random
➢ Column1 = column2
https://www.goseeko.com/reader/notes/savitribai-phule-pune-university-maharashtra/engineering/information-technology/second-year/sem-2/d… 35/39
3/1/23, 7:31 PM https://www.goseeko.com/reader/notes/savitribai-phule-pune-university-maharashtra/engineering/information-technology/seco…
1/10: no index
Size Estimation:
➢ Column>value
● At most half
Key takeaway:
- A single query can be executed using different algorithms or re-written in different forms
and structures.
- The optimizer tries to generate the most optimal execution plan for a SQL statement.
- The query optimizer helps to determine the most efficient way to execute a given query by
considering the possible query plans.
Generating a query-evaluation plan for an expression of the relational algebra involves two
steps:
2. Annotate these evaluation plans by specific algorithms and access structures to get
Alternative query plans Use equivalence rules to transform a relational algebra expression into
an equivalent one. Based on estimated cost, the most cost-effective annotated plan is selected
for evaluation.
7. E1 [∪, ∩] E2 ≡ E2 [∪, ∩] E1 (E1 ∪ E2) ∪ E3 ≡ E1 ∪ (E2 ∪ E3) (the analogous holds for ∩)
8. E1 × E2 ≡ πA1, A2(E2 × E1) (E1 × E2) × E3 ≡ E1 × (E2 × E3) (E1 × E2) × E3 ≡ π ((E1 ×E3) × E2)
9. E1 ✶ E2 ≡ E2 ✶ E1 (E1 ✶ E2) ✶ E3 ≡ E1 ✶ (E2 ✶ E3)
https://www.goseeko.com/reader/notes/savitribai-phule-pune-university-maharashtra/engineering/information-technology/second-year/sem-2/d… 37/39
3/1/23, 7:31 PM https://www.goseeko.com/reader/notes/savitribai-phule-pune-university-maharashtra/engineering/information-technology/seco…
The application of equivalence rules to a relational algebra expression is also sometimes called
algebraic optimization.
Examples:
Selection: –
Find the name of all customers who have ordered a product for more than $5,000 from a
supplier located in Davis.
(OFFERS ✶ SUPPLIERS))))
Perform selection as early as possible (but take existing indexes on relations into account)
0%Davis%0(SUPPLIERS))))) •
Projection: –
πCName(σProdname=0CD−ROM0(ORDERS)))
Projection should not be shifted before selections, because minimizing the number of tuples in
general leads to more efficient plans than reducing the size of tuples.
Join Ordering
• If (R2 ✶ R3) is quite large and (R1 ✶ R2) is small, we choose (R1 ✶ R2) ✶ R3 so that a smaller
temporary relation is computed and materialized.
https://www.goseeko.com/reader/notes/savitribai-phule-pune-university-maharashtra/engineering/information-technology/second-year/sem-2/d… 38/39
3/1/23, 7:31 PM https://www.goseeko.com/reader/notes/savitribai-phule-pune-university-maharashtra/engineering/information-technology/seco…
Example:
List the name of all customers who have ordered a product from a supplier located in Davis.
ORDERS ✶ CUSTOMERS is likely to be a large relation. Because it is likely that only a small
fraction of suppliers is from Davis, we compute the join σSAddress like
Key takeaway:
- The optimizer's first step is to enforce certain expressions that are logically equivalent to the
phrase given.
- An example of a legal database refers to a database system that meets all the integrity
limitations stated in the database schema.
References:
https://www.goseeko.com/reader/notes/savitribai-phule-pune-university-maharashtra/engineering/information-technology/second-year/sem-2/d… 39/39