Unit 3
Unit 3
i. Translation of queries in high level database languages into expressions that can be used at
the physical level of the file system,
ii. A variety of query optimizing transformations and
iii. Actual evaluation of queries.
Example
Suppose a user executes a query. As we have learned that there are various methods of
extracting the data from the database. In SQL, a user wants to fetch the records of the
employees whose salary is greater than or equal to 10000. For doing this, the following query
is undertaken:
select salary from Employee where salary>10000;
• Thus, to make the system understand the user query, it needs to be translated in the
form of relational algebra. We can bring this query in the relational algebra form as:
• σsalary>10000 (πsalary (Employee))
• πsalary (σsalary>10000 (Employee))
• After translating the given query, we can execute each relational algebra operation by
using different algorithms. So, in this way, a query processing begins its working.
ii. Pipelining: In pipelining, the results of one operation is passed on to the next
operation without storing it in temporary relations, thereby saving the cost of writing
the temporary relations after an operation and reading the results back for the next
operation.
Materialization is an easy approach for evaluating multiple operations of the given query and
storing the results in the temporary relations. The result can be the output of any join
condition, selection condition, and many more. Thus, materialization is the process of
creating and setting a view of the results of the evaluated operations for the user query. It is
similar to the cache memory where the searched data get settled temporarily. We can easily
understand the working of materialization through the pictorial representation of the
expression. An operator tree is used for representing an expression.
The process of estimating the cost of the materialized evaluation is different from the process
of estimating the cost of an algorithm. It is because in analyzing the cost of an algorithm, we
do not include the cost of writing the results on to the disks. But in the evaluation of an
expression, we not only compute the cost of all operations but also include the cost of writing
the result of currently evaluated operation to disk.
To estimate the cost of the materialized evaluation, we consider that results are stored in the
buffer, and when the buffer fills completely, the results are stored to the disk.
Let, a total of br number of blocks are written. Thus, we can estimate br as:
br = nr/fr.
Here, nr is the estimated number of tuples in the result relation r and fr is the number of
records of relation r that fits in a block. Thus, fr is a blocking factor of the resultant relation r.
Pipelining
Advantages of Pipeline
There are following advantages of creating a pipelining of operations:
o It reduces the cost of query evaluation by eliminating the cost of reading and writing
the temporary relations, unlike the materialization process.
o If we combine the root operator of a query evaluation plan in a pipeline with its
inputs, the process of generating query results becomes quick. As a result, it is
beneficial for the users as they can view the results of their asked queries as soon as
the outputs get generated. Else, the users need to wait for high-time to get and view
any query results.
Implementation of Pipeline
The system can use any of the following ways for executing a pipeline:
Demand-driven Pipeline: In the demand-driven pipeline, the system repeatedly makes tuples
request from the operation, which is at the top of the pipeline. Whenever the operation gets
the system request for the tuples, initially, it computes those next tuples which will be
returned, and after that, it returns the requested tuples. The operation repeats the same process
each time it receives any tuples request from the system. In case, the inputs of the operation
are not pipelined, then we compute the next returning tuples from the input relations only.
However, the system keeps track of all tuples which have been returned so far. But if there
are some pipelined inputs present, the operation will make a request for tuples from its
pipelined inputs also. After receiving tuples from its pipelined inputs, the operation uses them
for computing tuples for its output or result and then passes them to its parent which is at the
upper-level. So, in the demand-driven pipeline, a pipeline is implemented on the basis of the
demand or request of tuples made by the system.
A single query can be executed through different algorithms or re-written in different forms
and structures. Hence, the question of query optimization comes into the picture which of
these forms or pathways is the most optimal? The query optimizer attempts to determine the
most efficient way to execute a given query by considering the possible query plans.
The goal of query optimization is to reduce the system resources required to fulfill a query,
and ultimately provide the user with the correct result set faster. First, it provides the user
with faster results, which makes the application seem faster to the user. Secondly, it allows
the system to service more queries in the same amount of time, because each request takes
less time than un-optimized queries. Thirdly, query optimization ultimately reduces the
amount of wear on the hardware (e.g. disk drives), and allows the server to run more
efficiently (e.g. lower power consumption, less memory usage).
Equivalence Rules
The equivalence rule says that expressions of two forms are the same or equivalent because
both expressions produce the same outputs on any legal database instance. It means that we
can possibly replace the expression of the first form with that of the second form and replace
the expression of the second form with an expression of the first form. Thus, the optimizer of
the query-evaluation plan uses such an equivalence rule or method for transforming
expressions into the logically equivalent one.
Rule 1: Cascade of σ
This rule states the deconstruction of the conjunctive selection operations into a sequence of
individual selections. Such a transformation is known as a cascade of σ.
Rule 3: Cascade of ∏
This rule states that we only need the final operations in the sequence of the projection
operations, and other operations are omitted. Such a transformation is referred to as a cascade
of ∏.
Rule 4: We can combine the selections with Cartesian products as well as theta joins
In the theta associativity, θ2 involves the attributes from E2 and E3 only. There may be
chances of empty conditions, and thereby it concludes that Cartesian product is also
associative.
Under two following conditions, the selection operation gets distributed over the theta-join
operation:
a) When all attributes in the selection condition θ0 include only attributes of one of the
expressions which are being joined.
b) When the selection condition θ1 involves the attributes of E1 only, and θ2 includes the
attributes of E2 only.
Under two following conditions, the selection operation gets distributed over the theta-join
operation:
E1 υ E2 = E2 υ E1
E1 ꓥ E2 = E2 ꓥ E1
Rule 10: Distribution of selection operation on the intersection, union, and set difference
operations.
The below expression shows the distribution performed over the set difference operation.
Rule 11: Distribution of the projection operation over the union operation.
This rule states that we can distribute the projection operation on the union operation for the
given expressions.
Apart from these discussed equivalence rules, there are various other equivalence rules also.
1. Nested loop
2. Merge join
3. Hash join
1. Nested loop
Nested loop is the processing technique that works by ―brute force.‖ In other words, for each
row of the outer table, each row from the inner table is retrieved and compared. The pseudo-
code in Algorithm 1 demonstrates the nested loop processing technique for two tables.
ALGORITHM 1
2. Merge join
The merge join technique provides a cost-effective alternative to constructing an index for
nested loop. The rows of the joined tables must be physically sorted using the values of the
join column. Both tables are then scanned in order of the join columns, matching the rows
with the same value for the join columns.
The pseudo-code in Algorithm 2 demonstrates the merge join processing technique for two
tables.
ALGORITHM 2
Merge Join
3. Hash join
Hash joins are used when the joining large tables or when the joins requires most of the
joined tables rows. This is used for equality joins only.
1) The optimizer uses smaller of the 2 tables to build a hash table in memory. Small table is
called build table.
Build phase
2) Then scans the large tables and compares the hash value (of rows from large table) with
this hash table to find the joined rows. Large table is called probe table.
Probe Phase
For each row in big table loop
Calculate the hash value on join key
Probe the hash table for hash value
If match found
Hash Join
For a given query, the Optimizer allocates a cost in numerical form which is related to each
step of a possible plan and then finds these values together to get a cost estimate for the plan
or for the possible strategy. After calculating the costs of all possible plans, the Optimizer
tries to choose a plan which will have the possible lowest cost estimate. For that reason, the
Optimizer may be sometimes referred to as the Cost-Based Optimizer. Below are some of the
features of the cost-based optimization-
1. The cost-based optimization is based on the cost of the query that to be optimized.
2. The query can use a lot of paths based on the value of indexes, available sorting
methods, constraints, etc.
3. The aim of query optimization is to choose the most efficient path of implementing
the query at the possible lowest minimum cost in the form of an algorithm.
4. The cost of executing the algorithm needs to be provided by the query Optimizer so
that the most suitable query can be selected for an operation.
5. The cost of an algorithm also depends upon the cardinality (number of rows used by a
query) of the input.
Rules
Heuristic optimization transforms the expression-tree by using a set of rules which improve
the performance. These rules are as follows −
i. Perform the SELECTION process foremost in the query. This should be the first
action for any SQL table. By doing so, we can decrease the number of records
required in the query, rather than using all the tables during the query.
ii. Perform all the projection as soon as achievable in the query. Somewhat like a
selection but this method helps in decreasing the number of columns in the query.
iii. Perform the most restrictive joins and selection operations. What this means is that
select only those sets of tables and/or views which will result in a relatively lesser
number of records and are extremely necessary in the query. Obviously any query will
execute better when tables with few records are joined.
Some systems use only heuristics and the others combine heuristics with partial cost-based
optimization.
Let’s see the steps involve in heuristic optimization, which are explained below −
Example
―Find the names of all customers who have an account at any branch located in Brooklyn.‖
Query Tree
An index is a data structure that speeds up certain operations on a file. An index for a file in
a database system works in the same way as the index in any textbook. If we want to learn
about a particular topic (specified by a word or a phrase), we can search for the topic in the
index at the back of the book. Indexes provide faster access to data.
Types of Indexes
I. Single-level ordered indexes
a. Primary indexes
b. Secondary indexes
c. Clustering indexes
II. Multi-level Indexes
III. Dynamic Multi-level indexes using B-trees and B+-trees
Dense: A dense index has an index entry for every search key value (and hence every
record) in the data file.
Sparse: A sparse (or nondense) index, on the other hand, has index entries for only
some of the search values.
Index structure
The first column of the database is the search key that contains a copy of the primary key or candidate
key of the table. The values of the primary key are stored in sorted order so that the corresponding
data can be accessed easily. The second column of the database is the data reference. It contains a set
of pointers holding the address of the disk block where the value of the particular key can be found.
Primary indexes: If the index is created on the basis of the primary key of the table, then it is
known as primary indexing. As primary keys are stored in sorted order, the performance of
the searching operation is quite efficient. A primary index is hence a nondense (sparse) index,
since it includes an entry for each disk block of the data file rather than for every search value
(or every record).
I. A major problem with a primary index—as with any ordered file—is insertion and
deletion of records.
II. With a primary index, the problem is compounded because, if we attempt to insert a
record in its correct position in the data file, we have to not only move records to
make space for the new record but also change some index entries, since moving
records will change the anchor records of some blocks.
Clustering Index: If records of a file are physically ordered on a non-key field—which does
not have a distinct value for each record—that field is called the clustering field. A
Clustering Index
Secondary Index: A Secondary Index is an ordered file with two fields. The first is of the
same data type as some nonordering field and the second is either a block or a record pointer.
If the entries in this nonordering field must be unique this field is sometime referred to as a
Secondary Key. This results in a dense index.
B-trees and B+-trees are special cases of the well-known tree data structure. A tree is formed
of nodes. Each node in the tree, except for a special node called the root, has one parent node
and several—zero or more—child nodes. The root node has no parent. A node that does not
have any child nodes is called a leaf node; a nonleaf node is called an internal node. The level
of a node is always one more than the level of its parent, with the level of the root node being
zero. A subtree of a node consists of that node and all its descendant nodes—its child nodes,
the child nodes of its child nodes, and so on.
B tree
A B-tree of order m (the maximum number of children for each node) is a tree which satisfies
the following properties:
• Every node has at most m children.
• Every node (except root and leaves) has at least m⁄2 children.
• The root has at least two children if it is not a leaf node.
• All leaves appear in the same level, and carry information.
• A non-leaf node with k children contains k–1 keys.
Insertion algorithm
• All insertions start at a leaf node. To insert a new element Search the tree to find the
leaf node where the new element should be added.
• Insert the new element into that node with the following steps:
1. If the node contains fewer than the maximum legal number of elements, then there is
room for the new element. Insert the new element in the node, keeping the node's
elements ordered.
2. Otherwise the node is full, so evenly split it into two nodes.
– A single median is chosen from among the leaf's elements and the new
element.
– Values less than the median are put in the new left node and values greater
than the median are put in the new right node, with the median acting as a
separation value.
– Insert the separation value in the node's parent, which may cause it to be split,
and so on. If the node has no parent (i.e., the node was the root), create a new
root above this node (increasing the height of the tree).
For a huge database structure, it’s tough to search all the index values through all its
level and then you need to reach the destination data block to get the desired data.
Hashing method is used to index and retrieve items in a database as it is faster to
search that specific item using the shorter hashed key instead of using its original
value.
Hashing is an ideal method to calculate the direct location of a data record on the disk
without using index structure.
It is also a helpful technique for implementing dictionaries.
1. Static Hashing
2. Dynamic Hashing
Static Hashing
In the static hashing, the resultant data bucket address will always remain the same.
Therefore, in this static hashing method, the number of data buckets in memory always
remains constant.
Inserting a record: When a new record requires to be inserted into the table, you can
generate an address for the new record using its hash key. When the address is
generated, the record is automatically stored in that location.
Searching: When you need to retrieve the record, the same hash function should be
helpful to retrieve the address of the bucket where data should be stored.
Delete a record: Using the hash function, you can first fetch the record which is you
wants to delete. Then you can remove the records for that address in memory.
1. Open hashing
2. Close hashing.
Open Hashing
In Open hashing method, Instead of overwriting older one the next available data block is
used to enter the new record, This method is also known as linear probing.
For example, A2 is a new record which you wants to insert. The hash function generates
address as 222. But it is already occupied by some other value. That’s why the system looks
for the next data bucket 501 and assigns A2 to it.
Close Hashing
In the close hashing method, when buckets are full, a new bucket is allocated for the same
hash and results are linked after the previous one.
Dynamic Hashing
Dynamic hashing offers a mechanism in which data buckets are added and removed
dynamically and on demand. In this hashing, the hash function helps you to create a large
number of values.
What is Collision?
Hash collision is a state when the resultant hashes from two or more data in the data set,
wrongly map the same place in the hash table.