0% found this document useful (0 votes)
7 views24 pages

Unit 3

Unit 3 covers query processing and optimization, detailing stages such as scanning, parsing, validating, and the role of the query optimizer in executing queries efficiently. It discusses evaluation methods for relational algebra expressions, including materialization and pipelining, and outlines various join strategies and equivalence rules for optimizing queries. The goal is to enhance performance by reducing resource usage and improving response times for users.

Uploaded by

idyllic.bns1921
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views24 pages

Unit 3

Unit 3 covers query processing and optimization, detailing stages such as scanning, parsing, validating, and the role of the query optimizer in executing queries efficiently. It discusses evaluation methods for relational algebra expressions, including materialization and pipelining, and outlines various join strategies and equivalence rules for optimizing queries. The goal is to enhance performance by reducing resource usage and improving response times for users.

Uploaded by

idyllic.bns1921
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

UNIT 3

Query processing and optimization


Syllabus

Sl. No. Topics Page


Number
1 Introduction 1
2 Query Processing stages 1-2
3 Evaluation of relational 2-6
algebra expressions,
4 Query Optimizer 6
5 Query equivalence, 7-9
6 Join strategies, 10-13
7 Query optimization 13-15
algorithms.
8 Storage strategies: 16-24
Indices, B-trees, hashing
1. Introduction
Query processing refers to the range of activities involved in extracting data from a database. The
activities include

i. Translation of queries in high level database languages into expressions that can be used at
the physical level of the file system,
ii. A variety of query optimizing transformations and
iii. Actual evaluation of queries.

2. Query processing stages

Prepared By: Minu Choudhary Page 1


• Scanning, Parsing and Validating: A query expressed in a high-level language such
as SQL must first be scanned, parsed, and validated.
• Scanner: The scanner identifies the language tokens – such as SQL keywords,
attribute names, and relation names – in the text of the query.
• Parser: Whereas parser checks the query syntax to determine whether it is formulated
according to the syntax rules of the query language.
• Validated: The query must also be validated, by checking that all attribute and
relation names are valid and semantically meaningful names in the schema of the
particular database being queried. An internal representation of the query is then
created, usually as a tree data structure called a query tree. It is also possible to
represent the query using a graph data structure called a query graph.
• Query Optimizer: The DBMS must then devise an execution strategy for retrieving
the result of the query from the database files. A query has many possible execution
strategies and the process of choosing a suitable one for processing query known as
query optimization.
• Query Code Generator: The query optimizer module has the task of producing an
execution plan, and the code generator generates the code to execute that plan.
• Runtime Database Processor : The runtime database processor has the task of
running the query code, whether in compiled mode, to produce the query result. If a
runtime error results, an error message is generated by the runtime database processor.

Example
Suppose a user executes a query. As we have learned that there are various methods of
extracting the data from the database. In SQL, a user wants to fetch the records of the
employees whose salary is greater than or equal to 10000. For doing this, the following query
is undertaken:
select salary from Employee where salary>10000;
• Thus, to make the system understand the user query, it needs to be translated in the
form of relational algebra. We can bring this query in the relational algebra form as:
• σsalary>10000 (πsalary (Employee))
• πsalary (σsalary>10000 (Employee))
• After translating the given query, we can execute each relational algebra operation by
using different algorithms. So, in this way, a query processing begins its working.

3. Evaluation of relational algebra expressions


For evaluating an expression that carries multiple operations in it, we can perform the
computation of each operation one by one. However, in the query processing system, we use
two methods for evaluating an expression carrying multiple operations. These methods are:

i. Materialization: In this approach the output of one operation is stored in a temporary


relation for processing by the next operation.

ii. Pipelining: In pipelining, the results of one operation is passed on to the next
operation without storing it in temporary relations, thereby saving the cost of writing
the temporary relations after an operation and reading the results back for the next
operation.

Prepared By: Minu Choudhary Page 2


Materialization Pipelining

Materialization is an easy approach for evaluating multiple operations of the given query and
storing the results in the temporary relations. The result can be the output of any join
condition, selection condition, and many more. Thus, materialization is the process of
creating and setting a view of the results of the evaluated operations for the user query. It is
similar to the cache memory where the searched data get settled temporarily. We can easily
understand the working of materialization through the pictorial representation of the
expression. An operator tree is used for representing an expression.

Cost Estimation of Materialized Evaluation

The process of estimating the cost of the materialized evaluation is different from the process
of estimating the cost of an algorithm. It is because in analyzing the cost of an algorithm, we
do not include the cost of writing the results on to the disks. But in the evaluation of an
expression, we not only compute the cost of all operations but also include the cost of writing
the result of currently evaluated operation to disk.

To estimate the cost of the materialized evaluation, we consider that results are stored in the
buffer, and when the buffer fills completely, the results are stored to the disk.

Let, a total of br number of blocks are written. Thus, we can estimate br as:

br = nr/fr.

Here, nr is the estimated number of tuples in the result relation r and fr is the number of
records of relation r that fits in a block. Thus, fr is a blocking factor of the resultant relation r.

Pipelining

Pipelining helps in improving the efficiency of the query-evaluation by decreasing the


production of a number of temporary files. Actually, we reduce the construction of the
temporary files by merging the multiple operations into a pipeline. The result of one currently
executed operation passes to the next operation for its execution, and the chain continues till

Prepared By: Minu Choudhary Page 3


all operations are completed and we get the final output of the expression. Such type of
evaluation process is known as Pipelined Evaluation.

Advantages of Pipeline
There are following advantages of creating a pipelining of operations:
o It reduces the cost of query evaluation by eliminating the cost of reading and writing
the temporary relations, unlike the materialization process.
o If we combine the root operator of a query evaluation plan in a pipeline with its
inputs, the process of generating query results becomes quick. As a result, it is
beneficial for the users as they can view the results of their asked queries as soon as
the outputs get generated. Else, the users need to wait for high-time to get and view
any query results.

Implementation of Pipeline
The system can use any of the following ways for executing a pipeline:
Demand-driven Pipeline: In the demand-driven pipeline, the system repeatedly makes tuples
request from the operation, which is at the top of the pipeline. Whenever the operation gets
the system request for the tuples, initially, it computes those next tuples which will be
returned, and after that, it returns the requested tuples. The operation repeats the same process
each time it receives any tuples request from the system. In case, the inputs of the operation
are not pipelined, then we compute the next returning tuples from the input relations only.
However, the system keeps track of all tuples which have been returned so far. But if there
are some pipelined inputs present, the operation will make a request for tuples from its
pipelined inputs also. After receiving tuples from its pipelined inputs, the operation uses them
for computing tuples for its output or result and then passes them to its parent which is at the
upper-level. So, in the demand-driven pipeline, a pipeline is implemented on the basis of the
demand or request of tuples made by the system.

Prepared By: Minu Choudhary Page 4


Producer-driven Pipeline: The producer-driven pipeline is different from the demand-
driven pipeline. In the producer-driven pipeline, the operations do not wait for the system
request for producing the tuples. Instead, the operations are eager to produce such tuples. In
the producer-driven pipeline, it models each operation as a separate thread or process within
the system. Here, the system gets a stream of tuples from its pipelined inputs and finally
generates or produces a stream of tuples for its output. The producer-driven pipeline follows
such an approach.

Prepared By: Minu Choudhary Page 5


4. Query Optimizer

A single query can be executed through different algorithms or re-written in different forms
and structures. Hence, the question of query optimization comes into the picture which of
these forms or pathways is the most optimal? The query optimizer attempts to determine the
most efficient way to execute a given query by considering the possible query plans.

Goal of query optimization

The goal of query optimization is to reduce the system resources required to fulfill a query,
and ultimately provide the user with the correct result set faster. First, it provides the user
with faster results, which makes the application seem faster to the user. Secondly, it allows
the system to service more queries in the same amount of time, because each request takes
less time than un-optimized queries. Thirdly, query optimization ultimately reduces the
amount of wear on the hardware (e.g. disk drives), and allows the server to run more
efficiently (e.g. lower power consumption, less memory usage).

Ways a query can be optimized

There are broadly two ways a query can be optimized:


1. Analyze and transform equivalent relational expressions: Try to minimize the tuple
and column counts of the intermediate and final query processes.
2. Using different algorithms for each operation: These underlying algorithms determine
how tuples are accessed from the data structures they are stored in, indexing, hashing,
and data retrieval and hence influence the number of disk and block accesses.

Prepared By: Minu Choudhary Page 6


5. Query equivalence
The first step of the optimizer says to implement such expressions that are logically
equivalent to the given expression. For implementing such a step, we use the equivalence rule
that describes the method to transform the generated expression into a logically equivalent
expression. Although there are different ways through which we can express a query, with
different costs. But for expressing a query in an efficient manner, we will learn to create
alternative as well as equivalent expressions of the given expression, instead of working with
the given expression. Two relational-algebra expressions are equivalent if both the
expressions produce the same set of tuples on each legal database instance. A legal database
instance refers to that database system which satisfies all the integrity constraints specified in
the database schema. However, the sequence of the generated tuples may vary in both
expressions, but they are considered equivalent until they produce the same tuples set.

Equivalence Rules

The equivalence rule says that expressions of two forms are the same or equivalent because
both expressions produce the same outputs on any legal database instance. It means that we
can possibly replace the expression of the first form with that of the second form and replace
the expression of the second form with an expression of the first form. Thus, the optimizer of
the query-evaluation plan uses such an equivalence rule or method for transforming
expressions into the logically equivalent one.

The optimizer uses various equivalence rules on relational-algebra expressions for


transforming the relational expressions. For describing each rule, we will use the following
symbols:

θ, θ1, θ2 … : Used for denoting the predicates.


L1, L2, L3 … : Used for denoting the list of attributes.
E, E1, E2 …. : Represents the relational-algebra expressions.
Let's discuss a number of equivalence rules:

Rule 1: Cascade of σ

This rule states the deconstruction of the conjunctive selection operations into a sequence of
individual selections. Such a transformation is known as a cascade of σ.

σθ1 ᴧ θ 2 (E) = σθ1 (σθ2 (E))

Rule 2: Commutative Rule

a) This rule states that selections operations are commutative.

σθ1 (σθ2 (E)) = σ θ2 (σθ1 (E))

b) Theta Join (θ) is commutative.

E1 ⋈ θ E 2 = E 2 ⋈ θ E 1 (θ is in subscript with the join symbol)

Prepared By: Minu Choudhary Page 7


However, in the case of theta join, the equivalence rule does not work if the order of
attributes is considered. Natural join is a special case of Theta join, and natural join is also
commutative.

Rule 3: Cascade of ∏

This rule states that we only need the final operations in the sequence of the projection
operations, and other operations are omitted. Such a transformation is referred to as a cascade
of ∏.

∏L1 (∏L2 (. . . (∏Ln (E)) . . . )) = ∏L1 (E)

Rule 4: We can combine the selections with Cartesian products as well as theta joins

1. σθ (E1 x E2) = E1⋈ θ E2


2. σθ1 (E1 ⋈ θ2 E2) = E1 ⋈ θ1ᴧθ2 E2

Rule 5: Associative Rule

a) This rule states that natural join operations are associative.

(E1 ⋈ E2) ⋈ E3 = E1 ⋈ (E2 ⋈ E3)

b) Theta joins are associative for the following expression:

(E1 ⋈ θ1 E2) ⋈ θ2ᴧθ3 E3 = E1 ⋈ θ1ᴧθ3 (E2 ⋈ θ2 E3)

In the theta associativity, θ2 involves the attributes from E2 and E3 only. There may be
chances of empty conditions, and thereby it concludes that Cartesian product is also
associative.

Rule 6: Distribution of the Selection operation over the Theta join.

Under two following conditions, the selection operation gets distributed over the theta-join
operation:

a) When all attributes in the selection condition θ0 include only attributes of one of the
expressions which are being joined.

σθ0 (E1 ⋈ θ E2) = (σθ0 (E1)) ⋈ θ E2

b) When the selection condition θ1 involves the attributes of E1 only, and θ2 includes the
attributes of E2 only.

σθ1ꓥ θ2 (E1 ⋈ θ E2) = (σθ1 (E1)) ⋈ θ ((σθ2 (E2))

Rule 7: Distribution of the projection operation over the theta join.

Under two following conditions, the selection operation gets distributed over the theta-join
operation:

Prepared By: Minu Choudhary Page 8


a) Assume that the join condition θ includes only in L1 υ L2 attributes of E1 and E2 Then, we
get the following expression:

∏L1υL2 (E1 ⋈ θ E2) = (∏L1 (E1)) ⋈ θ (∏L2 (E2))

Rule 8: The union and intersection set operations are commutative.

E1 υ E2 = E2 υ E1

E1 ꓥ E2 = E2 ꓥ E1

However, set difference operations are not commutative.

Rule 9: The union and intersection set operations are associative.

(E1 υ E2) υ E3 = E1 υ (E2 υ E3)

(E1 ꓥ E2) ꓥ E3 = E1 ꓥ (E2 ꓥ E3)

Rule 10: Distribution of selection operation on the intersection, union, and set difference
operations.

The below expression shows the distribution performed over the set difference operation.

σ θ (E1 − E2) = σ θ (E1) − σ θ (E2)

Rule 11: Distribution of the projection operation over the union operation.

This rule states that we can distribute the projection operation on the union operation for the
given expressions.

∏L (E1 υ E2) = (∏L (E1)) υ (∏L (E2))

Apart from these discussed equivalence rules, there are various other equivalence rules also.

Prepared By: Minu Choudhary Page 9


6. Join Strategies
The Join operation is the most time-consuming operation in query processing. The Database
Engine supports the following three different join processing techniques, so the optimizer can
choose one of them depending on the statistics for both tables:

1. Nested loop
2. Merge join
3. Hash join

The following subsections describe these techniques.

1. Nested loop

Nested loop is the processing technique that works by ―brute force.‖ In other words, for each
row of the outer table, each row from the inner table is retrieved and compared. The pseudo-
code in Algorithm 1 demonstrates the nested loop processing technique for two tables.

ALGORITHM 1

Nested Loop Join

Prepared By: Minu Choudhary Page 10


In Algorithm 1, every row selected from the outer table (table Dept) causes the access of all
rows of the inner table (table Emp). After that, the comparison of the values in the join
columns is performed and the row is added to the result set if the values in both columns are
equal.

2. Merge join

The merge join technique provides a cost-effective alternative to constructing an index for
nested loop. The rows of the joined tables must be physically sorted using the values of the
join column. Both tables are then scanned in order of the join columns, matching the rows
with the same value for the join columns.

The pseudo-code in Algorithm 2 demonstrates the merge join processing technique for two
tables.

ALGORITHM 2

Prepared By: Minu Choudhary Page 11


The merge join processing technique has a high overhead if the rows from both tables are
unsorted. However, this method is preferable when the values of both join columns are sorted
in advance.

Merge Join

3. Hash join

Hash joins are used when the joining large tables or when the joins requires most of the
joined tables rows. This is used for equality joins only.

Algorithm for Hash Join

1) The optimizer uses smaller of the 2 tables to build a hash table in memory. Small table is
called build table.

Build phase

For each row in small table loop


Calculate hash value on join key
Insert row in appropriate hash bucket.
End loop;

2) Then scans the large tables and compares the hash value (of rows from large table) with
this hash table to find the joined rows. Large table is called probe table.

Probe Phase
For each row in big table loop
Calculate the hash value on join key
Probe the hash table for hash value
If match found

Prepared By: Minu Choudhary Page 12


Return rows
End loop;

Hash Join

7. Query optimization algorithms


There are two methods of query optimization
1. Cost based Optimization (Physical)
2. Heuristic Optimization (Logical)

1. Cost based Optimization (Physical)

For a given query, the Optimizer allocates a cost in numerical form which is related to each
step of a possible plan and then finds these values together to get a cost estimate for the plan
or for the possible strategy. After calculating the costs of all possible plans, the Optimizer
tries to choose a plan which will have the possible lowest cost estimate. For that reason, the
Optimizer may be sometimes referred to as the Cost-Based Optimizer. Below are some of the
features of the cost-based optimization-

1. The cost-based optimization is based on the cost of the query that to be optimized.
2. The query can use a lot of paths based on the value of indexes, available sorting
methods, constraints, etc.
3. The aim of query optimization is to choose the most efficient path of implementing
the query at the possible lowest minimum cost in the form of an algorithm.
4. The cost of executing the algorithm needs to be provided by the query Optimizer so
that the most suitable query can be selected for an operation.
5. The cost of an algorithm also depends upon the cardinality (number of rows used by a
query) of the input.

Prepared By: Minu Choudhary Page 13


Parameters that affects the Query Costs in the process of Query Optimization
i. Secondary storage access cost: this is the cost of accessing, reading, searching for
and writing data blocks that reside in the secondary storage. The cost of searching
depends on the file structures, presence of indexes and so on. The ways in which the
records are physically stored affect the access cost.
ii. Storage cost: this is the cost of storing any intermediate files that are generated by an
execution strategy for the query.
iii. Computation cost: this is the cost of performing in-memory operations on the data
buffers during query execution. Some examples are sorting records, merging records,
searching for records, performing computations on field values and so on.
iv. Memory usage cost: this is the cost pertaining to the number of memory buffers
needed during query execution.
v. Communication cost: this is the cost of communicating the query from the source to
the database and then the query results back to the terminal where the query
originated.

Issues in Cost-Based Optimization:

i. In cost-based optimization, the number of execution strategies that can be considered


is not really fixed. The number of execution strategies may vary based on the
situation.
ii. Sometimes, this process is really very time-consuming to cost because it does not
always guarantee finding the best optimal strategy
iii. It is an expensive process.

2. Heuristic Optimization (Logical)


Cost-based optimization is expensive. Heuristics are used to reduce the number of choices
that must be made in a cost-based approach.

Rules

Heuristic optimization transforms the expression-tree by using a set of rules which improve
the performance. These rules are as follows −

i. Perform the SELECTION process foremost in the query. This should be the first
action for any SQL table. By doing so, we can decrease the number of records
required in the query, rather than using all the tables during the query.
ii. Perform all the projection as soon as achievable in the query. Somewhat like a
selection but this method helps in decreasing the number of columns in the query.
iii. Perform the most restrictive joins and selection operations. What this means is that
select only those sets of tables and/or views which will result in a relatively lesser
number of records and are extremely necessary in the query. Obviously any query will
execute better when tables with few records are joined.

Some systems use only heuristics and the others combine heuristics with partial cost-based
optimization.

Prepared By: Minu Choudhary Page 14


Steps in heuristic optimization

Let’s see the steps involve in heuristic optimization, which are explained below −

i. Deconstruct the conjunctive selections into a sequence of single selection operations.


ii. Move the selection operations down the query tree for the earliest possible execution.
iii. First execute those selections and join operations which will produce smallest
relations.
iv. Replace the cartesian product operation followed by selection operation with join
operation.
v. Deconstructive and move the tree down as far as possible.
vi. Identify those subtrees whose operations are pipelined.

Example
―Find the names of all customers who have an account at any branch located in Brooklyn.‖

Query Tree

Prepared By: Minu Choudhary Page 15


8. Storage strategies: Indices, B-trees, hashing
Index

An index is a data structure that speeds up certain operations on a file. An index for a file in
a database system works in the same way as the index in any textbook. If we want to learn
about a particular topic (specified by a word or a phrase), we can search for the topic in the
index at the back of the book. Indexes provide faster access to data.

Types of Indexes
I. Single-level ordered indexes
a. Primary indexes
b. Secondary indexes
c. Clustering indexes
II. Multi-level Indexes
III. Dynamic Multi-level indexes using B-trees and B+-trees

Indexes can also be characterized as

 Dense: A dense index has an index entry for every search key value (and hence every
record) in the data file.
 Sparse: A sparse (or nondense) index, on the other hand, has index entries for only
some of the search values.

Index structure

Indexes can be created using some database columns.

The first column of the database is the search key that contains a copy of the primary key or candidate
key of the table. The values of the primary key are stored in sorted order so that the corresponding
data can be accessed easily. The second column of the database is the data reference. It contains a set
of pointers holding the address of the disk block where the value of the particular key can be found.

Primary indexes: If the index is created on the basis of the primary key of the table, then it is
known as primary indexing. As primary keys are stored in sorted order, the performance of
the searching operation is quite efficient. A primary index is hence a nondense (sparse) index,
since it includes an entry for each disk block of the data file rather than for every search value
(or every record).

Prepared By: Minu Choudhary Page 16


Primary: Dense Index

Primary: Sparse index


Problem with a primary index

I. A major problem with a primary index—as with any ordered file—is insertion and
deletion of records.
II. With a primary index, the problem is compounded because, if we attempt to insert a
record in its correct position in the data file, we have to not only move records to
make space for the new record but also change some index entries, since moving
records will change the anchor records of some blocks.

Clustering Index: If records of a file are physically ordered on a non-key field—which does
not have a distinct value for each record—that field is called the clustering field. A

Prepared By: Minu Choudhary Page 17


clustering index is also an ordered file with two fields; the first field is of the same type as the
clustering field of the data file, and the second field is a block pointer. There is one entry in
the clustering index for each distinct value of the clustering field, containing the value and a
pointer to the first block in the data file that has a record with that value for its clustering
field.

Clustering Index

Secondary Index: A Secondary Index is an ordered file with two fields. The first is of the
same data type as some nonordering field and the second is either a block or a record pointer.
If the entries in this nonordering field must be unique this field is sometime referred to as a
Secondary Key. This results in a dense index.

Prepared By: Minu Choudhary Page 18


Secondary Index
Multilevel Indexes
• A Multilevel Index is where you construct a Second- Level index on a First-Level
Index.
• Continue this process until the entire index can be contained in a Single File Block.

Prepared By: Minu Choudhary Page 19


Dynamic Multilevel Indexes Using B-Trees and B+-Trees

B-trees and B+-trees are special cases of the well-known tree data structure. A tree is formed
of nodes. Each node in the tree, except for a special node called the root, has one parent node
and several—zero or more—child nodes. The root node has no parent. A node that does not
have any child nodes is called a leaf node; a nonleaf node is called an internal node. The level
of a node is always one more than the level of its parent, with the level of the root node being
zero. A subtree of a node consists of that node and all its descendant nodes—its child nodes,
the child nodes of its child nodes, and so on.

B tree

A B-tree of order m (the maximum number of children for each node) is a tree which satisfies
the following properties:
• Every node has at most m children.
• Every node (except root and leaves) has at least m⁄2 children.
• The root has at least two children if it is not a leaf node.
• All leaves appear in the same level, and carry information.
• A non-leaf node with k children contains k–1 keys.

Prepared By: Minu Choudhary Page 20


Figure 1.33 Structure of B tree

Figure 1.34 Example of B tree with order 3

Insertion algorithm

• All insertions start at a leaf node. To insert a new element Search the tree to find the
leaf node where the new element should be added.
• Insert the new element into that node with the following steps:
1. If the node contains fewer than the maximum legal number of elements, then there is
room for the new element. Insert the new element in the node, keeping the node's
elements ordered.
2. Otherwise the node is full, so evenly split it into two nodes.
– A single median is chosen from among the leaf's elements and the new
element.
– Values less than the median are put in the new left node and values greater
than the median are put in the new right node, with the median acting as a
separation value.
– Insert the separation value in the node's parent, which may cause it to be split,
and so on. If the node has no parent (i.e., the node was the root), create a new
root above this node (increasing the height of the tree).

Prepared By: Minu Choudhary Page 21


Hashing
In DBMS, hashing is a technique to directly search the location of desired data on the disk
without using index structure. Hashing method is used to index and retrieve items in a
database as it is faster to search that specific item using the shorter hashed key instead of
using its original value. Data is stored in the form of data blocks whose address is generated
by applying a hash function in the memory location where these records are stored known as
a data block or data bucket.

Why do we need Hashing?


Here, are the situations in the DBMS where you need to apply the Hashing method:

 For a huge database structure, it’s tough to search all the index values through all its
level and then you need to reach the destination data block to get the desired data.
 Hashing method is used to index and retrieve items in a database as it is faster to
search that specific item using the shorter hashed key instead of using its original
value.
 Hashing is an ideal method to calculate the direct location of a data record on the disk
without using index structure.
 It is also a helpful technique for implementing dictionaries.

Important Terminologies in Hashing


Here, are important terminologies which are used in Hashing:

Prepared By: Minu Choudhary Page 22


 Data bucket – Data buckets are memory locations where the records are stored. It is
also known as Unit Of Storage.
 Key: A DBMS key is an attribute or set of an attribute which helps you to identify a
row(tuple) in a relation(table). This allows you to find the relationship between two
tables.
 Hash function: A hash function, is a mapping function which maps all the set of
search keys to the address where actual records are placed.
 Linear Probing – Linear probing is a fixed interval between probes. In this method,
the next available data block is used to enter the new record, instead of overwriting on
the older record.
 Quadratic probing– It helps you to determine the new bucket address. It helps you to
add Interval between probes by adding the consecutive output of quadratic
polynomial to starting value given by the original computation.
 Hash index – It is an address of the data block. A hash function could be a simple
mathematical function to even a complex mathematical function.
 Double Hashing –Double hashing is a computer programming method used in hash
tables to resolve the issues of has a collision.
 Bucket Overflow: The condition of bucket-overflow is called collision. This is a fatal
stage for any static has to function.

Types of Hashing Techniques


There are mainly two types of SQL hashing methods/techniques:

1. Static Hashing
2. Dynamic Hashing

Static Hashing
In the static hashing, the resultant data bucket address will always remain the same.

Therefore, if you generate an address for say Student_ID = 10 using hashing


function mod(3), the resultant bucket address will always be 1. So, you will not see any
change in the bucket address.

Therefore, in this static hashing method, the number of data buckets in memory always
remains constant.

Static Hash Functions

 Inserting a record: When a new record requires to be inserted into the table, you can
generate an address for the new record using its hash key. When the address is
generated, the record is automatically stored in that location.
 Searching: When you need to retrieve the record, the same hash function should be
helpful to retrieve the address of the bucket where data should be stored.
 Delete a record: Using the hash function, you can first fetch the record which is you
wants to delete. Then you can remove the records for that address in memory.

Prepared By: Minu Choudhary Page 23


Static hashing is further divided into

1. Open hashing
2. Close hashing.

Open Hashing
In Open hashing method, Instead of overwriting older one the next available data block is
used to enter the new record, This method is also known as linear probing.

For example, A2 is a new record which you wants to insert. The hash function generates
address as 222. But it is already occupied by some other value. That’s why the system looks
for the next data bucket 501 and assigns A2 to it.

How Open Hash Works

Close Hashing
In the close hashing method, when buckets are full, a new bucket is allocated for the same
hash and results are linked after the previous one.

Dynamic Hashing
Dynamic hashing offers a mechanism in which data buckets are added and removed
dynamically and on demand. In this hashing, the hash function helps you to create a large
number of values.

What is Collision?
Hash collision is a state when the resultant hashes from two or more data in the data set,
wrongly map the same place in the hash table.

How to deal with Hashing Collision?


There are two technique which you can use to avoid a hash collision:

1. Rehashing: This method, invokes a secondary hash function, which is applied


continuously until an empty slot is found, where a record should be placed.
2. Chaining: Chaining method builds a Linked list of items whose key hashes to the
same value. This method requires an extra link field to each table position.

Prepared By: Minu Choudhary Page 24

You might also like