0% found this document useful (0 votes)
9 views83 pages

20cb402 Dbms Unit 3

Uploaded by

231026.cb
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views83 pages

20cb402 Dbms Unit 3

Uploaded by

231026.cb
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 83

1

2
Please read this disclaimer before proceeding:
This document is confidential and intended solely for the educational purpose of
RMK Group of Educational Institutions. If you have received this document through
email in error, please notify the system manager. This document contains proprietary
information and is intended only to the respective group / learning community as
intended. If you are not the addressee you should not disseminate, distribute or
copy through e-mail. Please notify the sender immediately by e-mail if you have
received this document by mistake and delete this document from your system. If
you are not the intended recipient you are notified that disclosing, copying,
distributing or taking any action in reliance on the contents of this information is
strictly prohibited.

3
22CB304
Database Management Systems

Department: Computer Science and Business Systems

Batch/Year: 2023-2027 / II Year

Created by:

Dr. V. R. Kavitha AsP/CSBS


Ms. C.Mary Shiba AP/CSBS

Date: 09.07.2024
Table of Contents
Sl. Topics Page
No. No.
1. Contents 5

2. Course Objectives 6

3. Pre Requisites (Course Name with Code) 7

4. Syllabus (With Subject Code, Name, LTPC details) 8

5. Course Outcomes (6) 9

6. CO-PO/PSO Mapping 10

Lecture Plan (S.No., Topic, No. of Periods, Proposed date, 12


7. Actual Lecture Date, pertaining CO, Taxonomy level, Mode of
Delivery)

8. Activity based learning 13

Lecture Notes ( with Links to Videos, e-book reference, PPTs, 15


9.
Quiz and any other learning materials )
Assignments ( For higher level learning and Evaluation - 70
10.
Examples: Case study, Comprehensive design, etc.,)
11. Part A Q & A (with K level and CO) 71

12. Part B Qs (with K level and CO) 75

Supportive online Certification courses (NPTEL, Swayam, 76


13.
Coursera, Udemy, etc.,)
14. Real time Applications in day to day life and to Industry 77

Contents beyond the Syllabus ( COE related Value added 78


15.
courses)
16. Assessment Schedule ( Proposed Date & Actual Date) 80

17. Prescribed Text Books & Reference Books 81

18. Mini Project 82


Course Objective

1. Understand the basic concepts of Database, why its required


and what its benefits & advantage

2. Apply effective relational database design concepts.


3. Know the fundamental concepts of transaction processing,
concurrency control techniques and recovery procedure.
4. How to efficiently model and design various database objects
and entities
5. To implement efficient data querying and updates, with
needed configuration
Pre Requisites

21CB303 - Object Oriented Programming


21IT201 – Data Structures and Algorithm
Syllabus

22CB304 DATABASE MANAGEMENT SYSTEMS + LAB L T P C


3 0 2 4
UNIT I INTRODUCTION 9
Introduction: Introduction to Database. Hierarchical, Network and Relational Models.
Database system architecture: Data Abstraction, Data Independence, Data Definition
Language (DDL), Data Manipulation Language (DML). Data models: Entity-relationship
model, network model, relational and object oriented data models, integrity constraints,
data manipulation operations.

UNIT II RELATIONAL QUERY LANGUAGE 9


Relational query languages: Relational algebra, Tuple and domain relational calculus,
SQL3, DDL and DML constructs, Open source and Commercial DBMS - MYSQL, ORACLE,
DB2, SQL server. Relational database design: Domain and data dependency,
Armstrong's axioms, Functional Dependencies, Normal forms, Dependency preservation,
Lossless design.

UNIT III QUERY PROCESSING AND STORAGE 9


Query processing and optimization: Evaluation of relational algebra expressions,
Query equivalence, Join strategies, Query optimization algorithms. Storage strategies:
Indices, B-trees, Hashing.

UNIT IV TRANSACTION PROCESSING 9


Transaction processing: Concurrency control, ACID property, Serializability of
scheduling, Locking and timestamp based schedulers, Multi-version and optimistic
Concurrency Control schemes, Database recovery.

UNIT V DATA BASE SECURITY AND ADVANCED DATABASES 9


Database Security: Authentication, Authorization and access control, DAC, MAC and
RBAC models, Intrusion detection, SQL injection. Object oriented and object relational
databases, Logical databases, Web databases, Distributed databases, Data warehousing
and data mining.
Course Outcomes

CO1: Able to design and deploy an efficient & scalable data storage node for varied kind of
application requirements
CO2: Map ER model to Relational model to perform database design effectively
CO3: Write queries using normalization criteria and optimize queries

CO4: Compare and contrast various indexing strategies in different database systems
CO5: Appraise how advanced databases differ from traditional databases.
CO-PO/PSO Mapping

POs/PSOs

PSO3
COs
PO2

PO3
PO1

PO4

PO5

PO6

PO7

PO8

PO9

PSO

PSO
PO1

PO1

PO1

2
0

2
CO1 3 3 3 - 1 - - - 2 - - 3 2 2 2
CO2 3 3 3 3 1 - - - - - - 3 2 2 3
CO3 3 3 3 3 2 - - - - - - 3 2 2 3
CO4 3 3 3 3 3 - - - - - - 3 2 2 3
CO5 3 3 3 3 2 - - - - - - 3 2 2 3
Unit III
QUERY PROCESSING
AND STORAGE
Lecture Plan
UNIT - II
Actual Date Mode of
S. Scheduled Taxono
Topic of CO Deliver
No. Date my Level
Completion y
Query
1 Processing CO3 PPT

Evaluation of PPT
2 CO3
relational algebra
expressions

3 Query equivalence CO3 PPT


PPT
4 CO3
Join strategies
Using Heuristics in Chalk
5 Query optimization CO3
and Talk

Using selectivity and Chalk


6 CO3 and Talk
Cost estimates in
Query Optimization
QUIZ,
Storage Chalk
7 strategies: CO3 and Talk
Indices
Chalk
8 CO3
B-trees , B+ Trees and Talk
Chalk
9 Hashing CO3 and Talk
8. ACTIVITY BASED LEARNING

COMPLETE THE CROSSWORD PUZZLES GIVEN BELOW

13
8. ACTIVITY BASED LEARNING

Test Yourself HINTS:

ACROSS:
2. Management Studio
5. The database that is used as a template when creating a new database
6. Using a backup
7. Reporting Services
9. An index that rearranges the structure of a table
12. The underlying structure of an index
13. Common Table Expression
20. Making queries run faster
21. An index that doesn’t change the structure of the table
22. The often misunderstood go-faster switch

Down
1. A data type to hold international strings
2. A type of join where the left and right tables are the same table
3. The option used in a CTE that needs to recurse more than 100 levels
4. A type of join that produces a Cartesian product
8. A derived table
10. Not the consistency checker
11. The concept of everything a query needs being available in the index
14. Sometimes called the upsert
15. High Availability
16. Binary large object
17. Transact SQL
18. the two transactions collide
19. Codename for SQL Server 2012
Print the crossword puzzle and fill it in on paper.

14
3.1. QUERY PROCESSING

Definition:

Query processing refers to the range of activities involved in extracting data from a
database.

The steps involved in query processing:


The basic steps are:
parsing and translation
optimization

Evaluation

Fig: Query Processing

Step 1: Input the High Level Query:


SELECT LNAME, FNAME FROM EMPLOYEE WHERE SALARY > ( SELECT MAX (SALARY)
FROM EMPLOYEE WHERE DNO=5);
The above query can be decomposed it into two blocks:
Inner block: Outer block:
( SELECT MAX (SALARY) SELECT LNAME, FNAME FROM EMPLOYEE
FROM EMPLOYEE WHERE SALARY > c
WHERE DNO=5)

Step 2: scanning, parsing and validating:


The scanner identifies the query tokens—such as SQL keywords, attribute names, and
relation names—that appear in the text of the query,

The parser checks the query syntax to determine whether it is formulated according to
the syntax rules (rules of grammar) of the query language.

15
The query must also be validated by checking that all attribute and relation
names are valid and semantically meaningful names in the schema of the
particular database being queried

Step 3: Convert the parsed tree query into relational algebra:


The output of the query will be relational algebra which is represented as parse
tree.

An internal representation of the query is then created, usually as a tree data


structure called a query tree.

It is also possible to represent the query using a graph data structure called a
query graph.

Step 4: Query Optimizer:


A query typically has many possible execution strategies, and the process of
choosing a suitable one for processing a query is known as query optimization.

Step 5: Execution Plan:


The DBMS must then devise an execution strategy or query plan for retrieving
the results of the query from the database files.

Step 6: Runtime Database Processor:


The runtime database processor has the task of running (executing) the query
code, whether in compiled or interpreted mode, to produce the query result.

If a runtime error results, an error message is generated by the runtime database


processor.

16
Parsing and translation
Translate the query into its internal form. This is then translated into
relational algebra

Parser checks syntax, verifies relations

Evaluation
The query-execution engine takes a query-evaluation plan, executes that

plan, and returns the answers to the query.

Optimization

A relational algebra expression may have many equivalent expressions

E.g., balance 2500(balance(account)) is equivalent to balance

(balance 2500(account))

Each relational algebra operation can be evaluated using one of several

different algorithms Correspondingly, a relational-algebra expression can be

evaluated in many ways.

Annotated expression specifying detailed evaluation strategy is


called an evaluation- plan.

Query Optimization:

Amongst all equivalent evaluation plans choose the one with lowest cost.

Cost is estimated using statistical information from the database catalog

E.g. number of tuples in each relation, size of tuples, etc.

17
Measures of Query Cost
Cost is generally measured as total elapsed time for answering query Many
factors contribute to time cost - disk accesses, CPU, or even network
communication

Typically disk access is the predominant cost, and is also relatively easy to
estimate.

Query Cost is calculated using the formulae:

For si mplicity we just use the number of block transfers from dis k and the number
of seeks as the cost measures

tT – time to transfer one block

tS – time for one seek

Cost for b block transfers plus S seeks

b * tT + S * tS
Cost to write a block is greater than cost to read a block data is read back after
being written to ensure that the write was successful.

For simplicity we just use number of block transfers from disk as the cost
measure.

Ignore the difference in cost between sequential and random I/O for simplicity
ignore CPU costs for simplicity.

Costs depends on the size of the buffer in main memory.

Having more memory reduces need for disk access.

Amount of real memory available to buffer depends on other concurrent OS


processes, and hard to determine ahead of actual execution.

18
3.2 Evaluation of relational algebra expressions

The obvious way to evaluate an expression is simply to evaluate one operation at a


time, in an appropriate order. There are two ways to evaluate an expression

1. Materialization
2. Pipelining

Materialization Evaluation: evaluate one operation at a time, starting at the


lowest-level. Use intermediate results materialized into temporary relations to
evaluate next-level operations.
E.g., in figure below, compute and store

 building"Watson" (department)
then compute the store its join with instructor, and finally compute the projection on
name.

Materialized evaluation is always applicable


Cost of writing results to disk and reading them back can be quite high
Our cost formulas for operations ignore cost of writing results to disk, so
Overall cost = Sum of costs of individual operations +
cost of writing intermediate results to disk
Double buffering: use two output buffers for each operation, when one is full write
it to disk while the other is getting filled
Allows overlap of disk writes with computation and reduces execution time

19
Pipelined evaluation : evaluate several operations simultaneously, passing the
results of one operation on to the next.
E.g., in previous expression tree, do not store result of

 building"Watson" (department)
instead, pass tuples directly to the join.. Similarly, don’t store result of join,
pass tuples directly to projection.
Much cheaper than materialization: no need to store a temporary relation to disk.
Pipelining may not always be possible – e.g., sort, hash-join.
For pipelining to be effective, use evaluation algorithms that generate output tuples
even as tuples are received for inputs to the operation.
Pipelines can be executed in two ways: demand driven and producer driven
Demand driven or lazy evaluation
system repeatedly requests next tuple from top level operation
Each operation requests next tuple from children operations as required, in
order to output its next tuple
In between calls, operation has to maintain “state” so it knows what to
return next
Producer-driven or eager pipelining
Operators produce tuples eagerly and pass them up to their parents
Buffer maintained between operators, child puts tuples in buffer,
parent removes tuples from buffer
if buffer is full, child waits till there is space in the buffer, and then
generates more tuples
System schedules operations that have space in output buffer and can
process more input tuples
Alternative name: pull and push models of pipelining

20
Implementation of demand-driven pipelining
Each operation is implemented as an iterator implementing the following
operations
open()
E.g. file scan: initialize file scan
state: pointer to beginning of file
E.g.merge join: sort relations;
state: pointers to beginning of sorted relations
next()
E.g. for file scan: Output next tuple, and advance and store file
pointer
E.g. for merge join: continue with merge from earlier state till
next output tuple is found. Save pointers as iterator state.
close()

Evaluation Algorithms for Pipelining


Some algorithms are not able to output results even as they get input tuples
E.g. merge join, or hash join
intermediate results written to disk and then read back
Algorithm variants to generate (at least some) results on the fly, as input tuples are
read in
E.g. hybrid hash join generates output tuples even as probe relation tuples
in the in-memory partition (partition 0) are read in
Double-pipelined join technique: Hybrid hash join, modified to buffer
partition 0 tuples of both relations in-memory, reading them as they become
available, and output results of any matches between partition 0 tuples
When a new r0 tuple is found, match it with existing s0 tuples, output
matches, and save it in r0
Symmetrically for s0 tuples

21
3.2 Query equivalence
Cost difference between evaluation plans for a query can be enormous
E.g. seconds vs. days in some cases
Steps in cost-based query optimization
1. Generate logically equivalent expressions using equivalence rules
2. Annotate resultant expressions to get alternative query plans
3. Choose the cheapest plan based on estimated cost
Estimation of plan cost based on:
Statistical information about relations. Examples: number of tuples, number
of distinct values for an attribute
Statistics estimation for intermediate results to compute cost of complex
expressions
Cost formulae for algorithms, computed using statistics

Transformation of Relational Expressions

Two relational algebra expressions are said to be equivalent if the two


expressions generate the same set of tuples on every legal database
instance
Note: order of tuples is irrelevant
we don’t care if they generate different results on databases that violate
integrity constraints
22
In SQL, inputs and outputs are multisets of tuples
Two expressions in the multiset version of the relational algebra are said to be
equivalent if the two expressions generate the same multiset of tuples on
every legal database instance.
An equivalence rule says that expressions of two forms are equivalent
Can replace expression of first form by second, or vice versa
Equivalence Rules
1. Conjunctive selection operations can be deconstructed into a sequence of
individual selections.
s q Ùq ( E ) = s q (s q ( E ))
1 2 1 2

2. Selection operations are commutative.

s q (s q ( E )) = s q (s q ( E ))
1 2 2 1

3. Only the last in a sequence of projection operations is needed, the others can
be omitted.
 L1 ( L2 ( ( Ln ( E )) ))   L1 ( E )

4. Selections can be combined with Cartesian products and theta joins.


a. (E1 X E2) = E1  E2
b. 1(E1 2 E2) = E1 1 2 E2
5. Theta-join operations (and natural joins) are commutative.
E1  E2 = E2  E1
6. (a) Natural join operations are associative:
(E1 E2) E3 = E1 (E2 E3)

(b) Theta joins are associative in the following manner:

(E1 1 E2) 2 3 E3 = E1 1 3 (E2 2 E3)

where 2 involves attributes from only E2 and E3.


7. The selection operation distributes over the theta join operation under the
following two conditions:
(a) When all the attributes in 0 involve only the attributes of one of the
expressions (E1) being joined.
0E1  E2) = (0(E1))  E2

(b) When  1 involves only the attributes of E1 and 2 involves only the attributes
of E2.
1 E1  E2) = (1(E1))  ( (E2))

Pictorial Representation of Equivalence Rules


8.The projection operation distributes over the theta join operation as follows:
(a) if  involves only attributes from L1  L2:

 L1 L2 ( E1  E2 )  ( L1 ( E1 ))  ( L2 ( E2 ))
(b) Consider a join E1  E2.
Let L1 and L2 be sets of attributes from E1 and E2, respectively.
Let L3 be attributes of E1 that are involved in join condition , but are not in
L1  L2, and
let L4 be attributes of E2 that are involved in join condition , but are not in
L1  L2.

9. The set operations union and intersection are commutative


E1  E2 = E2  E1
E1  E2 = E2  E1
(set difference is not commutative).
10. Set union and intersection are associative.
(E1  E2)  E3 = E1  (E2  E3)
(E1  E2)  E3 = E1  (E2  E3)
11. The selection operation distributes over ,  and –.
 (E1 – E2) =  (E1) – (E2)
and similarly for  and  in place of –
Also:  (E1 – E2) = (E1) – E2
and similarly for  in place of –, but not for 
12. The projection operation distributes over union
L(E1  E2) = (L(E1))  (L(E2))
3.3 JOIN STRATEGIES

Joins:
Outer join:
The outer join operation is an extension of the join
operation to deal with missinginformation.
Example:Consider the relations employee and ft-works as below.

To generate single relation from the above two relations, a possible approach to use is
the natural join operation. The expression is given below.

Employee ft-works.

The result of this expression is given below.


In the above relation the street and city information about smith is lost, since the
tuple describing smith is absent from the ft-works relation. Similarly the branch
name and salary information about gates is lost, since the tuples describing gates
is absent from the employee relation.The outer join operation can be used to avoid
this loss of information. There are three forms of outer join operation. They are:

Left outer Join (ii) Right outer Join (iii) Full outer Join
(i)The left outer join: This takes all tuples in the left relation that did not match
with any tuple in the right relation, pads the tuples with null values for all other
attributes from the right relation, and adds them to the result of the natural join.
The result of employee ft-works is given below

(ii) The right outer join: it is symmetric with the left outer join. It pads tuples
from the right relation that did not match any from the left relation with nulls and
adds them to the result of the natural join. The result of employee ft-works is
given below.
(iii) The full outer join: it does both of the above operations, padding tuples from
the left relation that did not match any from the right relation, as well as tuples
from the right relation that did not match any from the left relation, and adding
them to the result of the join. The below relation shows the result of employee
ft-works.
3.3.1 IMPLEMENTING THE JOIN OPERATION

The JOIN operation is one of the most time-consuming operations in query


processing.

Many of the join operations encountered in queries are of the EQUIJOIN and NATURAL
JOIN.

Joins involving more than two files are called multiway joins. The number of
possible ways to execute multiway joins grows very rapidly.
Methods for Implementing Joins:

J1 - Nested-loop join (or nested-block join)


This is the default (brute force) algorithm, as it does not require any special access
paths on either file in the join. For each record t in R (outer loop), retrieve every
record s from S (inner loop) and test whether the two records satisfy the join
condition t [A ] = s [B ]

J2 - Single-loop join (using an access structure to retrieve the matching

records)

If an index (or hash key) exists for one of the two join attributes, attribute B of file
S- retrieve each record t in R (loop over file R), and then use the access structure to
retrieve directly all matching records s from S that satisfy s [B ] = t [ A ].

J3 - Sort-merge join
If the records of R and S are physically sorted (ordered) by value of the join attributes

A and B, respectively, we can implement the join in the most efficient way
possible.

Both files are scanned concurrently in order of the join attributes, matching the
records that have the same values for A and B. If the files are not sorted, they may
be sorted first by using external sorting
J4 - Partition-hash join
The records of files R and S are partitioned into smaller files. The partitioning of
each file is done using the same hashing function h on the join attribute A of R
(for partitioning file R) and B of S (for partitioning file S).

First, a single pass through the file with fewer records (say, R) hashes its records
to the various partitions of R; this is called the partitioning phase, since the
records of R are partitioned into the hash buckets.

The collection of records with the same value of h(A) are placed in the same
partition, which is a hash bucket in a hash table in main memory. In the second
phase, called the probing phase, a single pass through the other file (S) then
hashes each of its records using the same hash function h(B) to probe the
appropriate bucket, and that record is combined with all matching records from R
in that bucket.

HYBRID HASH-JOIN
The hybrid hash-join algorithm is a variation of partition hash-join, where the
joining phase for one of the partitions is included in the partitioning phase.

To illustrate this, let us assume that the size of a memory buffer is one disk block;
that nB such buffers are available; and that the partitioning hash function used is
h(K) = K mod M, so that M partitions are being created, where M < nB.

This simplified description of partition-hash join assumes that the smaller of the
two files fits entirely into memory buckets after the first phase.

30
3.4 HEURISTICS IN QUERY OPTIMIZATION
A query tree is used to represent a relational algebra or extended relational
algebra expression, whereas

Query graph is used to represent a relational calculus expression. Optimization


Techniques That Apply Heuristic Rules to modify the internal
representation of a query—which is usually in the form of a query tree or a
query graph data structure—to improve its expected performance.

The scanner and parser of an SQL query first generate a data structure that
corresponds to an initial query representation, which is then optimized according
to heuristic rules.

This leads to an optimized query representation, which corresponds to the


query execution strategy.

A query execution plan is generated to execute groups of operations based on


the access paths available on the files involved in the query.

One of the main heuristic rules is to apply SELECT and PROJECT operations
before applying the JOIN or other binary operations, because the size
of the file resulting from a binary operation - such as JOIN - is
usually a multiplicative function of the sizes of the input files.

The SELECT and PROJECT operations reduce the size of a file and hence
should be applied before a join or other binary operation.

NOTATION FOR QUERY TREES AND QUERY GRAPHS

Notation for Query Trees and Query Graphs:

A query tree is a tree data structure that corresponds to a relational algebra expression.

It represents the input relations of the query as leaf nodes of the tree, and represents
the relational algebra operations as internal nodes.

31
An execution of the query tree consists of executing an internal node operation
whenever its operands are available and then replacing that internal node by the
relation that results from executing the operation. The order of execution of
operations starts at the leaf nodes, which represents the input database relations
for the query, and ends at the root node, which represents the final operation of the
query.

The execution terminates when the root node operation is executed and produces
the result relation for the query.

From Figure:
a. query tree corresponding to relational algebra for the query Q2

b. initial query tree for SQL query Q2

c. query graph for Q2


The leaf nodes P, D, and E represent the three relations PROJECT, DEPARTMENT, and
EMPLOYEE, respectively, and the internal tree nodes represent the relational algebra
operations of the expression.

When this query tree is executed, the node marked (1) in Figure must begin execution
before node (2) because some resulting tuples of operation (1) must be available before
we can begin executing operation (2).

The query tree represents a specific order of operations for executing a query.

A more neutral data structure for representation of a query is the query graph
notation.
Relations in the query are represented by relation nodes, which are displayed as single
circles. Constant values, typically from the query selection conditions, are represented by
constant nodes, which are displayed as double circles or ovals. Selection and join
conditions are represented by the graph edges.
32
Fig: Notation for Query Trees and Query Graphs

Heuristic Optimization of Query Trees


The query parser will typically generate a standard initial query tree to
correspond to an SQL query, without doing any optimization.

The heuristic query optimizer will transform this initial query tree into an
equivalent final query tree that is efficient to execute.
The optimizer must include rules for equivalence among relational algebra
expressions that can be applied to transform the initial tree into the final,
optimized query tree.

How a query tree is transformed by using heuristics, and then general


transformation rules and show how they can be used in an algebraic heuristic
optimizer.

33
Steps in converting a query tree during heuristic optimization
a. Initial (canonical) query tree for SQL query Q.

b. Moving SELECT operations down the query tree.

c. Applying the more restrictive SELECT operation first.

d. Replacing CARTESIAN PRODUCT and SELECT with JOIN operations.

e. Moving PROJECT operations down the query tree.

a. The initial query tree for Q is shown below. Executing this tree directly first
creates a very large file containing the CARTESIAN PRODUCT of the entire
EMPLOYEE, WORKS_ON, and PROJECT files. That is why the initial query tree is
never executed, but is transformed into another equivalent tree that is efficient
to execute.

Fig: The initial query tree for Q


b. The above fig shows an improved query tree that first applies the SELECT
operations to reduce the number of tuples that appear in the CARTESIAN
PRODUCT.

34
Fig: CartesianProduct

c. A further improvement is achieved by switching the positions of the EMPLOYEE and


PROJECT relations in the tree.

This uses the information that Pnumber is a key attribute of the PROJECT relation,
and hence the SELECT operation on the PROJECT relation will retrieve a single record
only

Fig: Employee and Project relationsin the tree

35
d ) We can further improve the query tree by replacing any CARTESIAN
PRODUCT operation that is followed by a join condition with a
JOIN operation

Fig: JOIN Operation

e) Another improvement is to keep only the attributes needed by subsequent


operations in the intermediate relations, by including PROJECT operations as early
as possible in the query tree.

This reduces the attributes (columns) of the intermediate relations, whereas the
SELECT operations reduce the number of tuples (records).

Fig: SELECT Operation

36
General Transformation Rules for Relational Algebra Operations.
There are many rules for transforming relational algebra operations into
equivalent ones. For query optimization purposes, we are interested in the
meaning of the operations and the resulting relations.

Hence, if two relations have the same set of attributes in a different order but the
two relations represent the same information, we consider the relations to be
equivalent.

Transformation rules that are useful in query optimization, without proving them:

37
38
Converting Query Trees into Query Execution Plans:

An execution plan for a relational algebra expression represented as a query tree


includes information about the access methods available for each relation as well
as the algorithms to be used in computing the relational operators represented in
the tree.

Fig: Converting Query Trees into Query Execution Plans

39
To convert this into an execution plan, the optimizer might choose an index
search for the SELECT operation on DEPARTMENT (assuming one exists), a single-
loop join algorithm that loops over the records in the result of the SELECT
operation on DEPARTMENT for the join operation (assuming an index exists on
the Dno attribute of EMPLOYEE), and a scan of the JOIN result for input to the
PROJECT operator.

Additionally, the approach taken for executing the query may specify a
materialized or a pipelined evaluation, although in general a pipelined evaluation
is preferred whenever feasible.

With materialized evaluation, the result of an operation is stored as a


temporary relation (that is, the result is physically materialized).

For instance, the JOIN operation can be computed and the entire result stored as
a temporary relation, which is then read as input by the algorithm that computes
the PROJECT operation, which would produce the query result table.

On the other hand, with pipelined evaluation, as the resulting tuples of an


operation are produced, they are forwarded directly to the next operation in the
query sequence.

40
3.5 COST OPTIMIZATION IN QUERY OPTIMIZATION
A query optimizer does not depend solely on heuristic rules; it also estimates and
compares the costs of executing a query using different execution strategies and
algorithms, and it then chooses the strategy with the lowest cost estimate.

In addition, the optimizer must limit the number of execution strategies to be


considered; otherwise, too much time will be spent making cost estimates for the
many possible execution strategies.

Hence, this approach is more suitable for compiled queries where the
optimization is done at compile time and the resulting execution strategy code is
stored and executed directly at runtime.

For interpreted queries, where the entire process shown in occurs at runtime, a
full-scale optimization may slow down the response time. A more elaborate
optimization is indicated for compiled queries, whereas a partial, less time-
consuming optimization works best for interpreted queries.

This approach is generally referred to as cost-based query optimization


It uses traditional optimization techniques that search the solution space to a
problem for a solution that minimizes an objective (cost) function.

The cost functions used in query optimization are estimates and not exact cost
functions, so the optimization may select a query execution strategy that is not
the optimal (absolute best) one.

41
Cost Components for Query Execution:

The cost of executing a query includes the following components:

1. ACCESS COST TO SECONDARY STORAGE


This is the cost of transferring (reading and writing) data blocks between
secondary disk storage and main memory buffers.

This is also known as disk I/O (input/output) cost. The cost of searching for
records in a disk file depends on the type of access structures on that file, such as
ordering, hashing, and primary or secondary indexes.

In addition, factors such as whether the file blocks are allocated contiguously on
the same disk cylinder or scattered on the disk affect the access cost.

2. DISK STORAGE COST


This is the cost of storing on disk any intermediate files that are generated by an
execution strategy for the query.

3. COMPUTATION COST
This is the cost of performing in-memory operations on the records within the
data buffers during query execution. Such operations include searching for and
sorting records, merging records for a join or a sort operation, and performing
computations on field values.

This is also known as CPU (central processing unit) cost.

4. MEMORY USAGE COST


This is the cost pertaining to the number of main memory buffers needed during
query execution.

42
5. COMMUNICATION COST
This is the cost of shipping the query and its results from the database site to the
site or terminal where the query originated.

In distributed databases, it would also include the cost of transferring tables and
results among various computers during query evaluation.

Simple cost functions ignore other factors and compare different query execution
strategies in terms of the number of block transfers between disk and main
memory buffers.

For smaller databases, where most of the data in the files involved in the query
can be completely stored in memory, the emphasis is on minimizing computation
cost. In distributed databases, where many sites are involved, communication
cost must be minimized also.

It is difficult to include all the cost components in a (weighted) cost function


because of the difficulty of assigning suitable weights to the cost components.
That is why some cost functions consider a single factor only—disk access

Catalog Information Used in Cost Functions:


To estimate the costs of various execution strategies, we must keep track of any
information that is needed for the cost functions.

This information may be stored in the DBMS catalog, where it is accessed by the
query optimizer.

First, we must know the size of each file. For a file whose records are all of the
same type, the number of records (tuples) (r), the (average) record size
(R), and the number of file blocks (b) (or close estimates of them) are
needed. The blocking factor (bfr) for the file may also be needed.

43
We must also keep track of the primary file organization for each file.
The primary file organization records may be unordered, ordered by an attribute
with or without a primary or clustering index, or hashed (static hashing or one of
the dynamic hashing methods) on a key attribute. Information is also kept on all
primary, secondary, or clustering indexes and their indexing attributes.

The number of levels (x) of each multilevel index (primary, secondary, or


clustering) is needed for cost functions that estimate the number of block
accesses that occur during query execution. In some cost functions the number
of first-level index blocks (bI1) is needed.

Another important parameter is the number of distinct values (d) of an


attribute and the attribute selectivity (sl), which is the fraction of records
satisfying an equality condition on the attribute.

This allows estimation of the selection cardinality (s = sl*r) of an attribute,


which is the average number of records that will satisfy an equality selection
condition on that attribute. For a key attribute, d = r, sl = 1/r and s = 1.

For a non key attribute, by making an assumption that the d distinct values are
uniformly distributed among the records, we estimate sl = (1/d) and so
s = (r/d).

44
Cost Functions for SELECT
We now give cost functions for the selection algorithms S1 to S8 in terms of number of
block transfers between memory and disk.

These cost functions are estimates that ignore computation time, storage cost, and
other factors.

The cost for method Si is referred to as CSi block accesses.

S1 - Linear search (brute force) approach.

We search all the file blocks to retrieve all records satisfying the selection
condition; hence, CS1a = b.

For an equality condition on a key attribute, only half the file blocks are searched
on the average before finding the record, so a rough estimate for

if the record is found; if no record is found that satisfies the condition, CS1b = b.

S2 - Binary Search

This search accesses approxim ately CS2 file blocks.


This reduces to log2b if the equality condition is on a unique (key) attribute,
because s = 1 in this case.

S3a - Using a primary index to retrieve a single record

For a primary index, retrieve one disk block at each index level, plus one disk block
from the data file. Hence, the cost is one more disk block than the number of
index levels: CS3a = x + 1.

45
S3b - Using a hash key to retrieve a single record
For hashing, only one disk block needs to be accessed in most cases. The
cost function is approximately CS3b = 1 for static hashing or linear hashing,
and it is 2 disk block accesses for extendible hashing

S4 - Using an ordering index to retrieve multiple records

If the comparison condition is >, >=, <, or <= on a key field with an
ordering index, roughly half the file records will satisfy the condition.

This gives a cost function of CS4 = x + (b/2). This is a very rough estimate,
and although it may be correct on the average, it may be quite inaccurate in
individual cases. A more accurate estimate is possible if the distribution of
records is stored in a histogram.

S5 - Using a clustering index to retrieve multiple records


One disk block is accessed at each index level, which gives the address of
the first file disk block in the cluster.

Given an equality condition on the indexing attribute, s records will satisfy


the condition, where s is the selection cardinality of the indexing attribute.

This means that 𝖥(s/bfr)⎤ file blocks will be in the cluster of file blocks that
hold all the selected records, giving CS5 = x +[(s/bfr)].

S6 - Using a secondary (B+-tree) index

For a secondary index on a key (unique) attribute, the cost is x + 1 disk


block accesses. For a secondary index on a nonkey (non unique)
attribute, s records will satisfy an equality condition, where s is the
selection cardinality of the indexing attribute.

46
If the index is non clustering, each of the records may
reside on a different disk block, so the (worst case) cost
estimate is CS6a = x + 1 + s.
If the comparison condition is >, >=, <, or <= and half the file
records are assumed to satisfy the condition, then (very roughly) half
the first-level index blocks are accessed, plus half the file records via
the index.

The cost estimate for this case, approximately, is CS6b = x + (bI1/2) +


(r/2). The r/2 factor can be refined if better selectivity estimates are
available through a histogram. The latter method CS6b can be very
costly.

S7 - Conjunctive selection
We can use either S1 or one of the methods S2 to S6 discussed above.
In the latter case, we use one condition to retrieve the records and
then check in the main memory buffers whether each retrieved record
satisfies the remaining conditions in the conjunction.
If multiple indexes exist, the search of each index can produce a set of
record pointers (record ids) in the main memory buffers.

S8 - Conjunctive selection using a composite index

Same as S3a, S5, or S6a, depending on the type of index.

COST FUNCTIONS FOR JOIN:

The join operations are of the form: R X A =B S


where A and B are domain-compatible attributes of R and S, respectively. Assume
that R has bR blocks and that S has bS blocks:

47
J1 - Nested-loop Join:
If nB main memory buffers are available to perform the join, the cost formula
becomes:

J2 - Single-loop join:

If an index exists for the join attribute B of S with index levels xB, we can retrieve
each record s in R and then use the index to retrieve all the matching records

t from S that satisfy t[B] = s[A]. The cost depends on the type of index.

J3 - Sort-merge join
If the files are already sorted on the join attributes, the cost function for this method
Is,

J4: Partition-hashjoin

48
3.6 INDEXING TECHNIQUES:

Introduction:
Database system indices play the same role as book indices or card catalogs in
the libraries. For example, to retrieve an account record given the account
number, the database system would look up an index to find on which disk block
the corresponding record resides, and then fetch the disk block, to get the
account record.

There are two basic kinds of indices:

Ordered indices:

Based on a sorted ordering of the values.

Hash indices:
Based on a uniform distribution of values across a range of buckets. The bucket
to which a value is assigned is determined by a function called a hash function.

Several techniques exist for both ordered indexing and hashing. No one technique is
the best. Rather, each technique is best suited to particular database applications.

Each technique must be evaluated on the basis of these factors:

Access types:

Access types can include finding records with a specified attribute value and
finding records, whose attribute values fall in a specified range.

Access time:
The time it takes to find a particular data item, or set of items using the technique in
question.

49
Insertion time:

The time it takes to insert a new data item.

Deletion time:

The time it takes to delete a data item.

Space overhead:

additional space occupied by an index structure.

Search Key :

Attribute to set of attributes used to look up records in a file. An index file


consists of records (called index entries) of the form
search-key pointer

ORDERED INDICES
To gain fast random access to records in a file, an index structure is used. Each
index structure is associated with a particular search key. Just like the index of a
book or a library catalog an ordered index stores the values of the search keys in
sorted order, and associates with each search key the records that contain it.
Ordered indices can be categorized as primary index and secondary index.

If the file containing the records is sequentially ordered, a primary index is an


index whose search key also defines the sequential order of the file. (The term
primary index is sometimes used to mean an index on a primary key).

Primary indices are also called clustering indices. The search key of a primary
index is usually the primary key, although that is not necessarily so.

Indices whose search key specifies an order different from the sequential order of
the file are called secondary indices, or non clustering indices.

50
PRIMARY INDEX
In this index, it is assumed that all files are ordered sequentially on some search
key. Such files, with a primary index on the search key, are called index-sequential
files. They represent one of the oldest index schemes used in database systems.
They are designed for applications that require both sequential processing of the
entire file and random access to individual records.

The Figure show a sequential file of account records taken from the banking
example. In the example figure, the records are stored in search-key order, with
branch-name used as the search key.

Figure: – Sequential file for account records

Dense and Sparse Indices

An index record, or index entry, consists of a search-key value, and pointers to


one or more records with that value as their search-key value. The pointer to a
record consists of the identifier of a disk block and an offset within the disk block to
identify the record within the block.

There are two types of ordered indices that can be used:

51
Two types of ordered: Dense and Sparse Indices
DENSE INDEX
Dense index: an index record appears for every search-key value in the file. In a
dense primary index, the index record contains the search-key value and a pointer
to the first data record with that search-key value.

Implementations may store a list of pointers to all records with the same search-
key value; doing so is not essential for primary indices. The below figure, show
the dense index for the account file.

Figure: Dense Index.

SPARSE INDEX:
An index record appears for only some of the search-key values. To locate a
record we find the index entry with the largest search-key value that is less than
or equal to the search key value for which we are looking. We start at the record
pointed to by that index entry, and follow the pointers in the file until we find the
desired record.

52
The below figure show the sparse index for the account file.

Figure:Sparse Index
Suppose that we are looking up records for the Perryridge branch. Using the dense

index we follow the pointer directly to the first Perryridge record. We process this

record and follow the pointer in that record to locate the next record in search-key

(branch-name) order. We continue processing records until we encounter a record

for a branch other than Perryridge. If we are using the sparse index, we do not find

an index entry for "Perryridge". Since, the last entry (in alphabetic order) before

"Perryridge" is "Mianus" we follow that pointer. We then read the account file in

sequential order until we find the first Perryridge record, and begin processing at

that point.

Thus, it is generally faster to locate a record in a dense index; rather than a sparse

index. However, sparse indices have advantages over dense indices in that they

require less space and they impose less maintenance overhead for insertions and

deletions.

There is a trade-off that the system designer must make between access time and

space overhead. Although the decision regarding this trade-off depends on the spe-

cific application, a good compromise is to have a sparse index with one index entry

per block. The reason this design is a good trade-off is that the dominant cost in

processing a database request is the time that it takes to bring a block from disk into

main memory. Once we have brought in the block, the time to scan the entire block

is negligible. Using this sparse index, we locate the block containing the record that

53
we are seeking. Thus, unless the record is on an overflow block, we minimize block

accesses while keeping the size of the index as small as possible.

MULTI LEVEL INDICES


Even if the sparse index is used, the index itself may become too large for efficient
processing. It is not unreasonable, in practice, to have a file with 100,000 records,

with 10 records stored in each block. If we have one index record per block, the

index has 10,000 records. Index records are smaller than data records, so let us

assume that 100 index records fit on a block. Thus, our index occupies 100 blocks.

Such large indices are stored as sequential files on disk.

If an index is sufficiently small to be kept in main memory, the search time to find

an entry is low. However, if the index is so large that it must be kept on disk, a

search for an entry requires several disk block reads. Binary search can be used on

the index file to locate an entry, but the search still has a large cost. If overflow

blocks have been used, binary search will not be possible. In that case, a sequential

search is typically used, and that requires b block reads, which will take even longer.

Thus, the process of searching a large index may be costly.

To deal with this problem, we treat the index just as we would treat any other

sequential file, and construct a sparse index on the primary index, as in the below

figure To locate a record, we first use binary search on the outer index to find the

record for the largest search-key value less than or equal to the one that we desire.

The pointer points to a block of the inner index. We scan this block until we find the

record that has the largest search-key value less than or equal to the one that we

desire. The pointer in this record points to the block of the file that contains the

record for which we are looking

54
Figure: Two-level Sparse Index
Using the two levels of indexing, we have read only one index block, rather than
the seven we read with binary search, if we assume that the outer index is
already in main memory. If our file is extremely large, even the outer index may
grow too large to fit in main memory. In such a case, we can create yet another
level of index. Indices with two or more levels are called multilevel indices.
Searching for records with a multilevel index requires significantly fewer I/O
operations than does searching for records by binary search.

A typical dictionary is an example of a multilevel index in the non database world.


The header of each page lists the first word alphabetically on that page. Such a book
index is a multilevel index: The words at the top of each page of the book index
form a sparse index on the contents of the dictionary pages.

INDEX UPDATE
Regardless of what form of index is used, every index must be updated whenever
a record is either inserted into or deleted from the file. These are the algorithms
used for updating single level indices.

INSERTION:
First, the system performs a lookup using the search-key value that appears in
the record to be inserted. Again, the actions the system takes next depend on
whether the index is dense or sparse:

55
DENSE INDICES:
If the search-key value does not appear in the index, the system inserts an index
record with the search-key value in the index at the appropriate position.

Otherwise the following actions are taken:


If the index record stores pointers to all records with the same search- key value,
the system adds a pointer to the new record to the index record.

Otherwise, the index record stores a pointer to only the first record with the
search-key value. The system then places the record being inserted after the
other records with the same search-key values.

SPARSE INDICES:
We assume that the index stores an entry for each block. If the system creates a
new block, it inserts the first search-key value (in search-key order) appearing in
the new block into the index. On the other hand, if the new record has the least
search-key value in its block, the system updates the index entry pointing to the
block; if not, the system makes no change to the index.

DELETION.
To delete a record, the system first looks up the record to be deleted. The actions
the system takes next depend on whether the index is dense or sparse.

DENSE INDICES:
1. If the deleted record was the only record with its particular search-key
value, then the system deletes the corresponding index record from the
index.

56
2. Otherwise the following actions are taken:
If the index record stores pointers to all records with the same search-
key value, the system deletes the pointer to the deleted record from
the index record.
Otherwise, the index record stores a pointer to only the first record with
the search-key value.

SPARSE INDICES:
1. If the index does not contain an index record with the search-key value of
the deleted record, nothing needs to be done to the index.

2. Otherwise the system takes the following actions:


If the deleted record was the only record with its search key, the
system replaces the corresponding index record with an index record
for the next search-key value (in search-key order).
Otherwise, if the index record for the search-key value points to record
being deleted, the system updates the index record to point to the next
record with the same search-key value.

57
SECONDARY INDICES
Secondary indices must be dense, with an index entry for every search-key value,
and, a pointer to every record in the file. A primary index may be sparse, storing
only some of the search-key values, since it is always possible to find records with
intermediate, search-key values by a sequential access to a part of the file. If a
secondary index stores only some of the search-key values, records with
intermediate search-key values may be anywhere in the file and, in general, we
cannot find them without searching the entire file.

The pointers in such a secondary index do not point directly to the file. Instead,
each points to a bucket that contains pointers to the file. The below figure
shows the structure of a secondary index that uses an extra level of indirection on
the account file, on the search key balance.

SQL on INDEX:

Create an index
create index <index-name> on <relation-name> (<attribute-list>)
E.g.: create index b-index on branch(branch name)

Dropping of index: drop index <index-name>

58
3.7 B+ TREE AND B TREE
B+ Trees
Main disadvantage of index sequential file is that performance degrades as file
grows. Frequent reorganizations are undesirable

B+ trees are most widely used index structure that maintains efficiency.
Remember that a tree: Balanced tree: all leafs at the same level:

Fig: B+ tree is a balanced tree

ADVANTAGE OF B+-TREE INDEX FILES:


Automatically reorganizes itself with small, local, changes, in the face of insertions
and deletions.

Reorganization of entire file is not required to maintain performance.

(MINOR) DISADVANTAGE OF B+ TREES:

Extra insertion and deletion overhead, space overhead

PROPERTIES OF B+ TREE:

All paths from root to leaf are of the same length

Each node that is not a root or a leaf has between n/2 and n children. A leaf
node has between 2 to m values

A B+-tree is a rooted tree satisfying the following properties.

59
TWO TYPES OF NODES:
Leaf nodes: Store keys and pointers to data
Index nodes: Store keys and pointers to other nodes Leaf nodes are linked to
each other.
Keys may be duplicated: Every key to the right of a particular key is >= to that
key.
Typical structure of the Node

Ki are the search-key values


Pi are pointers to children (for non-leaf nodes) or pointers to records or buckets
of records (for leaf nodes).
The search-keys in a node are ordered
K1 <K2 <K3 <. . .<Kn–1

PROPERTIES OF LEAF NODE:

For i = 1, 2, . . ., n–1, pointer Pi either points to a file record with search -key
value Ki, or to a bucket of pointers to file records, each record having search-key
value Ki.

If Li, Lj are leaf nodes and i <j, Li‘s search-key values are less than Lj‘s search-
key values

Pn points to next leaf node in search-key order The search-keys in a leaf node
are ordered

K1 <K2 <K3 <. . .<Kn–1

60
Example For B+ Tree:

UPDATES ON B+TREE

1. Find the leaf node in which the search-key value would appear
2. If the search-key value is already present in the leaf node

Add record to the file

If necessary add a pointer to the bucket.

3. If the search-key value is not present, then.

add the record to the main file (and create a bucket if necessary)
If there is room in the leaf node, insert (key-value, pointer) pair in the
leaf node

Otherwise, split the node (along with the new (key-value, pointer) entry

4. Splitting A Leaf Node:

Take the n (search-key value, pointer) pairs (including the one being inserted) in
sorted order. Place the first n/2 in the original node, and the rest in a new node.

Let the new node be p, and let k be the least key value in p. Insert (k,p) in
the parent of the node being split.

If the parent is full, split it and propagatethe split further up.

5. Splitting of nodes proceeds upwards till a node that is not full is found.

In the worst case the root node may be split increasing the height of the tree
by 1.

61
Fig: Splitting A Leaf Node

Result of splitting node containing Brighton and Downtown on inserting Clear view

Next step: insert entry with (Downtown, pointer-to-new-node) into parent

UPDATTION OF B+TREE: INSERTION

Fig: B+Treebefore and after insertionof ―Clearview‖

62
UPDATION OF B+TREE: DELETION

Find the record to be deleted, and remove it from the main file and from the
bucket (if present)

Remove (search-key value, pointer) from the leaf node if there is no bucket or if
the bucket has become empty

If the node has too few entries due to the removal, and the entries in the node
and a sibling fit into a single node, then mergesiblings:

Insert all the search-key values in the two nodes into a single node (the one on
the left), and delete the other node.

Delete the pair (Ki–1, Pi), where Pi is the pointer to the deleted node, from its
parent, recursively using the above procedure.

Fig: Before and after deleting―Downtown

63
B TREE:
Similar to B+-tree, but B-tree allows search-key values to appear only once; eliminates
redundant storage of search keys.
Search keys in nonleaf nodes appear nowhere else in the B-tree; an additional pointer
field for each search key in a nonleaf node must be included.

GENERALIZED B-TREE LEAF NODE

Fig: Generalized B-tree Leaf Node


Non leaf node – pointers Bi are the bucket or file record pointers.
B Tree Example,

Fig: B Tree

Advantages of B-Tree indices:

May use less tree nodes than a corresponding B+-Tree.

Sometimes possible to find search-key value before reaching leaf node.

Disadvantages of B-Tree indices:

Only small fraction of all search-key values are found early


Non-leaf nodes are larger, so fan-out is reduced. Thus, B-Trees typically have greater
depth than corresponding B+-Tree

Insertion and deletion more complicated than in B+-Trees

Implementation is harder than B+-Trees.

64
3.8 HASHING TECHNIQUES:
Hashing is a type of primary file organization, which provides very fast access to
records on certain search conditions. This organization is called as hash file.

The idea behind the hashing is to provide a function h, called a hash function or
randomizing function that is applied to the hash filed value of a record and yields
the address of the disk block in which the record is stored. A search for the record
within the block can be carried out in a main memory buffer.
The hash function is given by:

H(k)=K mod M
M-size of the bucket

H(k)- hash functions

K- search key values

A bucket is a unit of storage containing one or more records (a bucket is typically


a disk block).

In a hash file organization we obtain the bucket of a record directly from its
search-key value using a hash function.

Hash function h is a function from the set of all search-key values K to the set of
all bucket addresses B.

Hash function is used to locate records for access, insertion as well as deletion.
Records with different search-key values may be mapped to the same bucket;
thus entire bucket has to be searched sequentially to locate a record.

DISTRIBUTION OF HASH FUNCTIONS:


An ideal hash function is uniform, i.e., each bucket is assigned the same number
of search-key values from the set of all possible values.

Ideal hash function is random, so each bucket will have the same number of
records assigned to it irrespective of the actual distribution of search-key values in
the file.

65
Handling of Bucket Overflows

Bucket overflow can occur because of insufficient buckets


Skew in distribution of records. This can occur due to two reasons: multiple
records have same search-key value chosen hash function produces non-uniform
distribution of key values.

Although the probability of bucket overflow can be reduced, it cannot be


eliminated; it is handled by using overflow buckets.

Overflow chaining – the overflow buckets of a given bucket are chained together
in a linked list. Above scheme is called closed hashing.

An alternative, called open hashing, which does not use overflow buckets, is not
suitable for database applications.

DEFICIENCIES OF STATIC HASHING


In static hashing, function h maps search-key values to a fixed set of B of bucket
addresses. Databases grow or shrink with time.

If initial number of buckets is too small, and file grows, performance will degrade
due to too much overflows.

If space is allocated for anticipated growth, a significant amount of space will be


wasted initially (and buckets will be under full).

If database shrinks, again space will be wasted.


One solution: periodic re-organization of the file with a new hash function
Expensive, disrupts normal operations

Better solution: allow the number of buckets to be modified dynamically.


66
DYNAMIC HASHING:
USE OF EXTENDABLE HASH STRUCTURE

Each bucket j stores a value ij

All the entries that point to the same bucket have the same values on the first ij
bits.

To locate the bucket containing search-key Kj:

Compute h(Kj) = X

Use the first i high order bits of X as a displacement into bucket address table, and
follow the pointer to appropriate bucket

To insert a record with search-key value Kj

follow same procedure as look-up and locate the bucket, say j.

If there is room in the bucket j insert record in the bucket.

Else the bucket must be split and insertion re-attempted.

Hash structure after insertion of one Brighton and two Downtown records

67
The main advantage of the extendible hashing is that the performance of the
fields does not degrade as the file grows, as opposed to static external hashing

where collisions increase and the corresponding chaining causes the additional

accesses. In addition no space is allocated in extendible hashing for future

growth, but additional buckets can be allocated dynamically as needed.

Other hashingtechniques are given below:

Folding involves applying an arithmetic function such as addition or a logical


function such as exclusive or to different portions of the hash field value to

calculate the hash address. Another technique involves picking some digits of

the hash field value – for example, the third, fifth and eighth digits to form the

hash address.

The problem with most hashing functions is that they do not guarantee that

distinct values will hash to addresses, because the hash field space (the number

of possible values a hash field can take) is usually much larger than the address

space (the number of available addresses for records).

A collision occurs when the hash field value of a record that is being inserted

hashes to an address that already contains a different record. In this situation the

new record must be inserted in some other position, since its hash address is

occupied. The process of finding another position is called collision resolution.

There are numerous methods for collision resolution as given below:

68
Open addressing: proceeding from the occupied position specified by
the hash address, the program checks the subsequent positions in order

until an unused (empty) position is found. The below algorithm may be used

for this purpose.

Algorithm:
i  hash address (k);

a i;

if location i is occupied

then begin i  (i + 1) mod M;

while (i <> a) and location I is occupied do i (i +1) mod M;

if (i = a) then all positions are full

else new_hash_address i;

end;

CHAINING:
For this method, various overflow locations are kept, for extending the array with
a number of overflow positions. In addition a pointer field is added to each record
location. A collision is resolved by placing the new record in an unused overflow
location and setting the pointer of the occupied hash address location to the
address of that overflow location.

MULTIPLE HASHING:
The program applies a second hash function if the first results in a collision, if
another collision results, the program uses open addressing or applies a third
hash function and then uses open addressing if necessary.

69
10. ASSIGNMENTS

1.Consider a B+-tree of order two (d=2). Thus, the maximum number of


pointers per node is 5 and the maximum number of entries is 4. Show the
results of entering one by one the keys that are three letter strings: (era, ban,
bat, kin, day, log, rye, max, won, ace, ado, bug, cop, gas, let, fax ) (in that
order) to an initially empty B+-tree. Assume that you use lexicographic
ordering to compare the strings. Show the state of the tree after every 4
insertions. (CO4,K4)
2. Answer the following questions: 1. What is the minimum space utilization for
a B+ tree index? 2. If your database system supported both a static and a
dynamic tree index, would you ever consider using the static index in
preference to the dynamic index? (CO4,K4)
3. A B-tree of order 4 is built from scratch by 10 successive insertions. What is
the maximum number of node splitting operations that may take place?(CO4,
K4)
4. The keys 18, 1 2 , 13, 2, 3, 23, 9 and 11 are inserted into an initially
empty hash table of length 10 using open addressing with hash function
h(k) = k mod 10 and linear probing. What is the resultant hash table? (CO4,
K4)
5. The following key values are organized in an extendable hashing

technique. 1 3 5 8 9 12 17 28 Show the extendable hash structure for this


file if the hash function is h(x)=x mod 8 and buckets can hold three
records. Show how the extendable hash structure changes as the result
of each of the following steps: (CO4, K3)

INSERT 2
INSERT 4
DELETE 5
DELETE 12
11. Part A Question & Answer
UNIT IV - IMPLEMENTATION TECHNIQUES
S.No Question and Answers CO K

List the possible ways of organizing records in files.


The possible ways of organizing records in files are:
Heap file organization
1 Sequential file organization CO4 K1
Hashing file organization
Clustering file organization

Explain i) heap file organization ii) Sequential file


organization.

2 In heap file organization, any record can be placed anywhere


in the file where there is space for the record. There is no
ordering of records. There is a single file for each relation. CO4 K1

In sequential file organization, records are stored in sequential


order according to the value of a ―search key’ of each record.
What are the two types of indices?
The two basic kinds of indices are:
Ordered indices- based on a sorted ordering of the values. Hash
3 indices – based on a uniform distribution of values across a
CO4 K1
range of buckets. The bucket to which a value is assigned is
determined by a function, called a hash function.
Define Query Tree and Query Graph
A query tree is a tree data structure that corresponds to a
4 relational algebra expression. It represents the input relations
of the query as leaf nodes of the tree, and represents the
relational algebra operations as internal nodes. The query tree
represents a specific order of operations for executing a query. A
more neutral data structure for representation of a query is the
query graph notation. Relations in the query are represented by CO4 K1
relation nodes, which are displayed as single circles. Constant
values, typically from the query selection conditions, are
represented by constant nodes, which are displayed as double
circles or ovals. Selection and join conditions are represented by
the graph edges,
Define query optimization.
Query optimization refers to the process of finding the lowest
– cost method of evaluating a given query. It is done by the CO4 K1
5
Query evaluation engine.
S.No Question and Answers CO K
List the factors used to evaluate the indexing and
hashing techniques.
The factors that must be evaluated for indexing and hashing
6 technique are: access types, access time, insertion time CO4 K1
deletion time and space overhead

What are the index-sequential files? What are the two


types of ordered indices?
Files that are ordered sequentially with a primary index on the
search key are called index- sequential files.
The two types of ordered indices are
7 CO4 K1
Dense index – an index record appears for every search-key
value in the file.
Sparse index – an index record appears for only some for the
search-key values.
Explain multilevel indices,
Indices with two or more levels are called multilevel indices.
8 Searching for records with a multi level index requires CO4 K2
significantly fewer I/O operations than searching for records by
binary search.
What is a B+ - tree index?
A B + tree index is a balanced tree in which every path from
9 CO4 K1
the root of the tree to a leaf of the tree is of the same length.

What is the advantages of B-tree index files over B+


tree index files?
B-trees eliminate the redundant storage of search key values,
10 as it allows the search key values to appear only one, whereas CO4 K1
B+ trees have redundant storage of search-key values.

Write notes on hashing.


Hashing provides very fast access on records on search
conditions. The search condition must be an equality condition
on a single field, called hash field of the file. The hash field is
11 also a key field of the file, in which case it is called the hash CO4 K1
key. Hashing is used to search within a program whenever a
group of records are accessed exclusively by using the value of
one field.

Name the two types of hashing and any two hashing


functions.
The two types of hashing are:
internal hashing
12 external hashing Hashing function include CO4 K1
Folding
Picking some digits of the hash field value.
S.No Question and Answers CO K
What is called query processing?
Query processing refers to the range of activities involved in
extracting data from a database.

13 The steps involved in query processing: CO4 K1


The basic steps are:
Parsing and translation
Optimization
Evaluation
What is called a query evaluation plan and query –
execution engine?
A sequence of primitive operations that can be used to evaluate
14 a query is a query evaluation plan or a query execution plan. CO4 K1
The query execution engine takes a query evaluation plan,
executes that plan, and returns the answers to the query.
How do we measure the cost of query evaluation?
The cost of query evaluation is measured in terms of a number
15 of different resources including disk access, CPU time to
execute a query, and in distributed database system, the cost of
communication.
What is called query processing?
Query processing refers to the range of activities involved in
16 extracting data from a database. CO4 K1
The steps involved in query processing:
The basic steps are:
Parsing and translation
Optimization
Evaluation
List the Cost Components for Query Execution.
Access cost to secondary storage
Disk storage cost.
17 Computation cost. CO4 K1
Memory usage cost.
Communication cost.
Define left-deep tree?
Left Deep Tree is a binary tree in which the right child of each
nonleaf node is always a base relation. The optimizer would
18 choose the particular left-deep tree with the lowest estimated CO4 K2
cost.
S.No Question and Answers CO K

What are the advantages and disadvantages of indexed


sequential file
Advantage
Quick accessing of records
19 Disadvantage CO4 K1
rewriting at least everything after the insertion point, which
makes inserts very expensive unless they are done at the end
of the file.
Define Bit-Interleaved Parity
When writing data, corresponding parity bits must also be
computed and written to a parity bit disk – To recover data in
20 CO4 K1
a damaged disk, compute XOR of bits from other disks

What are the disadvantages of B-Tree over B+ Tree?


Only small fraction of all search-key values are found early
Non-leaf nodes are larger. Thus, B-Trees typically have greater
depth than corresponding B+-Tree Insertion and deletion more
complicated than in B+-Trees Implementation is harder than
21 CO4 K1
B+-Trees.

Differentiate static and dynamic hashing.

22 CO4 K1
S. No. PART B CO K
Explain in detail about Query optimization using Cost Estimation
1 CO4 K4
Techniques with Examples.

Explain join strategies in detail


2 CO4 K2

Explain different properties of indexes in detail. Explain


3 CO4 K2
structure of file indices
Explain the various indexing schemes used in database
4 CO4 K2
environment.

5 Discuss about primary file storage system CO4 K1

6 Explain static and dynamic Hashing Techniques with examples CO4 K4

7 Briefly describe about B+ tree index file structure. CO4 K2

8 Explain in detail about B tree index files. CO4 K1

9 With a neat diagram explain the steps involved in query CO4 K2


process

Give a detailed description about Query processing and


10 CO4 K1
Optimization
Explain the cost estimation of Query Optimization.
11 CO4 K2
Explain about the equivalence rules with examples.
12 CO4 K3

When does a collision occur in hashing? Illustrate various


13 CO4 K3
collision resolution techniques.

Summarize in detail about Heuristic optimization algorithms.


14 CO4 K2

Illustrate with an example for Cost Estimation in Query


Optimization.
15 CO4 K2
13. SUPPORTIVE ONLINE CERTIFICATION COURSES

Sl. Name of the Name of the


No. Website Link
Institute Course

Data Base
Management https://nptel.ac.in/noc/courses/noc18/SEM1/no
1. NPTEL
System c18-cs15/

Database
https://www.coursera.org/learn/database-
2. Coursera Management
management
Essentials

Optimization of
SQL query https://www.coursera.org/projects/optimization-
3. Coursera Tuning and of-sql-query-tuning-and-performance
Performance

Database
Management
Final Part (5): https://www.udemy.com/course/database-
4. Udemy Indexing,B management-indexing-course-btree/
Trees,B+Trees

How does https://www.udemy.com/tutorial/data-


Udemy
5. hashing work? structures-and-algorithms-bootcamp/how-does-
hashing-work/

SQL Tuning
6. Udemy https://www.udemy.com/course/sql-tuning/
14.REAL TIME APPLICATIONS IN DAY TO DAY LIFE AND
TO INDUSTRY

Application and Uses of Database Management System (DBMS)

Online Shopping System.


Railway Reservation System.
Library Management System
Banking System

Universities and colleges Management Systems


Credit card transactions.
Social Media Sites
Finance Applications
15. CONTENT BEYOND SYLLABUS
DRIVE ROAMING
Complemented by hot swapping capabilities, drive roaming makes moving disks
between systems much easier by eliminating the need to keep track of which drives
were connected to which RAID controllers. With drive roaming, the system keeps
track of this for the user. This is extremely useful for applications that require
moving large amounts of data between systems quickly. In such applications, it is
much faster to move drives than to copy the data over the network.

CONTROLLER SPANNING
Controller spanning allows an array to span disks attached to multiple RAID
controllers and, in doing so, allows the creation of very large arrays and provides
high throughput rates. For example, if we have four RAID controllers and each
controller has eight drives attached, controller spanning allows the creation of an
array that spans all thirty-two disk drives. Since performance scales linearly, this
allows an extremely high I/O transfer performance with thirty-two spindles in a
single array.

DISTRIBUTED SPARING
The final enterprise-class RAID feature is distributed sparing, which creates a spare
failover drive without the need to actually include an extra drive. Essentially,
distributed sparing reserves enough disk space on all the included disks so that the
sum of the reserved space is equal to the largest drive in the array. A key advantage
of this approach versus global or dedicated sparing is that all of the drives are
actively used, resulting in significantly better performance. This is different from the
standard, non distributed approach, which sets aside one drive for failover, allowing
the possibility of a silent failure when that drive is not in use.

SUMMARY
There is some serious power in the enterprise-class RAID features that have
traditionally been available only to those with large IT budgets. Now, with the
convergence of RAID and lower priced storage technologies, this same power can be

enjoyed by any small-to-medium sized organization with critical data.


15. CONTENT BEYOND SYLLABUS

ADVANCED RAID

N-WAY MIRRORING, SPLITTING AND HIDING

Another collection of features, when used in conjunction with each one another,
delivers extra data protection from system threats derived from malicious users,
accidental deletion or viruses. These data protection features include N-way
mirroring, array splitting and array hiding. N-way mirroring goes beyond simple two-
way mirroring, allowing additional mirrors of a data set to be created; array splitting
allows one of those mirrors to be removed from the active array; and array hiding
makes an array invisible to users and the operating system, and accessible only to a
privileged administrator.
When combined, these features allow a system administrator to create a secure
backup of the active data array. To start, the admin creates a three-way mirror using
N-way mirroring. Next, he removes one of the mirrors from the active array through
array splitting. (Note that the active array retains the data integrity provided by a
two-way mirror after one of the mirrors is removed.) Finally, he hides the split mirror
so neither users nor the operating system can see the data. Threats can't attack
what they can't see.
Should a disaster occur in the active array occur, the fix is simple and local. The
admin deletes the data in the corrupted array, unhides the hidden array and
transforms the active array using ORLM to include the good mirror. Please note that
the data was valid as of the time it was split off, so any modifications since that time
would not be reflected in the previously hidden array.
16. ASSESSMENT II SCHEDULE

Proposed date 4.10.2024

Actual date 4.10.2024


17.PRESCRIBED TEXT BOOKS &REFERENCE BOOKS

TEXT BOOKS:

1. Elmasri R. and S. Navathe, ―Fundamentals of Database Systems‖, Pearson


Education, 7th Edition, 2016.

2. Abraham Silberschatz, Henry F.Korth, ―Database System Concepts‖, Tata


McGraw Hill , 7th Edition, 2021.

3. Elmasri R. and S. Navathe, Database Systems: Models, Languages, Design and


Application Programming, Pearson Education, 2013

REFERENCES:
1. Raghu Ramakrishnan, Gehrke ―Database Management Systems‖, MCGraw Hill,
3rd Edition 2014.
2. Plunkett T., B. Macdonald, ―Oracle Big Data Hand Book‖ , McGraw Hill, First
Edition, 2013
3. Gupta G K , ―Database Management Systems‖ , Tata McGraw Hill Education
Private Limited, New Delhi, 2011.

4. C. J. Date, A.Kannan, S. Swamynathan, ―An Introduction to Database Systems‖,


Eighth Edition, Pearson Education, 2015.
5. Maqsood Alam, Aalok Muley, Chaitanya Kadaru, Ashok Joshi, Oracle NoSQL
Database: Real-Time Big Data Management for the Enterprise, McGraw Hill
Professional, 2013.

6. Thomas Connolly, Carolyn Begg, ―Database Systems: A Practical Approach to


Design, Implementation and Management‖, Pearson , 6th Edition, 2015.
18. MINI PROJECT SUGGESTIONS

Design a Relational database for the following


1) Insurance Management System
The insurance management project deals the adding new insurance schemes and managing the
clients for the insurance. The project has complete access for the crud operations that are to
create, read, update and delete the database entries. At first you need to add a branch and the
staff members for the branch then secondly add a user to the database now you can add an
insurance scheme and finally make the payments for the client to the added insurance.

2) Inventory Management
The project starts by adding a seller and by adding details of customer. the user can now
purchase new products by the desired seller and then can sell them to the customer, the
purchasing and selling of products is reflected in the inventory section. The main aim of
Inventory Management Mini DBMS project is to add new products and sell them and keep an
inventory to manage them.

3) Pharmacy management System


The project starts by adding a dealer and by adding details of customer. the user can now
purchase new medicines by the desired dealer and then can sell them to the customer added
the purchasing and selling of medicines is reflected in the inventory section.

4) Library management system


there will be an admin who will be responsible for manages the system. The admin will look
after how many books are available in the library and he can update them if any new books
brought to the library. Perform operations like adding a book to the database, view all books
which are added, search for a specific book, issue books, and retrieve the book from users.

5) Hotel management system


Add features like providing the menu of the food items that are prepared in the hotel. Provide
the menu and prices of the food in the hotel. you can add the online food ordering facility to
this which will give a good impression on the project. we can also book the tables in the
project. Add a feature that will show the collection of the day this will be only viewed by the
admin. we can also provide online booking of rooms also. Improve this with some other
features.
Thank you

Disclaimer:

This document is confidential and intended solely for the educational purpose of
RMK Group of Educational Institutions. If you have received this document through
email in error, please notify the system manager. This document contains proprietary
information and is intended only to the respective group / learning community as
intended. If you are not the addressee you should not disseminate, distribute or
copy through e-mail. Please notify the sender immediately by e-mail if you have
received this document by mistake and delete this document from your system. If
you are not the intended recipient you are notified that disclosing, copying,
distributing or taking any action in reliance on the contents of this information is
strictly prohibited.

115

You might also like