0% found this document useful (0 votes)
71 views68 pages

UNIT-V Final

The document discusses the transaction concept in databases, including transaction states, ACID properties, and concurrency control. It outlines the life cycle of transactions, the importance of maintaining data consistency, and various concurrency problems that can arise during concurrent executions. Additionally, it covers serializability, including conflict and view serializability, as well as methods to test for these properties in database management systems.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
71 views68 pages

UNIT-V Final

The document discusses the transaction concept in databases, including transaction states, ACID properties, and concurrency control. It outlines the life cycle of transactions, the importance of maintaining data consistency, and various concurrency problems that can arise during concurrent executions. Additionally, it covers serializability, including conflict and view serializability, as well as methods to test for these properties in database management systems.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 68

UNIT-V: Transaction Concept: Transaction State, ACID properties,

Concurrent Executions, Serializability, Recoverability, Implementation of


Isolation, Testing for Serializability, lock based, time stamp based, optimistic,
concurrency protocols, Deadlocks, Failure Classification, Storage, Recovery and
Atomicity, Recovery algorithm.
Physical Database Design: Database File Structures , Raid
Introduction to Indexing Techniques: B+ Trees, operations on B+Trees, Hash
Based Indexing
 Transaction: Transaction is a single logical unit of work formed by a set of
operations.
Operations in Transaction-
The main operations in a transaction are-
 Read Operation
 Write Operation
1. Read Operation-
Read operation reads the data from the database and then stores it in the buffer in
main memory.
Example- Read(A) instruction will read the value of A from the database and
will store it in the buffer in main memory.
2. Write Operation-
Write operation writes the updated data value back to the database from the buffer.
Example- Write (A) will write the updated value of A from the buffer to the
database.
Transaction States-
A transaction goes through many different states throughout its life cycle. These
states are called as transaction states.

A.RAMESH DEPT OF CSE RGMCET 1


Transaction states are as follows-
1. Active state
2. Partially committed state
3. Committed state
4. Failed state
5. Aborted state
6. Terminated state

1. Active State-
 This is the first state in the life cycle of a transaction.
 A transaction is called in an active state as long as its instructions are
getting executed.
 All the changes made by the transaction now are stored in the buffer in
main memory.
2. Partially Committed State-
 After the last instruction of transaction has executed, it enters into
a partially committed state.
 After entering this state, the transaction is considered to be partially
committed.

A.RAMESH DEPT OF CSE RGMCET 2


 It is not considered fully committed because all the changes made by the
transaction are still stored in the buffer in main memory.
3. Committed State-
 After all the changes made by the transaction have been successfully
stored into the database, it enters into a committed state.
 Now, the transaction is considered to be fully committed.
 After a transaction has entered the committed state, it is not possible to
roll back the transaction.
 This is because the system is updated into a new consistent state.
 The only way to undo the changes is by carrying out another transaction
called as compensating transaction that performs the reverse
operations.
4. Failed State-
 When a transaction is getting executed in the active state or partially
committed state and some failure occurs due to which it becomes
impossible to continue the execution, it enters into a failed state.
5. Aborted State-
 After the transaction has failed and entered into a failed state, all the
changes made by it have to be undone.
 To undo the changes made by the transaction, it becomes necessary to
roll back the transaction.
 After the transaction has rolled back completely, it enters into
an aborted state.
6. Terminated State-
 This is the last state in the life cycle of a transaction.

A.RAMESH DEPT OF CSE RGMCET 3


 After entering the committed state or aborted state, the transaction
finally enters into a terminated state where its life cycle finally comes
to an end.

ACID Properties-
 It is important to ensure that the database remains consistent before and
after the transaction.
 To ensure the consistency of database, certain properties are followed by
all the transactions occurring in the system.
These properties are called as ACID Properties of a transaction

1. Atomicity-
 This property ensures that either the transaction occurs completely or it does not
occur at all.
 In other words, it ensures that no transaction occurs partially.
 That is why, it is also referred to as “All or nothing rule“.
 It is the responsibility of Transaction Control Manager to ensure atomicity of the
transactions.
2. Consistency-
 This property ensures that integrity constraints are maintained.

A.RAMESH DEPT OF CSE RGMCET 4


 In other words, it ensures that the database remains consistent before and after
the transaction.
 It is the responsibility of DBMS and application programmer to ensure
consistency of the database.
3. Isolation-
 This property ensures that multiple transactions can occur simultaneously
without causing any inconsistency.
 During execution, each transaction feels as if it is getting executed alone in the
system.
 A transaction does not realize that there are other transactions as well getting
executed parallel.
 Changes made by a transaction becomes visible to other transactions only after
they are written in the memory.
 The resultant state of the system after executing all the transactions is same as
the state that would be achieved if the transactions were executed serially one
after the other.
 It is the responsibility of concurrency control manager to ensure isolation for all
the transactions.
4. Durability-
 This property ensures that all the changes made by a transaction after its
successful execution are written successfully to the disk.
 It also ensures that these changes exist permanently and are never lost even if
there occurs a failure of any kind.
 It is the responsibility of recovery manager to ensure durability in the database
Advantages of ACID Properties in DBMS:
 1. Data Consistency: ACID properties ensure that the data remains consistent
and accurate after any transaction execution.
A.RAMESH DEPT OF CSE RGMCET 5
 2. Data Integrity: ACID properties maintain the integrity of the data by ensuring
that any changes to the database are permanent and cannot be lost.
 3. Concurrency Control: ACID properties help to manage multiple transactions
occurring concurrently by preventing interference between them.
 4. Recovery: ACID properties ensure that in case of any failure or crash, the
system can recover the data up to the point of failure or crash.
Disadvantages of ACID Properties in DBMS:
 1. Performance: The ACID properties can cause a performance overhead in the
system, as they require additional processing to ensure data consistency and
integrity.
 2. Scalability: The ACID properties may cause scalability issues in large
distributed systems where multiple transactions occur concurrently.
 3. Complexity: Implementing the ACID properties can increase the complexity
of the system and require significant expertise and resources. Overall, the
advantages of ACID properties in DBMS outweigh the disadvantages. They
provide a reliable and consistent approach to data
 4. management, ensuring data integrity, accuracy, and reliability. However, in
some cases, the overhead of implementing ACID properties can cause
performance and scalability issues. Therefore, it’s important to balance the
benefits of ACID properties against the specific needs and requirements of the
system.

Concurrent Executions:
• In a multi-user system, multiple users can access and use the same database at
one time, which is known as the concurrent execution of the database. It means
that the same database is executed simultaneously on a multi-user system by
different users. While working on the database transactions, there occurs the
requirement of using the database by multiple users for performing different
operations, and in that case, concurrent execution of the database is performed.

A.RAMESH DEPT OF CSE RGMCET 6


The advantages of concurrency control are as follows −
1. Waiting time will be decreased.
2. Response time will decrease.
3. Resource utilization will increase.
4. System performance & Efficiency is increased.

Concurrency Problems in DBMS-


 When multiple transactions execute concurrently in an uncontrolled or
unrestricted manner, then it might lead to several problems.
 Such problems are called as concurrency problems.
Problems with Concurrent Execution:
1. Dirty Read Problems (W-R Conflict)
2. Unrepeatable Read Problem (W-R Conflict)
3. Lost Update Problems (W - W Conflict)
4. Phantom Read Problem

Dirty Read Problems (W-R Conflict):


• Reading the data written by an uncommitted transaction is called as dirty read.
This read is called as dirty read because-
 Thus, uncommitted transaction might make other transactions read a value that
does not even exist.
 This leads to inconsistency of the database.

A.RAMESH DEPT OF CSE RGMCET 7


Example:

Here:
1. T1 reads the value of A.
2. T1 updates the value of A in the buffer.
3. T2 reads the value of A from the buffer.
4. T2 writes the updated the value of A.
5. T2 commits.
6. T1 fails in later stages and rolls back.
Example:
1. T2 reads the dirty value of A written by the uncommitted transaction
T1.
2. T1 fails in later stages and roll backs.
3. Thus, the value that T2 read now stands to be incorrect.
4. Therefore, database becomes inconsistent.

A.RAMESH DEPT OF CSE RGMCET 8


2.Unrepeatable Read Problem (W-R Conflict):
• This problem occurs when a transaction gets to read unrepeated i.e. different
values of the same variable in its different read operations even when it has not
updated its value.
• Example:

Here:
 T1 reads the value of X (= 10 say).
 T2 reads the value of X (= 10).
 T1 updates the value of X (from 10 to 15 say) in the buffer.
 T2 again reads the value of X (but = 15).
Example:
 T2 gets to read a different value of X in its second reading.
 T2 wonders how the value of X got changed because according to it, it
is running in isolation.
3. Lost Update Problems (W - W Conflict):
 This problem occurs when multiple transactions execute concurrently
and updates from one or more transactions get lost.

A.RAMESH DEPT OF CSE RGMCET 9


Example:

Here:
 T1 reads the value of A (= 10 say).
 T2 updates the value to A (= 15 say) in the buffer.
 T2 does blind write A = 25 (write without read) in the buffer.
 T2 commits.
 When T1 commits, it writes A = 25 in the database.
Example:
 T1 writes the over written value of X in the database.
 Thus, update from T1 gets lost.
NOTE-
 This problem occurs whenever there is a write-write conflict.
In write-write conflict, there are two writes one by each transaction on the same
data item without any read in the middle.

A.RAMESH DEPT OF CSE RGMCET 10


5. Phantom Read Problem:
 This problem occurs when a transaction reads some variable from the
buffer and when it reads the same variable later, it finds that the variable
does not exist.
 Example:

Here:
 T1 reads X.
 T2 reads X.
 T1 deletes X.
 T2 tries reading X but does not find it.
Example:
 T2 finds that there does not exist any variable X when it tries reading
X again.
 T2 wonders who deleted the variable X because according to it, it is
running in isolation.
Avoiding Concurrency Problems-
 To ensure consistency of the database, it is very important to prevent
the occurrence of above problems.

A.RAMESH DEPT OF CSE RGMCET 11


 Concurrency Control Protocols help to prevent the occurrence of
above problems and maintain the consistency of the database.

SERIALIZABILITY:

 A schedule where the operations of each transaction are executed consecutively


without any interference from other transactions is called serial schedule.
 If interleaving of operations is allowed, then there will be non-serial schedule.
SERIALIZABILITY:
 Some non-serial schedules may lead to inconsistency of the database.
 Serializability is a concept that helps to identify which non-serial schedules are
correct and will maintain the consistency of the database.
 A non-serial schedule is called a serializable schedule if it can be converted to
its equivalent serial schedule. In simple words, if a non-serial schedule and a
serial schedule result in the same then the non-serial schedule is called a
serializable schedule.

A.RAMESH DEPT OF CSE RGMCET 12


Types of Serializability-
Serializability is mainly of two types-
1. Conflict Serializability
2. View Serializability
Conflict Serializability-
 If a given non-serial schedule can be converted into a serial schedule by
swapping its non-conflicting operations, then it is called as a conflict
serializable schedule.
 Conflicting Operations-
Two operations are called as conflicting operations if all the following
conditions hold true for them-
 Both the operations belong to different transactions
 Both the operations are on the same data item
 At least one of the two operations is a write operation
Example-
Consider the following schedule-

In this schedule,
W1 (A) and R2 (A) are called as conflicting operations.
This is because all the above conditions hold true for them.

A.RAMESH DEPT OF CSE RGMCET 13


Example:
Check whether the given schedule S is conflict serializable or not-
S : R1(A) , R2(A) , R1(B) , R2(B) , R3(B) , W1(A) , W2(B)
Solution-
Step-01:
List all the conflicting operations and determine the dependency between the
transactions-
R2(A) , W1(A) (T2 → T1)
R1(B) , W2(B) (T1 → T2)
R3(B) , W2(B) (T3 → T2)
Step-02:
Draw the precedence graph-
Note: R-W,W-R,W-W are lead for the conflicts in schedule

Clearly, there exists a cycle in the precedence graph.


Therefore, the given schedule S is not conflict serializable.
NOTE: If a schedule is conflicting serializable, then it is surely a consistent
schedule. On the other hand, a non-conflicting serializable schedule may or may
not be serial. To further check its serial behavior, we use the concept of View
Serializability.

A.RAMESH DEPT OF CSE RGMCET 14


VIEW SERIALIZABILITY-
 If a given schedule is found to be view equivalent to some serial schedule, then
it is called as a view serializable schedule.
 Two schedules S1 and S2 are said to be view equal if below conditions are
satisified
1) Initial Read If a transaction T1 reading data item A from initial database in
S1 then in S2 also T1 should read A from initial database.
2)Updated Read If Ti is reading A which is updated by Tj in S1 then in S2 also
Ti should read A which is updated by Tj.
3)Final Write operation If a transaction T1 updated A at last in S1, then in S2
also T1 should perform final write operations.
NOTE:
Checking Whether a Schedule is View Serializable Or Not-
Method-01:
 Check whether the given schedule is conflict serializable or not.
 If the given schedule is conflict serializable, then it is surely view serializable.
 If the given schedule is not conflict serializable, then it may or may not be view
serializable.
Method-02:
 Check if there exists any blind write operation.
(Writing without reading is called as a blind write).
 If there does not exist any blind write, then the schedule is surely not view
serializable. Stop and report your answer.
 If there exists any blind write, then the schedule may or may not be view
serializable. Go and check using other methods.
Method-03:
 In this method, try finding a view equivalent serial schedule.
 By using the above three conditions, write all the dependencies.
 Then, draw a graph using those dependencies.

A.RAMESH DEPT OF CSE RGMCET 15


 If there exists no cycle in the graph, then the schedule is view serializable
otherwise not.
Example:
Check whether the given schedule S is view serializable or not-

Solution-
We know, if a schedule is conflict serializable, then it is surely view serializable.
So, let us check whether the given schedule is conflict serializable or not.
Checking Whether S is Conflict Serializable Or Not-
Step-01:
List all the conflicting operations and determine the dependency between the
transactions-
W1(B) , W2(B) (T1 → T2)
W1(B) , W3(B) (T1 → T3)
W1(B) , W4(B) (T1 → T4)
W2(B) , W3(B) (T2 → T3)
W2(B) , W4(B) (T2 → T4)
W3(B) , W4(B) (T3 → T4)

A.RAMESH DEPT OF CSE RGMCET 16


 Clearly, there exists no cycle in the precedence graph.
 Therefore, the given schedule S is conflict serializable.
 Thus, we conclude that the given schedule is also view serializable.
Non-Serializable Schedules-
• A non-serial schedule which is not serializable is called as a non-serializable
schedule.
• A non-serializable schedule is not guaranteed to produce the the same effect as
produced by some serial schedule on any consistent database.
Characteristics-
Non-serializable schedules-
 may or may not be consistent
 may or may not be recoverable
Irrecoverable Schedules-
 If in a schedule,
A transaction performs a dirty read operation from an uncommitted
transaction And commits before the transaction from which it has read the
value then such a schedule is known as an Irrecoverable Schedule.
Example-
Consider the following schedule-

A.RAMESH DEPT OF CSE RGMCET 17


Here,
T2 performs a dirty read operation.
T2 commits before T1.
T1 fails later and roll backs.
The value that T2 read now stands to be incorrect.
T2 cannot recover since it has already committed.
Recoverable Schedules-
 If in a schedule,
A transaction performs a dirty read operation from an uncommitted
transaction And its commit operation is delayed till the uncommitted
transaction either commits or roll backs then such a schedule is known as
a Recoverable Schedule.
Here,
 The commit operation of the transaction that performs the dirty read is
delayed.
 This ensures that it still has a chance to recover if the uncommitted transaction
fails later.
 Example-
 Consider the following schedule-

A.RAMESH DEPT OF CSE RGMCET 18


Here,
T2 performs a dirty read operation.
The commit operation of T2 is delayed till T1 commits or roll backs.
T1 commits later.
T2 is now allowed to commit.
In case, T1 would have failed, T2 has a chance to recover by rolling back.
Types of Recoverable Schedules-
A recoverable schedule may be any one of these kinds-

1. Cascading Schedule
2. Cascadeless Schedule
3. Strict Schedule
CASCADING SCHEDULE-
A.RAMESH DEPT OF CSE RGMCET 19
 If in a schedule, failure of one transaction causes several other
dependent transactions to rollback or abort, then such a schedule is
called as a Cascading Schedule or Cascading Rollback or Cascading
Abort.
 It simply leads to the wastage of CPU time.

Example-

Here,
 Transaction T2 depends on transaction T1.
 Transaction T3 depends on transaction T2.
 Transaction T4 depends on transaction T3.
In this schedule,
 The failure of transaction T1 causes the transaction T2 to rollback.

A.RAMESH DEPT OF CSE RGMCET 20


 The rollback of transaction T2 causes the transaction T3 to rollback.
 The rollback of transaction T3 causes the transaction T4 to rollback.
 Such a rollback is called as a Cascading Rollback.
NOTE-
If the transactions T2, T3 and T4 would have committed before the failure of
transaction T1, then the schedule would have been irrecoverable.

Cascadeless Schedule-
If in a schedule, a transaction is not allowed to read a data item until the last
transaction that has written it is committed or aborted, then such a schedule
is called as a Cascadeless Schedule.
In other words,
 Cascadeless schedule allows only committed read operations.
 Therefore, it avoids cascading roll back and thus saves CPU time.
Example-

A.RAMESH DEPT OF CSE RGMCET 21


NOTE-
Cascadeless schedule allows only committed read operations.
However, it allows uncommitted write operations.
STRICT SCHEDULE-
 If in a schedule, a transaction is neither allowed to read nor write a data
item until the last transaction that has written it is committed or aborted,
then such a schedule is called as a Strict Schedule.
In other words,
 Strict schedule allows only committed read and write operations.
 Clearly, strict schedule implements more restrictions than cascadeless
schedule.
Example-

Implementation of Isolation :
• Isolation is one of the core ACID properties of a database transaction, ensuring
that the operations of one transaction remain hidden from other transactions until
completion. It means that no two transactions should interfere with each other
and affect the other's intermediate state.
• Isolation Levels

A.RAMESH DEPT OF CSE RGMCET 22


Isolation Levels Isolation levels defines the degree to which a transaction must
be isolated from the data modifications made by any other transaction in the
database system. There are four levels of transaction isolation defined by SQL –
1. Serializable
● The highest isolation level.
● Guarantees full serializability and ensures complete isolation of transaction
operations.
2. Repeatable Read
● This is the most restrictive isolation level.
● The transaction holds read locks on all rows it references.
● It holds write locks on all rows it inserts, updates, or deletes.
● Since other transaction cannot read, update or delete these rows, it avoids non
repeatable read.
3. Read Committed
● This isolation level allows only committed data to be read.
● Thus it does not allow dirty read (i.e. one transaction reading of data
immediately after written by another transaction).
● The transaction holds a read or write lock on the current row, and thus prevent
other rows from reading, updating or deleting it.
4. Read Uncommitted
● It is the lowest isolation level.
● In this level, one transaction may read not yet committed changes made by
another transaction.
● This level allows dirty reads.

A.RAMESH DEPT OF CSE RGMCET 23


The proper isolation level or concurrency control mechanism to use depends on the
specific requirements of a system and its workload. Some systems may prioritize
high throughput and can tolerate lower isolation levels, while others might require
strict consistency and higher isolation.

Testing for Serializability:


To test the serializability of a schedule, we can use Serialization Graph or
Precedence Graph. A serialization Graph is nothing but a Directed Graph of the
entire transactions of a schedule.
It can be defined as a Graph G(V, E) consisting of a set of directed-edges E = {E1,
E2, E3, ..., En} and a set of vertices V = {V1, V2, V3, ...,Vn}. The set of edges
contains one of the two operations - READ, WRITE performed by a certain
transaction.

Ti -> Tj, means Transaction-Ti is either performing read or write before the
transaction-Tj.
NOTE: If there is a cycle present in the serialized graph then the schedule is non-
serializable because the cycle resembles that one transaction is dependent on the
A.RAMESH DEPT OF CSE RGMCET 24
other transaction and vice versa. It also means that there are one or more conflicting
pairs of operations in the transactions. On the other hand, no-cycle means that the
non-serial schedule is serializable.

Benefits of Serializability in DBMS


1. Predictable Executions:
Since all threads are executed one at a time, there are no surprises. All variables are
updated as expected, and no data is lost or corrupted.

2. Easier to Reason about & Debug:


As each thread is executed alone, it is easier to reason about what each thread is doing
and why. This can make debugging much easier since you don't have to worry about
concurrency issues.

3. Reduced Costs:
Serialization in DBMS can help reduce hardware costs by allowing fewer resources
to be used for a given computation (e.g., only one CPU instead of two). Additionally,
it can help reduce software development costs by making it easier to reason about
code and reducing the need for extensive testing with multiple threads running
concurrently.

4. Increased Performance:
In some cases, serializable executions can perform better than their non-serializable
counterparts since they allow the developer to optimize their code for performance.

Concurrency Control Protocols:


• To avoid concurrency control problems and to maintain consistency and
serializability during the execution of concurrent transactions some

A.RAMESH DEPT OF CSE RGMCET 25


rules are made. These rules are known as Concurrency Control
Protocols.

Lock-Based Protocol:
Why Do We Need Locks?
Locks are essential in a database system to ensure:
 Consistency: Without locks, multiple transactions could modify the same data
item simultaneously, resulting in an inconsistent state.
 Isolation: Locks ensure that the operations of one transaction are isolated from
other transactions, i.e., they are invisible to other transactions until the transaction
is committed.
 Concurrency: While ensuring consistency and isolation, locks also allow
multiple transactions to be processed simultaneously by the system, optimizing
system throughput and overall performance.
 Avoiding Conflicts: Locks help in avoiding data conflicts that might arise due
to simultaneous read and write operations by different transactions on the same
data item.
 Preventing Dirty Reads: With the help of locks, a transaction is prevented from
reading data that hasn't yet been committed by another transaction
Types of Locks:
There are two types of locks used -
 Shared Lock (S-lock)
This lock allows a transaction to read a data item. Multiple transactions can hold
shared locks on the same data item simultaneously. It is denoted by ’S’. This is
also called as read lock.

A.RAMESH DEPT OF CSE RGMCET 26


 Exclusive Lock (X-lock):
This lock allows a transaction to read and write a data item. If a transaction holds
an exclusive lock on an item, no other transaction can hold any kind of lock on
the same item. It is denoted as ’X’. This is also called as write lock.
Lock Compatibility Matrix :
A vital point to remember when using Lock-based protocols in
Database Management System is that a Shared Lock can be held by any amount
of transactions. On the other hand, an Exclusive Lock can only be held by one
transaction in DBMS, this is because a shared lock only reads data but does not
perform any other activities, whereas an exclusive lock performs read as well as
writing activities.
The figure given below demonstrates that when two transactions are
involved, and both of these transactions seek to read a specific data item, the
transaction is authorized, and no conflict occurs; but, in a situation when one
transaction intends to write the data item and another transaction attempts to read
or write simultaneously, the interaction is rejected.

The two methods outlined below can be used to convert between the locks:
1. Conversion from a read lock to a write lock is an upgrade.
2. Conversion from a write lock to a read lock is a downgrade.
There are four types of lock protocols available:
1. Simplistic lock protocol:
 It is the simplest way of locking the data while transaction. Simplistic
lock-based protocols allow all the transactions to get the lock on the
A.RAMESH DEPT OF CSE RGMCET 27
data before insert or delete or update on it. It will unlock the data item
after completing the transaction.
2.Pre-claiming Lock Protocol :
 Pre-claiming Lock Protocols evaluate the transaction to list all the data items
on which they need locks.
 Before initiating an execution of the transaction, it requests DBMS for all the
locks on all those data items.
 If all the locks are granted then this protocol allows the transaction to begin.
When the transaction is completed then it releases all the locks.
 If all the locks are not granted then this protocol allows the transaction to roll
back and waits until all the locks are granted.

3.Two-phase locking (2PL)


○ The two-phase locking protocol divides the execution phase of the
transaction into three parts.
○ In the first part, when the execution of the transaction starts, it seeks
permission for the lock it requires.
○ In the second part, the transaction acquires all the locks. The third phase is
started as soon as the transaction releases its first lock.
 In the third phase, the transaction cannot demand any new locks. It only releases
the acquired locks.
There are two phases of 2PL:

A.RAMESH DEPT OF CSE RGMCET 28


 Growing phase: In the growing phase, a new lock on the data item may be
acquired by the transaction, but none can be released.
 Shrinking phase: In the shrinking phase, existing locks held by the transaction
may be released, but no new locks can be acquired.

Limitations of 2PL:
Deadlock- It is a situation where transactions (T1, T2, etc) are waiting indefinitely
for locks held by each other. A circular dependency will be created in a situation
where the transactions hold locks on resources and wait for other resources that are
held by different transactions, so this needs to be avoided.
Cascading rollback- It is a phenomenon in which a single transaction failure leads
to a series of transaction rollbacks.
Locking overhead- The overhead associated with acquiring and releasing locks
can impact the overall system performance in times when there is a need for high
conflict for resources in the transaction.
Advantages of Two-Phase Locking (2PL):
● Ensures Serializability: 2PL guarantees conflict-serializability, ensuring the
consistency of the database.
● Concurrency: By allowing multiple transactions to acquire locks and release
them, 2PL increases the concurrency level, leading to better system throughput and
overall performance.

A.RAMESH DEPT OF CSE RGMCET 29


● Avoids Cascading Rollbacks: Since a transaction cannot read a value modified
by another uncommitted transaction, cascading rollbacks are avoided, making
recovery simpler.
Disadvantages of Two-Phase Locking (2PL)
● Deadlocks: The main disadvantage of 2PL is that it can lead to deadlocks, where
two or more transactions wait indefinitely for a resource locked by the other.
● Reduced Concurrency (in certain cases): Locking can block transactions,
which can reduce concurrency. For example, if one transaction holds a lock for a
long time, other transactions needing that lock will be blocked.
 Overhead: Maintaining locks, especially in systems with a large number of
items and transactions, requires overhead. There's a time cost associated with
acquiring and releasing locks, and memory overhead for maintaining the lock
table.
● Starvation: It's possible for some transactions to get repeatedly delayed if other
transactions are continually requesting and acquiring locks.

Starvation
Starvation is the situation when a transaction needs to wait for an indefinite period
to acquire a lock.
Following are the reasons for Starvation:
● When waiting scheme for locked items is not properly managed
● In the case of resource leak
● The same transaction is selected as a victim repeatedly

A.RAMESH DEPT OF CSE RGMCET 30


Deadlock: Deadlock refers to a specific situation where two or more processes are
waiting for each other to release a resource or more than two processes are waiting
for the resource in a circular chain.
Categories of Two-Phase Locking in DBMS
1. Strict Two-Phase Locking
2. Rigorous Two-Phase Locking
3. Conservative (or Static) Two-Phase Locking:
Strict Two-phase locking (Strict-2PL) :
○ The first phase of Strict-2PL is similar to 2PL. In the first phase, after acquiring
all the locks, the transaction continues to execute normally.
○ The only difference between 2PL and strict 2PL is that Strict-2PL does not release
a lock after using it.
o Strict-2PL waits until the whole transaction to commit, and then it releases all
the locks at a time.
○ Strict-2PL protocol does not have a shrinking phase of lock release

Limitations of Strict 2PL:


 Increased lock hold time
 Lead to Deadlock
 Improper resource utilization

A.RAMESH DEPT OF CSE RGMCET 31


Rigorous Two-Phase Locking :
● A transaction can release a lock after using it, but it cannot commit until all locks
have been acquired.
● Like strict 2PL, rigorous 2PL is deadlock-free and ensures serializability.
Conservative or Static Two-Phase Locking :
● A transaction must request all the locks it will ever need before it begins
execution. If any of the requested locks are unavailable, the transaction is delayed
until they are all available.
● This approach can avoid deadlocks since transactions only start when all their
required locks are available.
Timestamp-based Protocols:
• The most commonly used concurrency protocol is the timestamp based protocol.
This protocol uses either system time or logical counter as a timestamp.
• Every transaction has a timestamp associated with it, and the ordering is
determined by the age of the transaction. A transaction created at 0002 clock
time would be older than all other transactions that come after it. For example,
any transaction 'y' entering the system at 0004 is two seconds younger and the
priority would be given to the older one.
Timestamp Ordering Protocol:
• The timestamp-ordering protocol ensures serializability among transactions in
their conflicting read and write operations. This is the responsibility of the
protocol system that the conflicting pair of tasks should be executed according
to the timestamp values of the transactions.
• If a transaction Ti issues a read(X) operation −
o If TS(Ti) < W-timestamp(X)
Operation rejected.
o If TS(Ti) >= W-timestamp(X)
A.RAMESH DEPT OF CSE RGMCET 32
Operation executed.
o All data-item timestamps updated.
If a transaction Ti issues a write(X) operation −
o If TS(Ti) < R-timestamp(X)
Operation rejected.
o If TS(Ti) < W-timestamp(X)
Operation rejected and Ti rolled back.
o Otherwise, operation executed.
Thomas' Write Rule :
o This rule states if TS(Ti) < W-timestamp(X), then the operation is rejected and
Ti is rolled back.
Optimistic:
• Validation-based protocols, also known as Optimistic Concurrency Control
(OCC), are a set of techniques that aim to increase system concurrency and
performance by assuming that conflicts between transactions will be rare. Unlike
other concurrency control methods, which try to prevent conflicts proactively
using locks or timestamps, OCC checks for conflicts only at transaction commit
time.
Here’s how a typical validation-based protocol operates:
1.Read Phase
• The transaction reads from the database but does not write to it.
• All updates are made to a local copy of the data items.
2.Validation Phase
• Before committing, the system checks to ensure that this transaction's local
updates won't cause conflicts with other transactions.

A.RAMESH DEPT OF CSE RGMCET 33


• The validation can take many forms, depending on the specific protocol.
3.Write Phase
• If the transaction passes validation, its updates are applied to the database.
• If it doesn't pass validation, the transaction is aborted and restarted.
Deadlock:
• A deadlock is a condition where two or more transactions are waiting indefinitely
for one another to give up locks. Deadlock is said to be one of the most feared
complications in DBMS as no task ever gets finished and is in waiting state
forever.
Example: In the student table, transaction T1 holds a lock on some rows and
needs to update some rows in the grade table. Simultaneously, transaction T2
holds locks on some rows in the grade table and needs to update the rows in the
Student table held by Transaction T1.

Following are the deadlock conditions,


1. Mutual Exclusion
2. Hold and Wait

A.RAMESH DEPT OF CSE RGMCET 34


3. No Preemption
4. Circular Wait
A deadlock may occur, if all the above conditions hold true.
In Mutual exclusion states that at least one resource cannot be used by more than
one process at a time. The resources cannot be shared between processes.
Hold and Wait states that a process is holding a resource, requesting for additional
resources which are being held by other processes in the system.
No Preemption states that a resource cannot be forcibly taken from a process. Only
a process can release a resource that is being held by it.
Circular Wait states that one process is waiting for a resource which is being held
by second process and the second process is waiting for the third process and so on
and the last process is waiting for the first process. It makes a circular chain of
waiting.
Deadlock Prevention
Deadlock Prevention ensures that the system never enters a deadlock state.
Following are the requirements to free the deadlock:
1. No Mutual Exclusion : No Mutual Exclusion means removing all the resources
that are sharable.
2. No Hold and Wait: Removing hold and wait condition can be done if a process
acquires all the resources that are needed before starting out.
3. Allow Preemption: Allowing preemption is as good as removing mutual
exclusion. The only need is to restore the state of the resource for the preempted
process rather than letting it in at the same time as the preemptor.
4. Removing Circular Wait: The circular wait can be removed only if the resources
are maintained in a hierarchy and process can hold the resources in increasing the
order of precedence.

A.RAMESH DEPT OF CSE RGMCET 35


Deadlock Avoidance
 Deadlock Avoidance helps in avoiding the rolling back conflicting transactions.
 It is not good approach to abort a transaction when a deadlock occurs.
 Rather deadlock avoidance should be used to detect any deadlock situation in
advance.
Failure Classification:
• Failure in terms of a database can be defined as its inability to execute the
specified transaction or loss of data from the database. A DBMS is vulnerable to
several kinds of failures and each of these failures needs to be managed
differently. There are many reasons that can cause database failures such as
network failure, system crash, natural disasters, carelessness, sabotage(corrupting
the data intentionally), software errors, etc.
• A failure in DBMS can be classified as:

A.RAMESH DEPT OF CSE RGMCET 36


Transaction Failure:
 If a transaction is not able to execute or it comes to a point from where the
transaction becomes incapable of executing further then it is termed as a failure
in a transaction.
Reason for a transaction failure in DBMS:
 Logical error: A logical error occurs if a transaction is unable to execute because
of some mistakes in the code or due to the presence of some internal faults.
 System error: Where the termination of an active transaction is done by the
database system itself due to some system issue or because the database
management system is unable to proceed with the transaction. For example– The
system ends an operating transaction if it reaches a deadlock condition or if there
is an unavailability of resources.
System Crash:
 A system crash usually occurs when there is some sort of hardware or software
breakdown. Some other problems which are external to the system and cause
the system to abruptly stop or eventually crash include failure of the transaction,
operating system errors, power cuts, main memory crash, etc.
 These types of failures are often termed soft failures and are responsible for the
data losses in the volatile memory. It is assumed that a system crash does not
have any effect on the data stored in the non-volatile storage and this is known
as the fail-stop assumption.
Data-transfer Failure:
 When a disk failure occurs amid data-transfer operation resulting in loss of
content from disk storage then such failures are categorized as data-transfer
failures. Some other reason for disk failures includes disk head crash, disk
unreachability, formation of bad sectors, read-write errors on the disk, etc.
 In order to quickly recover from a disk failure caused amid a data-transfer
operation, the backup copy of the data stored on other tapes or disks can be used.
Thus it’s a good practice to backup your data frequently.
A.RAMESH DEPT OF CSE RGMCET 37
Storage:
• A database system provides an ultimate view of the stored data. However, data in
the form of bits, bytes get stored in different storage devices.
Types of Data Storage:
• For storing the data, there are different types of storage options available. These
storage types differ from one another as per the speed and accessibility. There are
the following types of storage devices used for storing the data:
1. Primary Storage
2. Secondary Storage
3. Tertiary Storage

Primary Storage:
 It is the primary area that offers quick access to the stored data. We also know the
primary storage as volatile storage. It is because this type of memory does not
permanently store the data. As soon as the system leads to a power cut or a crash,
the data also get lost. Main memory and cache are the types of primary storage.

A.RAMESH DEPT OF CSE RGMCET 38


o Main Memory: It is the one that is responsible for operating the data that is
available by the storage medium. The main memory handles each instruction of a
computer machine. This type of memory can store gigabytes of data on a system
but is small enough to carry the entire database. At last, the main memory loses
the whole content if the system shuts down because of power failure or other
reasons.

1. Cache: It is one of the costly storage media. On the other hand, it is the fastest
one. A cache is a tiny storage media which is maintained by the computer hardware
usually. While designing the algorithms and query processors for the data
structures, the designers keep concern on the cache effects.

Secondary Storage:
 Secondary storage is also called as Online storage. It is the storage area that allows
the user to save and store data permanently. This type of memory does not lose the
data due to any power failure or system crash. That's why we also call it non-
volatile storage. for example, magnetic disks, optical disks (DVD, CD, etc.), hard
disks, flash drives, and magnetic tapes.
 Flash Memory: A flash memory stores data in USB (Universal Serial Bus) keys
which are further plugged into the USB slots of a computer system. These USB
keys help transfer data to a computer system, but it varies in size limits. Unlike the
main memory, it is possible to get back the stored data which may be lost due to a
power cut or other reasons. This type of memory storage is most commonly used
in the server systems for caching the frequently used data. This leads the systems
towards high performance and is capable of storing large amounts of databases
than the main memory.
 Magnetic Disk Storage: This type of storage media is also known as online
storage media. A magnetic disk is used for storing the data for a long time. It is
capable of storing an entire database. It is the responsibility of the computer system
to make availability of the data from a disk to the main memory for further
accessing. Also, if the system performs any operation over the data, the modified
data should be written back to the disk. The tremendous capability of a magnetic

A.RAMESH DEPT OF CSE RGMCET 39


disk is that it does not affect the data due to a system crash or failure, but a disk
failure can easily ruin as well as destroy the stored data.
Tertiary Storage:
 It is the storage type that is external from the computer system. It has the slowest
speed. But it is capable of storing a large amount of data. It is also known as Offline
storage. Tertiary storage is generally used for data backup. Optical disks and
magnetic tapes are widely used as tertiary storage.
 Optical Storage: An optical storage can store megabytes or gigabytes of data. A
Compact Disk (CD) can store 700 megabytes of data with a playtime of around 80
minutes. On the other hand, a Digital Video Disk or a DVD can store 4.7 or 8.5
gigabytes of data on each side of the disk.
 Tape Storage: It is the cheapest storage medium than disks. Generally, tapes are
used for archiving or backing up the data. It provides slow access to data as it
accesses data sequentially from the start. Thus, tape storage is also known as
sequential-access storage. Disk storage is known as direct-access storage as we
can directly access the data from any location on disk.
Storage Hierarchy
 Besides the above, various other storage devices reside in the computer system.
These storage media are organized on the basis of data accessing speed, cost per
unit of data to buy the medium, and by medium's reliability. Thus, we can create
a hierarchy of storage media on the basis of its cost and speed.
 Thus, on arranging the above-described storage media in a hierarchy according to
its speed and cost, we conclude the below-described image:

A.RAMESH DEPT OF CSE RGMCET 40


 In the image, the higher levels are expensive but fast. On moving down, the cost
per bit is decreasing, and the access time is increasing. Also, the storage media
from the main memory to up represents the volatile nature, and below the main
memory, all are non-volatile devices.
Recovery and Atomicity:
Introduction Data may be monitored, stored, and changed rapidly and effectively
using a DBMS (Database Management System).A database possesses atomicity,
consistency, isolation, and durability qualities. The ability of a system to preserve
data and changes made to data defines its durability. A database could fail for any
of the following reasons:
○ System breakdowns occur as a result of hardware or software issues in the
system.
○ Transaction failures arise when a certain process dealing with data updates
cannot be completed.
○ Disk crashes may occur as a result of the system's failure to read the disc. ○
Physical damages include issues such as power outages or natural disasters.
○ The data in the database must be recoverable to the state they were in prior to
the system failure, even if the database system fails. In such situations, database
recovery procedures in DBMS are employed to retrieve the data.

A.RAMESH DEPT OF CSE RGMCET 41


The recovery procedures in DBMS ensure the database's atomicity and durability.
If a system crashes in the middle of a transaction and all of its data is lost, it is not
regarded as durable. If just a portion of the data is updated during the transaction,
it is not considered atomic. Data recovery procedures in DBMS make sure that the
data is always recoverable to protect the durability property and that its state is
retained to protect the atomic property. The procedures listed below are used to
recover data from a DBMS,
○ Recovery based on logs.
○ Recovery through Deferred Update
○ Immediate Recovery via Immediate Update
The atomicity attribute of DBMS safeguards the data state. If a data modification
is performed, the operation must be completed entirely, or the data's state must be
maintained as if the manipulation never occurred. This characteristic may be
impacted by DBMS failure brought on by transactions, but DBMS recovery
methods will protect it.
 Recovery using Log records
Log-based recovery is a method used in database systems to restore the database
to a consistent state after a crash or failure. The process uses a transaction log,
which keeps a record of all operations performed on the database, including
updates, inserts, deletes, and transaction states (start, commit, or abort).
 Recovery through Deferred Update :
Deferred Update, changes are not immediately applied to the database. Instead,
updates are written to the log first, and only after a transaction commits, are those
changes applied to the database.
 Immediate Recovery via Immediate Update:
In Immediate Update, the changes are applied to the database immediately as
the transaction executes, but the changes are still logged for recovery purposes.

A.RAMESH DEPT OF CSE RGMCET 42


Recovery Algorithm:
• a) Analysis Pass: This pass determines which transactions to undo, which pages
were dirty at the time of the crash, and the LSN(Log Sequence Numbers) from
which the redo pass should start.
• b) Redo Pass: This pass starts from a position determined during analysis, and
performs a redo, repeating history, to bring the database to a state it was in before
the crash.
• c) Undo Pass: This pass rolls back all transactions that were incomplete at the
time of the crash.

 When System restarted, the Analysis phase identifies T1 and T3 as transactions


that were active at the time of the crash, and therefore to be undone; T2 as a
committed transaction, and all its actions, therefore, to be written to disk; and P1,
P3, and P5 as potentially dirty pages. All the updates (including those of T1 and
T3) are reapplied in the order shown during the Redo phase. Finally, the actions
of T1 and T3 are undone in reverse order during the Undo phase; that is, T3's write
of P3 is undone, T3's write of P1 is undone, and then T1's write of P5 is undone.
There are three main principles behind the ARIES recovery algorithm:
 Write-ahead logging: Any change to a database object is first recorded in the log;
the record in the log must be written to stable storage before the change to the
database object is written to disk.

A.RAMESH DEPT OF CSE RGMCET 43


 Repeating history during Redo: Upon restart following a crash, ARIES re- traces
all actions of the DBMS before the crash and brings the system back to the exact
state that it was in at the time of the crash. Then, it undoes the actions of
transactions that were still active at the time of the crash (effectively aborting
them).
 Logging changes during Undo: Changes made to the database while undoing a
transaction are logged in order to ensure that such an action is not repeated in
the event of repeated (failures causing) restarts.

Physical Database Design:


DATABASE FILE STRUCTURES:
 A database consists of a huge amount of data. The data is grouped within a table
in RDBMS, and each table has related records. A user can see that the data is
stored in the form of tables, but in actuality, this huge amount of data is stored in
physical memory in the form of files.
 A file is named a collection of related information that is recorded on secondary
storage such as magnetic disks, magnetic tapes, and optical disks
Objective of File Organization:
 It helps in the faster selection of records i.e. it makes the process faster.
 Different Operations like inserting, deleting, and updating different records are
faster and easier.
 It prevents us from inserting duplicate records via various operations.
 It helps in storing the records or the data very efficiently at a minimal cost
Types of File Organizations
1. Sequential File Organization
2. Heap File Organization
3. Hash File Organization
4. B+ Tree File Organization
A.RAMESH DEPT OF CSE RGMCET 44
5. Clustered File Organization
6. ISAM (Indexed Sequential Access Method)
1.SEQUENTIAL FILE ORGANIZATION
 The easiest method for file Organization is the Sequential method. In this method,
the file is stored one after another in a sequential manner. There are two ways to
implement this method:
1.Pile File Method:
 This method is quite simple, in which we store the records in a sequence i.e. one
after the other in the order in which they are inserted into the tables.

Insertion of the new record: Let the R1, R3, and so on up to R5 and R4 be four
records in the sequence. Here, records are nothing but a row in any table. Suppose
a new record R2 has to be inserted in the sequence, then it is simply placed at the
end of the file.

A.RAMESH DEPT OF CSE RGMCET 45


2. Sorted File Method:
In this method, As the name itself suggests whenever a new record has to be
inserted, it is always inserted in a sorted (ascending or descending) manner. The
sorting of records may be based on any primary key or any other key.

Insertion of the new record: Let us assume that there is a preexisting sorted
sequence of four records R1, R3, and so on up to R7 and R8. Suppose a new record
R2 has to be inserted in the sequence, then it will be inserted at the end of the file
and then it will sort the sequence.

Advantages of Sequential File Organization


 Fast and efficient method for huge amounts of data.
 Simple design.
 Files can be easily stored inmagnetic tapes i.e. cheaper storage mechanism.
Disadvantages of Sequential File Organization
 Time wastage as we cannot jump on a particular record that is required, but we
have to move in a sequential manner which takes our time.
 The sorted file method is inefficient as it takes time and space for sorting
records.

A.RAMESH DEPT OF CSE RGMCET 46


2.HEAP FILE ORGANIZATION:
Heap File Organization works with data blocks. In this method, records are
inserted at the end of the file, into the data blocks. No Sorting or Ordering is
required in this method. If a data block is full, the new record is stored in some
other block, Here the other data block need not be the very next data block, but it
can be any block in the memory. It is the responsibility of DBMS to store and
manage the new records.

Insertion of the new record: Suppose we have four records in the heap R1, R5,
R6, R4, and R3, and suppose a new record R2 has to be inserted in the heap then,
since the last data block i.e data block 3 is full it will be inserted in any of the data
blocks selected by the DBMS, let’s say data block 1.

If we want to search, delete or update data in the heap file Organization we will
traverse the data from the beginning of the file till we get the requested record.
Thus if the database is very huge, searching, deleting, or updating the record will
take a lot of time.
A.RAMESH DEPT OF CSE RGMCET 47
Advantages of Heap File Organization:
 Fetching and retrieving records is faster than sequential records but only in the
case of small databases.
 When there is a huge number of data that needs to be loaded into thedatabase at
a time, then this method of file Organization is best suited.
Disadvantages of Heap File Organization:
 The problem of unused memory blocks.
 Inefficient for larger databases.
3.HASH FILE ORGANIZATION:
Hash File Organization uses the computation of hash function on some fields of the
records. The hash function's output determines the location of disk block where the
records are to be placed.

When a record has to be received using the hash key columns, then the address is
generated, and the whole record is retrieved using that address. In the same way,
when a new record has to be inserted, then the address is generated using the hash
key and record is directly inserted. The same process is applied in the case of delete
and update.

A.RAMESH DEPT OF CSE RGMCET 48


In this method, there is no effort for searching and sorting the entire file. In this
method, each record will be stored randomly in the memory.

4.B+ FILE ORGANIZATION:

o B+ tree file organization is the advanced method of an indexed sequential access


method. It uses a tree-like structure to store records in File.
o It uses the same concept of key-index where the primary key is used to sort the
records. For each primary key, the value of the index is generated and mapped
with the record.
o The B+ tree is similar to a binary search tree (BST), but it can have more than two
children. In this method, all the records are stored only at the leaf node.
Intermediate nodes act as a pointer to the leaf nodes. They do not contain any
records.

A.RAMESH DEPT OF CSE RGMCET 49


The above B+ tree shows that:

o There is one root node of the tree, i.e., 25.


o There is an intermediary layer with nodes. They do not store the actual record.
They have only pointers to the leaf node.
o The nodes to the left of the root node contain the prior value of the root and nodes
to the right contain next value of the root, i.e., 15 and 30 respectively.
o There is only one leaf node which has only values, i.e., 10, 12, 17, 20, 24, 27 and
29.
o Searching for any record is easier as all the leaf nodes are balanced.
o In this method, searching any record can be traversed through the single path and
accessed easily.

Advantages:

o In this method, searching becomes very easy as all the records are stored only in
the leaf nodes and sorted the sequential linked list.
o Traversing through the tree structure is easier and faster.
o The size of the B+ tree has no restrictions, so the number of records can increase
or decrease and the B+ tree structure can also grow or shrink.
o It is a balanced tree structure, and any insert/update/delete does not affect the
performance of tree.

Disadvantages:

o This method is inefficient for the static method.

A.RAMESH DEPT OF CSE RGMCET 50


5.Indexed sequential access method (ISAM):
ISAM method is an advanced sequential file organization. In this method,
records are stored in the file using the primary key. An index value is
generated for each primary key and mapped with the record. This index
contains the address of the record in the file.

If any record has to be retrieved based on its index value, then the address of
the data block is fetched and the record is retrieved from the memory.

Advantages:

In this method, each record has the address of its data block, searching a
record in a huge database is quick and easy.

o This method supports range retrieval and partial retrieval of records.


Since the index is based on the primary key values, we can retrieve the
data for the given range of value. In the same way, the partial value can
also be easily searched, i.e., the student name starting with 'JA' can be
easily searched.

A.RAMESH DEPT OF CSE RGMCET 51


Disadvantages:

o This method requires extra space in the disk to store the index value.
o When the new records are inserted, then these files have to be
reconstructed to maintain the sequence.
o When the record is deleted, then the space used by it needs to be
released. Otherwise, the performance of the database will slow down.

RAID:
RAID or Redundant Array of Independent Disks, is a technology to connect multiple
secondary storage devices and use them as a single storage media.
RAID consists of an array of disks in which multiple disks are connected together to
achieve different goals. RAID levels define the use of disk arrays.
RAID 0
In this level, a striped array of disks is implemented. The data is broken down into
blocks and the blocks are distributed among disks. Each disk receives a block of data
to write/read in parallel. It enhances the speed and performance of the storage device.
There is no parity and backup in Level 0.

 RAID 1
RAID 1 uses mirroring techniques. When data is sent to a RAID controller, it
sends a copy of data to all the disks in the array. RAID level 1 is also
called mirroring and provides 100% redundancy in case of a failure.

A.RAMESH DEPT OF CSE RGMCET 52


 RAID 2
RAID 2 records Error Correction Code using Hamming distance for its data,
striped on different disks. Like level 0, each data bit in a word is recorded on a
separate disk and ECC codes of the data words are stored on a different set disks.
Due to its complex structure and high cost, RAID 2 is not commercially available.

 RAID 3
RAID 3 stripes the data onto multiple disks. The parity bit generated for data word
is stored on a different disk. This technique makes it to overcome single disk
failures.

 RAID 4
In this level, an entire block of data is written onto data disks and then the parity
is generated and stored on a different disk. Note that level 3 uses byte-level
striping, whereas level 4 uses block-level striping. Both level 3 and level 4 require
at least three disks to implement RAID.

A.RAMESH DEPT OF CSE RGMCET 53


 RAID 5
RAID 5 writes whole data blocks onto different disks, but the parity bits
generated for data block stripe are distributed among all the data disks rather than
storing them on a different dedicated disk.

 RAID 6
RAID 6 is an extension of level 5. In this level, two independent parities are
generated and stored in distributed fashion among multiple disks. Two parities
provide additional fault tolerance. This level requires at least four disk drives to
implement RAID.

A.RAMESH DEPT OF CSE RGMCET 54


B+ Tree
o The B+ tree is a balanced binary search tree. It follows a multi-level index format.
o In the B+ tree, leaf nodes denote actual data pointers. B+ tree ensures that all leaf
nodes remain at the same height.
o In the B+ tree, the leaf nodes are linked using a link list. Therefore, a B+ tree can
support random access as well as sequential access.

Structure of B+ Tree

o In the B+ tree, every leaf node is at equal distance from the root node. The B+
tree is of the order n where n is fixed for every B+ tree.
o It contains an internal node and leaf node.

A.RAMESH DEPT OF CSE RGMCET 55


Internal node

o An internal node of the B+ tree can contain at least n/2 record pointers except the
root node.
o At most, an internal node of the tree contains n pointers.

Leaf node

o The leaf node of the B+ tree can contain at least n/2 record pointers and n/2 key
values.
o At most, a leaf node contains n record pointer and n key values.
o Every leaf node of the B+ tree contains one block pointer P to point to next leaf
node.

Searching a record in B+ Tree

Suppose we have to search 55 in the below B+ tree structure. First, we will fetch
for the intermediary node which will direct to the leaf node that can contain a
record for 55.

So, in the intermediary node, we will find a branch between 50 and 75 nodes.
Then at the end, we will be redirected to the third leaf node. Here DBMS will
perform a sequential search to find 55.

B+ Tree Insertion

Suppose we want to insert a record 60 in the below structure. It will go to the 3rd
leaf node after 55. It is a balanced tree, and a leaf node of this tree is already full,
so we cannot insert 60 there.

A.RAMESH DEPT OF CSE RGMCET 56


In this case, we have to split the leaf node, so that it can be inserted into tree
without affecting the fill factor, balance and order.

The 3rd leaf node has the values (50, 55, 60, 65, 70) and its current root node is
50. We will split the leaf node of the tree in the middle so that its balance is not
altered. So we can group (50, 55) and (60, 65, 70) into 2 leaf nodes.

If these two has to be leaf nodes, the intermediate node cannot branch from 50. It
should have 60 added to it, and then we can have pointers to a new leaf node.

This is how we can insert an entry when there is overflow. In a normal scenario,
it is very easy to find the node where it fits and then place it in that leaf node.

B+ Tree Deletion

Suppose we want to delete 60 from the above example. In this case, we have to
remove 60 from the intermediate node as well as from the 4th leaf node too. If we

A.RAMESH DEPT OF CSE RGMCET 57


remove it from the intermediate node, then the tree will not satisfy the rule of the
B+ tree. So we need to modify it to have a balanced tree.

After deleting node 60 from above B+ tree and re-arranging the nodes, it will
show as follows:

Hash-Based Indexing:
In hash-based indexing, a hash function is used to convert a key into a hash code.
This hash code serves as an index where the value associated with that key is
stored. The goal is to distribute the keys uniformly across an array, so that access
time is, on average, constant.
Let's break down some of these elements to further understand how hash-based
indexing works in practice:

Buckets
In hash-based indexing, the data space is divided into a fixed number of slots
known as "buckets." A bucket usually contains a single page (also known as a
block), but it may have additional pages linked in a chain if the primary page
becomes full. This is known as overflow.

Hash Function
The hash function is a mapping function that takes the search key as an input and
returns the bucket number where the record should be located. Hash functions aim
to distribute records uniformly across buckets to minimize the number of collisions
(two different keys hashing to the same bucket).

A.RAMESH DEPT OF CSE RGMCET 58


Disk I/O Efficiency
Hash-based indexing is particularly efficient when it comes to disk I/O operations.
Given a search key, the hash function quickly identifies the bucket (and thereby
the disk page) where the desired record is located. This often requires only one or
two disk I/Os, making the retrieval process very fast.

Insert Operations
When a new record is inserted into the dataset, its search key is hashed to find the
appropriate bucket. If the primary page of the bucket is full, an additional overflow
page is allocated and linked to the primary page. The new record is then stored on
this overflow page.

Search Operations
To find a record with a specific search key, the hash function is applied to the
search key to identify the bucket. All pages (primary and overflow) in that bucket
are then examined to find the desired record.

Limitations
Hash-based indexing is not suitable for range queries or when the search key is not
known. In such cases, a full scan of all pages is required, which is resource-
intensive.

Hash-Based Indexing Example


Let's consider a simple example using employee names as the search key.
Employee Records

A.RAMESH DEPT OF CSE RGMCET 59


| Name | Age | Salary
|-----------|----------|--------
| Alice | 28 | 50000
| Bob | 35 | 60000
| Carol | 40 | 70000
Hash Function: H(x) = ASCII value of first letter of the name mod 3

 Alice: 65 mod 3 = 2
 Bob: 66 mod 3 = 0
 Carol: 67 mod 3 = 1
Buckets:
Bucket 0: Bob
Bucket 1: Carol
Bucket 2: Alice

Advantages:

 Extremely fast for exact match queries.


 Well-suited for equality comparisons.

Disadvantages:

 Not suitable for range queries (e.g., "SELECT * FROM table WHERE age
BETWEEN 20 AND 30").
 Performance can be severely affected by poor hash functions or a large number of
collisions.

A.RAMESH DEPT OF CSE RGMCET 60


Hashing:
In a huge database structure, it is very inefficient to search all the index
values and reach the desired data. Hashing technique is used to calculate
the direct location of a data record on the disk without using index structure.

In this technique, data is stored at the data blocks whose address is


generated by using the hashing function. The memory location where these
records are stored is known as data bucket or data blocks.

In this, a hash function can choose any of the column value to generate the
address. Most of the time, the hash function uses the primary key to
generate the address of the data block. A hash function is a simple
mathematical function to any complex mathematical function. We can even
consider the primary key itself as the address of the data block. That means
each row whose address will be the same as a primary key stored in the data
block.

The above diagram shows data block addresses same as primary key
value. This hash function can also be a simple mathematical function
like exponential, mod, cos, sin, etc. Suppose we have mod (5) hash
function to determine the address of the data block. In this case, it
applies mod (5) hash function on the primary keys and generates 3, 3,

A.RAMESH DEPT OF CSE RGMCET 61


1, 4 and 2 respectively, and records are stored in those data block
addresses.

Types of Hashing:

STATIC HASHING:
In static hashing, the resultant data bucket address will always be the same. That
means if we generate an address for EMP_ID =103 using the hash function mod
(5) then it will always result in same bucket address 3. Here, there will be no
change in the bucket address.

Hence in this static hashing, the number of data buckets in memory remains
constant throughout. In this example, we will have five data buckets in the memory
used to store the data.

A.RAMESH DEPT OF CSE RGMCET 62


Operations of Static Hashing

o Searching a record
When a record needs to be searched, then the same hash function retrieves the
address of the bucket where the data is stored.

o Insert a Record
When a new record is inserted into the table, then we will generate an address for
a new record based on the hash key and record is stored in that location.

o Delete a Record
To delete a record, we will first fetch the record which is supposed to be deleted.
Then we will delete the records for that address in memory.

o Update a Record
To update a record, we will first search it using a hash function, and then the data
record is updated.

If we want to insert some new record into the file but the address of a data bucket
generated by the hash function is not empty, or data already exists in that address.
This situation in the static hashing is known as bucket overflow. This is a critical
situation in this method.

To overcome this situation, there are various methods. Some commonly used
methods are as follows:
A.RAMESH DEPT OF CSE RGMCET 63
1. Open Hashing

When a hash function generates an address at which data is already stored, then
the next bucket will be allocated to it. This mechanism is called as Linear
Probing.

For example: suppose R3 is a new address which needs to be inserted, the hash
function generates address as 112 for R3. But the generated address is already full.
So the system searches next available data bucket, 113 and assigns R3 to it.

2. Close Hashing

When buckets are full, then a new data bucket is allocated for the same hash result
and is linked after the previous one. This mechanism is known as Overflow
chaining.

For example: Suppose R3 is a new address which needs to be inserted into the
table, the hash function generates address as 110 for it. But this bucket is full to
store the new data. In this case, a new bucket is inserted at the end of 110 buckets
and is linked to it.

A.RAMESH DEPT OF CSE RGMCET 64


Dynamic Hashing

o The dynamic hashing method is used to overcome the problems of static hashing
like bucket overflow.
o In this method, data buckets grow or shrink as the records increases or decreases.
This method is also known as Extendable hashing method.
o This method makes hashing dynamic, i.e., it allows insertion or deletion without
resulting in poor performance.

How to search a key

o First, calculate the hash address of the key.


o Check how many bits are used in the directory, and these bits are called as i.
o Take the least significant i bits of the hash address. This gives an index of the
directory.
o Now using the index, go to the directory and find bucket address where the record
might be.

How to insert a new record

o Firstly, you have to follow the same procedure for retrieval, ending up in some
bucket.
o If there is still space in that bucket, then place the record in it.
o If the bucket is full, then we will split the bucket and redistribute the records.

For example:

Consider the following grouping of keys into buckets, depending on the prefix
of their hash address:

A.RAMESH DEPT OF CSE RGMCET 65


The last two bits of 2 and 4 are 00. So it will go into bucket B0. The last two bits
of 5 and 6 are 01, so it will go into bucket B1. The last two bits of 1 and 3 are 10,
so it will go into bucket B2. The last two bits of 7 are 11, so it will go into B3.

Insert key 9 with hash address 10001 into the above structure:

o Since key 9 has hash address 10001, it must go into the first bucket. But bucket
B1 is full, so it will get split.
o The splitting will separate 5, 9 from 6 since last three bits of 5, 9 are 001, so it
will go into bucket B1, and the last three bits of 6 are 101, so it will go into bucket
B5.
o Keys 2 and 4 are still in B0. The record in B0 pointed by the 000 and 100 entry
because last two bits of both the entry are 00.

A.RAMESH DEPT OF CSE RGMCET 66


o Keys 1 and 3 are still in B2. The record in B2 pointed by the 010 and 110 entry
because last two bits of both the entry are 10.
o Key 7 are still in B3. The record in B3 pointed by the 111 and 011 entry because
last two bits of both the entry are 11.

Advantages:

o In this method, the performance does not decrease as the data grows in the
system. It simply increases the size of memory to accommodate the data.
o In this method, memory is well utilized as it grows and shrinks with the data.
There will not be any unused memory lying.
o This method is good for the dynamic database where data grows and shrinks
frequently.

Disadvantages:

o In this method, if the data size increases then the bucket size is also increased.
These addresses of data will be maintained in the bucket address table. This is
because the data address will keep changing as buckets grow and shrink. If there
is a huge increase in data, maintaining the bucket address table becomes tedious.

A.RAMESH DEPT OF CSE RGMCET 67


o In this case, the bucket overflow situation will also occur. But it might take little
time to reach this situation than static hashing.

A.RAMESH DEPT OF CSE RGMCET 68

You might also like