0% found this document useful (0 votes)
3 views69 pages

Unit - 5 DBMS

The document provides an overview of various types of database indexes, including primary (clustered), secondary (non-clustered), and multilevel indexes, explaining their structures and functionalities. It also discusses hashing methods for indexing, the Indexed Sequential Access Method (ISAM), and the properties and operations of B-trees and B+ trees. Key features, advantages, and disadvantages of these indexing techniques are highlighted to optimize database query performance.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views69 pages

Unit - 5 DBMS

The document provides an overview of various types of database indexes, including primary (clustered), secondary (non-clustered), and multilevel indexes, explaining their structures and functionalities. It also discusses hashing methods for indexing, the Indexed Sequential Access Method (ISAM), and the properties and operations of B-trees and B+ trees. Key features, advantages, and disadvantages of these indexing techniques are highlighted to optimize database query performance.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 69

UNIT-5:

LC1:Types of indexes (Clustered index, un clustered index primary index, secondary


index), Tree based index versus and Hash based index.
What is an Index?
An index is a data structure that facilitates fast access to particular rows of data in a table,
optimising database query performance. Using indexes, data in a database may be easily
found and accessed without needing to be searched through each row of a table each time a
database query is executed. The working of an index is also simple. It works by copying the
data in a particular order, which facilitates finding particular data rows.
Structure of an Index:

Types of Indexes
Indexes in DBMS are categorised mainly into three types i.e., Primary index (Clustered
Indexing), Secondary index (Non-clustered Indexing), and Multilevel index. Selecting the
appropriate indexing approach for a given use case is made easier by having a thorough
understanding of the various index types. We will cover these in detail with examples.

Primary Index (Clustered Index)

Primary index refers to the index created using the primary key of the relational table.
Primary indexing is a type of clustered indexing that contains the sorted data and a primary
key. The primary index offers immediate access to records because each record is uniquely
identified by its primary key. Due to this, the search operation performance increases.

The primary index is typically a clustered index, which means that the index entries and the
physical order of the data in the table match.

Features:

 The table’s entries can only be kept in a single order, each table can only have one
clustered index.

 Since the data is stored in the index’s order, searching, retrieving, and sorting it can
be done more quickly.

For example, if an employee database contains a clustered index on a column called EmpID,
the table’s rows will be arranged according to EmpID in ascending order.

Primary index is categorised into two types i.e., Sparse Index and Dense Index.

Dense Index

Every record in the database table has an index entry in a dense index. This indicates that
every unique value has an index entry that corresponds to it and points straight to the
relevant disk record. Similar to primary indexing, however, each search key has its own
record. Denser indexes take up more storage space but are more effective for lookups.
 Sparse Index

Sparse indexing offers a more optimal approach as it creates index entries only for some of
the records. Although we save a search key that leads to a block, we do not include a search
key for each entry in a sparse index. A set of data is also contained in the pointed block.

When using sparse indexing, the size of the mapping increases along with the table’s growth.
Typically, these mappings are stored in the main memory to facilitate faster address fetches.
Having the advantage of sparse indexes using less space, it could take more lookups to locate
the requested record.

Secondary Index (Non-clustered Index)

The Secondary index is built using a non-primary key attribute. By offering a different way to
access information, secondary indexes facilitate data retrieval based on non-primary key
columns. Secondary indexes are usually non-clustered, in contrast to primary indexes.

We have seen in the sparse index which is a type of primary index, the size of the mapping
increases as the table grows. These mappings are stored in the primary memory which
makes the address fetching faster but searching for actual data through these addresses by
the secondary memory makes it slower. Therefore it becomes an inefficient approach.

The Secondary Index is a more efficient approach than the primary index. This is because it
creates a different level of columns to reduce the size of mappings and increase the search
process faster. Compared to the clustered index, it takes longer since more work needs to be
done to extract the data by following the pointer further. When an index is clustered, the
data is in front of the index itself.

Features:

 A table may have several non-clustered indexes, which increases query flexibility.

 A copy of the indexed columns and references to the real data rows are present in
non-clustered indexes.

For example, a table having a FName column in your database has a non-clustered index,
you may easily search for records using just the first name in the index structure without
changing the row order of the table.

Multilevel Index
Multi-level indexing makes it possible to manage big indexes more effectively. In this index
type, it involves breaking up the primary data block into smaller blocks. As a result, the index
table’s outer block is now short enough to fit in the main memory.

Features:

 It has several tiers of indexes, with the first tier pointing to the second, and so on,
until it reaches the real data.

 By using this method, fewer disk requests are needed to locate a specific record.

For example, consider a B+ tree where the leaf nodes holding pointers to the real data are
reached through intermediate nodes that are pointed to by the top-level nodes.

Hashing:
Hash Function

Hash Function is used to index the original value or key and then used later each time the
data associated with the value or key is to be retrieved. Thus, hashing is always a one-way
operation. There is no need to "reverse engineer" the hash function by analyzing the hashed
values.

Characteristics of Good Hash Function:

1. The hash value is fully determined by the data being hashed.

2. The hash Function uses all the input data.

3. The hash function "uniformly" distributes the data across the entire set of possible
hash values.

4. The hash function generates complicated hash values for similar strings.

Some Popular Hash Function is:

1. Division Method:

Choose a number m smaller than the number of n of keys in k (The number m is usually
chosen to be a prime number or a number without small divisors, since this frequently a
minimum number of collisions).

The hash function is:

For Example: if the hash table has size m = 12 and the key is k = 100, then h (k) = 4. Since it
requires only a single division operation, hashing by division is quite fast.

2. Multiplication Method:

The multiplication method for creating hash functions operates in two steps. First, we
multiply the key k by a constant A in the range 0 < A < 1 and extract the fractional part of kA.
Then, we increase this value by m and take the floor of the result.

The hash function is:

Where "k A mod 1" means the fractional part of k A, that is, k A -⌊k A⌋.

3. Mid Square Method:

The key k is squared. Then function H is defined by

1. H (k) = L
Where L is obtained by deleting digits from both ends of k2. We emphasize that the same
position of k2 must be used for all of the keys.

4. Folding Method:

The key k is partitioned into a number of parts k1, k2.... kn where each part except possibly
the last, has the same number of digits as the required address.

Then the parts are added together, ignoring the last carry.

H (k) = k1+ k2+.....+kn

Example: Company has 68 employees, and each is assigned a unique four- digit employee
number. Suppose L consist of 2- digit addresses: 00, 01, and 02....99. We apply the above
hash functions to each of the following employee numbers:

1. 3205, 7148, 2345

(a) Division Method: Choose a Prime number m close to 99, such as m =97, Then

1. H (3205) = 4, H (7148) = 67, H (2345) = 17.

That is dividing 3205 by 17 gives a remainder of 4, dividing 7148 by 97 gives a remainder of


67, dividing 2345 by 97 gives a remainder of 17.

(b) Mid-Square Method:

k = 3205 7148 2345

k2= 10272025 51093904 5499025

h (k) = 72 93 99

Observe that fourth & fifth digits, counting from right are chosen for hash address.

(c) Folding Method: Divide the key k into 2 parts and adding yields the following hash
address:

1. H (3205) = 32 + 50 = 82 H (7148) = 71 + 84 = 55

2. H (2345) = 23 + 45 = 68

Indexed sequential access method (ISAM)

ISAM method is an advanced sequential file organization. In this method, records are stored
in the file using the primary key. An index value is generated for each primary key and
mapped with the record. This index contains the address of the record in the file.
If any record has to be retrieved based on its index value, then the address of the data block
is fetched and the record is retrieved from the memory.

Pros of ISAM:

o In this method, each record has the address of its data block, searching a record in a
huge database is quick and easy.

o This method supports range retrieval and partial retrieval of records. Since the index
is based on the primary key values, we can retrieve the data for the given range of
value. In the same way, the partial value can also be easily searched, i.e., the student
name starting with 'JA' can be easily searched.

Cons of ISAM

o This method requires extra space in the disk to store the index value.

o When the new records are inserted, then these files have to be reconstructed to
maintain the sequence.

o When the record is deleted, then the space used by it needs to be released.
Otherwise, the performance of the database will slow down.

What is the B tree?

B tree is a self-balancing tree, and it is a m-way tree where m defines the order of the
tree. Btree is a generalization of the Binary Search tree in which a node can have more than
one key and more than two children depending upon the value of m. In the B tree, the data
is specified in a sorted order having lower values on the left subtree and higher values in the
right subtree.

Properties of B tree

The following are the properties of the B tree:

o In the B tree, all the leaf nodes must be at the same level, whereas, in the case of a
binary tree, the leaf nodes can be at different levels.

o The B trees maintain equilibrium by redistributing keys when nodes are full or
shallow.

o Nodes in a B tree can hold a variable number of keys in a defined order, allowing
storage efficiency.

o B trees are a type of data structure that are highly useful for performing range
queries. This is due to their balanced structure and the fact that they store keys in a
sorted order. The balanced nature of B trees ensures that search operations are
efficient.

o B trees are a type of data structure that allow for quick access to elements. This is
because they are balanced, meaning the tree is structured in a way that ensures
efficient searching and retrieving of information. As a result, B trees are commonly
used in databases and file systems, where rapid access to data is crucial.

o B trees are frequently used in databases and file systems to index and structure large
datasets in an effective manner. They help to efficiently manage and access sizeable
collections of information. These tree-like data structures allow for quick searches,
insertions, and deletions, making them a popular choice for numerous dataset.

B + Tree:
A B+ tree is a self-balancing, multi-way search tree data structure that stores keys in
internal nodes and all data (or pointers to data) in leaf nodes, which are also linked
together, making it efficient for range queries and sequential access.
 Self-Balancing:
B+ trees are designed to remain balanced, ensuring efficient search, insertion, and
deletion operations, with a time complexity of O(log n).
 Multi-Way Search Tree:
Unlike binary search trees, B+ trees allow each node to have multiple children (up to a
certain order), leading to a wider tree and reduced height.
 Internal Nodes:
Internal nodes store keys, which are used to guide the search process, and pointers to
child nodes.
 Leaf Nodes:
All data (or pointers to data) is stored in the leaf nodes, which are also linked together in a
sorted order, enabling efficient range queries.
 Advantages:
 Efficient Range Queries: The linked leaf nodes make it easy to find all
values within a specific range.
 Good for Disk-Based Storage: B+ trees are well-suited for databases and file
systems because they minimize disk I/O by organizing data in a way that
allows for efficient sequential access.
 Scalability: The structure of B+ trees allows them to scale well with large
datasets.
B + tree insertion:
Here is a somewhat small tree with the value 4 for d.

B+ Tree Characteristics
1. Only leaf nodes can store data points.
2. Keys are found in internal nodes.
3. We utilize keys in B+ trees to perform direct element searches.
4. There will be at least "[m/2] -1" keys and a maximum of "m-1" keys if there are "m"
elements.
5. At least two children and one key are present in the root node.
6. Each node other than the root can have a minimum of "m/2" children and a maximum
of "m" children for "m" elements.
Insertion on a B+ Tree
You will learn about insertion operations on a B+ tree in this tutorial.
There are three basic steps involved in adding an element to a B+ tree: finding the right
leaf, adding the element, and balancing or breaking the tree.
Let's examine these occurrences in detail below.
Operation of Insertion
1. These characteristics need to be taken into consideration before adding an element to
a B+ tree.
2. The root has a minimum of two children.
3. Each node, excluding the root, is allowed to have at least m/2 children and a
maximum of m children.
4. A minimum of m/2 - 1 keys and a maximum of m - 1 keys can be found in each node.
5. The processes for inserting an element are as follows.
6. Go to the proper leaf node since each element is inserted into a leaf node.
7. Activate the leaf node with the key.
Case I
Insert the key into the leaf node in ascending order if the leaf is not fully extended.
Case II
o If the leaf is filled, place the key in each leaf node in ascending order, then balance the
tree as follows.
o At the m/2 th position, break the node.

o Additionally, add the m/2nd key to the parent node.

o Follow steps two through three if the parent node is already full.

Example:
Show the tree after insertions.
Assume that each B+-tree node may store up to 4 pointers and 3 keys:
o m=3 (odd), d=1

o Partial (for odd m value)

o A leaf node with at least two (d+1) entries


o Non-leaf nodes with at least two (d+1) pointers and one entry

o Insert 1, 3, 5, 7, and 9.

o Insert 1

o Insert 3, 5
o Insert 7

o Insert 9

This is the final B+ Tree.

B + tree deletion:
Deleting an element on a B+ tree consists of three main events: searching the node where
the key to be deleted exists, deleting the key and balancing the tree if
required. Underflow is a situation when there is less number of keys in a node than the
minimum number of keys it should hold.

Deletion Operation
Before going through the steps below, one must know these facts about a B+ tree of
degree m.
1. A node can have a maximum of m children. (i.e. 3)
2. A node can contain a maximum of m - 1 keys. (i.e. 2)

3. A node should have a minimum of ⌈m/2⌉ children. (i.e. 2)

4. A node (except root node) should contain a minimum of ⌈m/2⌉ - 1 keys. (i.e. 1)
While deleting a key, we have to take care of the keys present in the internal nodes (i.e.
indexes) as well because the values are redundant in a B+ tree. Search the key to be
deleted then follow the following steps.
Case I
The key to be deleted is present only at the leaf node not in the indexes (or internal
nodes). There are two cases for it:
1. There is more than the minimum number of keys in the node. Simply delete the key.

Deleting
40 from B-tree
2. There is an exact minimum number of keys in the node. Delete the key and borrow a
key from the immediate sibling. Add the median key of the sibling node to the parent.
Deleting
5 from B-tree
Case II
The key to be deleted is present in the internal nodes as well. Then we have to remove
them from the internal nodes as well. There are the following cases for this situation.
1. If there is more than the minimum number of keys in the node, simply delete the key
from the leaf node and delete the key from the internal node as well.
Fill the empty space in the internal node with the inorder successor.
Deleting 45
from B-tree
2. If there is an exact minimum number of keys in the node, then delete the key and
borrow a key from its immediate sibling (through the parent).
Fill the empty space created in the index (internal node) with the borrowed key.
Deleting 35 from B-
tree
3. This case is similar to Case II(1) but here, empty space is generated above the
immediate parent node.
After deleting the key, merge the empty space with its sibling.
Fill the empty space in the grandparent node with the inorder successor.
Case III
In this case, the height of the tree gets shrinked. It is a little complicated.Deleting 55 from
the tree below leads to this condition. It can be understood in the illustrations below.
Deleting 55 from B-tree
LC3: Transaction concept, Transaction states, ACID properties of transaction.

In database management, a transaction is a unit of work that ensures data integrity and
consistency, characterized by the ACID properties (Atomicity, Consistency, Isolation, and
Durability), and it progresses through distinct states.

Transaction Concept:

 A transaction is a sequence of database operations treated as a single, indivisible unit


of work.

 It ensures that either all operations within the transaction are completed
successfully, or none of them are, maintaining data integrity.

 Transactions are used to manage concurrent access to data, preventing conflicts and
ensuring data consistency.

Transaction States:

 Active: The transaction is currently executing its operations.

 Failed: The transaction has encountered an error and cannot continue.

 Partially Committed: Some operations have been completed, but the transaction is
not yet fully committed.

 Committed: The transaction has successfully completed all operations and the
changes are permanently stored in the database.

 Aborted: The transaction has been rolled back, and any changes made are
discarded.

ACID Properties:

 Atomicity:

Ensures that a transaction is treated as a single, indivisible unit of work; either all operations
within the transaction are completed successfully, or none of them are.

 Consistency:

Guarantees that a transaction brings the database from one valid state to another, ensuring
that all data constraints are maintained.

 Isolation:

Ensures that concurrent transactions do not interfere with each other, preventing data
corruption or inconsistencies.

 Durability:
Guarantees that once a transaction is committed, the changes are permanently stored in the
database and will survive system failures.

In DBMS, a transaction is a logical unit of work that accesses and potentially modifies
database data, ensuring data integrity through ACID properties (Atomicity, Consistency,
Isolation, and Durability)

Transaction states:
 Active:
The transaction is currently executing and modifying the database.
 Partially Committed:
The transaction has completed all its operations but hasn't yet been permanently saved.
 Committed:
The transaction has successfully completed all operations and its changes are permanently
saved in the database.
 Failed:
The transaction encountered an error and cannot continue, potentially requiring rollback.
 Aborted:
The transaction is explicitly canceled, and its changes are rolled back to the database's
previous state.
 Terminated:
The transaction has completed its execution, either successfully or after being aborted or
failed.
ACID Propertires:

A transaction is a single logical unit of work that interacts with the database, potentially
modifying its content through read and write operations. To maintain database consistency
both before and after a transaction, specific properties, known as ACID properties must be
followed.

This article focuses on the ACID properties in DBMS, which are essential for ensuring data
consistency, integrity, and reliability during database transactions.

Atomicity:
By this, we mean that either the entire transaction takes place at once or doesn’t happen at
all. There is no midway i.e. transactions do not occur partially. Each transaction is considered
as one unit and either runs to completion or is not executed at all. It involves the following
two operations.
— Abort : If a transaction aborts, changes made to the database are not visible.
— Commit : If a transaction commits, changes made are visible.
Atomicity is also known as the ‘All or nothing rule’.

Consider the following transaction T consisting of T1 and T2 : Transfer of 100 from


account X to account Y .

Example

If the transaction fails after completion of T1 but before completion of T2 ( say, after write(X)
but before write(Y) ), then the amount has been deducted from X but not added to Y . This
results in an inconsistent database state. Therefore, the transaction must be executed in its
entirety in order to ensure the correctness of the database state.

Consistency:

Consistency ensures that a database remains in a valid state before and after a transaction. It
guarantees that any transaction will take the database from one consistent state to another,
maintaining the rules and constraints defined for the data.
Referring to the example above,
The total amount before and after the transaction must be maintained.
Total before T occurs = 500 + 200 = 700 .
Total after T occurs = 400 + 300 = 700 .
Therefore, the database is consistent . Inconsistency occurs in case T1 completes but T2 fails.

Isolation:

This property ensures that multiple transactions can occur concurrently without leading to
the inconsistency of the database state. Transactions occur independently without
interference. Changes occurring in a particular transaction will not be visible to any other
transaction until that particular change in that transaction is written to memory or has been
committed. This property ensures that when multiple transactions run at the same time, the
result will be the same as if they were run one after another in a specific order.
Let X = 500, Y = 500.
Consider two transactions T and T”.

Suppose T has been executed till Read (Y) and then T’’ starts. As a result, interleaving of
operations takes place due to which T’’ reads the correct value of X but the incorrect value
of Y and sum computed by
T’’: (X+Y = 50, 000+500=50, 500) .
is thus not consistent with the sum at end of the transaction:
T: (X+Y = 50, 000 + 450 = 50, 450) .
This results in database inconsistency, due to a loss of 50 units. Hence, transactions must
take place in isolation and changes should be visible only after they have been made to the
main memory.

Durability:

This property ensures that once the transaction has completed execution, the updates and
modifications to the database are stored in and written to disk and they persist even if a
system failure occurs. These updates now become permanent and are stored in non-volatile
memory. The effects of the transaction, thus, are never lost.

Some important points:

Property Responsibility for maintaining properties

Atomicity Transaction Manager

Consistenc
Application programmer
y

Isolation Concurrency Control Manager

Durability Recovery Manager

LC4:Transactions and Schedules, Concurrent executions of transactions (anomalies).


Schedule, as the name suggests, is a process of lining the transactions and executing them one
by one. When there are multiple transactions that are running in a concurrent manner and the
order of operation is needed to be set so that the operations do not overlap each other,
Scheduling is brought into play and the transactions are timed accordingly. The basics of
Transactions and Schedules is discussed in Concurrency Control, and Transaction Isolation
Levels in DBMS articles. Here we will discuss various types of schedules.

1. Serial Schedules: Schedules in which the transactions are executed non-interleaved,


i.e., a serial schedule is one in which no transaction starts until a running transaction
has ended are called serial schedules. Example: Consider the following schedule
involving two transactions T 1 and T 2 .

T1 T2

R(A)

W(A
)
T1 T2

R(B)

W(B)

R(A)

R(B)

2. where R(A) denotes that a read operation is performed on some data item ‘A’ This is a
serial schedule since the transactions perform serially in the order T 1 —> T 2
2. Non-Serial Schedule: This is a type of Scheduling where the operations of multiple
transactions are interleaved. This might lead to a rise in the concurrency problem. The
transactions are executed in a non-serial manner, keeping the end result correct and
same as the serial schedule. Unlike the serial schedule where one transaction must
wait for another to complete all its operation, in the non-serial schedule, the other
transaction proceeds without waiting for the previous transaction to complete. This
sort of schedule does not provide any benefit of the concurrent transaction. It can be
of two types namely, Serializable and Non-Serializable Schedule. The Non-Serial
Schedule can be divided further into Serializable and Non-Serializable.
1. Serializable: This is used to maintain the consistency of the database. It is
mainly used in the Non-Serial scheduling to verify whether the scheduling will
lead to any inconsistency or not. On the other hand, a serial schedule does not
need the serializability because it follows a transaction only when the previous
transaction is complete. The non-serial schedule is said to be in a serializable
schedule only when it is equivalent to the serial schedules, for an n number of
transactions. Since concurrency is allowed in this case thus, multiple
transactions can execute concurrently. A serializable schedule helps in
improving both resource utilization and CPU throughput. These are of two
types:
1. Conflict Serializable: A schedule is called conflict serializable if it can
be transformed into a serial schedule by swapping non-conflicting
operations. Two operations are said to be conflicting if all conditions
satisfy:
 They belong to different transactions
 They operate on the same data item
 At Least one of them is a write operation
2. View Serializable: A Schedule is called view serializable if it is view
equal to a serial schedule (no overlapping transactions). A conflict
schedule is a view serializable but if the serializability contains blind
writes, then the view serializable does not conflict serializable.
2. Non-Serializable: The non-serializable schedule is divided into two types,
Recoverable and Non-recoverable Schedule.
1. Recoverable Schedule: Schedules in which transactions commit only
after all transactions whose changes they read commit are called
recoverable schedules. In other words, if some transaction T j is
reading value updated or written by some other transaction T i , then
the commit of T j must occur after the commit of T i . Example
– Consider the following schedule involving two transactions T 1 and
T2.

T1 T2

R(A)

W(A)

W(A)

R(A)

commit

commit

2. This is a recoverable schedule since T 1 commits before T 2 , that


makes the value read by T 2 correct. There can be three types of
recoverable schedule:
 Cascading Schedule: Also called Avoids cascading
aborts/rollbacks (ACA). When there is a failure in one
transaction and this leads to the rolling back or aborting other
dependent transactions, then such scheduling is referred to as
Cascading rollback or cascading abort. Example:

 Cascadeless Schedule: Schedules in which transactions read


values only after all transactions whose changes they are going
to read commit are called cascadeless schedules. Avoids that a
single transaction abort leads to a series of transaction
rollbacks. A strategy to prevent cascading aborts is to disallow
a transaction from reading uncommitted changes from another
transaction in the same schedule. In other words, if some
transaction T j wants to read value updated or written by some
other transaction T i , then the commit of T j must read it after
the commit of T i . Example: Consider the following schedule
involving two transactions T 1 and T 2 .

T1 T2

R(A)

W(A)
T1 T2

W(A)

commit

R(A)

commit

 This schedule is cascadeless. Since the updated value of A is


read by T 2 only after the updating transaction i.e.
T 1 commits. Example: Consider the following schedule
involving two transactions T 1 and T 2 .

T1 T2

R(A)

W(A)

R(A)

W(A)

abort

abort

 It is a recoverable schedule but it does not avoid cascading


aborts. It can be seen that if T 1 aborts, T 2 will have to be
aborted too in order to maintain the correctness of the schedule
as T 2 has already read the uncommitted value written by T 1 .
 Strict Schedule: A schedule is strict if for any two transactions
T i , T j , if a write operation of T i precedes a conflicting
operation of T j (either read or write), then the commit or abort
event of T i also precedes that conflicting operation of T j . In
other words, T j can read or write updated or written value of
T i only after T i commits/aborts. Example: Consider the
following schedule involving two transactions T 1 and T 2 .

T1 T2

R(A)

R(A)

W(A)

commit

W(A)

R(A)

commit

 This is a strict schedule since T 2 reads and writes A which is


written by T 1 only after the commit of T 1 .
2. Non-Recoverable Schedule:Example: Consider the following schedule
involving two transactions T 1 and T 2 .

T1 T2

R(A)

W(A)
T1 T2

W(A)

R(A)

commit

abort

3. T 2 read the value of A written by T 1 , and committed. T 1 later


aborted, therefore the value read by T 2 is wrong, but since
T 2 committed, this schedule is non-recoverable .
Concurrent execution of transaction:
In a multi-user system, several users can access and work on the same database at the same
time. This is known as concurrent execution, where the database is used simultaneously by
different users for various operations. For instance, one user might be updating data while
another is retrieving it.
When multiple transactions are performed on the database simultaneously, it is important that
these operations are executed in an interleaved manner. This means that the actions of one
user should not interfere with or affect the actions of another. This helps in maintaining the
consistency of the database. However, managing such simultaneous operations can be
challenging, and certain problems may arise if not handled properly. These challenges need to
be addressed to ensure smooth and error-free concurrent execution.
Concurrency control is an essential aspect of database management systems (DBMS) that
ensures transactions can execute concurrently without interfering with each other. However,
concurrency control can be challenging to implement, and without it, several problems can
arise, affecting the consistency of the database. In this article, we will discuss some of the
concurrency problems that can occur in DBMS transactions and explore solutions to prevent
them.
When multiple transactions execute concurrently in an uncontrolled or unrestricted manner,
then it might lead to several problems. These problems are commonly referred to as
concurrency problems in a database environment.
The five concurrency problems that can occur in the database are:
 Temporary Update Problem
 Incorrect Summary Problem
 Lost Update Problem
 Unrepeatable Read Problem
 Phantom Read Problem
Concurrency control ensures the consistency and integrity of data in databases when multiple
transactions are executed simultaneously. Understanding issues like lost updates, dirty reads,
and non-repeatable reads is crucial when studying DBMS.
These are explained as following below.
Temporary Update Problem:
Temporary update or dirty read problem occurs when one transaction updates an item and
fails. But the updated item is used by another transaction before the item is changed or
reverted back to its last value.
Example:
In the above example, if transaction 1 fails for some reason then X will revert back to its
previous value. But transaction 2 has already read the incorrect value of X.
Incorrect Summary Problem:
Consider a situation, where one transaction is applying the aggregate function on some
records while another transaction is updating these records. The aggregate function may
calculate some values before the values have been updated and others after they are updated.
Example:
In the above example, transaction 2 is calculating the sum of some records while transaction
1 is updating them. Therefore the aggregate function may calculate some values before they
have been updated and others after they have been updated.
Lost Update Problem:
In the lost update problem, an update done to a data item by a transaction is lost as it is
overwritten by the update done by another transaction.
Example:

In the above example, transaction 2 changes the value of X but it will get overwritten by the
write commit by transaction 1 on X (not shown in the image above). Therefore, the update
done by transaction 2 will be lost. Basically, the write commit done by the last
transaction will overwrite all previous write commits.
Unrepeatable Read Problem:
The unrepeatable problem occurs when two or more read operations of the same transaction
read different values of the same variable.
Example:
In the above example, once transaction 2 reads the variable X, a write operation in transaction
1 changes the value of the variable X. Thus, when another read operation is performed by
transaction 2, it reads the new value of X which was updated by transaction 1.
Phantom Read Problem:
The phantom read problem occurs when a transaction reads a variable once but when it tries
to read that same variable again, an error occurs saying that the variable does not exist.
Example:
In the above example, once transaction 2 reads the variable X, transaction 1 deletes the
variable X without transaction 2’s knowledge. Thus, when transaction 2 tries to read X, it is
not able to do it.

Solution :
To prevent concurrency problems in DBMS transactions, several concurrency control
techniques can be used, including locking, timestamp ordering, and optimistic concurrency
control.
Locking involves acquiring locks on the data items used by transactions, preventing other
transactions from accessing the same data until the lock is released. There are different types
of locks, such as shared and exclusive locks, and they can be used to prevent Dirty Read and
Non-Repeatable Read.
Timestamp ordering assigns a unique timestamp to each transaction and ensures that
transactions execute in timestamp order. Timestamp ordering can prevent Non-Repeatable
Read and Phantom Read.
Optimistic concurrency control assumes that conflicts between transactions are rare and
allows transactions to proceed without acquiring locks initially. If a conflict is detected, the
transaction is rolled back, and the conflict is resolved. Optimistic concurrency control can
prevent Dirty Read, Non-Repeatable Read, and Phantom Read.
In conclusion, concurrency control is crucial in DBMS transactions to ensure data
consistency and prevent concurrency problems such as Dirty Read, Non-Repeatable Read,
and Phantom Read. By using techniques like locking, timestamp ordering, and optimistic
concurrency control, developers can build robust database systems that support concurrent
access while maintaining data consistency.
LC5: Serializability, Testing for serializability,2PL.
Serializability ensures concurrent transactions appear as if they executed sequentially,
maintaining database consistency. Testing for serializability uses a precedence graph, where
edges represent conflicting operations, and a schedule is serializable if the graph is
acyclic. Two-Phase Locking (2PL) is a concurrency control protocol that helps achieve
serializability by controlling when locks can be acquired and released.
What is a serializable schedule, and what is it used for?
If a non-serial schedule can be transformed into its corresponding serial schedule, it is said to
be serializable. Simply said, a non-serial schedule is referred to as a serializable schedule if it
yields the same results as a serial timetable.
Non-serial Schedule
A schedule where the transactions are overlapping or switching places. As they are used to
carry out actual database operations, multiple transactions are running at once. It's possible
that these transactions are focusing on the same data set. Therefore, it is crucial that non-
serial schedules can be serialized in order for our database to be consistent both before and
after the transactions are executed.
Example:

Transaction-
1 Transaction-2

R(a)

W(a)

R(b)

W(b)

R(b)
Transaction-
1 Transaction-2

R(a)

W(b)

W(a)

We can observe that Transaction-2 begins its execution before Transaction-1 is finished, and
they are both working on the same data, i.e., "a" and "b", interchangeably. Where "R"-Read,
"W"-Write
Serializability testing
We can utilize the Serialization Graph or Precedence Graph to examine a schedule's
serializability. A schedule's full transactions are organized into a Directed Graph, what a
serialization graph is.

Precedence Graph
It can be described as a Graph G(V, E) with vertices V = "V1, V2, V3,..., Vn" and directed
edges E = "E1, E2, E3,..., En". One of the two operations—READ or WRITE—performed by
a certain transaction is contained in the collection of edges. Where Ti -> Tj, means
Transaction-Ti is either performing read or write before the transaction-Tj.
Types of Serializability
There are two ways to check whether any non-serial schedule is serializable.
Types of Serializability - Conflict & View
1. Conflict serializability
Conflict serializability refers to a subset of serializability that focuses on maintaining the
consistency of a database while ensuring that identical data items are executed in an order. In
a DBMS each transaction has a value and all the transactions, in the database rely on this
uniqueness. This uniqueness ensures that no two operations with the conflict value can occur
simultaneously.
For example lets consider an order table and a customer table as two instances. Each order is
associated with one customer even though a single client may place orders. However there
are restrictions for achieving conflict serializability in the database. Here are a few of them.
1. Different transactions should be used for the two procedures.
2. The identical data item should be present in both transactions.
3. Between the two operations, there should be at least one write operation.
Example
Three transactions—t1, t2, and t3—are active on a schedule "S" at once. Let's create a graph
of precedence.

Transaction - 1 Transaction - 2 Transaction - 3


(t1) (t2) (t3)

R(a)
Transaction - 1 Transaction - 2 Transaction - 3
(t1) (t2) (t3)

R(b)

R(b)

W(b)

W(a)

W(a)

R(a)

W(a)

It is a conflict serializable schedule as well as a serial schedule because the graph (a DAG)
has no loops. We can also determine the order of transactions because it is a serial schedule.

DAG of transactions
As there is no incoming edge on Transaction 1, Transaction 1 will be executed first. T3 will
run second because it only depends on T1. Due to its dependence on both T1 and T3, t2 will
finally be executed.
Therefore, the serial schedule's equivalent order is: t1 --> t3 --> t2
Note: A schedule is unquestionably consistent if it is conflicting serializable. A non-
conflicting serializable schedule, on the other hand, might or might not be serial. We employ
the idea of View Serializability to further examine its serial behavior.
2. View Serializability
View serializability is a kind of operation in a serializable in which each transaction should
provide some results, and these outcomes are the output of properly sequentially executing
the data item. The view serializability, in contrast to conflict serialized, is concerned with
avoiding database inconsistency. The view serializability feature of DBMS enables users to
see databases in contradictory ways.
To further understand view serializability in DBMS, we need to understand the schedules S1
and S2. The two transactions T1 and T2 should be used to establish these two schedules. Each
schedule must follow the three transactions in order to retain the equivalent of the transaction.
These three circumstances are listed below.
1. The first prerequisite is that the same kind of transaction appears on every schedule.
This requirement means that the same kind of group of transactions cannot appear on
both schedules S1 and S2. The schedules are not equal to one another if one schedule
commits a transaction but it does not match the transaction of the other schedule.
2. The second requirement is that different read or write operations should not be used in
either schedule. On the other hand, we say that two schedules are not similar if
schedule S1 has two write operations whereas schedule S2 only has one. The number
of the write operation must be the same in both schedules, however there is no issue if
the number of the read operation is different.
3. The second to last requirement is that there should not be a conflict between either
timetable. execution order for a single data item. Assume, for instance, that schedule
S1's transaction is T1, and schedule S2's transaction is T2. The data item A is written
by both the transaction T1 and the transaction T2. The schedules are not equal in this
instance. However, we referred to the schedule as equivalent to one another if it had
the same number of all write operations in the data item.
he Two-Phase Locking (2PL) Protocol is an essential concept in database management
systems used to maintain data consistency and ensure smooth operation when multiple
transactions are happening simultaneously. It helps to prevent issues like data conflicts where
two or more transactions try to access or modify the same data at the same time, potentially
causing errors.
Two-Phase Locking is widely used to ensure serializability, meaning transactions occur in a
sequence that maintains data accuracy. This article will explore the workings of the 2PL
protocol, its types, advantages and its role in maintaining a reliable database system
Two Phase Locking
The Two-Phase Locking (2PL) Protocol is a key technique used in database management
systems to manage how multiple transactions access and modify data at the same time. When
many users or processes interact with a database, it’s important to ensure that data remains
consistent and error-free. Without proper management, issues like data conflicts or corruption
can occur if two transactions try to use the same data simultaneously.
The Two-Phase Locking Protocol resolves this issue by defining clear rules for managing
data locks. It divides a transaction into two phases:
1. Growing Phase: In this step, the transaction gathers all the locks it needs to access
the required data. During this phase, it cannot release any locks.
2. Shrinking Phase: Once a transaction starts releasing locks, it cannot acquire any new
ones. This ensures that no other transaction interferes with the ongoing process.
Types of Lock
Shared Lock (S): Shared Lock is also called a read-only lock, allows multiple transactions to
access the same data item for reading at the same time. However, transactions with this lock
cannot make changes to the data. A shared lock is requested using the lock-S instruction.
Exclusive Lock (X): An Exclusive Lock allows a transaction to both read and modify a data
item. This lock is exclusive, meaning no other transaction can access the same data item
while this lock is held. An exclusive lock is requested using the lock-X instruction.
Read more about the Locks.
Lock Conversions
In the Two-Phase Locking Protocol, lock conversion means changing the type of lock on data
while a transaction is happening. This process is carefully controlled to maintain consistency
in the database.
1. Upgrading a Lock: This means changing a shared lock (S) to an exclusive lock (X).
For example, if a transaction initially only needs to read data (S) but later decides it
needs to update the same data, it can request an upgrade to an exclusive lock (X).
However, this can only happen during the Growing Phase, where the transaction is
still acquiring locks.
 Example: A transaction reads a value (S lock) but then realizes it needs to
modify the value. It upgrades to an X lock during the Growing Phase.
2. Downgrading a Lock: This means changing an exclusive lock (X) to a shared lock
(S). For instance, if a transaction initially planned to modify data (X lock) but later
decides it only needs to read it, it can downgrade the lock. However, this must happen
during the Shrinking Phase, where the transaction is releasing locks.
 Example: A transaction modifies a value (X lock) but later only needs to read
the value, so it downgrades to an S lock during the Shrinking Phase.

LC6: Strict 2PL, Deadlocks, timestamp-based protocols

In database management systems (DBMS) a deadlock occurs when two or more transactions
are unable to the proceed because each transaction is waiting for the other to the release locks
on resources. This situation creates a cycle of the dependencies where no transaction can
continue leading to the standstill in the system. The Deadlocks can severely impact the
performance and reliability of a DBMS making it crucial to the understand and manage them
effectively.
What is Deadlock?
The Deadlock is a condition in a multi-user database environment where transactions are
unable to the complete because they are each waiting for the resources held by other
transactions. This results in a cycle of the dependencies where no transaction can proceed.
Characteristics of Deadlock
 Mutual Exclusion: Only one transaction can hold a particular resource at a time.
 Hold and Wait: The Transactions holding resources may request additional resources
held by others.
 No Preemption: The Resources cannot be forcibly taken from the transaction holding
them.
 Circular Wait: A cycle of transactions exists where each transaction is waiting for the
resource held by the next transaction in the cycle.
In a database management system (DBMS), a deadlock occurs when two or more transactions
are waiting for each other to release resources, such as locks on database objects, that they
need to complete their operations. As a result, none of the transactions can proceed, leading
to a situation where they are stuck or “deadlocked.”
Deadlocks can happen in multi-user environments when two or more transactions are running
concurrently and try to access the same data in a different order. When this happens, one
transaction may hold a lock on a resource that another transaction needs, while the second
transaction may hold a lock on a resource that the first transaction needs. Both transactions
are then blocked, waiting for the other to release the resource they need.
DBMSs often use various techniques to detect and resolve deadlocks automatically. These
techniques include timeout mechanisms, where a transaction is forced to release its locks
after a certain period of time, and deadlock detection algorithms, which periodically scan the
transaction log for deadlock cycles and then choose a transaction to abort to resolve the
deadlock.
It is also possible to prevent deadlocks by careful design of transactions, such as always
acquiring locks in the same order or releasing locks as soon as possible. Proper design of the
database schema and application can also help to minimize the likelihood of deadlocks.
In a database, a deadlock is an unwanted situation in which two or more transactions are
waiting indefinitely for one another to give up locks. Deadlock is said to be one of the most
feared complications in DBMS as it brings the whole system to a Halt.
Example – let us understand the concept of deadlock suppose, Transaction T1 holds a lock on
some rows in the Students table and needs to update some rows in the Grades table.
Simultaneously, Transaction T2 holds locks on those very rows (Which T1 needs to update)
in the Grades table but needs to update the rows in the Student table held by Transaction T1.
Now, the main problem arises. Transaction T1 will wait for transaction T2 to give up the
lock, and similarly, transaction T2 will wait for transaction T1 to give up the lock. As a
consequence, All activity comes to a halt and remains at a standstill forever unless the DBMS
detects the deadlock and aborts one of the transactions.

Deadlock in DBMS
What is Deadlock Avoidance?
When a database is stuck in a deadlock, It is always better to avoid the deadlock rather than
restarting or aborting the database. The deadlock avoidance method is suitable for smaller
databases whereas the deadlock prevention method is suitable for larger databases.
One method of avoiding deadlock is using application-consistent logic. In the above-given
example, Transactions that access Students and Grades should always access the tables in the
same order. In this way, in the scenario described above, Transaction T1 simply waits for
transaction T2 to release the lock on Grades before it begins. When transaction T2 releases
the lock, Transaction T1 can proceed freely.
Another method for avoiding deadlock is to apply both the row-level locking mechanism and
the READ COMMITTED isolation level. However, It does not guarantee to remove
deadlocks completely.
What is Deadlock Detection?
When a transaction waits indefinitely to obtain a lock, The database management system
should detect whether the transaction is involved in a deadlock or not.
Wait-for-graph is one of the methods for detecting the deadlock situation. This method is
suitable for smaller databases. In this method, a graph is drawn based on the transaction and
its lock on the resource. If the graph created has a closed loop or a cycle, then there is a
deadlock. For the above-mentioned scenario, the Wait-For graph is drawn below:

What is Deadlock Prevention?


For a large database, the deadlock prevention method is suitable. A deadlock can be
prevented if the resources are allocated in such a way that a deadlock never occurs. The
DBMS analyzes the operations whether they can create a deadlock situation or not, If they
do, that transaction is never allowed to be executed.
Deadlock prevention mechanism proposes two schemes:
 Wait-Die Scheme: In this scheme, If a transaction requests a resource that is locked
by another transaction, then the DBMS simply checks the timestamp of both
transactions and allows the older transaction to wait until the resource is available for
execution.
Suppose, there are two transactions T1 and T2, and Let the timestamp of any
transaction T be TS (T). Now, If there is a lock on T2 by some other transaction and
T1 is requesting resources held by T2, then DBMS performs the following actions:
Checks if TS (T1) < TS (T2) – if T1 is the older transaction and T2 has held some resource,
then it allows T1 to wait until resource is available for execution. That means if a younger
transaction has locked some resource and an older transaction is waiting for it, then an older
transaction is allowed to wait for it till it is available. If T1 is an older transaction and has
held some resource with it and if T2 is waiting for it, then T2 is killed and restarted later with
random delay but with the same timestamp. i.e. if the older transaction has held some
resource and the younger transaction waits for the resource, then the younger transaction is
killed and restarted with a very minute delay with the same timestamp.
This scheme allows the older transaction to wait but kills the younger one.
 Wound Wait Scheme: In this scheme, if an older transaction requests for a resource
held by a younger transaction, then an older transaction forces a younger transaction
to kill the transaction and release the resource. The younger transaction is restarted
with a minute delay but with the same timestamp. If the younger transaction is
requesting a resource that is held by an older one, then the younger transaction is
asked to wait till the older one releases it.
The following table lists the differences between Wait – Die and Wound -Wait scheme
prevention schemes:
Wait – Die Wound -Wait

It is based on a non-preemptive technique. It is based on a preemptive technique.

In this, older transactions must wait for the In this, older transactions never wait
younger one to release its data items. for younger transactions.

The number of aborts and rollbacks is higher in In this, the number of aborts and
these techniques. rollback is lesser.

Applications
1. Delayed Transactions: Deadlocks can cause transactions to be delayed, as the
resources they need are being held by other transactions. This can lead to slower
response times and longer wait times for users.
2. Lost Transactions: In some cases, deadlocks can cause transactions to be lost or
aborted, which can result in data inconsistencies or other issues.
3. Reduced Concurrency: Deadlocks can reduce the level of concurrency in the system,
as transactions are blocked waiting for resources to become available. This can lead to
slower transaction processing and reduced overall throughput.
4. Increased Resource Usage: Deadlocks can result in increased resource usage, as
transactions that are blocked waiting for resources to become available continue to
consume system resources. This can lead to performance degradation and increased
resource contention.
5. Reduced User Satisfaction: Deadlocks can lead to a perception of poor system
performance and can reduce user satisfaction with the application. This can have a
negative impact on user adoption and retention.
Features of Deadlock in a DBMS
1. Mutual Exclusion: Each resource can be held by only one transaction at a time, and
other transactions must wait for it to be released.
2. Hold and Wait: Transactions can request resources while holding on to resources
already allocated to them.
3. No Preemption: Resources cannot be taken away from a transaction forcibly, and the
transaction must release them voluntarily.
4. Circular Wait: Transactions are waiting for resources in a circular chain, where each
transaction is waiting for a resource held by the next transaction in the chain.
5. Indefinite Blocking: Transactions are blocked indefinitely, waiting for resources to
become available, and no transaction can proceed.
6. System Stagnation: Deadlock leads to system stagnation, where no transaction can
proceed, and the system is unable to make any progress.
7. Inconsistent Data: Deadlock can lead to inconsistent data if transactions are unable to
complete and leave the database in an intermediate state.
8. Difficult to Detect and Resolve: Deadlock can be difficult to detect and resolve, as it
may involve multiple transactions, resources, and dependencies.
Disadvantages
1. System downtime: Deadlock can cause system downtime, which can result in loss of
productivity and revenue for businesses that rely on the DBMS.
2. Resource waste: When transactions are waiting for resources, these resources are not
being used, leading to wasted resources and decreased system efficiency.
3. Reduced concurrency: Deadlock can lead to a decrease in system concurrency, which
can result in slower transaction processing and reduced throughput.
4. Complex resolution: Resolving deadlock can be a complex and time-consuming
process, requiring system administrators to intervene and manually resolve the
deadlock.
5. Increased system overhead: The mechanisms used to detect and resolve deadlock,
such as timeouts and rollbacks, can increase system overhead, leading to decreased
performance.
Conclusion
Deadlock is like a traffic jam in a database where transactions are stopped because they are
waiting for each other to move. To overcome this, we need to design transactions carefully,
use smart strategies to detect deadlocks, and have a plan to untangle things when deadlocks
occur. By doing this, we can keep our database running smoothly and avoid traffic jams.

n Database Management Systems (DBMS), deadlocks occur when two or more transactions
are waiting for each other to release a resource, leading to an indefinite wait. Deadlocks are a
common issue in concurrency control, especially when multiple transactions try to access the
same data. To avoid this problem, different techniques like Conservative Two-Phase
Locking (2PL) and Graph-Based Protocols are used, but they have some limitations.
A more effective approach is Timestamp-Based Deadlock Prevention. A timestamp (TS) is a
unique identifier assigned to each transaction by the DBMS. It helps determine the execution
order of transactions and ensures that older transactions are given priority, reducing the
chances of deadlocks. Each transaction is assigned a unique timestamp based on the system
clock.
 When a transaction requests access to a data item, the system compares the timestamp
of the transaction with the last transaction that accessed that data.
 If the requesting transaction’s timestamp is older than the last transaction’s
timestamp, the system rolls back the requesting transaction. This ensures that the
transactions execute in a way that preserves their serializability (meaning the result is
the same as if the transactions were executed one by one).
There are different ways to generate timestamps:
1. Using a Counter – A simple number that increases for every new transaction.
2. Using System Clock – Assigning the transaction a timestamp based on the current
date and time.
Timestamp-based scheduling helps in maintaining serializability, ensuring transactions
execute in a consistent order and avoiding conflicts. It is widely used in modern DBMS for
better transaction management and deadlock prevention.
Deadlock Prevention Schemes based on Timestamps
To prevent any deadlock situation in the system, the DBMS aggressively inspects all the
operations, where transactions are about to execute. The DBMS inspects the operations and
analyzes if they can create a deadlock situation. If it finds that a deadlock situation might
occur, then that transaction is never allowed to be executed. There are deadlock prevention
schemes that use timestamp ordering mechanism of transactions in order to predetermine a
deadlock situation.

 Wait_Die : An older transaction is allowed to wait for a younger transaction, whereas


a younger transaction requesting an item held by an older transaction is aborted and
restarted.
From the context above, if TS(Ti) < TS(Tj), then (Ti older than Tj) Ti is allowed to
wait; otherwise abort Ti (Ti younger than Tj) and restart it later with the same
timestamp.
 Wound_Wait: It is just the opposite of the Wait_Die technique. Here, a younger
transaction is allowed to wait for an older one, whereas if an older transaction
requests an item held by the younger transaction, we preempt the younger transaction
by aborting it.
From the context above, if TS(Ti) < TS(Tj), then (Ti older than Tj) Tj is aborted (i.e.,
Ti wounds Tj) and restarts it later with the same Timestamp; otherwise (Ti younger
than Tj) Ti is allowed to wait.

Thus, both schemes end up aborting the younger of the two transactions that may be involved
in a deadlock. It is done on the basis of the assumption that aborting the younger transaction
will waste less processing which is logical. In such a case there cannot be a cycle since we
are waiting linearly in both cases.
Deadlock Prevention Without Timestamps
Some deadlock prevention techniques do not require timestamps. Two such methods are:
1. No-Wait Algorithm
 If a transaction cannot get a lock, it is immediately aborted and restarted after some
time.
 This method completely eliminates waiting, so deadlocks cannot occur.
 However, it is not practical, as transactions may restart too often, leading to
unnecessary delays.
2. Cautious Waiting
 When a transaction Ti tries to lock a resource X, but X is already locked by another
transaction Tj, one of two things happens:
1. If Tj is not waiting for any other resource, Ti is allowed to wait.
2. If Tj is already waiting for another resource, Ti is aborted to prevent a possible
deadlock.
This method reduces unnecessary aborts compared to the No-Wait Algorithm while still
preventing deadlocks.
Deadlock Detection Using Wait-for Graph
In a database, when a transaction keeps waiting forever to get a lock, the DBMS should check
if the transaction is involved in a deadlock. To do this, the lock manager uses a Wait-for
Graph to detect deadlocks.
Wait-for Graph
 This graph shows transactions and the locks they are waiting for.
 If the graph has a cycle (a loop where transactions are waiting on each other), it
means there is a deadlock.
In simple terms, if the transactions are stuck in a cycle, the system knows that a deadlock has
occurred and can take action to fix it.

LC7: Recoverability, Introduction to Log based recovery, check pointing and shadow
paging.

ecoverability is a critical feature of database systems that ensures the database can return to a
consistent and reliable state after a failure or error. It guarantees that the effects of committed
transactions are saved permanently, while uncommitted transactions are rolled back to
maintain data integrity. This process relies on transaction logs, which record all changes made
during transaction processing. These logs enable the system to either undo the changes of
uncommitted transactions or redo the committed ones when a failure occurs.
In database systems, multiple transactions often run simultaneously, with some being
independent and others interdependent. When a dependent transaction fails, it can have a
cascading impact on other transactions. Recoverability in DBMS addresses these challenges
by focusing on minimizing the effects of such failures and ensuring the database remains
consistent.
There are several levels of recoverability that can be supported by a database system:
No-undo logging: This level of recoverability only guarantees that committed transactions
are durable, but does not provide the ability to undo the effects of uncommitted transactions.
Undo logging: This level of recoverability provides the ability to undo the effects of
uncommitted transactions but may result in the loss of updates made by committed
transactions that occur after the failed transaction.
Redo logging: This level of recoverability provides the ability to redo the effects of
committed transactions, ensuring that all committed updates are durable and can be recovered
in the event of failure.
Undo-redo logging: This level of recoverability provides both undo and redo capabilities,
ensuring that the system can recover to a consistent state regardless of whether a transaction
has been committed or not.
In addition to these levels of recoverability, database systems may also use techniques such as
checkpointing and shadow paging to improve recovery performance and reduce the overhead
associated with logging.
Overall, recoverability is a crucial property of database systems, as it ensures that data is
consistent and durable even in the event of failures or errors. It is important for database
administrators to understand the level of recoverability provided by their system and to
configure it appropriately to meet their application’s requirements.

Recoverable Schedules
A recoverable schedule is a type of transaction schedule in a Database Management System
(DBMS) where committed transactions do not violate the rules of consistency, even in the
event of a failure. In other words, a transaction in a recoverable schedule only commits after
all the transactions it depends on have committed. This ensures that the database can maintain
integrity and recover to a consistent state.
Example 1:
Consider the following schedule involving two transactions T1 and T2.
T1 T2

R(A)

W(A)

W(A)

R(A)

commit

commit

This is a recoverable schedule since T1 commits before T2, that makes the value read by
T2 correct.
Example 2:
S1: R1(x), W1(x), R2(x), R1(y), R2(y),W2(x), W1(y), C1, C2;

 T2 reads uncommitted data from T1 (R2(x) depends on W1(x)), but T2 commits only
after T1 commits.
 The schedule is recoverable.
Log based recovery:
Log-based recovery in DBMS ensures data can be maintained or restored in the event of a
system failure. The DBMS records every transaction on stable storage, allowing for easy
data recovery when a failure occurs. For each operation performed on the database, a log
file is created. Transactions are logged and verified before being applied to the database,
ensuring data integrity.

Log Based Recovery


Log in DBMS
A log is a sequence of records that document the operations performed during database
transactions. Logs are stored in a log file for each transaction, providing a mechanism to
recover data in the event of a failure. For every operation executed on the database, a
corresponding log record is created. It is critical to store these logs before the actual
transaction operations are applied to the database, ensuring data integrity and consistency
during recovery processes.
For example, consider a transaction to modify a student’s city. This transaction generates
the following logs:
Start Log: When the transaction begins, a log is created to indicate the start of the
transaction.
Format:<Tn, Start>
 Here, Tn represents the transaction identifier.
 Example: <T1, Start> indicates that Transaction 1 has started.
Operation Log: When the city is updated, a log is recorded to capture the old and new
values of the operation.
Format:<Tn, Attribute, Old_Value, New_Value>
 Example: <T1, City, 'Gorakhpur', 'Noida'> shows that in Transaction 1, the value of
the City attribute has changed from 'Gorakhpur' to 'Noida'.
Commit Log: Once the transaction is successfully completed, a final log is created to
indicate that the transaction has been completed and the changes are now permanent.
Format:<Tn, Commit>
 Example: <T1, Commit> signifies that Transaction 1 has been successfully
completed.
These logs play a crucial role in ensuring that the database can recover to a consistent
state after a system crash. If a failure occurs, the DBMS can use these logs to either roll
back incomplete transactions or redo committed transactions to maintain data consistency.
Key Operations in Log-Based Recovery
Undo Operation
The undo operation reverses the changes made by an uncommitted transaction, restoring
the database to its previous state.
Example of Undo:
Consider a transaction T1 that updates a bank account balance but fails before
committing:
Initial State:
 Account balance = 500.
Transaction T1:
 Update balance to 600.
 Log entry:
<T1, Balance, 500, 600>
Failure:
 T1 fails before committing.
Undo Process:
 Use the old value from the log to revert the change.
 Set balance back to 500.
 Final log entry after undo:
<T1, Abort>
Redo Operation
The redo operation re-applies the changes made by a committed transaction to ensure
consistency in the database.
Example of Redo:
Consider a transaction T2 that updates an account balance but the database crashes before
changes are permanently reflected:
Initial State:
 Account balance = 300.
Transaction T2:
 Update balance to 400.
 Log entries:
<T2, Start><T2, Balance, 300, 400><T2, Commit>
Crash:
 Changes are not reflected in the database.
Redo Process:
 Use the new value from the log to reapply the committed change.
 Set balance to 400.
Undo-Redo Example:
Assume two transactions:
 T1: Failed transaction (requires undo).
 T2: Committed transaction (requires redo).
Log File:
<T1, Start><T1, Balance, 500, 600><T2, Start><T2, Balance, 300, 400><T2,
Commit><T1, Abort>
Recovery Steps:
Identify Committed and Uncommitted Transactions:
 T1: Not committed → Undo.
 T2: Committed → Redo.
Undo T1:
 Revert balance from 600 to 500.
Redo T2:
 Reapply balance change from 300 to 400.

Operation Trigger Action

For uncommitted/failed Revert changes using the old


Undo
transactions values in the log.

Reapply changes using the new


Redo For committed transactions
values in the log.

These operations ensure data consistency and integrity in the event of system failures.
Check Point:
In DBMS, checkpoints mark points of consistent database state, while shadow paging is a
recovery technique that uses a "shadow copy" of the database to enable crash recovery by
allowing updates to occur in a separate area without affecting the original data.
Checkpoints:
 Purpose:
Checkpoints ensure that in case of a system failure or crash, the database can be
recovered to a consistent state before the checkpoint.
 Mechanism:
At a checkpoint, all modified data in the buffer pool is written to disk, and a special
checkpoint record is added to the transaction log.
 Benefit:
Reduces recovery time by allowing the system to restart from the last checkpoint, rather
than having to replay all transactions from the beginning.
Shadow Paging:
 Purpose:
Shadow paging is a technique for maintaining data consistency and enabling crash
recovery, particularly useful for databases stored on disk.
 Mechanism:
 A "shadow copy" or snapshot of the entire database is created, which serves as
a consistent version for transaction execution.
 During a transaction, updates are made to the shadow copy, leaving the
original data untouched.
 Once the transaction is committed, the shadow copy becomes the new main
copy, ensuring consistency.
 Key Components:
 Page Table: Maps logical addresses to physical addresses of shadow pages.
 Shadow Pages: Duplicate database pages where modifications are made
during transaction execution.
 Transaction Log: Records all modifications made during a transaction,
including both original and shadow pages, enabling recovery in case of a
system failure.
 Advantages:
 Atomicity: Updates are made to the shadow copy, ensuring that if a
transaction fails, the original data remains consistent.
 Simplicity: Shadow paging can be simpler to implement than log-based
recovery in some cases.
 Disadvantages:
 Space Overhead: Requires extra storage space for the shadow copy.
 Performance: Can be slower for transactions that modify a large number of
pages, as it involves copying pages.

Shadow Paging:
Shadow paging is a fundamental recovery technique used in database management systems
(DBMS) to ensure the reliability and consistency of data. It plays a crucial role in maintaining
atomicity and durability which are the two core properties of transaction management.
Unlike log-based recovery mechanisms that rely on recording detailed logs of changes,
shadow paging offers a simpler, log-free approach by maintaining two versions of the
database state: the shadow page table and the current page table. This technique is also
known as Cut-of-Place updating.
This technique ensures that a database can recover seamlessly from failures without losing
data integrity. During a transaction, updates are made to a new version of the database pages
tracked by the current page table, while the shadow page table preserves the pre-transaction
state. This dual-table approach allows for efficient crash recovery and simplifies the commit
and rollback processes.
Page Table : A page table is a data structure that maps logical pages (a logical division of
data) to physical pages (actual storage on disk).
 Each entry in the page table corresponds to a physical page location on the disk.
 The database uses the page table to retrieve or modify data.
How Shadow Paging Works ?
Shadow paging is a recovery technique that views the database as a collection of fixed-sized
logical storage units, known as pages, which are mapped to physical storage blocks using a
structure called the page table. The page table enables the system to efficiently locate and
manage database pages.
Here’s how shadow paging works in detail:
Start of Transaction:
 The shadow page table is created by copying the current page table.
 The shadow page table represents the original, unmodified state of the database.
 This table is saved to disk and remains unchanged throughout the transaction.

Shadow Page Table


Logical Page (Disk) Current Page Table

P1 Address_1 Address_1

P2 Address_2 Address_2

P3 Address_3 Address_3

Transaction Execution:
 Updates are made to the database by creating new pages.
 The current page table reflects these changes, while the shadow page table remains
unchanged.
Page Modification:
If a logical page (e.g. P2) needs to be updated:
 A new version of the page (P2’) is created in memory and written to a new physical
storage block.
 The current page table entry for P2 is updated to point to P2’.
 The shadow page table still points to the original page P2, ensuring it is unaffected by
the changes.
Shadow Page Table
Logical Page (Disk) Current Page Table

P1 Address_1 Address_1

P2 Address_2 Address_4 (P2′)

P3 Address_3 Address_3

Commit:
 If the transaction is successful, the shadow page table is replaced by the current page
table.
 This replacement makes the changes permanent.

Shadow Page Table


Logical Page (Disk) Current Page Table

P1 Address_1 Address_1

P2 Address_4 (P2′) Address_4 (P2′)

P3 Address_3 Address_3

Abort:
 If the transaction is aborted, the current page table is discarded, leaving the shadow
page table intact.
 Since the shadow page table still points to the original pages, no changes are reflected
in the database.
LC8: ARIES algorithm:
Algorithms for Recovery and Isolation Exploiting Semantics, or ARIES, is a
recovery algorithm designed to work with a no-force, steal database approach; it is used
by IBM Db2, Microsoft SQL Server and many other database systems.IBM
Fellow Chandrasekaran Mohan is the primary inventor of the ARIES family of
algorithms.
Three main principles lie behind ARIES:
 Write-ahead logging: Any change to an object is first recorded in the log, and the log
must be written to stable storage before changes to the object are written to disk.
 Repeating history during Redo: On restart after a crash, ARIES retraces the actions of
a database before the crash and brings the system back to the exact state that it was in
before the crash. Then it undoes the transactions still active at crash time.
 Logging changes during Undo: Changes made to the database while undoing
transactions are logged to ensure such an action isn't repeated in the event of repeated
restarts.
The ARIES algorithm relies on logging of all database operations with ascending
Sequence Numbers. Usually the resulting logfile is stored on so-called "stable storage",
that is a storage medium that is assumed to survive crashes and hardware failures.
To gather the necessary information for the logs, two data structures have to be
maintained: the dirty page table (DPT) and the transaction table (TT).
The dirty page table keeps record of all the pages that have been modified, and not yet
written to disk, and the first Sequence Number that caused that page to become dirty. The
transaction table contains all currently running transactions and the Sequence Number of
the last log entry they created.
We create log records of the form (Sequence Number, Transaction ID, Page ID, Redo,
Undo, Previous Sequence Number). The Redo and Undo fields keep information about
the changes this log record saves and how to undo them. The Previous Sequence Number
is a reference to the previous log record that was created for this transaction. In the case
of an aborted transaction, it's possible to traverse the log file in reverse order using the
Previous Sequence Numbers, undoing all actions taken within the specific transaction.
Every transaction implicitly begins with the first "Update" type of entry for the given
Transaction ID, and is committed with "End Of Log" (EOL) entry for the transaction.
During a recovery, or while undoing the actions of an aborted transaction, a special kind
of log record is written, the Compensation Log Record (CLR), to record that the action
has already been undone. CLRs are of the form (Sequence Number, Transaction ID, Page
ID, Redo, Previous Sequence Number, Next Undo Sequence Number). The Redo field
contains application of Undo field of reverted action, and the Undo field is omitted
because CLR is never reverted.
Recovery
The recovery works in three phases. The first phase, Analysis, computes all the necessary
information from the logfile. The Redo phase restores the database to the exact state at the
crash, including all the changes of uncommitted transactions that were running at that
point in time. The Undo phase then undoes all uncommitted changes, leaving the database
in a consistent state.
Analysis
During the Analysis phase we restore the DPT and the TT as they were at the time of the
crash.
We run through the logfile (from the beginning or the last checkpoint) and add all
transactions for which we encounter Begin Transaction entries to the TT. Whenever an
End Log entry is found, the corresponding transaction is removed. The last Sequence
Number for each transaction is also maintained.
During the same run we also fill the dirty page table by adding a new entry whenever we
encounter a page that is modified and not yet in the DPT. This however only computes a
superset of all dirty pages at the time of the crash, since we don't check the actual
database file whether the page was written back to the storage.
Redo
From the DPT, we can compute the minimal Sequence Number of a dirty page. From
there, we have to start redoing the actions until the crash, in case they weren't persisted
already.
Running through the log file, we check for each entry, whether the modified page P on the
entry exists in the DPT. If it doesn't, then we do not have to worry about redoing this
entry since the data persists on the disk. If page P exists in the DPT table, then we see
whether the Sequence Number in the DPT is smaller than the Sequence Number of the
log record (i.e. whether the change in the log is newer than the last version that was
persisted). If it isn't, then we don't redo the entry since the change is already there. If it is,
we fetch the page from the database storage and check the Sequence Number stored on
the page to the Sequence Number on the log record. If the former is smaller than the
latter, the page needs to be written to the disk. That check is necessary because the
recovered DPT is only a conservative superset of the pages that really need changes to be
reapplied. Lastly, when all the above checks are finished and failed, we reapply the redo
action and store the new Sequence Number on the page. It is also important for recovery
from a crash during the Redo phase, as the redo isn't applied twice to the same page.
Undo
After the Redo phase, the database reflects the exact state at the crash. However the
changes of uncommitted transactions have to be undone to restore the database to a
consistent state.
For that we run backwards through the log for each transaction in the TT (those runs can
of course be combined into one) using the Previous Sequence Number fields in the
records. For each record we undo the changes (using the information in the Undo field)
and write a compensation log record to the log file. If we encounter a Begin Transaction
record we write an End Log record for that transaction.
The compensation log records make it possible to recover during a crash that occurs
during the recovery phase. That isn't as uncommon as one might think, as it is possible for
the recovery phase to take quite long. CLRs are read during the Analysis phase and
redone during the Redo phase.

WAL:(write ahead logging)


The Write-Ahead Logging (WAL) protocol is a technique used in databases and other
systems to ensure data is never lost in case of a crash. It does this by always logging or
writing changes to a secure place before making the changes effective.
write-ahead logging is a sophisticated solution to the problem of file system inconsistency in
operating systems. Inspired by database management systems, this method first writes down a
summary of the actions to be performed into a “log” before actually writing them to the disk.
Hence the name, “write-ahead logging”. In the case of a crash, the OS can simply check this
log and pick up from where it left off. This saves multiple disk scans to fix inconsistency, as
is the case with FSCK. Good examples of systems that implement data journaling include
Linux ext3 and ext4 file systems, and Windows NTFS. Data Journaling: A log is stored in a
simple data structure called the journal. The figure below shows its structure, which
comprises of three components.

1. TxB (Transaction Begin Block): This contains the transaction ID, or the TID.
2. Inode, Bitmap and Data Blocks (Metadata): These three blocks contain a copy of
the contents of the blocks to be updated in the disk.
3. TxE (Transaction End Block) This simply marks the end of the transaction
identified by the TID.
As soon as an update is requested, it is written onto the log, and thereafter onto the file
system. Once all these writes are successful, we can say that we have reached
the checkpoint and the update is complete. What if a crash occurs during
journaling ? One could argue that journaling, itself, is not atomic. Therefore, how does the
system handle an un-checkpointed write ? To overcome this scenario, journaling happens in
two steps: simultaneous writes to TxB and the following three blocks, and then write of the
TxE. The process can be summarized as follows.
1. Journal Write: Write TxB, inode, bitmap and data block contents to the journal (log).
2. Journal Commit: Write TxE to the journal (log).
3. Checkpoint: Write the contents of the inode, bitmap and data block onto the disk.
A crash may occur at different points during the process of journaling. If a crash occurs at
step 1, i.e. before the TxE, we can simply skip this transaction altogether and the file system
stays consistent. If a crash occurs at step 2, it means that although the transaction has been
logged, it hasn’t been written onto the disk completely. We cannot be sure which of the three
blocks (inode, bitmap and data block) were actually updated and which ones suffered a crash.
In this case, the system scans the log for recent transactions, and performs the last transaction
again. This does lead to redundant disk writes, but ensures consistency. This process is
called redo logging. Using the Journal as a Circular Buffer: Since many transactions are
made, the journal log might get used up. To address this issue, we can use the journal log as a
circular buffer wherein newer transactions keep replacing the old ones in a circular manner.
The figure below shows an overall view of the journal, with tr1 as the oldest transaction and
tr5 the newest.

The super block maintains pointers to the oldest and the newest transactions. As soon as the
transaction is complete, it is marked as “free” and the super block is updated to the next
transaction.
The benefits of journaling, or write-ahead logging, in file systems, are as follows:
 Improved Recovery Time: It ensures quick recovery after a crash occurs, all actions
are logged prior to being written on disk on file for examination. This eliminates the
need to perform lengthy disk scans or consistency checks.
 Enhanced Data Integrity: It ensures data integrity by maintaining the consistency of
the file system. By writing the actions to the journal before committing them to the
disk, the system can ensure that updates are complete and recoverable. In case of a
crash, the system can recover by referring to the journal and redoing any incomplete
transactions.
 Reduced Disk Scans: It minimizes the need for full disk scans to fix file system
inconsistencies. Instead of scanning the entire disk to identify and repair
inconsistencies, the system can rely on the journal to determine the state of the file
system and apply the necessary changes. This leads to faster recovery and reduced
overhead.
NOTE:
B+ tree insertion and deletion example go through the notes and practice well
Serializability and testing for serializability example go through the notes and
practice well
Topics to be read thoroughly:
Dirty read,anamolies of concurrent execution,serializability and testing, b+trees
insertion and deletion, characteristics of b+trees, recoverability, write a head
logging,aries algorithm, and cover all the learning concepts
Unit -5 all the learning concepts should be read thoroughly don’t leave any single
topic
Unit 3 and 4 also all topics should be prepared thoroughly
All the best

You might also like