UNIT-V Final
UNIT-V Final
1. Active State-
This is the first state in the life cycle of a transaction.
A transaction is called in an active state as long as its instructions are
getting executed.
All the changes made by the transaction now are stored in the buffer in
main memory.
2. Partially Committed State-
After the last instruction of transaction has executed, it enters into
a partially committed state.
After entering this state, the transaction is considered to be partially
committed.
ACID Properties-
It is important to ensure that the database remains consistent before and
after the transaction.
To ensure the consistency of database, certain properties are followed by
all the transactions occurring in the system.
These properties are called as ACID Properties of a transaction
1. Atomicity-
This property ensures that either the transaction occurs completely or it does not
occur at all.
In other words, it ensures that no transaction occurs partially.
That is why, it is also referred to as “All or nothing rule“.
It is the responsibility of Transaction Control Manager to ensure atomicity of the
transactions.
2. Consistency-
This property ensures that integrity constraints are maintained.
Concurrent Executions:
• In a multi-user system, multiple users can access and use the same database at
one time, which is known as the concurrent execution of the database. It means
that the same database is executed simultaneously on a multi-user system by
different users. While working on the database transactions, there occurs the
requirement of using the database by multiple users for performing different
operations, and in that case, concurrent execution of the database is performed.
Here:
1. T1 reads the value of A.
2. T1 updates the value of A in the buffer.
3. T2 reads the value of A from the buffer.
4. T2 writes the updated the value of A.
5. T2 commits.
6. T1 fails in later stages and rolls back.
Example:
1. T2 reads the dirty value of A written by the uncommitted transaction
T1.
2. T1 fails in later stages and roll backs.
3. Thus, the value that T2 read now stands to be incorrect.
4. Therefore, database becomes inconsistent.
Here:
T1 reads the value of X (= 10 say).
T2 reads the value of X (= 10).
T1 updates the value of X (from 10 to 15 say) in the buffer.
T2 again reads the value of X (but = 15).
Example:
T2 gets to read a different value of X in its second reading.
T2 wonders how the value of X got changed because according to it, it
is running in isolation.
3. Lost Update Problems (W - W Conflict):
This problem occurs when multiple transactions execute concurrently
and updates from one or more transactions get lost.
Here:
T1 reads the value of A (= 10 say).
T2 updates the value to A (= 15 say) in the buffer.
T2 does blind write A = 25 (write without read) in the buffer.
T2 commits.
When T1 commits, it writes A = 25 in the database.
Example:
T1 writes the over written value of X in the database.
Thus, update from T1 gets lost.
NOTE-
This problem occurs whenever there is a write-write conflict.
In write-write conflict, there are two writes one by each transaction on the same
data item without any read in the middle.
Here:
T1 reads X.
T2 reads X.
T1 deletes X.
T2 tries reading X but does not find it.
Example:
T2 finds that there does not exist any variable X when it tries reading
X again.
T2 wonders who deleted the variable X because according to it, it is
running in isolation.
Avoiding Concurrency Problems-
To ensure consistency of the database, it is very important to prevent
the occurrence of above problems.
SERIALIZABILITY:
In this schedule,
W1 (A) and R2 (A) are called as conflicting operations.
This is because all the above conditions hold true for them.
Solution-
We know, if a schedule is conflict serializable, then it is surely view serializable.
So, let us check whether the given schedule is conflict serializable or not.
Checking Whether S is Conflict Serializable Or Not-
Step-01:
List all the conflicting operations and determine the dependency between the
transactions-
W1(B) , W2(B) (T1 → T2)
W1(B) , W3(B) (T1 → T3)
W1(B) , W4(B) (T1 → T4)
W2(B) , W3(B) (T2 → T3)
W2(B) , W4(B) (T2 → T4)
W3(B) , W4(B) (T3 → T4)
1. Cascading Schedule
2. Cascadeless Schedule
3. Strict Schedule
CASCADING SCHEDULE-
A.RAMESH DEPT OF CSE RGMCET 19
If in a schedule, failure of one transaction causes several other
dependent transactions to rollback or abort, then such a schedule is
called as a Cascading Schedule or Cascading Rollback or Cascading
Abort.
It simply leads to the wastage of CPU time.
Example-
Here,
Transaction T2 depends on transaction T1.
Transaction T3 depends on transaction T2.
Transaction T4 depends on transaction T3.
In this schedule,
The failure of transaction T1 causes the transaction T2 to rollback.
Cascadeless Schedule-
If in a schedule, a transaction is not allowed to read a data item until the last
transaction that has written it is committed or aborted, then such a schedule
is called as a Cascadeless Schedule.
In other words,
Cascadeless schedule allows only committed read operations.
Therefore, it avoids cascading roll back and thus saves CPU time.
Example-
Implementation of Isolation :
• Isolation is one of the core ACID properties of a database transaction, ensuring
that the operations of one transaction remain hidden from other transactions until
completion. It means that no two transactions should interfere with each other
and affect the other's intermediate state.
• Isolation Levels
Ti -> Tj, means Transaction-Ti is either performing read or write before the
transaction-Tj.
NOTE: If there is a cycle present in the serialized graph then the schedule is non-
serializable because the cycle resembles that one transaction is dependent on the
A.RAMESH DEPT OF CSE RGMCET 24
other transaction and vice versa. It also means that there are one or more conflicting
pairs of operations in the transactions. On the other hand, no-cycle means that the
non-serial schedule is serializable.
3. Reduced Costs:
Serialization in DBMS can help reduce hardware costs by allowing fewer resources
to be used for a given computation (e.g., only one CPU instead of two). Additionally,
it can help reduce software development costs by making it easier to reason about
code and reducing the need for extensive testing with multiple threads running
concurrently.
4. Increased Performance:
In some cases, serializable executions can perform better than their non-serializable
counterparts since they allow the developer to optimize their code for performance.
Lock-Based Protocol:
Why Do We Need Locks?
Locks are essential in a database system to ensure:
Consistency: Without locks, multiple transactions could modify the same data
item simultaneously, resulting in an inconsistent state.
Isolation: Locks ensure that the operations of one transaction are isolated from
other transactions, i.e., they are invisible to other transactions until the transaction
is committed.
Concurrency: While ensuring consistency and isolation, locks also allow
multiple transactions to be processed simultaneously by the system, optimizing
system throughput and overall performance.
Avoiding Conflicts: Locks help in avoiding data conflicts that might arise due
to simultaneous read and write operations by different transactions on the same
data item.
Preventing Dirty Reads: With the help of locks, a transaction is prevented from
reading data that hasn't yet been committed by another transaction
Types of Locks:
There are two types of locks used -
Shared Lock (S-lock)
This lock allows a transaction to read a data item. Multiple transactions can hold
shared locks on the same data item simultaneously. It is denoted by ’S’. This is
also called as read lock.
The two methods outlined below can be used to convert between the locks:
1. Conversion from a read lock to a write lock is an upgrade.
2. Conversion from a write lock to a read lock is a downgrade.
There are four types of lock protocols available:
1. Simplistic lock protocol:
It is the simplest way of locking the data while transaction. Simplistic
lock-based protocols allow all the transactions to get the lock on the
A.RAMESH DEPT OF CSE RGMCET 27
data before insert or delete or update on it. It will unlock the data item
after completing the transaction.
2.Pre-claiming Lock Protocol :
Pre-claiming Lock Protocols evaluate the transaction to list all the data items
on which they need locks.
Before initiating an execution of the transaction, it requests DBMS for all the
locks on all those data items.
If all the locks are granted then this protocol allows the transaction to begin.
When the transaction is completed then it releases all the locks.
If all the locks are not granted then this protocol allows the transaction to roll
back and waits until all the locks are granted.
Limitations of 2PL:
Deadlock- It is a situation where transactions (T1, T2, etc) are waiting indefinitely
for locks held by each other. A circular dependency will be created in a situation
where the transactions hold locks on resources and wait for other resources that are
held by different transactions, so this needs to be avoided.
Cascading rollback- It is a phenomenon in which a single transaction failure leads
to a series of transaction rollbacks.
Locking overhead- The overhead associated with acquiring and releasing locks
can impact the overall system performance in times when there is a need for high
conflict for resources in the transaction.
Advantages of Two-Phase Locking (2PL):
● Ensures Serializability: 2PL guarantees conflict-serializability, ensuring the
consistency of the database.
● Concurrency: By allowing multiple transactions to acquire locks and release
them, 2PL increases the concurrency level, leading to better system throughput and
overall performance.
Starvation
Starvation is the situation when a transaction needs to wait for an indefinite period
to acquire a lock.
Following are the reasons for Starvation:
● When waiting scheme for locked items is not properly managed
● In the case of resource leak
● The same transaction is selected as a victim repeatedly
Primary Storage:
It is the primary area that offers quick access to the stored data. We also know the
primary storage as volatile storage. It is because this type of memory does not
permanently store the data. As soon as the system leads to a power cut or a crash,
the data also get lost. Main memory and cache are the types of primary storage.
1. Cache: It is one of the costly storage media. On the other hand, it is the fastest
one. A cache is a tiny storage media which is maintained by the computer hardware
usually. While designing the algorithms and query processors for the data
structures, the designers keep concern on the cache effects.
Secondary Storage:
Secondary storage is also called as Online storage. It is the storage area that allows
the user to save and store data permanently. This type of memory does not lose the
data due to any power failure or system crash. That's why we also call it non-
volatile storage. for example, magnetic disks, optical disks (DVD, CD, etc.), hard
disks, flash drives, and magnetic tapes.
Flash Memory: A flash memory stores data in USB (Universal Serial Bus) keys
which are further plugged into the USB slots of a computer system. These USB
keys help transfer data to a computer system, but it varies in size limits. Unlike the
main memory, it is possible to get back the stored data which may be lost due to a
power cut or other reasons. This type of memory storage is most commonly used
in the server systems for caching the frequently used data. This leads the systems
towards high performance and is capable of storing large amounts of databases
than the main memory.
Magnetic Disk Storage: This type of storage media is also known as online
storage media. A magnetic disk is used for storing the data for a long time. It is
capable of storing an entire database. It is the responsibility of the computer system
to make availability of the data from a disk to the main memory for further
accessing. Also, if the system performs any operation over the data, the modified
data should be written back to the disk. The tremendous capability of a magnetic
Insertion of the new record: Let the R1, R3, and so on up to R5 and R4 be four
records in the sequence. Here, records are nothing but a row in any table. Suppose
a new record R2 has to be inserted in the sequence, then it is simply placed at the
end of the file.
Insertion of the new record: Let us assume that there is a preexisting sorted
sequence of four records R1, R3, and so on up to R7 and R8. Suppose a new record
R2 has to be inserted in the sequence, then it will be inserted at the end of the file
and then it will sort the sequence.
Insertion of the new record: Suppose we have four records in the heap R1, R5,
R6, R4, and R3, and suppose a new record R2 has to be inserted in the heap then,
since the last data block i.e data block 3 is full it will be inserted in any of the data
blocks selected by the DBMS, let’s say data block 1.
If we want to search, delete or update data in the heap file Organization we will
traverse the data from the beginning of the file till we get the requested record.
Thus if the database is very huge, searching, deleting, or updating the record will
take a lot of time.
A.RAMESH DEPT OF CSE RGMCET 47
Advantages of Heap File Organization:
Fetching and retrieving records is faster than sequential records but only in the
case of small databases.
When there is a huge number of data that needs to be loaded into thedatabase at
a time, then this method of file Organization is best suited.
Disadvantages of Heap File Organization:
The problem of unused memory blocks.
Inefficient for larger databases.
3.HASH FILE ORGANIZATION:
Hash File Organization uses the computation of hash function on some fields of the
records. The hash function's output determines the location of disk block where the
records are to be placed.
When a record has to be received using the hash key columns, then the address is
generated, and the whole record is retrieved using that address. In the same way,
when a new record has to be inserted, then the address is generated using the hash
key and record is directly inserted. The same process is applied in the case of delete
and update.
Advantages:
o In this method, searching becomes very easy as all the records are stored only in
the leaf nodes and sorted the sequential linked list.
o Traversing through the tree structure is easier and faster.
o The size of the B+ tree has no restrictions, so the number of records can increase
or decrease and the B+ tree structure can also grow or shrink.
o It is a balanced tree structure, and any insert/update/delete does not affect the
performance of tree.
Disadvantages:
If any record has to be retrieved based on its index value, then the address of
the data block is fetched and the record is retrieved from the memory.
Advantages:
In this method, each record has the address of its data block, searching a
record in a huge database is quick and easy.
o This method requires extra space in the disk to store the index value.
o When the new records are inserted, then these files have to be
reconstructed to maintain the sequence.
o When the record is deleted, then the space used by it needs to be
released. Otherwise, the performance of the database will slow down.
RAID:
RAID or Redundant Array of Independent Disks, is a technology to connect multiple
secondary storage devices and use them as a single storage media.
RAID consists of an array of disks in which multiple disks are connected together to
achieve different goals. RAID levels define the use of disk arrays.
RAID 0
In this level, a striped array of disks is implemented. The data is broken down into
blocks and the blocks are distributed among disks. Each disk receives a block of data
to write/read in parallel. It enhances the speed and performance of the storage device.
There is no parity and backup in Level 0.
RAID 1
RAID 1 uses mirroring techniques. When data is sent to a RAID controller, it
sends a copy of data to all the disks in the array. RAID level 1 is also
called mirroring and provides 100% redundancy in case of a failure.
RAID 3
RAID 3 stripes the data onto multiple disks. The parity bit generated for data word
is stored on a different disk. This technique makes it to overcome single disk
failures.
RAID 4
In this level, an entire block of data is written onto data disks and then the parity
is generated and stored on a different disk. Note that level 3 uses byte-level
striping, whereas level 4 uses block-level striping. Both level 3 and level 4 require
at least three disks to implement RAID.
RAID 6
RAID 6 is an extension of level 5. In this level, two independent parities are
generated and stored in distributed fashion among multiple disks. Two parities
provide additional fault tolerance. This level requires at least four disk drives to
implement RAID.
Structure of B+ Tree
o In the B+ tree, every leaf node is at equal distance from the root node. The B+
tree is of the order n where n is fixed for every B+ tree.
o It contains an internal node and leaf node.
o An internal node of the B+ tree can contain at least n/2 record pointers except the
root node.
o At most, an internal node of the tree contains n pointers.
Leaf node
o The leaf node of the B+ tree can contain at least n/2 record pointers and n/2 key
values.
o At most, a leaf node contains n record pointer and n key values.
o Every leaf node of the B+ tree contains one block pointer P to point to next leaf
node.
Suppose we have to search 55 in the below B+ tree structure. First, we will fetch
for the intermediary node which will direct to the leaf node that can contain a
record for 55.
So, in the intermediary node, we will find a branch between 50 and 75 nodes.
Then at the end, we will be redirected to the third leaf node. Here DBMS will
perform a sequential search to find 55.
B+ Tree Insertion
Suppose we want to insert a record 60 in the below structure. It will go to the 3rd
leaf node after 55. It is a balanced tree, and a leaf node of this tree is already full,
so we cannot insert 60 there.
The 3rd leaf node has the values (50, 55, 60, 65, 70) and its current root node is
50. We will split the leaf node of the tree in the middle so that its balance is not
altered. So we can group (50, 55) and (60, 65, 70) into 2 leaf nodes.
If these two has to be leaf nodes, the intermediate node cannot branch from 50. It
should have 60 added to it, and then we can have pointers to a new leaf node.
This is how we can insert an entry when there is overflow. In a normal scenario,
it is very easy to find the node where it fits and then place it in that leaf node.
B+ Tree Deletion
Suppose we want to delete 60 from the above example. In this case, we have to
remove 60 from the intermediate node as well as from the 4th leaf node too. If we
After deleting node 60 from above B+ tree and re-arranging the nodes, it will
show as follows:
Hash-Based Indexing:
In hash-based indexing, a hash function is used to convert a key into a hash code.
This hash code serves as an index where the value associated with that key is
stored. The goal is to distribute the keys uniformly across an array, so that access
time is, on average, constant.
Let's break down some of these elements to further understand how hash-based
indexing works in practice:
Buckets
In hash-based indexing, the data space is divided into a fixed number of slots
known as "buckets." A bucket usually contains a single page (also known as a
block), but it may have additional pages linked in a chain if the primary page
becomes full. This is known as overflow.
Hash Function
The hash function is a mapping function that takes the search key as an input and
returns the bucket number where the record should be located. Hash functions aim
to distribute records uniformly across buckets to minimize the number of collisions
(two different keys hashing to the same bucket).
Insert Operations
When a new record is inserted into the dataset, its search key is hashed to find the
appropriate bucket. If the primary page of the bucket is full, an additional overflow
page is allocated and linked to the primary page. The new record is then stored on
this overflow page.
Search Operations
To find a record with a specific search key, the hash function is applied to the
search key to identify the bucket. All pages (primary and overflow) in that bucket
are then examined to find the desired record.
Limitations
Hash-based indexing is not suitable for range queries or when the search key is not
known. In such cases, a full scan of all pages is required, which is resource-
intensive.
Alice: 65 mod 3 = 2
Bob: 66 mod 3 = 0
Carol: 67 mod 3 = 1
Buckets:
Bucket 0: Bob
Bucket 1: Carol
Bucket 2: Alice
Advantages:
Disadvantages:
Not suitable for range queries (e.g., "SELECT * FROM table WHERE age
BETWEEN 20 AND 30").
Performance can be severely affected by poor hash functions or a large number of
collisions.
In this, a hash function can choose any of the column value to generate the
address. Most of the time, the hash function uses the primary key to
generate the address of the data block. A hash function is a simple
mathematical function to any complex mathematical function. We can even
consider the primary key itself as the address of the data block. That means
each row whose address will be the same as a primary key stored in the data
block.
The above diagram shows data block addresses same as primary key
value. This hash function can also be a simple mathematical function
like exponential, mod, cos, sin, etc. Suppose we have mod (5) hash
function to determine the address of the data block. In this case, it
applies mod (5) hash function on the primary keys and generates 3, 3,
Types of Hashing:
STATIC HASHING:
In static hashing, the resultant data bucket address will always be the same. That
means if we generate an address for EMP_ID =103 using the hash function mod
(5) then it will always result in same bucket address 3. Here, there will be no
change in the bucket address.
Hence in this static hashing, the number of data buckets in memory remains
constant throughout. In this example, we will have five data buckets in the memory
used to store the data.
o Searching a record
When a record needs to be searched, then the same hash function retrieves the
address of the bucket where the data is stored.
o Insert a Record
When a new record is inserted into the table, then we will generate an address for
a new record based on the hash key and record is stored in that location.
o Delete a Record
To delete a record, we will first fetch the record which is supposed to be deleted.
Then we will delete the records for that address in memory.
o Update a Record
To update a record, we will first search it using a hash function, and then the data
record is updated.
If we want to insert some new record into the file but the address of a data bucket
generated by the hash function is not empty, or data already exists in that address.
This situation in the static hashing is known as bucket overflow. This is a critical
situation in this method.
To overcome this situation, there are various methods. Some commonly used
methods are as follows:
A.RAMESH DEPT OF CSE RGMCET 63
1. Open Hashing
When a hash function generates an address at which data is already stored, then
the next bucket will be allocated to it. This mechanism is called as Linear
Probing.
For example: suppose R3 is a new address which needs to be inserted, the hash
function generates address as 112 for R3. But the generated address is already full.
So the system searches next available data bucket, 113 and assigns R3 to it.
2. Close Hashing
When buckets are full, then a new data bucket is allocated for the same hash result
and is linked after the previous one. This mechanism is known as Overflow
chaining.
For example: Suppose R3 is a new address which needs to be inserted into the
table, the hash function generates address as 110 for it. But this bucket is full to
store the new data. In this case, a new bucket is inserted at the end of 110 buckets
and is linked to it.
o The dynamic hashing method is used to overcome the problems of static hashing
like bucket overflow.
o In this method, data buckets grow or shrink as the records increases or decreases.
This method is also known as Extendable hashing method.
o This method makes hashing dynamic, i.e., it allows insertion or deletion without
resulting in poor performance.
o Firstly, you have to follow the same procedure for retrieval, ending up in some
bucket.
o If there is still space in that bucket, then place the record in it.
o If the bucket is full, then we will split the bucket and redistribute the records.
For example:
Consider the following grouping of keys into buckets, depending on the prefix
of their hash address:
Insert key 9 with hash address 10001 into the above structure:
o Since key 9 has hash address 10001, it must go into the first bucket. But bucket
B1 is full, so it will get split.
o The splitting will separate 5, 9 from 6 since last three bits of 5, 9 are 001, so it
will go into bucket B1, and the last three bits of 6 are 101, so it will go into bucket
B5.
o Keys 2 and 4 are still in B0. The record in B0 pointed by the 000 and 100 entry
because last two bits of both the entry are 00.
Advantages:
o In this method, the performance does not decrease as the data grows in the
system. It simply increases the size of memory to accommodate the data.
o In this method, memory is well utilized as it grows and shrinks with the data.
There will not be any unused memory lying.
o This method is good for the dynamic database where data grows and shrinks
frequently.
Disadvantages:
o In this method, if the data size increases then the bucket size is also increased.
These addresses of data will be maintained in the bucket address table. This is
because the data address will keep changing as buckets grow and shrink. If there
is a huge increase in data, maintaining the bucket address table becomes tedious.