0% found this document useful (0 votes)
24 views37 pages

DBMS Unit-4 && 5 Notes

dbms notes unit 4,5 for engineering 2nd year computer science

Uploaded by

Abhinay
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views37 pages

DBMS Unit-4 && 5 Notes

dbms notes unit 4,5 for engineering 2nd year computer science

Uploaded by

Abhinay
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 37

TRANSACTION

The transaction is a set of logically related operation. It contains a group of tasks.

 A transaction is an action or series of actions. It is performed by a single user to perform


operations for accessing the contents of the database.

Following are the main operations of transaction:

Read(X): Read operation is used to read the value of X from the database and stores it in a
buffer in main memory.

Write(X): Write operation is used to write the value back to the database from the buffer.

The transaction has the four properties. These are used to maintain consistency in a database,
before and after the transaction.

Property of Transaction
1. Atomicity
2. Consistency
3. Isolation
4. Durability

Atomicity

It states that all operations of the transaction take place at once if not, the transaction is aborted.
There is no midway, i.e., the transaction cannot occur partially. Each transaction is treated as one
unit and either run to completion or is not executed at all.

Atomicity involves the following two operations:

Abort: If a transaction aborts then all the changes made are not visible.

Commit: If a transaction commits then all the changes made are visible.

Example: Let's assume that following transaction T consisting of T1 and T2. A consists of Rs 600 and B
consists of Rs 300. Transfer Rs 100 from account A to account B.
T1 T2
Read(A)
A:= A-100
Write(A) Read(B)
Y:= Y+100
Write(B)

After completion of the transaction, A consists of Rs 500 and B consists of Rs 400.

If the transaction T fails after the completion of transaction T1 but before completion of transaction
T2, then the amount will be deducted from A but not added to B. This shows the inconsistent
database state. In order to ensure correctness of database state, the transaction must be executed in
entirety.

Consistency
The integrity constraints are maintained so that the database is consistent before and after
the transaction.The execution of a transaction will leave a database in either its prior stable state or a
new stable state.
The consistent property of database states that every transaction sees a consistent database
instance.The transaction is used to transform the database from one consistent state to another
consistent state.

For example: The total amount must be maintained before or after the transaction.

Total before T occurs = 600+300=900


Total after T occurs= 500+400=900

Therefore, the database is consistent. In the case when T1 is completed but T2 fails, then
inconsistency will occur.

Isolation

It shows that the data which is used at the time of execution of a transaction cannot be used by
the second transaction until the first one is completed.
In isolation, if the transaction T1 is being executed and using the data item X, then that data item
can't be accessed by any other transaction T2 until the transaction T1 ends.
The concurrency control subsystem of the DBMS enforced the isolation property.

Durability

The durability property is used to indicate the performance of the database's consistent state. It
states that the transaction made the permanent changes.
They cannot be lost by the erroneous operation of a faulty transaction or by the system failure.
When a transaction is completed, then the database reaches a state known as the consistent state.
That consistent state cannot be lost, even in the event of a system's failure.
The recovery subsystem of the DBMS has the responsibility of Durability property.
States of Transaction

Active state

 The active state is the first state of every transaction. In this state, the transaction is being
executed.
 For example: Insertion or deletion or updating a record is done here. But all the records are
still not saved to the database.

Partially committed

 In the partially committed state, a transaction executes its final operation, but the data is still
not saved to the database.
 In the total mark calculation example, a final display of the total marks step is executed in
this state.

Committed

A transaction is said to be in a committed state if it executes all its operations successfully. In


this state, all the effects are now permanently saved on the database system.

Failed state

 If any of the checks made by the database recovery system fails, then the transaction is said
to be in the failed state.
 In the example of total mark calculation, if the database is not able to fire a query to fetch
the marks, then the transaction will fail to execute.

Aborted

 If any of the checks fail and the transaction has reached a failed state then the database
recovery system will make sure that the database is in its previous consistent state. If not
then it will abort or roll back the transaction to bring the database into a consistent state.
 If the transaction fails in the middle of the transaction then before executing the transaction,
all the executed transactions are rolled back to its consistent state.
 After aborting the transaction, the database recovery module will select one of the two
operations:
1. Re-start the transaction
2. Kill the transaction

Schedule
A series of operation from one transaction to another transaction is known as schedule. It is
used to preserve the order of the operation in each of the individual transaction.

1. Serial Schedule

The serial schedule is a type of schedule where one transaction is executed completely before
starting another transaction. In the serial schedule, when the first transaction completes its
cycle, then the next transaction is executed.

For example: Suppose there are two transactions T1 and T2 which have some operations. If
it has no interleaving of operations, then there are the following two possible outcomes:

1. Execute all the operations of T1 which was followed by all the operations of T2.
2. Execute all the operations of T1 which was followed by all the operations of T2.

2. Non-serial Schedule

 If interleaving of operations is allowed, then there will be non-serial schedule.


 It contains many possible orders in which the system can execute the individual operations of
the transactions.
 In the given figure (c) and (d), Schedule C and Schedule D are the non-serial schedules. It has
interleaving of operations.
3. Serializable schedule

 The serializability of schedules is used to find non-serial schedules that allow the transaction
to execute concurrently without interfering with one another.
 It identifies which schedules are correct when executions of the transaction have interleaving
of their operations.
 A non-serial schedule will be serializable if its result is equal to the result of its transactions
executed serially.
Testing of Serializability
Serialization Graph is used to test the Serializability of a schedule.

Assume a schedule S. For S, we construct a graph known as precedence graph. This graph
has a pair G = (V, E), where V consists a set of vertices, and E consists a set of edges. The set
of vertices is used to contain all the transactions participating in the schedule. The set of
edges is used to contain all edges Ti ->Tj for which one of the three conditions holds:

1. Create a node Ti → Tj if Ti executes write (Q) before Tj executes read (Q).


2. Create a node Ti → Tj if Ti executes read (Q) before Tj executes write (Q).
3. Create a node Ti → Tj if Ti executes write (Q) before Tj executes write (Q).

 if a precedence graph contains a single edge Ti → Tj, then all the instructions of Ti
are executed before the first instruction of Tj is executed.
 If a precedence graph for schedule S contains a cycle, then S is non-serializable. If the
precedence graph has no cycle, then S is known as serializable.
For example:

Read(A): In T1, no subsequent writes to A, so no new edges


Read(B): In T2, no subsequent writes to B, so no new edges
Read(C): In T3, no subsequent writes to C, so no new edges
Write(B): B is subsequently read by T3, so add edge T2 → T3
Write(C): C is subsequently read by T1, so add edge T3 → T1
Write(A): A is subsequently read by T2, so add edge T1 → T2
Write(A): In T2, no subsequent reads to A, so no new edges
Write(C): In T1, no subsequent reads to C, so no new edges
Write(B): In T3, no subsequent reads to B, so no new edges

Conflict Serializable Schedule


 A schedule is called conflict serializability if after swapping of non-conflicting operations, it
can transform into a serial schedule.
 The schedule will be a conflict serializable if it is conflict equivalent to a serial schedule.

Conflicting Operations

The two operations become conflicting if all conditions satisfy:

1. Both belong to separate transactions.


2. They have the same data item.
3. They contain at least one write operation.

Example:

Swapping is possible only if S1 and S2 are logically equal


Here, S1 = S2. That means it is non-conflict.

Here, S1 ≠ S2. That means it is conflict.

Conflict Equivalent

In the conflict equivalent, one can be transformed to another by swapping non-conflicting


operations. In the given example, S2 is conflict equivalent to S1 (S1 can be converted to S2
by swapping non-conflicting operations).

Two schedules are said to be conflict equivalent if and only if:

1. They contain the same set of the transaction.


2. If each pair of conflict operations are ordered in the same way.
Checkpoint
 The checkpoint is a type of mechanism where all the previous logs are removed from the
system and permanently stored in the storage disk.
 The checkpoint is like a bookmark. While the execution of the transaction, such checkpoints
are marked, and the transaction is executed then using the steps of the transaction, the log
files will be created.
 When it reaches to the checkpoint, then the transaction will be updated into the database,
and till that point, the entire log file will be removed from the file. Then the log file is
updated with the new step of transaction till next checkpoint and so on.
 The checkpoint is used to declare a point before which the DBMS was in the consistent state,
and all transactions were committed.

Recovery using Checkpoint

In the following manner, a recovery system recovers the database from this failure:
 The recovery system reads log files from the end to start. It reads log files from T4 to T1.
 Recovery system maintains two lists, a redo-list, and an undo-list.
 The transaction is put into redo state if the recovery system sees a log with <Tn, Start> and
<Tn, Commit> or just <Tn, Commit>. In the redo-list and their previous list, all the
transactions are removed and then redone before saving their logs.
 For example: In the log file, transaction T2 and T3 will have <Tn, Start> and <Tn, Commit>.
The T1 transaction will have only <Tn, commit> in the log file. That's why the transaction is
committed after the checkpoint is crossed. Hence it puts T1, T2 and T3 transaction into redo
list.
 The transaction is put into undo state if the recovery system sees a log with <Tn, Start> but
no commit or abort log found. In the undo-list, all the transactions are undone, and their logs
are removed.
 For example: Transaction T4 will have <Tn, Start>. So T4 will be put into undo list since this
transaction is not yet complete and failed amid.

Deadlock in DBMS
A deadlock is a condition where two or more transactions are waiting indefinitely for one
another to give up locks. Deadlock is said to be one of the most feared complications in
DBMS as no task ever gets finished and is in waiting state forever.

For example: In the student table, transaction T1 holds a lock on some rows and needs to
update some rows in the grade table. Simultaneously, transaction T2 holds locks on some
rows in the grade table and needs to update the rows in the Student table held by Transaction
T1.

Now, the main problem arises. Now Transaction T1 is waiting for T2 to release its lock and
similarly, transaction T2 is waiting for T1 to release its lock. All activities come to a halt state
and remain at a standstill. It will remain in a standstill until the DBMS detects the deadlock
and aborts one of the transactions.
Deadlock Avoidance

 When a database is stuck in a deadlock state, then it is better to avoid the database rather
than aborting or restating the database. This is a waste of time and resource.
 Deadlock avoidance mechanism is used to detect any deadlock situation in advance. A
method like "wait for graph" is used for detecting the deadlock situation but this method is
suitable only for the smaller database. For the larger database, deadlock prevention method
can be used.

Deadlock Detection

In a database, when a transaction waits indefinitely to obtain a lock, then the DBMS should
detect whether the transaction is involved in a deadlock or not. The lock manager maintains a
Wait for the graph to detect the deadlock cycle in the database.

Wait for Graph

 This is the suitable method for deadlock detection. In this method, a graph is created based
on the transaction and their lock. If the created graph has a cycle or closed loop, then there
is a deadlock.
 The wait for the graph is maintained by the system for every transaction which is waiting for
some data held by the others. The system keeps checking the graph if there is any cycle in
the graph.

The wait for a graph for the above scenario is shown below:

Deadlock Prevention

 Deadlock prevention method is suitable for a large database. If the resources are allocated in
such a way that deadlock never occurs, then the deadlock can be prevented.The Database
management system analyzes the operations of the transaction whether they can create a
deadlock situation or not. If they do, then the DBMS never allowed that transaction to be
executed.

Wait-Die scheme

In this scheme, if a transaction requests for a resource which is already held with a conflicting
lock by another transaction then the DBMS simply checks the timestamp of both transactions.
It allows the older transaction to wait until the resource is available for execution.

Let's assume there are two transactions Ti and Tj and let TS(T) is a timestamp of any
transaction T. If T2 holds a lock by some other transaction and T1 is requesting for resources
held by T2 then the following actions are performed by DBMS:

1. Check if TS(Ti) < TS(Tj) - If Ti is the older transaction and Tj has held some resource, then Ti is
allowed to wait until the data-item is available for execution. That means if the older
transaction is waiting for a resource which is locked by the younger transaction, then the
older transaction is allowed to wait for resource until it is available.
2. Check if TS(Ti) < TS(Tj) - If Ti is older transaction and has held some resource and if Tj is
waiting for it, then Tj is killed and restarted later with the random delay but with the same
timestamp.

Wound wait scheme

 In wound wait scheme, if the older transaction requests for a resource which is held by the
younger transaction, then older transaction forces younger one to kill the transaction and
release the resource. After the minute delay, the younger transaction is restarted but with
the same timestamp.
 If the older transaction has held a resource which is requested by the Younger transaction,
then the younger transaction is asked to wait until older releases it.

DBMS Concurrency Control

Concurrency Control is the management procedure that is required for controlling concurrent
execution of the operations that take place on a database.

But before knowing about concurrency control, we should know about concurrent execution.

Concurrent Execution in DBMS

 In a multi-user system, multiple users can access and use the same database at one time,
which is known as the concurrent execution of the database. It means that the same
database is executed simultaneously on a multi-user system by different users.
 While working on the database transactions, there occurs the requirement of using the
database by multiple users for performing different operations, and in that case, concurrent
execution of the database is performed.
 The thing is that the simultaneous execution that is performed should be done in an
interleaved manner, and no operation should affect the other executing operations, thus
maintaining the consistency of the database. Thus, on making the concurrent execution of
the transaction operations, there occur several challenging problems that need to be solved.
Problems with Concurrent Execution

In a database transaction, the two main operations are READ and WRITE operations. So,
there is a need to manage these two operations in the concurrent execution of the transactions
as if these operations are not performed in an interleaved manner, and the data may become
inconsistent. So, the following problems occur with the Concurrent Execution of the
operations:

Problem 1: Lost Update Problems (W - W Conflict)

The problem occurs when two different database transactions perform the read/write
operations on the same database items in an interleaved manner (i.e., concurrent execution)
that makes the values of the items incorrect hence making the database inconsistent.

Dirty Read Problems (W-R Conflict)

The dirty read problem occurs when one transaction updates an item of the database, and
somehow the transaction fails, and before the data gets rollback, the updated database item
is accessed by another transaction. There comes the Read-Write Conflict between both
transactions.
Unrepeatable Read Problem (W-R Conflict)

Also known as Inconsistent Retrievals Problem that occurs when in a transaction, two
different values are read for the same database item.
Concurrency Control

Concurrency Control is the working concept that is required for controlling and managing the
concurrent execution of database operations and thus avoiding the inconsistencies in the
database. Thus, for maintaining the concurrency of the database, we have the concurrency
control protocols.

Concurrency Control Protocols

The concurrency control protocols ensure the atomicity, consistency, isolation, durability and
serializability of the concurrent execution of the database transactions. Therefore, these
protocols are categorized as:

 Lock Based Concurrency Control Protocol


 Time Stamp Concurrency Control Protocol
 Validation Based Concurrency Control Protocol

Lock-Based Protocol

In this type of protocol, any transaction cannot read or write data until it acquires an
appropriate lock on it. There are two types of lock:

1. Shared lock:

 It is also known as a Read-only lock. In a shared lock, the data item can only read by the
transaction.
 It can be shared between the transactions because when the transaction holds a lock, then it
can't update the data on the data item.

2. Exclusive lock:

 In the exclusive lock, the data item can be both reads as well as written by the transaction.
 This lock is exclusive, and in this lock, multiple transactions do not modify the same data
simultaneously.

There are four types of lock protocols available:


1. Simplistic lock protocol

It is the simplest way of locking the data while transaction. Simplistic lock-based protocols
allow all the transactions to get the lock on the data before insert or delete or update on it. It
will unlock the data item after completing the transaction.

 Pre-claiming Lock Protocols evaluate the transaction to list all the data items on which they
need locks.
 Before initiating an execution of the transaction, it requests DBMS for all the lock on all those
data items.
 If all the locks are granted then this protocol allows the transaction to begin. When the
transaction is completed then it releases all the lock.
 If all the locks are not granted then this protocol allows the transaction to rolls back and
waits until all the locks are granted.

Two-phase locking (2PL)

 The two-phase locking protocol divides the execution phase of the transaction into three
parts.
 In the first part, when the execution of the transaction starts, it seeks permission for the lock
it requires.
 In the second part, the transaction acquires all the locks. The third phase is started as soon
as the transaction releases its first lock.
 In the third phase, the transaction cannot demand any new locks. It only releases the
acquired locks.
Strict Two-phase locking (Strict-2PL)

 The first phase of Strict-2PL is similar to 2PL. In the first phase, after acquiring all the locks,
the transaction continues to execute normally.
 The only difference between 2PL and strict 2PL is that Strict-2PL does not release a lock after
using it.
 Strict-2PL waits until the whole transaction to commit, and then it releases all the locks at a
time.
 Strict-2PL protocol does not have shrinking phase of lock release.

Timestamp Ordering Protocol


 The Timestamp Ordering Protocol is used to order the transactions based on their
Timestamps. The order of transaction is nothing but the ascending order of the
transaction creation.
 The priority of the older transaction is higher that's why it executes first. To determine
the timestamp of the transaction, this protocol uses system time or logical counter.
 The lock-based protocol is used to manage the order between conflicting pairs among
transactions at the execution time. But Timestamp based protocols start working as
soon as a transaction is created.
 Let's assume there are two transactions T1 and T2. Suppose the transaction T1 has
entered the system at 007 times and transaction T2 has entered the system at 009
times. T1 has the higher priority, so it executes first as it is entered the system first.
 The timestamp ordering protocol also maintains the timestamp of last 'read' and 'write'
operation on a data.

Basic Timestamp ordering protocol works as follows:

1. Check the following condition whenever a transaction Ti issues a Read (X) operation:

 If W_TS(X) >TS(Ti) then the operation is rejected.


 If W_TS(X) <= TS(Ti) then the operation is executed.
 Timestamps of all the data items are updated.
2. Check the following condition whenever a transaction Ti issues a Write(X) operation:

 If TS(Ti) < R_TS(X) then the operation is rejected.


 If TS(Ti) < W_TS(X) then the operation is rejected and Ti is rolled back otherwise the
operation is executed.

Where,

TS(TI) denotes the timestamp of the transaction Ti.

R_TS(X) denotes the Read time-stamp of data-item X.

W_TS(X) denotes the Write time-stamp of data-item X.

Validation Based Protocol


Validation phase is also known as optimistic concurrency control technique. In the validation
based protocol, the transaction is executed in the following three phases:

1. Read phase: In this phase, the transaction T is read and executed. It is used to read
the value of various data items and stores them in temporary local variables. It can
perform all the write operations on temporary variables without an update to the
actual database.
2. Validation phase: In this phase, the temporary variable value will be validated
against the actual data to see if it violates the serializability.
3. Write phase: If the validation of the transaction is validated, then the temporary
results are written to the database or system otherwise the transaction is rolled back.

Here each phase has the following different timestamps:

Start(Ti): It contains the time when Ti started its execution.

Validation (Ti): It contains the time when Ti finishes its read phase and starts its validation
phase.

Thomas write Rule

Thomas Write Rule provides the guarantee of serializability order for the protocol. It
improves the Basic Timestamp Ordering Algorithm.

The basic Thomas write rules are as follows:

 If TS(T) < R_TS(X) then transaction T is aborted and rolled back, and operation is
rejected.
 If TS(T) < W_TS(X) then don't execute the W_item(X) operation of the transaction
and continue processing.
 If neither condition 1 nor condition 2 occurs, then allowed to execute the WRITE
operation by transaction Ti and set W_TS(X) to TS(T).
If we use the Thomas write rule then some serializable schedule can be permitted that does
not conflict serializable as illustrate by the schedule in a given figure:

In the above figure, T1's read and precedes T1's write of the same data item. This schedule
does not conflict serializable.

Thomas write rule checks that T2's write is never seen by any transaction. If we delete the
write operation in transaction T2, then conflict serializable schedule can be obtained which is
shown in below figure.

Multiple Granularity
Let's start by understanding the meaning of granularity.

Granularity: It is the size of data item allowed to lock.

Multiple Granularity:

 It can be defined as hierarchically breaking up the database into blocks which can be
locked.
 The Multiple Granularity protocol enhances concurrency and reduces lock overhead.
 It maintains the track of what to lock and how to lock.
 It makes easy to decide either to lock a data item or to unlock a data item. This type of
hierarchy can be graphically represented as a tree.

For example: Consider a tree which has four levels of nodes.

 The first level or higher level shows the entire database.


 The second level represents a node of type area. The higher level database consists of
exactly these areas.
 The area consists of children nodes which are known as files. No file can be present in
more than one area.
 Finally, each file contains child nodes known as records. The file has exactly those
records that are its child nodes. No records represent in more than one file.
 Hence, the levels of the tree starting from the top level are as follows:
1. Database
2. Area
3. File
4. Record
File Organization
 The File is a collection of records. Using the primary key, we can access the records. The type
and frequency of access can be determined by the type of file organization which was used
for a given set of records.
 File organization is a logical relationship among various records. This method defines how file
records are mapped onto disk blocks.
 File organization is used to describe the way in which the records are stored in terms of
blocks, and the blocks are placed on the storage medium.
 The first approach to map the database to the file is to use the several files and store only
one fixed length record in any given file. An alternative approach is to structure our files so
that we can contain multiple lengths for records.
 Files of fixed length records are easier to implement than the files of variable length records.

Objective of file organization

 It contains an optimal selection of records, i.e., records can be selected as fast as possible.
 To perform insert, delete or update transaction on the records should be quick and easy.
 The duplicate records cannot be induced as a result of insert, update or delete.
 For the minimal cost of storage, records should be stored efficiently.

Types of file organization:

File organization contains various methods. These particular methods have pros and cons on
the basis of access or selection. In the file organization, the programmer decides the best-
suited file organization method according to his requirement.

Types of file organization are as follows:

Sequential File Organization


This method is the easiest method for file organization. In this method, files are stored
sequentially. This method can be implemented in two ways:
1. Pile File Method:

 It is a quite simple method. In this method, we store the record in a sequence, i.e., one
after another. Here, the record will be inserted in the order in which they are inserted
into tables.
 In case of updating or deleting of any record, the record will be searched in the
memory blocks. When it is found, then it will be marked for deleting, and the new
record is inserted.

insertion of the new record:

Suppose we have four records R1, R3 and so on upto R9 and R8 in a sequence. Hence,
records are nothing but a row in the table. Suppose we want to insert a new record R2 in the
sequence, then it will be placed at the end of the file. Here, records are nothing but a row in
any table.

Insertion of the new record:

Suppose there is a preexisting sorted sequence of four records R1, R3 and so on upto R6 and
R7. Suppose a new record R2 has to be inserted in the sequence, then it will be inserted at the
end of the file, and then it will sort the sequence.
Pros of sequential file organization

 It contains a fast and efficient method for the huge amount of data.
 In this method, files can be easily stored in cheaper storage mechanism like magnetic tapes.
 It is simple in design. It requires no much effort to store the data.
 This method is used when most of the records have to be accessed like grade calculation of a
student, generating the salary slip, etc.
 This method is used for report generation or statistical calculations.

Cons of sequential file organization

 It will waste time as we cannot jump on a particular record that is required but we have to
move sequentially which takes our time.
 Sorted file method takes more time and space for sorting the records.

Heap file organization


 It is the simplest and most basic type of organization. It works with data blocks. In
heap file organization, the records are inserted at the file's end. When the records are
inserted, it doesn't require the sorting and ordering of records.
 When the data block is full, the new record is stored in some other block. This new
data block need not to be the very next data block, but it can select any data block in
the memory to store new records. The heap file is also known as an unordered file.
 In the file, every record has a unique id, and every page in a file is of the same size. It
is the DBMS responsibility to store and manage the new records.
insertion of a new record

Suppose we have five records R1, R3, R6, R4 and R5 in a heap and suppose we want to insert
a new record R2 in a heap. If the data block 3 is full then it will be inserted in any of the
database selected by the DBMS, let's say data block 1.

Pros of Heap file organization

 It is a very good method of file organization for bulk insertion. If there is a large number of
data which needs to load into the database at a time, then this method is best suited.
 In case of a small database, fetching and retrieving of records is faster than the sequential
record.

Cons of Heap file organization

 This method is inefficient for the large database because it takes time to search or modify
the record.

 This method is inefficient for large databases.
B+ File Organization

 B+ tree file organization is the advanced method of an indexed sequential access


method. It uses a tree-like structure to store records in File.
 It uses the same concept of key-index where the primary key is used to sort the
records. For each primary key, the value of the index is generated and mapped with
the record.
 The B+ tree is similar to a binary search tree (BST), but it can have more than two
children. In this method, all the records are stored only at the leaf node. Intermediate
nodes act as a pointer to the leaf nodes. They do not contain any records.

The above B+ tree shows that:

 There is one root node of the tree, i.e., 25.


 There is an intermediary layer with nodes. They do not store the actual record. They have
only pointers to the leaf node.
 The nodes to the left of the root node contain the prior value of the root and nodes to the
right contain next value of the root, i.e., 15 and 30 respectively.
 There is only one leaf node which has only values, i.e., 10, 12, 17, 20, 24, 27 and 29.
 Searching for any record is easier as all the leaf nodes are balanced.
 In this method, searching any record can be traversed through the single path and accessed
easily.

Pros of B+ tree file organization

 In this method, searching becomes very easy as all the records are stored only in the leaf
nodes and sorted the sequential linked list.
 Traversing through the tree structure is easier and faster.
 The size of the B+ tree has no restrictions, so the number of records can increase or decrease
and the B+ tree structure can also grow or shrink.
 It is a balanced tree structure, and any insert/update/delete does not affect the performance
of tree.
Cons of B+ tree file organization

 This method is inefficient for the static method.

Indexed sequential access method (ISAM)

ISAM method is an advanced sequential file organization. In this method, records are stored
in the file using the primary key. An index value is generated for each primary key and
mapped with the record. This index contains the address of the record in the file.

Pros of ISAM:

 In this method, each record has the address of its data block, searching a record in a huge
database is quick and easy.
 This method supports range retrieval and partial retrieval of records. Since the index is based
on the primary key values, we can retrieve the data for the given range of value. In the same
way, the partial value can also be easily searched, i.e., the student name starting with 'JA' can
be easily searched.

Cons of ISAM

 This method requires extra space in the disk to store the index value.
 When the new records are inserted, then these files have to be reconstructed to maintain
the sequence.
 When the record is deleted, then the space used by it needs to be released. Otherwise, the
performance of the database will slow down.
Cluster file organization
 When the two or more records are stored in the same file, it is known as clusters. These files
will have two or more tables in the same data block, and key attributes which are used to
map these tables together are stored only once.
 This method reduces the cost of searching for various records in different files.
 The cluster file organization is used when there is a frequent need for joining the tables with
the same condition. These joins will give only a few records from both tables. In the given
example, we are retrieving the record for only particular departments. This method can't be
used to retrieve the record for the entire department.

In this method, we can directly insert, update or delete any record. Data is sorted based on the
key with which searching is done. Cluster key is a type of key with which joining of the table
is performed.

Types of Cluster file organization:

Cluster file organization is of two types:


1. Indexed Clusters:

In indexed cluster, records are grouped based on the cluster key and stored together. The
above EMPLOYEE and DEPARTMENT relationship is an example of an indexed cluster.
Here, all the records are grouped based on the cluster key- DEP_ID and all the records are
grouped.

2. Hash Clusters:

It is similar to the indexed cluster. In hash cluster, instead of storing the records based on the
cluster key, we generate the value of the hash key for the cluster key and store the records
with the same hash key value.

ADVERTISEMENT
Pros of Cluster file organization

 The cluster file organization is used when there is a frequent request for joining the tables
with same joining condition.
 It provides the efficient result when there is a 1:M mapping between the tables.

Cons of Cluster file organization

 This method has the low performance for the very large database.
 If there is any change in joining condition, then this method cannot use. If we change the
condition of joining then traversing the file takes a lot of time.
 This method is not suitable for a table with a 1:1 condition.

Indexing in DBMS
Indexing is used to optimize the performance of a database by minimizing the number of
disk accesses required when a query is processed.
 The index is a type of data structure. It is used to locate and access the data in a database
table quickly.

Index structure:

Indexes can be created using some database columns.

 The first column of the database is the search key that contains a copy of the primary key or
candidate key of the table. The values of the primary key are stored in sorted order so that
the corresponding data can be accessed easily.
 The second column of the database is the data reference. It contains a set of pointers holding
the address of the disk block where the value of the particular key can be found.

Indexing Methods

Ordered indices

The indices are usually sorted to make searching faster. The indices which are sorted are
known as ordered indices.

Example: Suppose we have an employee table with thousands of record and each of which is
10 bytes long. If their IDs start with 1, 2, 3....and so on and we have to search student with
ID-543.

 In the case of a database with no index, we have to search the disk block from starting till it
reaches 543. The DBMS will read the record after reading 543*10=5430 bytes.
 In the case of an index, we will search using indexes and the DBMS will read the record after
reading 542*2= 1084 bytes which are very less compared to the previous case.
Primary Index

 If the index is created on the basis of the primary key of the table, then it is known as
primary indexing. These primary keys are unique to each record and contain 1:1 relation
between the records.
 As primary keys are stored in sorted order, the performance of the searching operation is
quite efficient.
 The primary index can be classified into two types: Dense index and Sparse index.

Dense index

 The dense index contains an index record for every search key value in the data file. It makes
searching faster.
 In this, the number of records in the index table is same as the number of records in the
main table.
 It needs more space to store index record itself. The index records have the search key and a
pointer to the actual record on the disk.

Sparse index

 In the data file, index record appears only for a few items. Each item points to a block.
 In this, instead of pointing to each record in the main table, the index points to the records in
the main table in a gap.

ADVERTISEMENT
Clustering Index

 A clustered index can be defined as an ordered data file. Sometimes the index is created on
non-primary key columns which may not be unique for each record.
 In this case, to identify the record faster, we will group two or more columns to get the
unique value and create index out of them. This method is called a clustering index.
 The records which have similar characteristics are grouped, and indexes are created for these
group.
Example: suppose a company contains several employees in each department. Suppose we
use a clustering index, where all employees which belong to the same Dept_ID are
considered within a single cluster, and index pointers point to the cluster as a whole. Here
Dept_Id is a non-unique key.

ADVERTISEMENT

The previous schema is little confusing because one disk block is shared by records which
belong to the different cluster. If we use separate disk block for separate clusters, then it is
called better technique.
Secondary Index

In the sparse indexing, as the size of the table grows, the size of mapping also grows. These
mappings are usually kept in the primary memory so that address fetch should be faster. Then
the secondary memory searches the actual data based on the address got from mapping. If the
mapping size grows then fetching the address itself becomes slower. In this case, the sparse
index will not be efficient. To overcome this problem, secondary indexing is introduced.

In secondary indexing, to reduce the size of mapping, another level of indexing is introduced.
In this method, the huge range for the columns is selected initially so that the mapping size of
the first level becomes small. Then each range is further divided into smaller ranges. The
mapping of the first level is stored in the primary memory, so that address fetch is faster. The
mapping of the second level and actual data are stored in the secondary memory (hard disk).
B+ Tree
 The B+ tree is a balanced binary search tree. It follows a multi-level index format.
 In the B+ tree, leaf nodes denote actual data pointers. B+ tree ensures that all leaf nodes
remain at the same height.
 In the B+ tree, the leaf nodes are linked using a link list. Therefore, a B+ tree can support
random access as well as sequential access.

Structure of B+ Tree

 In the B+ tree, every leaf node is at equal distance from the root node. The B+ tree is of the
order n where n is fixed for every B+ tree.
 It contains an internal node and leaf node.
Internal node

 An internal node of the B+ tree can contain at least n/2 record pointers except the root node.
 At most, an internal node of the tree contains n pointers.

Leaf node

 The leaf node of the B+ tree can contain at least n/2 record pointers and n/2 key values.
 At most, a leaf node contains n record pointer and n key values.
 Every leaf node of the B+ tree contains one block pointer P to point to next leaf node.

Searching a record in B+ Tree

Suppose we have to search 55 in the below B+ tree structure. First, we will fetch for the
intermediary node which will direct to the leaf node that can contain a record for 55.

So, in the intermediary node, we will find a branch between 50 and 75 nodes. Then at the
end, we will be redirected to the third leaf node. Here DBMS will perform a sequential search
to find 55.

ADVERTISEMENT
B+ Tree Insertion

Suppose we want to insert a record 60 in the below structure. It will go to the 3rd leaf node
after 55. It is a balanced tree, and a leaf node of this tree is already full, so we cannot insert
60 there.

In this case, we have to split the leaf node, so that it can be inserted into tree without affecting
the fill factor, balance and order.
The 3rd leaf node has the values (50, 55, 60, 65, 70) and its current root node is 50. We will
split the leaf node of the tree in the middle so that its balance is not altered. So we can group
(50, 55) and (60, 65, 70) into 2 leaf nodes.

If these two has to be leaf nodes, the intermediate node cannot branch from 50. It should have
60 added to it, and then we can have pointers to a new leaf node.

This is how we can insert an entry when there is overflow. In a normal scenario, it is very
easy to find the node where it fits and then place it in that leaf node.

B+ Tree Deletion

Suppose we want to delete 60 from the above example. In this case, we have to remove 60
from the intermediate node as well as from the 4th leaf node too. If we remove it from the
intermediate node, then the tree will not satisfy the rule of the B+ tree. So we need to modify
it to have a balanced tree.

After deleting node 60 from above B+ tree and re-arranging the nodes, it will show as
follows:
Hashing in DBMS
In a huge database structure, it is very inefficient to search all the index values and reach the
desired data. Hashing technique is used to calculate the direct location of a data record on the
disk without using index structure.

In this technique, data is stored at the data blocks whose address is generated by using the
hashing function. The memory location where these records are stored is known as data
bucket or data blocks.

In this, a hash function can choose any of the column value to generate the address. Most of
the time, the hash function uses the primary key to generate the address of the data block. A
hash function is a simple mathematical function to any complex mathematical function. We
can even consider the primary key itself as the address of the data block. That means each
row whose address will be the same as a primary key stored in the data block.

The above diagram shows data block addresses same as primary key value. This hash
function can also be a simple mathematical function like exponential, mod, cos, sin, etc.
Suppose we have mod (5) hash function to determine the address of the data block. In this
case, it applies mod (5) hash function on the primary keys and generates 3, 3, 1, 4 and 2
respectively, and records are stored in those data block addresses.
Types of Hashing:

You might also like