DBMS Unti-5
DBMS Unti-5
Transaction
A database transaction is a set of logically related operation. It contains a group of
tasks. A transaction is an action or series of actions. It is performed by a single user to
perform operations for accessing the contents of the database.
All types of database access operation which are held between the beginning and
end transaction statements are considered as a single logical transaction in DBMS. During
the transaction the database is inconsistent. Only once the database is committed the
state is changed from one consistent state to another.
Example: Suppose an employee of bank transfers Rs 800 from X's account to Y's account.
This small transaction contains several low-level tasks:
1. Open_Account(X) 1. Open_Account(Y)
2. Old_Balance = X.balance 2. Old_Balance = Y.balance
3. New_Balance = Old_Balance - 800 3. New_Balance = Old_Balance + 800
4. X.balance = New_Balance 4. Y.balance = New_Balance
5. Close_Account(X) 5. Close_Account(Y)
Operations of Transaction:
Read(X): Read operation is used to read the value of X from the database and stores it in
a buffer in main memory.
Write(X): Write operation is used to write the value back to the database from the
buffer.
Let's take an example to debit transaction from an account which consists of following
operations:
1. R(X);
2. X = X - 500;
3. W(X);
The first operation reads X's value from database and stores it in a buffer.
The second operation will decrease the value of X by 500. So, buffer will contain
3500.
The third operation will write the buffer's value to the database. So, x’s final value
will be 3500.
But it may be possible that because of the failure of hardware, software or power, etc.
that transaction may fail before finished all the operations in the set.
For example: If in the above transaction, the debit transaction fails after executing
operation 2 then X's value will remain 4000 in the database which is not acceptable by
the bank.
Transaction State
A transaction passes through many different states in its life cycle. These states
are known as transaction states. A transaction can be in one of the following states in the
database:
1. Active state
2. Partially committed state
3. Committed state
4. Failed state
5. Terminated state
Active State:
When the instructions of the transaction are running then the transaction is in
active state. If all the ‘read and write’ operations are performed without any error then
it goes to the “partially committed state”; if any instruction fails, it goes to the “failed
state”.
Partially Committed
After completion of all the read and write operation the changes are made in main
memory or local buffer. If the the changes are made permanent on the Data Base then
the state will change to “committed state” and in case of failure it will go to the “failed
state”.
When any instruction of the transaction fails, it goes to the “failed state” or if
failure occurs in making a permanent change of data on Data Base.
Aborted State
After having any type of failure the transaction goes from “failed state” to
“aborted state” and since in previous states, the changes are only made to local buffer or
main memory and hence these changes are deleted or rolled-back.
Committed State
It is the state when the changes are made permanent on the Data Base and the
transaction is complete and therefore terminated in the “terminated state”.
Terminated State
If there isn’t any roll-back or the transaction comes from the “committed state”,
then the system is consistent and ready for new transaction and the old transaction is
terminated.
The term atomicity defines that the data remains atomic. It means if any operation
is performed on the data, either it should be performed or executed completely or should
not be executed at all. It further means that the operation should not break in between
or execute partially. In the case of executing operations on the transaction, the operation
should be completely executed and not partially.
Example:
If Remo has account ‘A’ having $30 in his account from which he wishes to send
$10 to Sheero's account, which is ‘B’. In account ‘B’, a sum of $100 is already present.
When $10 will be transferred to account ‘B’, the sum will become $110. Now, there will
be two operations that will take place. One is the amount of $10 that Remo wants to
transfer will be debited from his account ‘A’, and the same amount will get credited to
account ‘B’, i.e., into Sheero's account. Now, what happens - the first operation of debit
executes successfully, but the credit operation, however, fails. Thus, in Remo's account
‘A’, the value becomes $20, and to that of Sheero's account, it remains $100 as it was
previously present.
In the above diagram, it can be seen that after crediting $10, the amount is still
$100 in account ‘B’. So, it is not an atomic transaction.
The below image shows that both debit and credit operations are done
successfully. Thus the transaction is atomic.
Thus, when the amount loses atomicity, then in the bank systems, this becomes a
huge issue, and so the atomicity is the main focus in the bank systems.
Consistency:
The word consistency means that the value should remain preserved always. In
DBMS, the integrity of the data should be maintained, which means if a change in the
database is made, it should remain preserved always. In the case of transactions, the
integrity of the data is very essential so that the database remains consistent before and
after the transaction. The data should always be correct.
Example:
Isolation:
Example:
If two operations are concurrently running on two different accounts, then the
value of both accounts should not get affected. The value should remain persistent. As
you can see in the below diagram, account A is making T1 and T2 transactions to account
B and C, but both are executing independently without affecting each other. It is known
as Isolation.
Durability:
In DBMS, the term durability ensures that the data after the successful execution
of the operation becomes permanent in the database. The durability of the data should
be so perfect that even if the system fails or leads to a crash, the database still survives.
However, if gets lost, it becomes the responsibility of the recovery manager for ensuring
the durability of the database. For committing the values, the COMMIT command must be
used every time we make changes.
Therefore, the ACID property of DBMS plays a vital role in maintaining the
consistency and availability of data in the database.
Concurrent Executions
In a multi-user system, multiple users can access and use the same database at one
time, which is known as the concurrent execution of the database. It means that the
same database is executed simultaneously on a multi-user system by different users.
While working on the database transactions, there occurs the requirement of using
the database by multiple users for performing different operations, and in that case,
concurrent execution of the database is performed.
The thing is that the simultaneous execution that is performed should be done in
an interleaved manner, and no operation should affect the other executing operations,
thus maintaining the consistency of the database. Thus, on making the concurrent
In a database transaction, the two main operations are READ and WRITE
operations. So, there is a need to manage these two operations in the concurrent
execution of the transactions as if these operations are not performed in an interleaved
manner, and the data may become inconsistent. So, the following problems occur with
the Concurrent Execution of the operations:
The problem occurs when two different database transactions perform the
read/write operations on the same database items in an interleaved manner (i.e.,
concurrent execution) that makes the values of the items incorrect hence making the
database inconsistent.
For example:
Consider the below diagram where two transactions TX and TY, are performed
on the same account A where the balance of account A is $300.
At time t2, transaction TX deducts $50 from account A that becomes $250 (only
deducted and not updated/write).
Alternately, at time t3, transaction TY reads the value of account A that will be
$300 only because TX didn't update the value yet.
At time t4, transaction TY adds $100 to account A that becomes $400 (only added
but not updated/write).
At time t6, transaction TX writes the value of account A that will be updated as
$250 only, as TY didn't update the value yet.
Similarly, at time t7, transaction TY writes the values of account A, so it will write
as done at time t4 that will be $400. It means the value written by TX is lost, i.e.,
$250 is lost.
The dirty read problem occurs when one transaction updates an item of the
database, and somehow the transaction fails, and before the data gets rollback, the
updated database item is accessed by another transaction. There comes the Read-Write
Conflict between both transactions.
For example:
At time t3, transaction TX writes the updated value in account A, i.e., $350.
Then at time t4, transaction TY reads account A that will be read as $350.
Then at time t5, transaction TX rollbacks due to server problem and the value
changes back to $300 (as initially).
But the value for account A remains $350 for transaction TY as committed, which is
the dirty read and therefore known as the Dirty Read Problem.
For example:
At time t1, transaction TX reads the value from account A, i.e., $300.
At time t2, transaction TY reads the value from account A, i.e., $300.
After that, at time t5, transaction TX reads the available value of account A, and
that will be read as $400.
It means that within the same transaction TX, it reads two different values of
account A, i.e., $ 300 initially, and after updation made by transaction TY, it reads
$400. It is an unrepeatable read and is therefore known as the Unrepeatable read
problem.
DBMS Serializability
When multiple transactions are running concurrently then there is a possibility that
the database may be left in an inconsistent state. Serializability is a concept that helps us
to check which schedules are serializable. A serializable schedule is the one that always
leaves the database in consistent state.
Types of Serializability
1. Conflict Serializability
2. View Serializability
Conflict Serializability
In the DBMS Schedules, there are two types of schedules – Serial & Non-Serial. A
Serial schedule doesn’t support concurrent execution of transactions while a non-serial
schedule supports concurrency. In Serializability that a non-serial schedule may leave the
database in inconsistent state so we need to check these non-serial schedules for the
Serializability.
Conflicting operations
The two operations are called conflicting operations, if all the following three
conditions are satisfied:
Conflict Equivalent
Example:
View Serializability
It is a type of serializability that can be used to check whether the given schedule
is view serializable or not. A schedule called as a view serializable if it is view equivalent
to a serial schedule.
View Equivalent
Two schedules S1 and S2 are said to be view equivalent if they satisfy the following
conditions:
Initial Read
Updated Read
Final Write
Initial Read
An initial read of both schedules must be the same. Suppose two schedule S1 and
S2. In schedule S1, if a transaction T1 is reading the data item A, then in S2, transaction
T1 should also read A.
Updated Read
Above two schedules are not view equal because, in S1, T3 is reading A updated by T2
and in S2, T3 is reading A updated by T1.
Final Write
A final write must be the same between both the schedules. In schedule S1, if a
transaction T1 updates A at last then in S2, final writes operations should also be done by
T1.
Above two schedules are view equal because Final write operation in S1 is done by T3 and
in S2, the final write operation is also done by T3.
Recoverability
Sometimes a transaction may not execute completely due to a software issue,
system crash or hardware failure. In that case, the failed transaction has to be rollback.
The above table 1 shows a schedule which has two transactions. T1 reads and
writes the value of A and that value is read and written by T2. T2 commits but later on,
T1 fails. Due to the failure, we have to rollback T1. T2 should also be rollback because it
reads the value written by T1, but T2 can't be rollback because it already committed. So
this type of schedule is known as irrecoverable schedule.
The above table shows a schedule with two transactions. Transaction T1 reads and
writes A, and that value is read and written by transaction T2. But later on, T1 fails. Due
to this, we have to rollback T1. T2 should be rollback because T2 has read the value
written by T1. As it has not committed before T1 commits so we can rollback transaction
T2 as well. So it is recoverable with cascade rollback.
The above Table shows a schedule with two transactions. Transaction T1 reads and write
A and commits, and that value is read and written by T2. So this is a cascade less
recoverable schedule.
Assume a schedule S. For S, we construct a graph known as precedence graph. This graph
has a pair G = (V, E), where V consists a set of vertices, and E consists a set of edges. The
set of vertices is used to contain all the transactions participating in the schedule. The
set of edges is used to contain all edges Ti Tj for which one of the three conditions
holds:
For example:
Explanation:
The precedence graph for schedule S1 contains a cycle that's why Schedule S1 is non-
serializable.
Explanation:
The precedence graph for schedule S2 contains no cycle that's why Schedule S2 is
serializable.
Implementation of Isolation
The execution of every transaction must be done in an isolated manner, such that
execution of a transaction is not known to any other transaction i.e., every transaction
must execute independently. The intermediate results generated by the transactions
should not be available to other transactions.
This locking policy decreases the performance of the system as only a single
transaction is executed at a time, because of this execution only serial schedules are
generated. This concurrency control mechanism provides poor concurrency level.
Failure Classification
A failure in DBMS is categorized into the following three classifications to ease the process of
determining the exact nature of the problem:
Transaction failure
Disk failure
System crash
Transaction failure
The transaction failure occurs when it fails to execute or when it reaches a point from where it
can't go any further. If a few transaction or process is hurt, then this is called as transaction
failure.
1. Logical errors: If a transaction cannot complete due to some code error or an internal
error condition, then the logical error occurs.
2. Syntax error: It occurs where the DBMS itself terminates an active transaction because
the database system is not able to execute it. For example, the system aborts an active
transaction, in case of deadlock or resource unavailability.
System Crash
System failure can occur due to power failure or other hardware or software failure.
Fail-stop assumption: In the system crash, non-volatile storage is assumed not to be corrupted.
It occurs where hard-disk drives or storage drives used to fail frequently. It was a common
problem in the early days of technology evolution.
Disk failure occurs due to the formation of bad sectors, disk head crash, and unreachability
to the disk or any other failure, which destroy all or part of disk storage.
A database system provides an ultimate view of the stored data. However, data in
the form of bits, bytes get stored in different storage devices.
We will take an overview of various types of storage devices that are used for
accessing and storing data.
For storing the data, there are different types of storage options available. These
storage types differ from one another as per the speed and accessibility. There are the
following types of storage devices used for storing the data:
Primary Storage
Secondary Storage
Tertiary Storage
It is the primary area that offers quick access to the stored data. We also know the
primary storage as volatile storage. It is because this type of memory does not
permanently store the data. As soon as the system leads to a power cut or a crash, the
data also get lost. Main memory and cache are the types of primary storage.
Main Memory: It is the one that is responsible for operating the data that is
available by the storage medium. The main memory handles each instruction of a
computer machine. This type of memory can store gigabytes of data on a system but
is small enough to carry the entire database. At last, the main memory loses the
whole content if the system shuts down because of power failure or other reasons.
Cache: It is one of the costly storage media. On the other hand, it is the fastest one.
A cache is a tiny storage media which is maintained by the computer hardware
usually. While designing the algorithms and query processors for the data structures,
the designers keep concern on the cache effects.
Secondary Storage
Secondary storage is also called as online storage. It is the storage area that allows
the user to save and store data permanently. This type of memory does not lose the data
due to any power failure or system crash. That's why we also call it non-volatile storage.
There are some commonly described secondary storage media which are available in
almost every type of computer system:
o Flash Memory: A flash memory stores data in USB (Universal Serial Bus) keys which
are further plugged into the USB slots of a computer system. These USB keys help
transfer data to a computer system, but it varies in size limits. Unlike the main
memory, it is possible to get back the stored data which may be lost due to a
power cut or other reasons. This type of memory storage is most commonly used in
the server systems for caching the frequently used data. This leads the systems
towards high performance and is capable of storing large amounts of databases
than the main memory.
o Magnetic Disk Storage: This type of storage media is also known as online storage
media. A magnetic disk is used for storing the data for a long time. It is capable of
Tertiary Storage
It is the storage type that is external from the computer system. It has the slowest
speed. But it is capable of storing a large amount of data. It is also known as Offline
storage. Tertiary storage is generally used for data backup. There are following tertiary
storage devices available:
o Tape Storage: It is the cheapest storage medium than disks. Generally, tapes are
used for archiving or backing up the data. It provides slow access to data as it
accesses data sequentially from the start. Thus, tape storage is also known as
sequential-access storage. Disk storage is known as direct-access storage as we can
directly access the data from any location on disk.
Storage Hierarchy
Besides the above, various other storage devices reside in the computer system.
These storage media are organized on the basis of data accessing speed, cost per unit of
data to buy the medium, and by medium's reliability. Thus, we can create a hierarchy of
storage media on the basis of its cost and speed.
In the image, the higher levels are expensive but fast. On moving down, the cost per bit
is decreasing, and the access time is increasing. Also, the storage media from the main
memory to up represents the volatile nature, and below the main memory, all are non-
volatile devices.
It should check the states of all the transactions, which were being executed.
A transaction may be in the middle of some operation; the DBMS must ensure the
atomicity of the transaction in this case.
There are two types of techniques, which can help a DBMS in recovering as well as
maintaining the atomicity of a transaction −
Maintaining shadow paging, where the changes are done on a volatile memory, and
later, the actual database is updated.
Log-based Recovery
When a transaction enters the system and starts execution, it writes a log about it.
<Tn, Start>
<Tn, commit>
Deferred database modification − All logs are written on to the stable storage and
the database is updated when a transaction commits.
When more than one transaction is being executed in parallel, the logs are
interleaved. At the time of recovery, it would become hard for the recovery system to
Checkpoint
Keeping and maintaining logs in real time and in real environment may fill out all the
memory space available in the system. As time passes, the log file may grow too big to be
handled at all. Checkpoint is a mechanism where all the previous logs are removed from
the system and stored permanently in a storage disk. Checkpoint declares a point before
which the DBMS was in consistent state, and all the transactions were committed.
Recovery
When a system with concurrent transactions crashes and recovers, it behaves in the
following manner –
The recovery system reads the logs backwards from the end to the last checkpoint.
If the recovery system sees a log with <T n, Start> and <Tn, Commit> or just <Tn,
Commit>, it puts the transaction in the redo-list.
If the recovery system sees a log with <Tn, Start> but no commit or abort log
found, it puts the transaction in undo-list.
All the transactions in the undo-list are then undone and their logs are removed. All the
transactions in the redo-list and their previous logs are removed and then redone before
saving their logs.
Indexing in DBMS
Indexing is a data structure technique which allows you to quickly retrieve records
from a database file. An Index is a small table having only two columns. The first column
comprises a copy of the primary or candidate key of a table. Its second column contains a
set of pointers for holding the address of the disk block where that specific key value
stored.
Index structure:
Indexes can be created using some database columns.
The first column of the database is the search key that contains a copy of the
primary key or candidate key of the table. The values of the primary key are
stored in sorted order so that the corresponding data can be accessed easily.
The second column of the database is the data reference. It contains a set of
pointers holding the address of the disk block where the value of the particular key
can be found.
Types of Indexing
Indexing is defined based on its indexing attributes. Indexing can be of the
following types −
Ordered indices
The indices are usually sorted to make searching faster. The indices which are sorted are
known as ordered indices.
Example: Suppose we have an employee table with thousands of record and each of
which is 10 bytes long. If their IDs start with 1, 2, 3....and so on and we have to search
student with ID-543.
In the case of a database with no index, we have to search the disk block from
starting till it reaches 543. The DBMS will read the record after reading
543*10=5430 bytes.
In the case of an index, we will search using indexes and the DBMS will read the
record after reading 542*2= 1084 bytes which are very less compared to the
previous case.
Primary Index
If the index is created on the basis of the primary key of the table, then it is
known as primary indexing. These primary keys are unique to each record and
contain 1:1 relation between the records.
Primary Index is an ordered file which is fixed length size with two fields. The first
field is the same a primary key and second, filed is pointed to that specific data
block.
As primary keys are stored in sorted order, the performance of the searching
operation is quite efficient.
Dense Index
Sparse Index
Dense Index
The dense index contains an index record for every search key value in the data
file. It makes searching faster.
In this, the number of records in the index table is same as the number of records
in the main table.
Tirumala Engineering College Page 30
DATABASE MANAGEMENT SYSTEMS UNIT-5
It needs more space to store index record itself. The index records have the search
key and a pointer to the actual record on the disk.
Sparse Index
It is an index record that appears for only some of the values in the file. Sparse
Index helps you to resolve the issues of dense Indexing in DBMS. In this method of
indexing technique, a range of index columns stores the same data block address, and
when data needs to be retrieved, the block address will be fetched.
However, sparse Index stores index records for only some search-key values. It
needs less space, less maintenance overhead for insertion, and deletions but It is slower
compared to the dense Index for locating records.
Clustering Index
A clustered index can be defined as an ordered data file. Sometimes the index is
created on non-primary key columns which may not be unique for each record.
In this case, to identify the record faster, we will group two or more columns to
get the unique value and create index out of them. This method is called a
clustering index.
The records which have similar characteristics are grouped, and indexes are
created for these group.
Example:
Suppose a company contains several employees in each department. Suppose we
use a clustering index, where all employees which belong to the same Dept_ID are
considered within a single cluster, and index pointers point to the cluster as a whole.
Here Dept_Id is a non-unique key.
Secondary Index
In the sparse indexing, as the size of the table grows, the size of mapping also
grows. These mappings are usually kept in the primary memory so that address fetch
should be faster. Then the secondary memory searches the actual data based on the
address got from mapping. If the mapping size grows then fetching the address itself
becomes slower. In this case, the sparse index will not be efficient. To overcome this
problem, secondary indexing is introduced.
For example:
If you want to find the record of roll 111 in the diagram, then it will search the
highest entry which is smaller than or equal to 111 in the first level index. It will
get 100 at this level.
Then in the second index level, again it does max (111) <= 111 and gets 110. Now
using the address 110, it goes to the data block and starts searching each record
till it gets 111.
Introduction of B+ Trees
The B+ tree is a balanced binary search tree. It follows a multi-level index format.
In the B+ tree, leaf nodes denote actual data pointers. B+ tree ensures that all leaf
nodes remain at the same height.
In the B+ tree, the leaf nodes are linked using a link list. Therefore, a B+ tree can
support random access as well as sequential access.
In the B+ tree, every leaf node is at equal distance from the root node. The B+
tree is of the order n where n is fixed for every B+ tree.
Internal node
An internal node of the B+ tree can contain at least n/2 record pointers except the
root node.
Leaf node
The leaf node of the B+ tree can contain at least n/2 record pointers and n/2 key
values.
Every leaf node of the B+ tree contains one block pointer P to point to next leaf
node.
Consider the STUDENT table below. This can be stored in B+ tree structure as shown
below. We can observe here that it divides the records into two and splits into left node
and right node.
The values shown in the intermediary nodes are only the pointers to next level. All the
leaf nodes will have the actual records in a sorted order.
If we have to search for any record, they are all found at leaf node. Hence searching any
record will take same time because of equidistance of the leaf nodes. Also they are all
sorted. Hence searching a record is like a sequential search and does not take much time.
Suppose we want to search 65 in the below B+ tree structure. First we will fetch for the
intermediary node which will direct to the leaf node that can contain record for 65. So
we find branch between 50 and 75 nodes in the intermediary node. Then we will be
redirected to the third leaf node at the end. Here DBMS will perform sequential search to
find 65. Suppose, instead of 65, we have to search for 60. What will happen in this case?
We will not be able to find in the leaf node. No insertions/update/delete is allowed
during the search in B+ tree.
Insertion in B+ tree
Suppose we have to insert a record 60 in below structure. It will go to 3rd leaf node after
55. Since it is a balanced tree and that leaf node is already full, we cannot insert the
record there. But it should be inserted there without affecting the fill factor, balance and
order. So the only option here is to split the leaf node. But how do we split the nodes?
The 3rd leaf node should have values (50, 55, 60, 65, 70) and its current root node is 50.
We will split the leaf node in the middle so that its balance is not altered. So we can
group (50, 55) and (60, 65, 70) into 2 leaf nodes. If these two has to be leaf nodes, the
intermediary node cannot branch from 50. It should have 60 added to it and then we can
have pointers to new leaf node.
Delete in B+ tree
Suppose we have to delete 60 from the above example. What will happen in this case? We
have to remove 60 from 4th leaf node as well as from the intermediary node too. If we
remove it from intermediary node, the tree will not satisfy B+ tree rules. So we need to
modify it have a balanced tree. After deleting 60 from above B+ tree and re-arranging
nodes, it will appear as below.
Suppose we have to delete 15 from above tree. We will traverse to the 1st leaf node and
simply delete 15 from that node. There is no need for any re-arrangement as the tree is
balanced and 15 do not appear in the intermediary node.
The File is a collection of records. Using the primary key, we can access the
records. The type and frequency of access can be determined by the type of file
organization which was used for a given set of records.
File organization is used to describe the way in which the records are stored in
terms of blocks, and the blocks are placed on the storage medium.
The first approach to map the database to the file is to use the several files and
store only one fixed length record in any given file. An alternative approach is to
structure our files so that we can contain multiple lengths for records.
Files of fixed length records are easier to implement than the files of variable
length records.
To perform insert, delete or update transaction on the records should be quick and
easy.
Various methods have been introduced to Organize files. These particular methods have
advantages and disadvantages on the basis of access or selection. Thus, it is all upon the
programmer to decide the best suited file Organization method according to his
requirements.
Some types of File Organizations are:
This method is the easiest method for file organization. In this method, files are stored
sequentially. This method can be implemented in two ways:
In case of updating or deleting of any record, the record will be searched in the
memory blocks. When it is found, then it will be marked for deleting, and the new
record is inserted.
In this method, the new record is always inserted at the file's end, and then it will
sort the sequence in ascending or descending order. Sorting of records is based on
any primary key or any other key.
In the case of modification of any record, it will update the record and then sort
the file, and lastly, the updated record is placed in the right place.
Suppose there is a preexisting sorted sequence of four records R1, R3 and so on upto R6
and R7. Suppose a new record R2 has to be inserted in the sequence, then it will be
inserted at the end of the file, and then it will sort the sequence.
It contains a fast and efficient method for the huge amount of data.
In this method, files can be easily stored in cheaper storage mechanism like
magnetic tapes.
This method is used when most of the records have to be accessed like grade
calculation of a student, generating the salary slip, etc.
It will waste time as we cannot jump on a particular record that is required but we
have to move sequentially which takes our time.
Sorted file method takes more time and space for sorting the records.
It is the simplest and most basic type of organization. It works with data blocks. In
heap file organization, the records are inserted at the file's end. When the records
are inserted, it doesn't require the sorting and ordering of records.
When the data block is full, the new record is stored in some other block. This new
data block need not to be the very next data block, but it can select any data
block in the memory to store new records. The heap file is also known as an
unordered file.
In the file, every record has a unique id, and every page in a file is of the same
size. It is the DBMS responsibility to store and manage the new records.
Suppose we have five records R1, R3, R6, R4 and R5 in a heap and suppose we want to
insert a new record R2 in a heap. If the data block 3 is full then it will be inserted in any
of the database selected by the DBMS, let's say data block 1.
If the database is very large then searching, updating or deleting of record will be time-
consuming because there is no sorting or ordering of records. In the heap file
organization, we need to check all the data until we get the requested record.
It is a very good method of file organization for bulk insertion. If there is a large
number of data which needs to load into the database at a time, then this method
is best suited.
In case of a small database, fetching and retrieving of records is faster than the
sequential record.
This method is inefficient for the large database because it takes time to search or
modify the record.
Hash File Organization uses the computation of hash function on some fields of the
records. The hash function's output determines the location of disk block where the
records are to be placed.
In this method, there is no effort for searching and sorting the entire file. In this method,
each record will be stored randomly in the memory.
B+ File Organization
It uses the same concept of key-index where the primary key is used to sort the
records. For each primary key, the value of the index is generated and mapped
with the record.
The B+ tree is similar to a binary search tree (BST), but it can have more than two
children. In this method, all the records are stored only at the leaf node.
Intermediate nodes act as a pointer to the leaf nodes. They do not contain any
records.
There is an intermediary layer with nodes. They do not store the actual record.
They have only pointers to the leaf node.
The nodes to the left of the root node contain the prior value of the root and
nodes to the right contain next value of the root, i.e., 15 and 30 respectively.
There is only one leaf node which has only values, i.e., 10, 12, 17, 20, 24, 27 and
29.
Searching for any record is easier as all the leaf nodes are balanced.
In this method, searching any record can be traversed through the single path and
accessed easily.
In this method, searching becomes very easy as all the records are stored only in
the leaf nodes and sorted the sequential linked list.
The size of the B+ tree has no restrictions, so the number of records can increase
or decrease and the B+ tree structure can also grow or shrink.
It is a balanced tree structure, and any insert/update/delete does not affect the
performance of tree.
ISAM method is an advanced sequential file organization. In this method, records are
stored in the file using the primary key. An index value is generated for each primary key
and mapped with the record. This index contains the address of the record in the file.
If any record has to be retrieved based on its index value, then the address of the data
block is fetched and the record is retrieved from the memory.
Pros of ISAM:
In this method, each record has the address of its data block, searching a record in
a huge database is quick and easy.
This method supports range retrieval and partial retrieval of records. Since the
index is based on the primary key values, we can retrieve the data for the given
range of value. In the same way, the partial value can also be easily searched, i.e.,
the student’s name starting with 'JA' can be easily searched.
Cons of ISAM
This method requires extra space in the disk to store the index value.
When the new records are inserted, then these files have to be reconstructed to
maintain the sequence.
When the record is deleted, then the space used by it needs to be released.
Otherwise, the performance of the database will slow down.
When the two or more records are stored in the same file, it is known as clusters.
These files will have two or more tables in the same data block, and key attributes
which are used to map these tables together are stored only once.
This method reduces the cost of searching for various records in different files.
The cluster file organization is used when there is a frequent need for joining the
tables with the same condition. These joins will give only a few records from both
tables. In the given example, we are retrieving the record for only particular
departments. This method can't be used to retrieve the record for the entire
department.
1. Indexed Clusters:
In indexed cluster, records are grouped based on the cluster key and stored together. The
above EMPLOYEE and DEPARTMENT relationship is an example of an indexed cluster.
Here, all the records are grouped based on the cluster key- DEP_ID and all the records are
grouped.
2. Hash Clusters:
It is similar to the indexed cluster. In hash cluster, instead of storing the records based on
the cluster key, we generate the value of the hash key for the cluster key and store the
records with the same hash key value.
The cluster file organization is used when there is a frequent request for joining
the tables with same joining condition.
It provides the efficient result when there is a 1:M mapping between the tables.
This method has the low performance for the very large database.
If there is any change in joining condition, then this method cannot use. If we
change the condition of joining then traversing the file takes a lot of time.
In this technique, data is stored at the data blocks whose address is generated by using
the hashing function. The memory location where these records are stored is known as
data bucket or data blocks.
In this, a hash function can choose any of the column value to generate the address. Most
of the time, the hash function uses the primary key to generate the address of the data
block. A hash function is a simple mathematical function to any complex mathematical
function. We can even consider the primary key itself as the address of the data block.
That means each row whose address will be the same as a primary key stored in the data
block.
The above diagram shows data block addresses same as primary key value. This
hash function can also be a simple mathematical function like exponential, mod, cos, sin,
etc. Suppose we have mod (5) hash function to determine the address of the data block.
In this case, it applies mod (5) hash function on the primary keys and generates 3, 3, 1, 4
and 2 respectively, and records are stored in those data block addresses.
Types of Hashing:
Static Hashing
Dynamic Hashing
Static Hashing
In static hashing, the resultant data bucket address will always be the same. That
means if we generate an address for EMP_ID =103 using the hash function mod (5) then it
will always result in same bucket address 3. Here, there will be no change in the bucket
address.
Hence in this static hashing, the number of data buckets in memory remains
constant throughout. In this example, we will have five data buckets in the memory used
to store the data.
Searching a record
When a record needs to be searched, then the same hash function retrieves the
address of the bucket where the data is stored.
Insert a Record
When a new record is inserted into the table, then we will generate an address for
a new record based on the hash key and record is stored in that location.
Delete a Record
To delete a record, we will first fetch the record which is supposed to be deleted.
Then we will delete the records for that address in memory.
Update a Record
To update a record, we will first search it using a hash function, and then the data
record is updated.
If we want to insert some new record into the file but the address of a data bucket
generated by the hash function is not empty, or data already exists in that address. This
To overcome this situation, there are various methods. Some commonly used
methods are as follows:
1. Open Hashing
When a hash function generates an address at which data is already stored, then
the next bucket will be allocated to it. This mechanism is called as Linear Probing.
For example: suppose R3 is a new address which needs to be inserted, the hash function
generates address as 112 for R3. But the generated address is already full. So the system
searches next available data bucket, 113 and assigns R3 to it.
2. Close Hashing
When buckets are full, then a new data bucket is allocated for the same hash
result and is linked after the previous one. This mechanism is known as Overflow
chaining.
For example: Suppose R3 is a new address which needs to be inserted into the table, the
hash function generates address as 110 for it. But this bucket is full to store the new
data. In this case, a new bucket is inserted at the end of 110 buckets and is linked to it.
Dynamic Hashing
The dynamic hashing method is used to overcome the problems of static hashing
like bucket overflow.
In this method, data buckets grow or shrink as the records increases or decreases.
This method is also known as Extendable hashing method.
This method makes hashing dynamic, i.e., it allows insertion or deletion without
resulting in poor performance.
Check how many bits are used in the directory, and these bits are called as i.
Take the least significant i bits of the hash address. This gives an index of the
directory.
Now using the index, go to the directory and find bucket address where the record
might be.
Firstly, you have to follow the same procedure for retrieval, ending up in some
bucket.
If there is still space in that bucket, then place the record in it.
If the bucket is full, then we will split the bucket and redistribute the records.
For example:
Consider the following grouping of keys into buckets, depending on the prefix of
their hash address:
The last two bits of 2 and 4 are 00. So it will go into bucket B0. The last two bits of
5 and 6 are 01, so it will go into bucket B1. The last two bits of 1 and 3 are 10, so it will
go into bucket B2. The last two bits of 7 are 11, so it will go into B3.
Insert key 9 with hash address 10001 into the above structure:
Since key 9 has hash address 10001, it must go into the first bucket. But bucket B1
is full, so it will get split.
The splitting will separate 5, 9 from 6 since last three bits of 5, 9 are 001, so it will
go into bucket B1, and the last three bits of 6 are 101, so it will go into bucket B5.
Keys 2 and 4 are still in B0. The record in B0 pointed by the 000 and 100 entry
because last two bits of both the entry are 00.
Keys 1 and 3 are still in B2. The record in B2 pointed by the 010 and 110 entry
because last two bits of both the entry are 10.
Key 7 are still in B3. The record in B3 pointed by the 111 and 011 entry because
last two bits of both the entry are 11.
In this method, the performance does not decrease as the data grows in the
system. It simply increases the size of memory to accommodate the data.
In this method, memory is well utilized as it grows and shrinks with the data.
There will not be any unused memory lying.
This method is good for the dynamic database where data grows and shrinks
frequently.
In this method, if the data size increases then the bucket size is also increased.
These addresses of data will be maintained in the bucket address table. This is
because the data address will keep changing as buckets grow and shrink. If there is
a huge increase in data, maintaining the bucket address table becomes tedious.
In this case, the bucket overflow situation will also occur. But it might take little
time to reach this situation than static hashing.