DBMS Unit5
DBMS Unit5
Data on External Storage, File Organization and Indexing, Cluster Indexes, Primary and
Secondary Indexes, Index data Structures, Hash Based Indexing, Tree base Indexing,
Comparison of File Organizations, Indexes and Performance Tuning, Intuitions for tree
Indexes, Indexed Sequential Access Methods (ISAM), B+ Trees: A Dynamic Index
Structure
Storage Hierarchy
Besides the above, various other storage devices reside in the computer system. These storage
media are organized on the basis of data accessing speed, cost per unit of data to buy the
medium, and by medium's reliability. Thus, we can create a hierarchy of storage media on the
basis of its cost and speed.
Thus, on arranging the above-described storage media in a hierarchy according to its speed and
cost, we conclude the below-described image:
In the image, the higher levels are expensive but fast. On moving down, the cost per bit is
decreasing, and the access time is increasing. Also, the storage media from the main memory to
up represents the volatile nature, and below the main memory, all are non-volatile devices.
https://www.javatpoint.com/storage-system-in-dbms
Sorted File Method: In this method, the new record is always inserted at the file's end,
and then it will sort the sequence in ascending or descending order. Sorting of records is
based on any primary key or any other key. In the case of modification of any record, it
will update the record and then sort the file, and lastly, the updated record is placed in the
right place.
When a record has to be received using the hash key columns, then the address is
generated, and the whole record is retrieved using that address. In the same way, when a
new record has to be inserted, then the address is generated using the hash key and record
is directly inserted. The same process is applied in the case of delete and update.
In this method, there is no effort for searching and sorting the entire file. In this method,
each record will be stored randomly in the memory.
B+ file
organization - B+ tree file organization is the advanced method of an indexed
sequential access method. It uses a tree-like structure to store records in File. It uses the
same concept of key-index where the primary key is used to sort the records. For each
primary key, the value of the index is generated and mapped with the record. The B+
tree is similar to a binary search tree (BST), but it can have more than two children. In
this method, all the records are stored only at the leaf node. Intermediate nodes act as a
pointer to the leaf nodes. They do not contain any records.
If any
record has to be retrieved based on its index value, then the address of the data block
is fetched and the record is retrieved from the memory.
Pros of ISAM:
In this method, each record has the address of its data block, searching a record in
a huge database is quick and easy.
This method supports range retrieval and partial retrieval of records. Since the
index is based on the primary key values, we can retrieve the data for the given
range of value. In the same way, the partial value can also be easily searched, i.e.,
the student name starting with 'JA' can be easily searched.
Cons of ISAM
This method requires extra space in the disk to store the index value. When the
new records are inserted, then these files have to be reconstructed to maintain
the sequence.
When the record is deleted, then the space used by it needs to be released.
Otherwise, the performance of the database will slow down.
Cluster file organization - When the two or more records are stored in the same file, it is
known as clusters. These files will have two or more tables in the same data block, and
key attributes which are used to map these tables together are stored only once. This
method reduces the cost of searching for various records in different files. The cluster file
organization is used when there is a frequent need for joining the tables with the same
condition. These joins will give only a few records from both tables. In the given
example, we are retrieving the record for only particular departments. This method can't
be used to retrieve the record for the entire department.
Cluster Indexes
Primary and Secondary Indexes
Indexing is used to optimize the performance of a database by minimizing the number of disk
accesses required when a query is processed.
The index is a type of data structure. It is used to locate and access the data in a database table
quickly.
Index structure: Indexes can be created using some database columns.
The first column of the database is the search key that contains a copy of the primary key or
candidate key of the table. The values of the primary key are stored in sorted order so that the
corresponding data can be accessed easily.
The second column of the database is the data reference. It contains a set of pointers holding the
address of the disk block where the value of the particular key can be found. Indexing Methods
Ordered indices - The indices are usually sorted to make searching faster. The indices which are
sorted are known as ordered indices.
Example: Suppose we have an employee table with thousands of record and each of which is 10
bytes long. If their IDs start with 1, 2, 3....and so on and we have to search student with ID-543.
In the case of a database with no index, we have to search the disk block from starting till it
reaches 543. The DBMS will read the record after reading 543*10=5430 bytes. In the case of an
index, we will search using indexes and the DBMS will read the record after reading 542*2=
1084 bytes which are very less compared to the previous case. Primary Index - If the index is
created on the basis of the primary key of the table, then it is known as primary indexing. These
primary keys are unique to each record and contain 1:1 relation between the records.
As primary keys are stored in sorted order, the performance of the searching operation is quite
efficient.
The primary index can be classified into two types: Dense index and Sparse index.
Dense index - The dense index contains an index record for every search key value in the data
file. It makes searching faster.
In this, the number of records in the index table is same as the number of records in the main
table.
It needs more space to store index record itself. The index records have the search key and a
pointer to the actual record on the disk.
Sparse index
In the data file, index record appears only for a few items. Each item points to a block. In this,
instead of pointing to each record in the main table, the index points to the records in the main
table in a gap.
Clustering Index
A clustered index can be defined as an ordered data file. Sometimes the index is created on non
primary key columns which may not be unique for each record.
In this case, to identify the record faster, we will group two or more columns to get the unique
value and create index out of them. This method is called a clustering index. The records
which have similar characteristics are grouped, and indexes are created for these group.
Example: suppose a company contains several employees in each department. Suppose we use a
clustering index, where all employees which belong to the same Dept_ID are considered within a
single cluster, and index pointers point to the cluster as a whole. Here Dept_Id is a non-unique
key.
Secondary Index
In the sparse indexing, as the size of the table grows, the size of mapping also grows. These
mappings are usually kept in the primary memory so that address fetch should be faster. Then the
secondary memory searches the actual data based on the address got from mapping. If the
mapping size grows then fetching the address itself becomes slower. In this case, the sparse
index will not be efficient. To overcome this problem, secondary indexing is introduced. In
secondary indexing, to reduce the size of mapping, another level of indexing is introduced. In
this method, the huge range for the columns is selected initially so that the mapping size of the
first level becomes small. Then each range is further divided into smaller ranges. The mapping of
the first level is stored in the primary memory, so that address fetch is faster. The mapping of the
second level and actual data are stored in the secondary memory (hard disk).
For example:
If you want to find the record of roll 111 in the diagram, then it will search the highest entry
which is smaller than or equal to 111 in the first level index. It will get 100 at this level. Then
in the second index level, again it does max (111) <= 111 and gets 110. Now using the address
110, it goes to the data block and starts searching each record till it gets 111. This is how a
search is performed in this method. Inserting, updating or deleting is also done in the same
manner.
Primary Index Secondary Index
The above diagram shows data block addresses same as primary key value. This hash function
can also be a simple mathematical function like exponential, mod, cos, sin, etc. Suppose we have
mod (5) hash function to determine the address of the data block. In this case, it applies mod (5)
hash function on the primary keys and generates 3, 3, 1, 4 and 2 respectively, and records are
stored in those data block addresses.
Types of Hashing:
Static Hashing - In static hashing, the resultant data bucket address will always be the
same. That means if we generate an address for EMP_ID =103 using the hash function mod (5)
then it will always result in same bucket address 3. Here, there will be no change in the bucket
address.
Hence in this static hashing, the number of data buckets in memory remains constant throughout.
In this example, we will have five data buckets in the memory used to store the data.
Close Hashing - When buckets are full, then a new data bucket is allocated for the same hash
result and is linked after the previous one. This mechanism is known as Overflow chaining. For
example: Suppose R3 is a new address which needs to be inserted into the table, the hash
function generates address as 110 for it. But this bucket is full to store the new data. In this case,
a new bucket is inserted at the end of 110 buckets and is linked to it.
Dynamic Hashing - The dynamic hashing method is used to overcome the problems of static
hashing like bucket overflow.
In this method, data buckets grow or shrink as the records increases or decreases. This method is
also known as Extendable hashing method.
This method makes hashing dynamic, i.e., it allows insertion or deletion without resulting in poor
performance.
How to search a key
First, calculate the hash address of the key.
Check how many bits are used in the directory, and these bits are called as i.
Take the least significant i bits of the hash address. This gives an index of the directory.
Now using the index, go to the directory and find bucket address where the record might
be.
How to insert a new record
Firstly, you have to follow the same procedure for retrieval, ending up in some
bucket. If there is still space in that bucket, then place the record in it.
If the bucket is full, then we will split the bucket and redistribute the
records. For example:
Consider the following grouping of keys into buckets, depending on the prefix of their hash
address:
The last two bits of 2 and 4 are 00. So it will go into bucket B0. The last two bits of 5 and 6 are
01, so it will go into bucket B1. The last two bits of 1 and 3 are 10, so it will go into bucket B2.
The last two bits of 7 are 11, so it will go into B3.
Insert key 9 with hash address 10001 into the above structure:
Since key 9 has hash address 10001, it must go into the first bucket. But bucket B1 is full,
so it will get split.
The splitting will separate 5, 9 from 6 since last three bits of 5, 9 are 001, so it will go
into bucket B1, and the last three bits of 6 are 101, so it will go into bucket B5. Keys 2
and 4 are still in B0. The record in B0 pointed by the 000 and 100 entry because last
two bits of both the entry are 00.
Keys 1 and 3 are still in B2. The record in B2 pointed by the 010 and 110 entry because
last two bits of both the entry are 10.
Key 7 are still in B3. The record in B3 pointed by the 111 and 011 entry because last two
bits of both the entry are 11.
Sequential File Records are stored Good for range Slow for
in order based on queries, insertion and
a sequential updating.
designated search access.
key.
Hash File Records are Very fast for Not suitable for
distributed among point queries. range queries.
a fixed number
of
buckets using a hash
function.
Structure Separate leaf nodes for data Nodes store both keys and
storage and internal nodes for data values
indexing
Leaf Nodes Leaf nodes form a linked list Leaf nodes do not form a linked list
for efficient range-based
queries
Key Typically allows key duplication Usually does not allow key
Duplication in leaf nodes duplication
Disk Access Better disk access due to More disk I/O due to non-
sequential reads in a linked list sequential reads in internal nodes
structure
Diagram-II Using the Pnext pointer it is viable to traverse all the leaf nodes, just like a linked list,
thereby achieving ordered access to the records stored in the disk.
Searching a Record in B+ Trees
Let us suppose we have to find 58 in the B+ Tree. We will start by fetching from the root node
then we will move to the leaf node, which might contain a record of 58. In the image given
above, we will get 58 between 50 and 70. Therefore, we will we are getting a leaf node in the
third leaf node and get 58 there. If we are unable to find that node, we will return that ‘record not
founded’ message.
Insertion in B+ Trees
Every element in the tree has to be inserted into a leaf node. Therefore, it is necessary to
go to a proper leaf node.
Insert the key into the leaf node in increasing order if there is no
overflow. Deletion in B+Trees
Deletion in B+ Trees is just not deletion but it is a combined process of Searching,
Deletion, and Balancing. In the last step of the Deletion Process, it is mandatory to
balance the B+ Trees, otherwise, it fails in the property of B+ Trees.
Advantages of B+Trees
A B+ tree with ‘l’ levels can store more entries in its internal nodes compared to a B-tree
having the same ‘l’ levels. This accentuates the significant improvement made to the
search time for any given key. Having lesser levels and the presence of Pnext pointers
imply that the B+ trees is very quick and efficient in accessing records from disks. Data
stored in a B+ tree can be accessed both sequentially and directly.
It takes an equal number of disk accesses to fetch records.
B+trees have redundant search keys, and storing search keys repeatedly is not possible.
Disadvantages of B+Trees
The major drawback of B-tree is the difficulty of traversing the keys sequentially. The B+
tree retains the rapid random access property of the B-tree while also allowing rapid
sequential access.
Application of B+ Trees
Multilevel Indexing
Faster operations on the tree (insertion, deletion, search)
Database indexing
https://www.geeksforgeeks.org/introduction-of-b-tree/