DBMS Unit-5
DBMS Unit-5
Indexing in DBMS is a technique that uses data structures to optimize the searching time of a
database query. It helps in faster query results and quick data retrieval from the database.
Indexing makes database performance better. It also consumes lesser space in the main
memory.
Indexing is used to quickly retrieve particular data from the database. Formally we can define
Indexing as a technique that uses data structures to optimize the searching time of a database
query in DBMS. Indexing reduces the number of disks required to access a particular data by
internally creating an index table.
Index usually consists of two columns which are a key-value pair. The two columns of the
index table(i.e., the key-value pair) contain copies of selected columns of the tabular data of
the database.
Here, Search Key contains the copy of the Primary Key or the Candidate Key of the database
table. Generally, we store the selected Primary or Candidate keys in a sorted manner so that
we can reduce the overall query time or search time(from linear to binary).
Data Reference contains a set of pointers that holds the address of the disk block. The
pointed disk block contains the actual data referred to by the Search Key. Data Reference is
also called Block Pointer because it uses block-based addressing.
Types of Indexes
According to the attributes defined above, we divide indexing into three types:
Ordered indices
The indices are usually sorted to make searching faster. The indices which are sorted are
known as ordered indices.
Example: Suppose we have an employee table with thousands of record and each of which is
10 bytes long. If their IDs start with 1, 2, 3....and so on and we have to search student with
ID-543.
o In the case of a database with no index, we have to search the disk block from starting
till it reaches 543. The DBMS will read the record after reading 543*10=5430 bytes.
o In the case of an index, we will search using indexes and the DBMS will read the
record after reading 542*2= 1084 bytes which are very less compared to the previous
case.
Primary Index
o If the index is created on the basis of the primary key of the table, then it is known as
primary indexing. These primary keys are unique to each record and contain 1:1
relation between the records.
o As primary keys are stored in sorted order, the performance of the searching operation
is quite efficient.
o The primary index can be classified into two types: Dense index and Sparse index.
Dense index
o The dense index contains an index record for every search key value in the data file. It
makes searching faster.
o In this, the number of records in the index table is same as the number of records in
the main table.
o It needs more space to store index record itself. The index records have the search key
and a pointer to the actual record on the disk.
Sparse index
o In the data file, index record appears only for a few items. Each item points to a block.
o In this, instead of pointing to each record in the main table, the index points to the
records in the main table in a gap.
Clustering Index
o A clustered index can be defined as an ordered data file. Sometimes the index is
created on non-primary key columns which may not be unique for each record.
o In this case, to identify the record faster, we will group two or more columns to get
the unique value and create index out of them. This method is called a clustering
index.
o The records which have similar characteristics are grouped, and indexes are created
for these group.
Secondary Index
In the sparse indexing, as the size of the table grows, the size of mapping also grows. These
mappings are usually kept in the primary memory so that address fetch should be faster. Then
the secondary memory searches the actual data based on the address got from mapping. If the
mapping size grows then fetching the address itself becomes slower. In this case, the sparse
index will not be efficient. To overcome this problem, secondary indexing is introduced.
In secondary indexing, to reduce the size of mapping, another level of indexing is introduced.
In this method, the huge range for the columns is selected initially so that the mapping size of
the first level becomes small. Then each range is further divided into smaller ranges. The
mapping of the first level is stored in the primary memory, so that address fetch is faster. The
mapping of the second level and actual data are stored in the secondary memory (hard disk).
For example:
o If you want to find the record of roll 111 in the diagram, then it will search the highest
entry which is smaller than or equal to 111 in the first level index. It will get 100 at
this level.
o Then in the second index level, again it does max (111) <= 111 and gets 110. Now
using the address 110, it goes to the data block and starts searching each record till it
gets 111.
o This is how a search is performed in this method. Inserting, updating or deleting is
also done in the same manner.
Advantages of Indexing
r similarly the index table of the database Indexing helps in faster query results or quick data
retrieval.
Various methods have been introduced to Organize files. These particular methods have
advantages and disadvantages on the basis of access or selection . Thus it is all upon the
programmer to decide the best suited file Organization method according to his requirements.
Some types of File Organizations are :
Sequential File Organization
Heap File Organization
Hash File Organization
B+ Tree File Organization
Clustered File Organization
We will be discussing each of the file Organizations in further sets of this article along with
differences and advantages/ disadvantages of each file Organization methods.
The easiest method for file Organization is Sequential method. In this method the file are
stored one after another in a sequential manner. There are two ways to implement this
method:
1. Pile File Method – This method is quite simple, in which we store the records in a
sequence i.e one after other in the order in which they are inserted into the tables.
Insertion of new record –
Let the R1, R3 and so on upto R5 and R4 be four records in the sequence. Here, records
are nothing but a row in any table. Suppose a new record R2 has to be inserted in the
sequence, then it is simply placed at the end of the file.
2. Sorted File Method –In this method, As the name itself suggest whenever a new record
has to be inserted, it is always inserted in a sorted (ascending or descending) manner.
Sorting of records may be based on any primary key or any other key.
Heap File Organization works with data blocks. In this method records are inserted at the end
of the file, into the data blocks. No Sorting or Ordering is required in this method. If a data
block is full, the new record is stored in some other block, Here the other data block need not
be the very next data block, but it can be any block in the memory. It is the responsibility of
DBMS to store and manage the new records.
Hashing
Hashing is a DBMS technique for searching for needed data on the disc without utilising an
index structure. The hashing method is basically used to index items and retrieve them in a
DB since searching for a specific item using a shorter hashed key rather than the original
value is faster.
It can be nearly hard to search all index values through all levels of a large database structure
and then get to the target data block to obtain the needed data. Hashing is a method for
calculating the direct position of an information record on the disk without the use of an
index structure.
To generate the actual address of a data record, hash functions containing search keys as
parameters are used.
Properties of Hashing in DBMS
Data is kept in data blocks whose addresses are produced using the hashing function in this
technique. Data buckets or data blocks are the memory locations where these records are
stored.
In this case, a hash function can produce the address from any column value. The primary
key is frequently used by the hash function to generate the data block’s address. To every
complex mathematical function, a hash function is a basic mathematical function. The
primary key can also be considered as the data block’s address, i.e. each row with the same
address as a primary key contained in the data block.
The data block addresses are the same as the primary key value in the picture above. This
hash function could alternatively be a simple mathematical function, such as exponential,
mod, cos, sin, and so on. Assume we’re using the mod (5) hash function to find the data
block’s address. In this scenario, the primary keys are hashed with the mod (5) function,
yielding 3, 3, 1, 4, and 2, respectively, and records are saved at those data block locations.
Hash Organization
Bucket – A bucket is a type of storage container. Data is stored in bucket format in a hash
file. Typically, a bucket stores one entire disc block, which can then store one or more
records.
Types of Hashing
Static Hashing
Whenever a search-key value is given in static hashing, the hash algorithm always returns the
same address. If the mod-4 hash function is employed, for example, only 5 values will be
generated. For this function, the output address must always be the same. At all times, the
total number of buckets available remains constant.
Dynamic Hashing
The disadvantage of static hashing is that it doesn’t expand or contract dynamically as the
database size grows or diminishes. Dynamic hashing is a technology that allows data buckets
to be created and withdrawn on the fly. Extended hashing is another name for dynamic
hashing.
In dynamic hashing, the hash function is designed to output a huge number of values, but
only a few are used at first.
Whenever a search-key value is specified in static hashing, the hash algorithm always returns
the same address. In case the mod-4 hash function is employed, for example, only 5 values
will be generated. For this function, the output address must always be the same. At all times,
the number of buckets available remains constant.
The resultant data bucket address with static hashing will always be the same. That is, if we
use the hash function mod (5) to obtain an address for EMP ID =103, we will always get the
same bucket address 3. The bucket address will not change in this case.
As a result, the total number of data buckets present in the memory remains constant
throughout the process of static hashing. In this case, the memory utilised to hold the data
will include five data buckets.
When a record is needed, the very same hash function is used in order to get the address of
that bucket in which the data is kept.
Insert a Record
When a new record is entered into the table, the hash key is used to construct an address for
the new record, and the record is placed there.
Delete a Record
To delete a record, we must first retrieve the record that will be destroyed. The records for
this address will then be deleted from memory.
Update a Record
To edit a record, we’ll use a hash function to find it first, then change the data record.
If we wish to add a new record to the file, but the address of the data bucket formed by the
hash function isn’t empty, or information already exists in that address, we can’t add the
record. Bucket overflow is a term used in static hashing to describe this occurrence. In this
strategy, this is a critical condition.
There are a number of options for dealing with this scenario. The following are some of the
most widely utilised methods:
Open Hashing
Whenever a hash function generates any address that already contains data, the next bucket is
assigned to it, and this process is called Linear probing.
For instance, if R3 is a new address that needs to be entered, the hash function will generate
112 as R3’s address. However, the address that was produced is already full; as a result, the
system selects 113 as the next available data bucket and assigns R3 to it.
Close Hashing
When a data bucket is filled, a new one is created for the very same hash result and connected
after the old one, and this method is called Overflow chaining.
For example, if R3 is a new address that has to be added to the database, the hash function
will assign it the address 110. However, this bucket is too full to accommodate the additional
data. In this scenario, a new bucket is placed and linked to the end of 110 buckets.
The dynamic hashing approach is used to solve problems like bucket overflow that can occur
with static hashing. As the number of records increases or decreases, data buckets grow or
shrink in this manner. This method makes hashing dynamic, allowing for insertion and
deletion without causing performance issues. The extendible hashing method is another name
for this technology.
Searching a Key
Calculate the key’s hash address first.
Determine the number of bits used in the directory; these bits are referred to as i.
Take the hash address’s least significant i bits. This returns the directory’s index.
Now, using the index, navigate to the directory and look for the bucket address in
which the record may be located.
Example
Consider the following classification of keys into buckets based on their hash address prefix:
Since the last two bits in 2 and 4 are 00, they will be placed in bucket B0. Because the last
two bits of the numbers 5 and 6 are 01, they will be placed in bucket B1. Since the last two
parts of 1 and 3 add up to 10, they will be placed in bucket B2, and as the last two bits in 7
are 11, they will be placed in B3.
Inserting a key 9 into the above structure with hash address 10001:
Key 9 must be put into the first bucket because its hash address is 10001. However,
because bucket B1 is full, it will be split.
Because the final three bits of 5, 9 are 001, they will be split into bucket B1, whereas
the last three bits in 6 are 101, and they will be split into bucket B5.
The 2nd and 4th keys are still in B0. Because the last two bits of both entries are 00,
the record in B0 is pointed to 000 and 100 entries.
The first and third keys are still in B2. Because the last two bits of both entries are 10,
the record in B2 is pointed to 010 and 110 entries.
The keys 7 and 8 are still in B3. Because the last two bits of both entries are 11, the
record in B3 is pointed to 111 and 011 entries.
The performance of this method does not degrade as the amount of data in the system
grows. To accommodate the data, it massively increases the memory size.
Memory is properly utilised in this manner since it shrinks and grows with the
information. There will be no unused memory to be found.
This strategy is ideal for a dynamic database with data that increases and shrinks on a
regular basis.
In this strategy, as the data amount grows, the bucket size grows as well. The bucket
address table will keep track of these data addresses due to the fact that the data
address will change as the buckets expand and shrink. Maintenance of the bucket
address table gets difficult when there is a significant increase in data.
The bucket overflow problem will also occur in this case. However, reaching this state
may take less time than static hashing.
Indexed sequential access method also known as ISAM method, is an upgrade to the
conventional sequential file organization method. You can say that it is an advanced version
of sequential file organization method. In this method, primary key of the record is stored
with an address, this address is mapped to an address of a data block in memory. This address
field works as an index of the file.
In this method, reading and fetching a record is done using the index of the file. Index field
contains the address of a data record in memory, which can be quickly used to read and fetch
the record from memory.
Advantages of ISAM
Disadvantages of ISAM
B Tree
B Tree is a specialized m-way tree that can be widely used for disk access. A B-Tree of order
m can have at most m-1 keys and m children. One of the main reason of using B tree is its
capability to store large number of keys in a single node and large key values by keeping the
height of the tree relatively small.
A B tree of order m contains all the properties of an M way tree. In addition, it contains the
following properties.
It is not necessary that, all the nodes contain the same number of children but, each node
must have m/2 number of nodes.
5.7M
753
OOPs Concepts in Java
Operations
Searching :
Searching in B Trees is similar to that in Binary search tree. For example, if we search for an
item 49 in the following B Tree. The process will something like following :
1. Compare item 49 with root node 78. since 49 < 78 hence, move to its left sub-tree.
2. Since, 40<49<56, traverse right sub-tree of 40.
3. 49>45, move to right. Compare 49.
4. match found, return.
Searching in a B tree depends upon the height of the tree. The search algorithm takes O(log
n) time to search any element in a B tree.
Inserting
Insertions are done at the leaf node level. The following algorithm needs to be followed in
order to insert an item into B Tree.
1. Traverse the B Tree in order to find the appropriate leaf node at which the node can be
inserted.
2. If the leaf node contain less than m-1 keys then insert the element in the increasing
order.
3. Else, if the leaf node contains m-1 keys, then follow the following steps.
o Insert the new element in the increasing order of elements.
o Split the node into the two nodes at the median.
o Push the median element upto its parent node.
o If the parent node also contain m-1 number of keys, then split it too by
following the same steps.
Example:
Insert the node 8 into the B Tree of order 5 shown in the following image.
8 will be inserted to the right of 5, therefore insert 8.
The node, now contain 5 keys which is greater than (5 -1 = 4 ) keys. Therefore split the node
from the median i.e. 8 and push it up to its parent node shown as follows.
Deletion
Deletion is also performed at the leaf nodes. The node which is to be deleted can either be a
leaf node or an internal node. Following algorithm needs to be followed in order to delete a
node from a B tree.
If the the node which is to be deleted is an internal node, then replace the node with its in-
order successor or predecessor. Since, successor or predecessor will always be on the leaf
node hence, the process will be similar as the node is being deleted from the leaf node.
Example 1
Delete the node 53 from the B Tree of order 5 shown in the following figure.
B tree is used to index the data and provides fast access to the actual data stored on the disks
since, the access to value stored in a large database that is stored on a disk is a very time
consuming process.
Searching an un-indexed and unsorted database containing n key values needs O(n) running
time in worst case. However, if we use B Tree to index this database, it will be searched in
O(log n) time in worst case.
B+ Tree
B+ Tree is an extension of B Tree which allows efficient insertion, deletion and search
operations.
In B Tree, Keys and records both can be stored in the internal as well as leaf nodes. Whereas,
in B+ tree, records (data) can only be stored on the leaf nodes while internal nodes can only
store the key values.
The leaf nodes of a B+ tree are linked together in the form of a singly linked lists to make the
search queries more efficient.
5.7M
753
OOPs Concepts in Java
B+ Tree are used to store the large amount of data which can not be stored in the main
memory. Due to the fact that, size of main memory is always limited, the internal nodes (keys
to access records) of the B+ tree are stored in the main memory whereas, leaf nodes are
stored in the secondary memory.
The internal nodes of B+ tree are often called index nodes. A B+ tree of order 3 is shown in
the following figure.
Advantages of B+ Tree
SN B Tree B+ Tree
1 Search keys can not be repeatedly stored. Redundant search keys can be present.
2 Data can be stored in leaf nodes as well as Data can only be stored on the leaf nodes.
internal nodes
3 Searching for some data is a slower process Searching is comparatively faster as data can
since data can be found on internal nodes as well only be found on the leaf nodes.
as on the leaf nodes.
4 Deletion of internal nodes are so complicated Deletion will never be a complexed process
and time consuming. since element will always be deleted from the
leaf nodes.
5 Leaf nodes can not be linked together. Leaf nodes are linked together to make the
search operations more efficient.
Insertion in B+ Tree
Step 3: If the index node doesn't have required space, split the node and copy the middle
element to the next index page.
Example :
Insert the value 195 into the B+ tree of order 5 shown in the following figure.
195 will be inserted in the right sub-tree of 120 after 190. Insert it at the desired position.
The node contains greater than the maximum number of elements i.e. 4, therefore split it and
place the median node up to the parent.
Now, the index node contains 6 children and 5 keys which violates the B+ tree properties,
therefore we need to split it, shown as follows.
Deletion in B+ Tree
Step 2: if the leaf node contains less than minimum number of elements, merge down the
node with its sibling and delete the key in between them.
Step 3: if the index node contains less than minimum number of elements, merge the node
with the sibling and move down the key in between them.
Example
Delete the key 200 from the B+ Tree shown in the following figure.
200 is present in the right sub-tree of 190, after 195. delete it.
Merge the two nodes by using 195, 190, 154 and 129.
Now, element 120 is the single element present in the node which is violating the B+ Tree
properties. Therefore, we need to merge it by using 60, 78, 108 and 120.