0% found this document useful (0 votes)
22 views28 pages

DBMS Unit5

Uploaded by

priyakumari97082
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views28 pages

DBMS Unit5

Uploaded by

priyakumari97082
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

UNIT - V

Data on External Storage, File Organization and Indexing, Cluster Indexes, Primary and
Secondary Indexes, Index data Structures, Hash Based Indexing, Tree base Indexing,
Comparison of File Organizations, Indexes and Performance Tuning, Intuitions for tree
Indexes, Indexed Sequential Access Methods (ISAM), B+ Trees: A Dynamic Index
Structure

Data on External Storage


A database system provides an ultimate view of the stored data. However, data in the form of
bits, bytes get stored in different storage devices.
Types of Data Storage
For storing the data, there are different types of storage options available. These storage types
differ from one another as per the speed and accessibility. There are the following types of
storage devices used for storing the data:
Primary Storage - It is the primary area that offers quick access to the stored data. We also
know the primary storage as volatile storage. It is because this type of memory does not
permanently store the data. As soon as the system leads to a power cut or a crash, the data also
get lost.
Main Memory: It is the one that is responsible for operating the data that is available by
the storage medium. The main memory handles each instruction of a computer machine.
This type of memory can store gigabytes of data on a system but is small enough to carry
the entire database. At last, the main memory loses the whole content if the system shuts
down because of power failure or other reasons.
Cache: It is one of the costly storage media. On the other hand, it is the fastest one. A
cache is a tiny storage media which is maintained by the computer hardware usually.
While designing the algorithms and query processors for the data structures, the designers
keep concern on the cache effects.
Secondary Storage - Secondary storage is also called as Online storage. It is the storage area
that allows the user to save and store data permanently. This type of memory does not lose the
data due to any power failure or system crash. That's why we also call it non-volatile storage.
There are some commonly described secondary storage media which are available in almost
every type of computer system:
Flash Memory: A flash memory stores data in USB (Universal Serial Bus) keys which
are further plugged into the USB slots of a computer system. These USB keys help
transfer data to a computer system, but it varies in size limits. Unlike the main memory, it
is possible to get back the stored data which may be lost due to a power cut or other
reasons. This type of memory storage is most commonly used in the server systems for
caching the frequently used data. This leads the systems towards high performance and is
capable of storing large amounts of databases than the main memory.
Magnetic Disk Storage: This type of storage media is also known as online storage
media. A magnetic disk is used for storing the data for a long time. It is capable of storing
an entire database. It is the responsibility of the computer system to make availability of
the data from a disk to the main memory for further accessing. Also, if the system
performs any operation over the data, the modified data should be written back to the
disk. The tremendous capability of a magnetic disk is that it does not affect the data due
to a system crash or failure, but a disk failure can easily ruin as well as destroy the stored
data.
Tertiary Storage – It is the storage type that is external from the computer system. It has the
slowest speed. But it is capable of storing a large amount of data. It is also known as Offline
storage. Tertiary storage is generally used for data backup. There are following tertiary storage
devices available:
Optical Storage: An optical storage can store megabytes or gigabytes of data. A
Compact Disk (CD) can store 700 megabytes of data with a playtime of around 80
minutes. On the other hand, a Digital Video Disk or a DVD can store 4.7 or 8.5 gigabytes
of data on each side of the disk.
Tape Storage: It is the cheapest storage medium than disks. Generally, tapes are used for
archiving or backing up the data. It provides slow access to data as it accesses data
sequentially from the start. Thus, tape storage is also known as sequential-access storage.
Disk storage is known as direct-access storage as we can directly access the data from
any location on disk.

Storage Hierarchy
Besides the above, various other storage devices reside in the computer system. These storage
media are organized on the basis of data accessing speed, cost per unit of data to buy the
medium, and by medium's reliability. Thus, we can create a hierarchy of storage media on the
basis of its cost and speed.
Thus, on arranging the above-described storage media in a hierarchy according to its speed and
cost, we conclude the below-described image:
In the image, the higher levels are expensive but fast. On moving down, the cost per bit is
decreasing, and the access time is increasing. Also, the storage media from the main memory to
up represents the volatile nature, and below the main memory, all are non-volatile devices.
https://www.javatpoint.com/storage-system-in-dbms

File Organization and Indexing


File Organization - The File is a collection of records. Using the primary key, we can access the
records. The type and frequency of access can be determined by the type of file organization
which was used for a given set of records.
File organization is a logical relationship among various records. This method defines how file
records are mapped onto disk blocks.
File organization is used to describe the way in which the records are stored in terms of blocks,
and the blocks are placed on the storage medium.
The first approach to map the database to the file is to use the several files and store only one
fixed length record in any given file. An alternative approach is to structure our files so that we
can contain multiple lengths for records.
Files of fixed length records are easier to implement than the files of variable length
records. Objective of file organization
It contains an optimal selection of records, i.e., records can be selected as fast as
possible. To perform insert, delete or update transaction on the records should be quick
and easy. The duplicate records cannot be induced as a result of insert, update or delete.
For the minimal cost of storage, records should be stored efficiently.
Types of file organization:
Sequential file organization - This method is the easiest method for file organization. In
this method, files are stored sequentially.
Pile File Method: It is a quite simple method. In this method, we store the record in a
sequence, i.e., one after another. Here, the record will be inserted in the order in which
they are inserted into tables. In case of updating or deleting of any record, the record will
be searched in the memory blocks. When it is found, then it will be marked for deleting,
and the new record is inserted.

Insertion of the new record:


Suppose we have four records R1, R3 and so on upto R9 and R8 in a sequence. Hence,
records are nothing but a row in the table. Suppose we want to insert a new record R2 in
the sequence, then it will be placed at the end of the file. Here, records are nothing but a
row in any table.

Sorted File Method: In this method, the new record is always inserted at the file's end,
and then it will sort the sequence in ascending or descending order. Sorting of records is
based on any primary key or any other key. In the case of modification of any record, it
will update the record and then sort the file, and lastly, the updated record is placed in the
right place.

Insertion of the new record:


Suppose there is a preexisting sorted sequence of four records R1, R3 and so on upto R6
and R7. Suppose a new record R2 has to be inserted in the sequence, then it will be
inserted at the end of the file, and then it will sort the sequence.

Pros of sequential file organization


It contains a fast and efficient method for the huge amount of data.
In this method, files can be easily stored in cheaper storage mechanism like
magnetic tapes.
It is simple in design. It requires no much effort to store the data.
This method is used when most of the records have to be accessed like grade
calculation of a student, generating the salary slip, etc.
This method is used for report generation or statistical calculations.
Cons of sequential file organization
It will waste time as we cannot jump on a particular record that is required but we
have to move sequentially which takes our time.
Sorted file method takes more time and space for sorting the records.
Heap file organization - It is the simplest and most basic type of organization. It works
with data blocks. In heap file organization, the records are inserted at the file's end. When
the records are inserted, it doesn't require the sorting and ordering of records. When the
data block is full, the new record is stored in some other block. This new data block need
not to be the very next data block, but it can select any data block in the memory to store
new records. The heap file is also known as an unordered file. In the file, every record has
a unique id, and every page in a file is of the same size. It is the DBMS responsibility to
store and manage the new records.

Insertion of a new record


Suppose we have five records R1, R3, R6, R4 and R5 in a heap and suppose we want to
insert a new record R2 in a heap. If the data block 3 is full then it will be inserted in any of
the database selected by the DBMS, let's say data block 1.
If we want to search, update or delete the data in heap file organization, then we need to
traverse the data from staring of the file till we get the requested record. If the database is
very large then searching, updating or deleting of record will be time consuming because
there is no sorting or ordering of records. In the heap file organization, we need to check
all the data until we get the requested record. Pros of Heap file organization
It is a very good method of file organization for bulk insertion. If there is a large number
of data which needs to load into the database at a time, then this method is best suited.
In case of a small database, fetching and retrieving of records is faster than the
sequential record.
Cons of Heap file organization
This method is inefficient for the large database because it takes time to search or
modify the record.
This method is inefficient for large databases.
Hash file organization - Hash File Organization uses the computation of hash function
on some fields of the records. The hash function's output determines the location of disk
block where the records are to be placed.

When a record has to be received using the hash key columns, then the address is
generated, and the whole record is retrieved using that address. In the same way, when a
new record has to be inserted, then the address is generated using the hash key and record
is directly inserted. The same process is applied in the case of delete and update.

In this method, there is no effort for searching and sorting the entire file. In this method,
each record will be stored randomly in the memory.

B+ file
organization - B+ tree file organization is the advanced method of an indexed
sequential access method. It uses a tree-like structure to store records in File. It uses the
same concept of key-index where the primary key is used to sort the records. For each
primary key, the value of the index is generated and mapped with the record. The B+
tree is similar to a binary search tree (BST), but it can have more than two children. In
this method, all the records are stored only at the leaf node. Intermediate nodes act as a
pointer to the leaf nodes. They do not contain any records.

The above B+ tree shows that:


There is one root node of the tree, i.e., 25.
There is an intermediary layer with nodes. They do not store the actual record.
They have only pointers to the leaf node.
The nodes to the left of the root node contain the prior value of the root and nodes
to the right contain next value of the root, i.e., 15 and 30 respectively. There is
only one leaf node which has only values, i.e., 10, 12, 17, 20, 24, 27 and 29.
Searching for any record is easier as all the leaf nodes are balanced.
In this method, searching any record can be traversed through the single path and
accessed easily.
Pros of B+ tree file organization
In this method, searching becomes very easy as all the records are stored only in
the leaf nodes and sorted the sequential linked list.
Traversing through the tree structure is easier and faster.
The size of the B+ tree has no restrictions, so the number of records can increase
or decrease and the B+ tree structure can also grow or shrink.
It is a balanced tree structure, and any insert/update/delete does not affect the
performance of tree.
Cons of B+ tree file organization
This method is inefficient for the static method.
Indexed sequential access method (ISAM) - ISAM method is an advanced sequential
file organization. In this method, records are stored in the file using the primary key. An
index value is generated for each primary key and mapped with the record. This index
contains the address of the record in the file.

If any
record has to be retrieved based on its index value, then the address of the data block
is fetched and the record is retrieved from the memory.
Pros of ISAM:
In this method, each record has the address of its data block, searching a record in
a huge database is quick and easy.
This method supports range retrieval and partial retrieval of records. Since the
index is based on the primary key values, we can retrieve the data for the given
range of value. In the same way, the partial value can also be easily searched, i.e.,
the student name starting with 'JA' can be easily searched.
Cons of ISAM
This method requires extra space in the disk to store the index value. When the
new records are inserted, then these files have to be reconstructed to maintain
the sequence.
When the record is deleted, then the space used by it needs to be released.
Otherwise, the performance of the database will slow down.
Cluster file organization - When the two or more records are stored in the same file, it is
known as clusters. These files will have two or more tables in the same data block, and
key attributes which are used to map these tables together are stored only once. This
method reduces the cost of searching for various records in different files. The cluster file
organization is used when there is a frequent need for joining the tables with the same
condition. These joins will give only a few records from both tables. In the given
example, we are retrieving the record for only particular departments. This method can't
be used to retrieve the record for the entire department.

In this method, we can


directly insert, update or delete any record. Data is sorted based on the key with which
searching is done. Cluster key is a type of key with which joining of the table is
performed.
Types of Cluster file organization:
Indexed Clusters: - In indexed cluster, records are grouped based on the cluster key and
stored together. The above EMPLOYEE and DEPARTMENT relationship is an example
of an indexed cluster. Here, all the records are grouped based on the cluster key- DEP_ID
and all the records are grouped.
Hash Clusters: It is similar to the indexed cluster. In hash cluster, instead of storing the
records based on the cluster key, we generate the value of the hash key for the cluster key
and store the records with the same hash key value.
Pros of Cluster file organization
The cluster file organization is used when there is a frequent request for joining
the tables with same joining condition.
It provides the efficient result when there is a 1:M mapping between the
tables.
Cons of Cluster file organization
This method has the low performance for the very large database.
If there is any change in joining condition, then this method cannot use. If we
change the condition of joining then traversing the file takes a lot of time.
This method is not suitable for a table with a 1:1 condition.
https://www.javatpoint.com/dbms-file-organization
https://www.javatpoint.com/dbms-sequential-file-organization
https://www.javatpoint.com/dbms-heap-file-organization
https://www.javatpoint.com/dbms-hash-file-organization
https://www.javatpoint.com/dbms-b-plus-file-organization
https://www.javatpoint.com/dbms-indexed-sequential-access-method
https://www.javatpoint.com/dbms-cluster-file-organization

Cluster Indexes
Primary and Secondary Indexes
Indexing is used to optimize the performance of a database by minimizing the number of disk
accesses required when a query is processed.
The index is a type of data structure. It is used to locate and access the data in a database table
quickly.
Index structure: Indexes can be created using some database columns.

The first column of the database is the search key that contains a copy of the primary key or
candidate key of the table. The values of the primary key are stored in sorted order so that the
corresponding data can be accessed easily.
The second column of the database is the data reference. It contains a set of pointers holding the
address of the disk block where the value of the particular key can be found. Indexing Methods
Ordered indices - The indices are usually sorted to make searching faster. The indices which are
sorted are known as ordered indices.
Example: Suppose we have an employee table with thousands of record and each of which is 10
bytes long. If their IDs start with 1, 2, 3....and so on and we have to search student with ID-543.
In the case of a database with no index, we have to search the disk block from starting till it
reaches 543. The DBMS will read the record after reading 543*10=5430 bytes. In the case of an
index, we will search using indexes and the DBMS will read the record after reading 542*2=
1084 bytes which are very less compared to the previous case. Primary Index - If the index is
created on the basis of the primary key of the table, then it is known as primary indexing. These
primary keys are unique to each record and contain 1:1 relation between the records.
As primary keys are stored in sorted order, the performance of the searching operation is quite
efficient.
The primary index can be classified into two types: Dense index and Sparse index.
Dense index - The dense index contains an index record for every search key value in the data
file. It makes searching faster.
In this, the number of records in the index table is same as the number of records in the main
table.
It needs more space to store index record itself. The index records have the search key and a
pointer to the actual record on the disk.

Sparse index
In the data file, index record appears only for a few items. Each item points to a block. In this,
instead of pointing to each record in the main table, the index points to the records in the main
table in a gap.

Clustering Index
A clustered index can be defined as an ordered data file. Sometimes the index is created on non
primary key columns which may not be unique for each record.
In this case, to identify the record faster, we will group two or more columns to get the unique
value and create index out of them. This method is called a clustering index. The records
which have similar characteristics are grouped, and indexes are created for these group.
Example: suppose a company contains several employees in each department. Suppose we use a
clustering index, where all employees which belong to the same Dept_ID are considered within a
single cluster, and index pointers point to the cluster as a whole. Here Dept_Id is a non-unique
key.

Secondary Index
In the sparse indexing, as the size of the table grows, the size of mapping also grows. These
mappings are usually kept in the primary memory so that address fetch should be faster. Then the
secondary memory searches the actual data based on the address got from mapping. If the
mapping size grows then fetching the address itself becomes slower. In this case, the sparse
index will not be efficient. To overcome this problem, secondary indexing is introduced. In
secondary indexing, to reduce the size of mapping, another level of indexing is introduced. In
this method, the huge range for the columns is selected initially so that the mapping size of the
first level becomes small. Then each range is further divided into smaller ranges. The mapping of
the first level is stored in the primary memory, so that address fetch is faster. The mapping of the
second level and actual data are stored in the secondary memory (hard disk).
For example:
If you want to find the record of roll 111 in the diagram, then it will search the highest entry
which is smaller than or equal to 111 in the first level index. It will get 100 at this level. Then
in the second index level, again it does max (111) <= 111 and gets 110. Now using the address
110, it goes to the data block and starts searching each record till it gets 111. This is how a
search is performed in this method. Inserting, updating or deleting is also done in the same
manner.
Primary Index Secondary Index

Definition A primary index is an index A secondary index is an


on a set of fields that index that is not a primary
includes the unique primary index and may have
key and is guaranteed not to duplicates.
contain duplicates

Order The primary index requires The primary index requires


the rows in data blocks to be the rows in data blocks to be
ordered on the index key ordered on the index key
Number of indexes There is only one There can be multiple
primary index se3condary indexes.

Duplicates There are no duplicates in There can be duplicates in


the primary index the secondary indexes.
https://www.javatpoint.com/indexing-in-dbms
https://pediaa.com/what-is-the-difference-between-primary-and-secondary-index/

Index data Structures


In a database management system (DBMS), an index is a data structure that improves the speed
of data retrieval operations on a database table. Indexes provide a way to quickly locate rows
based on the values of one or more columns. Without indexes, the DBMS would need to scan the
entire table to find the desired data, which can be very inefficient for large tables. Indexes work
similar to the index in a book – they provide a way to look up information more quickly. When
you create an index on a column or a set of columns, the DBMS creates a separate data structure
that stores the indexed column's values along with a reference to the corresponding rows in the
table. This allows the DBMS to perform lookups and queries more efficiently. There are different
types of index data structures commonly used in DBMS: B-Tree Index: This is the most
common type of index used in most relational database systems. B-Trees are balanced trees that
store data in a sorted order, allowing for efficient range queries, point queries, and insertions.
They are well-suited for disk-based storage.
Hash Index: Hash indexes use a hash function to map index keys to locations in the index. They
are effective for point queries but less efficient for range queries or ordered retrieval. Bitmap
Index: Bitmap indexes use a bitmap for each unique value in the indexed column, representing
whether a row contains that value or not. They are useful for low cardinality columns (columns
with few distinct values) and are efficient for certain types of queries like boolean operations.
Sparse Index: Sparse indexes are used in databases that have a lot of null values. Instead of
indexing every row, they only index non-null values, which reduces the size of the index and
speeds up lookups.
Clustered Index: In databases like SQL Server and InnoDB in MySQL, the clustered index
determines the physical order of data rows in the table. A table can have only one clustered
index, and it greatly affects the way data is stored on disk.
Non-Clustered Index: Non-clustered indexes are separate structures from the actual table data.
They include the indexed column's values and a reference to the corresponding rows in the table.
A table can have multiple non-clustered indexes.
Covering Index: A covering index includes all the columns needed for a query, so the DBMS
doesn't need to access the actual table to retrieve the required data. This can significantly
improve query performance.
Indexing involves a trade-off between query performance and the overhead of maintaining the
index during data modifications (inserts, updates, and deletes). While indexes can greatly speed
up read operations, they can slightly slow down write operations due to the additional index
maintenance overhead.
Hash Based Indexing
In a huge database structure, it is very inefficient to search all the index values and reach the
desired data. Hashing technique is used to calculate the direct location of a data record on the
disk without using index structure.
In this technique, data is stored at the data blocks whose address is generated by using the
hashing function. The memory location where these records are stored is known as data bucket
or data blocks.
In this, a hash function can choose any of the column value to generate the address. Most of the
time, the hash function uses the primary key to generate the address of the data block. A hash
function is a simple mathematical function to any complex mathematical function. We can even
consider the primary key itself as the address of the data block. That means each row whose
address will be the same as a primary key stored in the data block.

The above diagram shows data block addresses same as primary key value. This hash function
can also be a simple mathematical function like exponential, mod, cos, sin, etc. Suppose we have
mod (5) hash function to determine the address of the data block. In this case, it applies mod (5)
hash function on the primary keys and generates 3, 3, 1, 4 and 2 respectively, and records are
stored in those data block addresses.
Types of Hashing:
Static Hashing - In static hashing, the resultant data bucket address will always be the
same. That means if we generate an address for EMP_ID =103 using the hash function mod (5)
then it will always result in same bucket address 3. Here, there will be no change in the bucket
address.
Hence in this static hashing, the number of data buckets in memory remains constant throughout.
In this example, we will have five data buckets in the memory used to store the data.

Operations of Static Hashing


Searching a record - When a record needs to be searched, then the same hash function
retrieves the address of the bucket where the data is stored.
Insert a Record - When a new record is inserted into the table, then we will generate an
address for a new record based on the hash key and record is stored in that location.
Delete a Record - To delete a record, we will first fetch the record which is supposed to
be deleted. Then we will delete the records for that address in memory.
Update a Record - To update a record, we will first search it using a hash function, and
then the data record is updated.
If we want to insert some new record into the file but the address of a data bucket generated by
the hash function is not empty, or data already exists in that address. This situation in the static
hashing is known as bucket overflow. This is a critical situation in this method. To overcome
this situation, there are various methods. Some commonly used methods are as follows:
Open Hashing - When a hash function generates an address at which data is already
stored, then the next bucket will be allocated to it. This mechanism is called as Linear Probing.
For example: suppose R3 is a new address which needs to be inserted, the hash function
generates address as 112 for R3. But the generated address is already full. So the system searches
next available data bucket, 113 and assigns R3 to it.

Close Hashing - When buckets are full, then a new data bucket is allocated for the same hash
result and is linked after the previous one. This mechanism is known as Overflow chaining. For
example: Suppose R3 is a new address which needs to be inserted into the table, the hash
function generates address as 110 for it. But this bucket is full to store the new data. In this case,
a new bucket is inserted at the end of 110 buckets and is linked to it.

Dynamic Hashing - The dynamic hashing method is used to overcome the problems of static
hashing like bucket overflow.
In this method, data buckets grow or shrink as the records increases or decreases. This method is
also known as Extendable hashing method.
This method makes hashing dynamic, i.e., it allows insertion or deletion without resulting in poor
performance.
How to search a key
First, calculate the hash address of the key.
Check how many bits are used in the directory, and these bits are called as i.
Take the least significant i bits of the hash address. This gives an index of the directory.
Now using the index, go to the directory and find bucket address where the record might
be.
How to insert a new record
Firstly, you have to follow the same procedure for retrieval, ending up in some
bucket. If there is still space in that bucket, then place the record in it.
If the bucket is full, then we will split the bucket and redistribute the
records. For example:
Consider the following grouping of keys into buckets, depending on the prefix of their hash
address:

The last two bits of 2 and 4 are 00. So it will go into bucket B0. The last two bits of 5 and 6 are
01, so it will go into bucket B1. The last two bits of 1 and 3 are 10, so it will go into bucket B2.
The last two bits of 7 are 11, so it will go into B3.

Insert key 9 with hash address 10001 into the above structure:
Since key 9 has hash address 10001, it must go into the first bucket. But bucket B1 is full,
so it will get split.
The splitting will separate 5, 9 from 6 since last three bits of 5, 9 are 001, so it will go
into bucket B1, and the last three bits of 6 are 101, so it will go into bucket B5. Keys 2
and 4 are still in B0. The record in B0 pointed by the 000 and 100 entry because last
two bits of both the entry are 00.
Keys 1 and 3 are still in B2. The record in B2 pointed by the 010 and 110 entry because
last two bits of both the entry are 10.
Key 7 are still in B3. The record in B3 pointed by the 111 and 011 entry because last two
bits of both the entry are 11.

Advantages of dynamic hashing


In this method, the performance does not decrease as the data grows in the system. It
simply increases the size of memory to accommodate the data.
In this method, memory is well utilized as it grows and shrinks with the data. There will
not be any unused memory lying.
This method is good for the dynamic database where data grows and shrinks
frequently.
Disadvantages of dynamic hashing
In this method, if the data size increases then the bucket size is also increased. These
addresses of data will be maintained in the bucket address table. This is because the data
address will keep changing as buckets grow and shrink. If there is a huge increase in data,
maintaining the bucket address table becomes tedious.
In this case, the bucket overflow situation will also occur. But it might take little time to
reach this situation than static hashing.
https://www.javatpoint.com/dbms-hashing
https://www.javatpoint.com/dbms-static-hashing
https://www.javatpoint.com/dbms-dynamic-hashing

Tree base Indexing


Tree-based indexing is a common approach used in database management systems to efficiently
organize and retrieve data from a database table. It involves creating a hierarchical data structure
(usually a tree) that allows for quick access to rows based on the values of indexed columns. The
two most prevalent types of tree-based indexing are B-Tree (Balanced Tree) and B+Tree
(Balanced Plus Tree).
B-Tree (Balanced Tree):B-Trees are self-balancing tree structures that maintain sorted
data. They are commonly used in file systems and databases to index data.
Each node in a B-Tree can have multiple keys and pointers to child nodes. The tree is kept
balanced by redistributing keys when nodes become too full or too empty. B-Trees are
designed for disk-based storage systems and are suitable for both point queries and range
queries.
They are used for both clustered and non-clustered indexes in various database systems.
B+Tree (Balanced Plus Tree):B+Trees are an extension of B-Trees and are widely used
for indexing in most modern database systems.
Like B-Trees, B+Trees are self-balancing and maintain sorted data.
In a B+Tree, keys are stored in the leaf nodes, and leaf nodes are linked together in a linked list.
Non-leaf nodes in a B+Tree only contain pointers to child nodes, making the tree more
compact. B+Trees are optimized for disk-based storage systems and work well with sequential
I/O operations.
They are particularly suitable for range queries due to the linked list structure of leaf nodes,
which enables efficient range scans.
Both B-Trees and B+Trees have logarithmic height, meaning the number of levels in the tree
grows slowly as the number of elements (rows) increases. This property ensures efficient lookup
times, as the number of nodes to traverse remains manageable even for large datasets. Tree-
based indexes improve query performance by reducing the number of disk accesses required to
locate specific rows. When a query involves filtering or searching based on indexed columns,
the DBMS can use the index to navigate the tree structure and quickly locate the desired rows.
This significantly speeds up data retrieval compared to a full table scan.

Comparison of File Organizations


File Organization Description Advantages Disadvantages

Heap File Random placement Simple insertion, Slow for retrieval


of records within no need for and range queries.
the file. No specific sorting.
order.

Sequential File Records are stored Good for range Slow for
in order based on queries, insertion and
a sequential updating.
designated search access.
key.

Indexed Similar to Improved Slower insertion


Sequential File sequential file, but retrieval with due to index
includes an index index. maintenance.
for quicker
access.

B-Tree File Records are Efficient for Overhead of


organized using a B point and range index
Tree data structure. queries. maintenance.

B+Tree File Extension of B-Tree, Efficient for Overhead of


optimized for disk range queries. index
based systems. maintenance.

Hash File Records are Very fast for Not suitable for
distributed among point queries. range queries.
a fixed number
of
buckets using a hash
function.

Clustered File Records are Efficient for May require


physically stored in specific reorganization
the same order as a queries. for new queries.
specified clustering
key.

Partitioned File Data is divided into Parallel Uneven


partitions based on a processing for distribution may
range of values in a queries. lead to
partitioning key. imbalance.

Indexes and Performance Tuning


Indexes and performance tuning are crucial aspects of database management systems (DBMS) to
optimize query execution and overall system performance.
Indexes - Indexes are data structures that accelerate data retrieval by providing a quick way to
locate rows in a database table. They allow the DBMS to locate rows based on the values in one
or more indexed columns.
Here's how indexes impact performance:
Faster Data Retrieval
Reduced I/O
Query Optimization
Trade-offs
Performance Tuning - Performance tuning involves optimizing the database and queries to
achieve better overall system performance.
Key strategies:
Indexing Strategy: Choose the right columns to index based on the types of queries your
application performs most frequently. Over-indexing can lead to increased overhead, while
under-indexing can result in slow query performance.
Query Optimization: Craft efficient SQL queries by using appropriate joins,
aggregations, and filtering conditions. Analyze query execution plans to identify performance
bottlenecks.
Normalization and Denormalization: Properly normalize your database to minimize
redundancy and data anomalies. However, consider denormalization for frequently queried tables
to reduce join operations.
Caching: Implement caching mechanisms to store frequently accessed data in memory,
reducing the need for repeated disk reads.
Partitioning: For large tables, consider partitioning the data into smaller, manageable
chunks. This improves both query and maintenance performance.
Hardware Optimization: Configure the database server's hardware parameters, such as
memory allocation and CPU usage, to match the workload and maximize performance. Query
and Index Statistics: Regularly update statistics on tables and indexes. The DBMS uses these
statistics to make informed decisions about query execution plans. Monitoring and Profiling:
Continuously monitor the database's performance using tools and collect metrics. Identify and
address performance issues as they arise. Database Maintenance: Perform routine maintenance
tasks like index rebuilding, data purging, and database backups to ensure optimal performance.
Connection Pooling: Use connection pooling to efficiently manage database connections
and reduce the overhead of opening and closing connections.
Parallelism and Concurrency: Utilize parallel processing and proper concurrency
control mechanisms to make efficient use of multi-core processors and allow multiple users to
access the database concurrently.
Intuitions for tree Indexes
Hierarchy and Organization
Balancing and Logarithmic Height
Efficient Data Retrieval
Application and Performance

Indexed Sequential Access Methods (ISAM)


Indexed Sequential Access Method (ISAM) is a data access method used in computer systems
for managing and accessing data stored on disk or other storage devices. It combines the features
of sequential access and random access methods to provide efficient retrieval and update
operations.
ISAM organizes data into fixed-length blocks or records and stores them in sequential order on
the storage device. Each block is assigned a unique identifier called a record number or block
number. Additionally, ISAM maintains an index structure that allows for direct access to specific
records based on key values.
The index structure typically consists of an index file or index table, which contains key values
and corresponding pointers to the physical location of the records. The index can be organized in
various ways, such as a B-tree or a hash table, depending on the specific implementation. To
access a record using ISAM, the system performs a search operation on the index to locate the
appropriate block or blocks containing the desired record. Once the block is located, sequential
access is used within the block to retrieve or update the desired record.
Advantages:
Efficient retrieval
Sequential processing
Data integrity
Support for concurrent access
Limitations:
Fixed record length
Index maintenance
Limited flexibility
Lack of data independence
https://www.javatpoint.com/dbms-indexed-sequential-access-method
https://www.w3spoint.com/indexed-sequential-access-method
isam#:~:text=Indexed%20Sequential%20Access%20Method%20%28ISAM%29%20is%20a%
2 0data,methods%20to%20provide%20efficient%20retrieval%20and%20update%20operations.

B+ Trees: A Dynamic Index Structure


B + Tree is a variation of the B-tree data structure. In a B + tree, data pointers are stored only at
the leaf nodes of the tree. In a B+ tree structure of a leaf node differs from the structure of
internal nodes. The leaf nodes have an entry for every value of the search field, along with a
data pointer to the record (or to the block that contains this record). The leaf nodes of the B+
tree are linked together to provide ordered access to the search field to the records. Internal
nodes of a B+ tree are used to guide the search. Some search field values from the leaf nodes are
repeated in the internal nodes of the B+ tree.
Features of B+ Trees
• Balanced: B+ Trees are self-balancing, which means that as data is added or removed from the
tree, it automatically adjusts itself to maintain a balanced structure. This ensures that the search
time remains relatively constant, regardless of the size of the tree.
• Multi-level: B+ Trees are multi-level data structures, with a root node at the top and one or more
levels of internal nodes below it. The leaf nodes at the bottom level contain the actual data.
• Ordered: B+ Trees maintain the order of the keys in the tree, which makes it easy to
perform range queries and other operations that require sorted data.
• Fan-out: B+ Trees have a high fan-out, which means that each node can have many child
nodes. This reduces the height of the tree and increases the efficiency of searching and
indexing operations.
• Cache-friendly: B+ Trees are designed to be cache-friendly, which means that they can take
advantage of the caching mechanisms in modern computer architectures to improve
performance.
• Disk-oriented: B+ Trees are often used for disk-based storage systems because they are
efficient at storing and retrieving data from disk.
Why Use B+ Tree?
1. B+ Trees are the best choice for storage systems with sluggish data access because they
minimize I/O operations while facilitating efficient disc access.
2. B+ Trees are a good choice for database systems and applications needing quick data
retrieval because of their balanced structure, which guarantees predictable performance for a
variety of activities and facilitates effective range-based queries.
Difference between B+ Tree and B Tree
Parameters B+ Tree B Tree

Structure Separate leaf nodes for data Nodes store both keys and
storage and internal nodes for data values
indexing

Leaf Nodes Leaf nodes form a linked list Leaf nodes do not form a linked list
for efficient range-based
queries

Order Higher order (more keys) Lower order (fewer keys)

Key Typically allows key duplication Usually does not allow key
Duplication in leaf nodes duplication

Disk Access Better disk access due to More disk I/O due to non-
sequential reads in a linked list sequential reads in internal nodes
structure

Applications Database systems, file systems, In-memory data structures,


where range queries are databases, general-purpose use
common

are Better performance for range Balanced performance for


queries and bulk data retrieval search, insert, and delete
operations

Memory Requires more memory for Requires less memory as keys


Usage internal nodes and values are stored in the
same node
Implementation of B+ Tree
In order, to implement dynamic multilevel indexing, B-tree and B+ tree are generally employed.
The drawback of the B-tree used for indexing, however, is that it stores the data pointer (a pointer
to the disk file block containing the key value), corresponding to a particular key value, along with
that key value in the node of a B-tree. This technique greatly reduces the number of entries that
can be packed into a node of a B-tree, thereby contributing to the increase in the number of levels
in the B-tree, hence increasing the search time of a record. B+ tree eliminates the above drawback
by storing data pointers only at the leaf nodes of the tree. Thus, the structure of the leaf nodes of a
B+ tree is quite different from the structure of the internal nodes of the B tree. It may be noted here
that, since data pointers are present only at the leaf nodes, the leaf nodes must necessarily store all
the key values along with their corresponding data pointers to the disk file block, in order to access
them.
Moreover, the leaf nodes are linked to providing ordered access to the records. The leaf nodes,
therefore form the first level of the index, with the internal nodes forming the other levels of a
multilevel index. Some of the key values of the leaf nodes also appear in the internal nodes, to
simply act as a medium to control the searching of a record. From the above discussion, it is
apparent that a B+ tree, unlike a B-tree, has two orders, ‘a’ and ‘b’, one for the internal nodes and
the other for the external (or leaf) nodes.
Structure of B+ Trees

B+ Trees contain two types of nodes:


• Internal Nodes: Internal Nodes are the nodes that are present in at least n/2 record pointers, but
not in the root node,
• Leaf Nodes: Leaf Nodes are the nodes that have n pointers.
The Structure of the Internal Nodes of a B+ Tree of Order ‘a’ is as Follows 1. Each internal
node is of the form: <P1, K1, P2, K2, ….., Pc-1, Kc-1, Pc> where c <= a and each Pi is a tree
pointer (i.e points to another node of the tree) and, each Ki is a key value (see diagram-I for
reference).
2. Every internal node has : K1 < K2 < …. < Kc-1
3. For each search field value ‘X’ in the sub-tree pointed at by Pi, the following condition
holds: Ki-1 < X <= Ki, for 1 < I < c and, Ki-1 < X, for i = c (See diagram I for reference) 4.
Each internal node has at most ‘aa tree pointers.
5. The root node has, at least two tree pointers, while the other internal nodes have at least
\ceil(a/2) tree pointers each.
6. If an internal node has ‘c’ pointers, c <= a, then it has ‘c – 1’ key values.
The Structure of the Leaf Nodes of a B+ Tree of Order ‘b’ is as Follows
Each leaf node is of the form: <<K1, D1>, <K2, D2>, ….., <Kc-1, Dc-1>, Pnext> where c <= b
and each Di is a data pointer (i.e points to actual record in the disk whose key value is Ki or to a
disk file block containing that record) and, each Ki is a key value and, Pnext points to next leaf
node in the B+ tree (see diagram II for reference).
Every leaf node has : K1 < K2 < …. < Kc-1, c <= b

Each leaf node has at least \ceil(b/2) values.


All leaf nodes are at the same level.

Diagram-II Using the Pnext pointer it is viable to traverse all the leaf nodes, just like a linked list,
thereby achieving ordered access to the records stored in the disk.
Searching a Record in B+ Trees

Let us suppose we have to find 58 in the B+ Tree. We will start by fetching from the root node
then we will move to the leaf node, which might contain a record of 58. In the image given
above, we will get 58 between 50 and 70. Therefore, we will we are getting a leaf node in the
third leaf node and get 58 there. If we are unable to find that node, we will return that ‘record not
founded’ message.
Insertion in B+ Trees
Every element in the tree has to be inserted into a leaf node. Therefore, it is necessary to
go to a proper leaf node.
Insert the key into the leaf node in increasing order if there is no
overflow. Deletion in B+Trees
Deletion in B+ Trees is just not deletion but it is a combined process of Searching,
Deletion, and Balancing. In the last step of the Deletion Process, it is mandatory to
balance the B+ Trees, otherwise, it fails in the property of B+ Trees.
Advantages of B+Trees
A B+ tree with ‘l’ levels can store more entries in its internal nodes compared to a B-tree
having the same ‘l’ levels. This accentuates the significant improvement made to the
search time for any given key. Having lesser levels and the presence of Pnext pointers
imply that the B+ trees is very quick and efficient in accessing records from disks. Data
stored in a B+ tree can be accessed both sequentially and directly.
It takes an equal number of disk accesses to fetch records.
B+trees have redundant search keys, and storing search keys repeatedly is not possible.
Disadvantages of B+Trees
The major drawback of B-tree is the difficulty of traversing the keys sequentially. The B+
tree retains the rapid random access property of the B-tree while also allowing rapid
sequential access.
Application of B+ Trees
Multilevel Indexing
Faster operations on the tree (insertion, deletion, search)
Database indexing
https://www.geeksforgeeks.org/introduction-of-b-tree/

You might also like