0% found this document useful (0 votes)
29 views33 pages

DBMS Unit-5

Uploaded by

kusuridivyasri
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views33 pages

DBMS Unit-5

Uploaded by

kusuridivyasri
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 33

UNIT-5

FILE ORGANIZATION AND INDEXING IN DBMS

Indexing in DBMS is a technique that uses data structures to optimize the searching time of a
database query. It helps in faster query results and quick data retrieval from the database.
Indexing makes database performance better. It also consumes lesser space in the main
memory.

What is Indexing in DBMS?

Indexing is used to quickly retrieve particular data from the database. Formally we can define
Indexing as a technique that uses data structures to optimize the searching time of a database
query in DBMS. Indexing reduces the number of disks required to access a particular data by
internally creating an index table.

Indexing is achieved by creating Index-table or Index.

Index usually consists of two columns which are a key-value pair. The two columns of the
index table(i.e., the key-value pair) contain copies of selected columns of the tabular data of
the database.

Here, Search Key contains the copy of the Primary Key or the Candidate Key of the database
table. Generally, we store the selected Primary or Candidate keys in a sorted manner so that
we can reduce the overall query time or search time(from linear to binary).

Data Reference contains a set of pointers that holds the address of the disk block. The
pointed disk block contains the actual data referred to by the Search Key. Data Reference is
also called Block Pointer because it uses block-based addressing.

Types of Indexes

According to the attributes defined above, we divide indexing into three types:
Ordered indices

The indices are usually sorted to make searching faster. The indices which are sorted are
known as ordered indices.

Example: Suppose we have an employee table with thousands of record and each of which is
10 bytes long. If their IDs start with 1, 2, 3....and so on and we have to search student with
ID-543.

o In the case of a database with no index, we have to search the disk block from starting
till it reaches 543. The DBMS will read the record after reading 543*10=5430 bytes.
o In the case of an index, we will search using indexes and the DBMS will read the
record after reading 542*2= 1084 bytes which are very less compared to the previous
case.

Primary Index

o If the index is created on the basis of the primary key of the table, then it is known as
primary indexing. These primary keys are unique to each record and contain 1:1
relation between the records.
o As primary keys are stored in sorted order, the performance of the searching operation
is quite efficient.
o The primary index can be classified into two types: Dense index and Sparse index.

Dense index

o The dense index contains an index record for every search key value in the data file. It
makes searching faster.
o In this, the number of records in the index table is same as the number of records in
the main table.
o It needs more space to store index record itself. The index records have the search key
and a pointer to the actual record on the disk.

Sparse index

o In the data file, index record appears only for a few items. Each item points to a block.
o In this, instead of pointing to each record in the main table, the index points to the
records in the main table in a gap.

Clustering Index

o A clustered index can be defined as an ordered data file. Sometimes the index is
created on non-primary key columns which may not be unique for each record.
o In this case, to identify the record faster, we will group two or more columns to get
the unique value and create index out of them. This method is called a clustering
index.
o The records which have similar characteristics are grouped, and indexes are created
for these group.

Example: suppose a company contains several employees in each department. Suppose we


use a clustering index, where all employees which belong to the same Dept_ID are
considered within a single cluster, and index pointers point to the cluster as a whole. Here
Dept_Id is a non-unique key.
The previous schema is little confusing because one disk block is shared by records which
belong to the different cluster. If we use separate disk block for separate clusters, then it is
called better technique.

Secondary Index
In the sparse indexing, as the size of the table grows, the size of mapping also grows. These
mappings are usually kept in the primary memory so that address fetch should be faster. Then
the secondary memory searches the actual data based on the address got from mapping. If the
mapping size grows then fetching the address itself becomes slower. In this case, the sparse
index will not be efficient. To overcome this problem, secondary indexing is introduced.

Features of Java - Javatpoint

In secondary indexing, to reduce the size of mapping, another level of indexing is introduced.
In this method, the huge range for the columns is selected initially so that the mapping size of
the first level becomes small. Then each range is further divided into smaller ranges. The
mapping of the first level is stored in the primary memory, so that address fetch is faster. The
mapping of the second level and actual data are stored in the secondary memory (hard disk).

For example:

o If you want to find the record of roll 111 in the diagram, then it will search the highest
entry which is smaller than or equal to 111 in the first level index. It will get 100 at
this level.
o Then in the second index level, again it does max (111) <= 111 and gets 110. Now
using the address 110, it goes to the data block and starts searching each record till it
gets 111.
o This is how a search is performed in this method. Inserting, updating or deleting is
also done in the same manner.
Advantages of Indexing

r similarly the index table of the database Indexing helps in faster query results or quick data
retrieval.

 Indexing helps in faster sorting and grouping of records


 Some Indexing uses sorted and unique keys which helps to retrieve sorted queries
even faster.
 Index tables are smaller in size so require lesser memory.
 As Index tables are smaller in size, they are stored in the main memory.
 Since CPU speed and secondary memory speed have a large difference, the CPU uses
this main memory index table to bridge the gap of speeds.
 Indexing helps in better CPU utilization and better performance.

File Organization in DBMS


A database consist of a huge amount of data. The data is grouped within a table in RDBMS,
and each table have related records. A user can see that the data is stored in form of tables,
but in acutal this huge amount of data is stored in physical memory in form of files.
File – A file is named collection of related information that is recorded on secondary storage
such as magnetic disks, magnetic tables and optical disks.
What is File Organization?
File Organization refers to the logical relationships among various records that constitute the
file, particularly with respect to the means of identification and access to any specific record.
In simple terms, Storing the files in certain order is called file Organization. File
Structure refers to the format of the label and data blocks and of any logical control record.
o The File is a collection of records. Using the primary key, we can access the records.
The type and frequency of access can be determined by the type of file organization
which was used for a given set of records.
o File organization is a logical relationship among various records. This method defines
how file records are mapped onto disk blocks.
o File organization is used to describe the way in which the records are stored in terms
of blocks, and the blocks are placed on the storage medium.
o The first approach to map the database to the file is to use the several files and store
only one fixed length record in any given file. An alternative approach is to structure
our files so that we can contain multiple lengths for records.
o Files of fixed length records are easier to implement than the files of variable length
records.

Objective of file organization

o It contains an optimal selection of records, i.e., records can be selected as fast as


possible.
o To perform insert, delete or update transaction on the records should be quick and
easy.
o The duplicate records cannot be induced as a result of insert, update or delete.
o For the minimal cost of storage, records should be stored efficiently.

Types of File Organizations –

Various methods have been introduced to Organize files. These particular methods have
advantages and disadvantages on the basis of access or selection . Thus it is all upon the
programmer to decide the best suited file Organization method according to his requirements.
Some types of File Organizations are :
 Sequential File Organization
 Heap File Organization
 Hash File Organization
 B+ Tree File Organization
 Clustered File Organization
We will be discussing each of the file Organizations in further sets of this article along with
differences and advantages/ disadvantages of each file Organization methods.

Types of file organization are as follows:

Sequential File Organization –

The easiest method for file Organization is Sequential method. In this method the file are
stored one after another in a sequential manner. There are two ways to implement this
method:
1. Pile File Method – This method is quite simple, in which we store the records in a
sequence i.e one after other in the order in which they are inserted into the tables.
Insertion of new record –
Let the R1, R3 and so on upto R5 and R4 be four records in the sequence. Here, records
are nothing but a row in any table. Suppose a new record R2 has to be inserted in the
sequence, then it is simply placed at the end of the file.

2. Sorted File Method –In this method, As the name itself suggest whenever a new record
has to be inserted, it is always inserted in a sorted (ascending or descending) manner.
Sorting of records may be based on any primary key or any other key.

Insertion of new record –


Let us assume that there is a preexisting sorted sequence of four records R1, R3, and so on
upto R7 and R8. Suppose a new record R2 has to be inserted in the sequence, then it will
be inserted at the end of the file and then it will sort the sequence .

Pros and Cons of Sequential File Organization –


Pros –
 Fast and efficient method for huge amount of data.
 Simple design.
 Files can be easily stored in magnetic tapes i.e cheaper storage mechanism.
Cons –
 Time wastage as we cannot jump on a particular record that is required, but we have to
move in a sequential manner which takes our time.
 Sorted file method is inefficient as it takes time and space for sorting records.

Heap File Organization –

Heap File Organization works with data blocks. In this method records are inserted at the end
of the file, into the data blocks. No Sorting or Ordering is required in this method. If a data
block is full, the new record is stored in some other block, Here the other data block need not
be the very next data block, but it can be any block in the memory. It is the responsibility of
DBMS to store and manage the new records.

Insertion of new record –


Suppose we have four records in the heap R1, R5, R6, R4 and R3 and suppose a new record
R2 has to be inserted in the heap then, since the last data block i.e data block 3 is full it will
be inserted in any of the data blocks selected by the DBMS, lets say data block 1.
If we want to search, delete or update data in heap file Organization the we will traverse the
data from the beginning of the file till we get the requested record. Thus if the database is
very huge, searching, deleting or updating the record will take a lot of time.
Pros and Cons of Heap File Organization –
Pros –
 Fetching and retrieving records is faster than sequential record but only in case of small
databases.
 When there is a huge number of data needs to be loaded into the database at a time, then
this method of file Organization is best suited.
Cons –
 Problem of unused memory blocks.
 Inefficient for larger databases.

Hashing
Hashing is a DBMS technique for searching for needed data on the disc without utilising an
index structure. The hashing method is basically used to index items and retrieve them in a
DB since searching for a specific item using a shorter hashed key rather than the original
value is faster.

What is Hashing in DBMS?

It can be nearly hard to search all index values through all levels of a large database structure
and then get to the target data block to obtain the needed data. Hashing is a method for
calculating the direct position of an information record on the disk without the use of an
index structure.

To generate the actual address of a data record, hash functions containing search keys as
parameters are used.
Properties of Hashing in DBMS

Data is kept in data blocks whose addresses are produced using the hashing function in this
technique. Data buckets or data blocks are the memory locations where these records are
stored.

In this case, a hash function can produce the address from any column value. The primary
key is frequently used by the hash function to generate the data block’s address. To every
complex mathematical function, a hash function is a basic mathematical function. The
primary key can also be considered as the data block’s address, i.e. each row with the same
address as a primary key contained in the data block.

The data block addresses are the same as the primary key value in the picture above. This
hash function could alternatively be a simple mathematical function, such as exponential,
mod, cos, sin, and so on. Assume we’re using the mod (5) hash function to find the data
block’s address. In this scenario, the primary keys are hashed with the mod (5) function,
yielding 3, 3, 1, 4, and 2, respectively, and records are saved at those data block locations.
Hash Organization

Bucket – A bucket is a type of storage container. Data is stored in bucket format in a hash
file. Typically, a bucket stores one entire disc block, which can then store one or more
records.

Hash Function – A hash function, abbreviated as h, refers to a mapping function that


connects all of the search-keys K to that address in which the actual records are stored. From
the search keys to the bucket addresses, it’s a function.

Types of Hashing

Hashing is of the following types:

Static Hashing

Whenever a search-key value is given in static hashing, the hash algorithm always returns the
same address. If the mod-4 hash function is employed, for example, only 5 values will be
generated. For this function, the output address must always be the same. At all times, the
total number of buckets available remains constant.
Dynamic Hashing

The disadvantage of static hashing is that it doesn’t expand or contract dynamically as the
database size grows or diminishes. Dynamic hashing is a technology that allows data buckets
to be created and withdrawn on the fly. Extended hashing is another name for dynamic
hashing.

In dynamic hashing, the hash function is designed to output a huge number of values, but
only a few are used at first.

What is Static Hashing in DBMS?

Whenever a search-key value is specified in static hashing, the hash algorithm always returns
the same address. In case the mod-4 hash function is employed, for example, only 5 values
will be generated. For this function, the output address must always be the same. At all times,
the number of buckets available remains constant.

The resultant data bucket address with static hashing will always be the same. That is, if we
use the hash function mod (5) to obtain an address for EMP ID =103, we will always get the
same bucket address 3. The bucket address will not change in this case.

As a result, the total number of data buckets present in the memory remains constant
throughout the process of static hashing. In this case, the memory utilised to hold the data
will include five data buckets.

Static Hashing Operations


Search a Record

When a record is needed, the very same hash function is used in order to get the address of
that bucket in which the data is kept.

Insert a Record

When a new record is entered into the table, the hash key is used to construct an address for
the new record, and the record is placed there.

Delete a Record

To delete a record, we must first retrieve the record that will be destroyed. The records for
this address will then be deleted from memory.

Update a Record

To edit a record, we’ll use a hash function to find it first, then change the data record.

If we wish to add a new record to the file, but the address of the data bucket formed by the
hash function isn’t empty, or information already exists in that address, we can’t add the
record. Bucket overflow is a term used in static hashing to describe this occurrence. In this
strategy, this is a critical condition.

There are a number of options for dealing with this scenario. The following are some of the
most widely utilised methods:

Open Hashing

Whenever a hash function generates any address that already contains data, the next bucket is
assigned to it, and this process is called Linear probing.

For instance, if R3 is a new address that needs to be entered, the hash function will generate
112 as R3’s address. However, the address that was produced is already full; as a result, the
system selects 113 as the next available data bucket and assigns R3 to it.
Close Hashing

When a data bucket is filled, a new one is created for the very same hash result and connected
after the old one, and this method is called Overflow chaining.

For example, if R3 is a new address that has to be added to the database, the hash function
will assign it the address 110. However, this bucket is too full to accommodate the additional
data. In this scenario, a new bucket is placed and linked to the end of 110 buckets.

What is Dynamic Hashing in DBMS?

The dynamic hashing approach is used to solve problems like bucket overflow that can occur
with static hashing. As the number of records increases or decreases, data buckets grow or
shrink in this manner. This method makes hashing dynamic, allowing for insertion and
deletion without causing performance issues. The extendible hashing method is another name
for this technology.

Searching a Key
 Calculate the key’s hash address first.
 Determine the number of bits used in the directory; these bits are referred to as i.
 Take the hash address’s least significant i bits. This returns the directory’s index.
 Now, using the index, navigate to the directory and look for the bucket address in
which the record may be located.

Inserting a New Record

 To begin, repeat the retrieval technique, ending up in a bucket somewhere.


 Place the record in the bucket if there is still room in it.
 In case the bucket is completely filled, split it and disperse the records.

Example

Consider the following classification of keys into buckets based on their hash address prefix:

Since the last two bits in 2 and 4 are 00, they will be placed in bucket B0. Because the last
two bits of the numbers 5 and 6 are 01, they will be placed in bucket B1. Since the last two
parts of 1 and 3 add up to 10, they will be placed in bucket B2, and as the last two bits in 7
are 11, they will be placed in B3.
Inserting a key 9 into the above structure with hash address 10001:

 Key 9 must be put into the first bucket because its hash address is 10001. However,
because bucket B1 is full, it will be split.
 Because the final three bits of 5, 9 are 001, they will be split into bucket B1, whereas
the last three bits in 6 are 101, and they will be split into bucket B5.
 The 2nd and 4th keys are still in B0. Because the last two bits of both entries are 00,
the record in B0 is pointed to 000 and 100 entries.
 The first and third keys are still in B2. Because the last two bits of both entries are 10,
the record in B2 is pointed to 010 and 110 entries.
 The keys 7 and 8 are still in B3. Because the last two bits of both entries are 11, the
record in B3 is pointed to 111 and 011 entries.

Dynamic Hashing Pros

 The performance of this method does not degrade as the amount of data in the system
grows. To accommodate the data, it massively increases the memory size.
 Memory is properly utilised in this manner since it shrinks and grows with the
information. There will be no unused memory to be found.
 This strategy is ideal for a dynamic database with data that increases and shrinks on a
regular basis.

Dynamic Hashing Cons

 In this strategy, as the data amount grows, the bucket size grows as well. The bucket
address table will keep track of these data addresses due to the fact that the data
address will change as the buckets expand and shrink. Maintenance of the bucket
address table gets difficult when there is a significant increase in data.
 The bucket overflow problem will also occur in this case. However, reaching this state
may take less time than static hashing.
Indexed sequential access method also known as ISAM method, is an upgrade to the
conventional sequential file organization method. You can say that it is an advanced version
of sequential file organization method. In this method, primary key of the record is stored
with an address, this address is mapped to an address of a data block in memory. This address
field works as an index of the file.

In this method, reading and fetching a record is done using the index of the file. Index field
contains the address of a data record in memory, which can be quickly used to read and fetch
the record from memory.

Advantages of ISAM

1. Searching a record is faster in ISAM file organization compared to other file


organization methods as the primary key can be used to identify the record and
since primary key also has the address of the record, it can read and fetch the
data from memory.
2. This method is more flexible compared to other methods as this allows to
generate the index field (address field) for any column of the record. This
makes searching easier and efficient as searches can be done using multiple
column fields.
3. This allows range retrieval of the records since the address file is stored with
the primary key of the record, we can retrieve the record based on a certain
range of primary key columns.
4. This method allow partial searches as well. For example, employee name
starting with “St” can be used to search all the employees with the name
starting with letters “St”. This will result all the records where employee name
begins with the letters “St”.

Disadvantages of ISAM

1. Requires additional space in the memory to store the index field.


2. After adding a record to the file, the file needs to be re-organized to maintain
the sequence based on primary key column.
3. Requires memory cleanup because when a record is deleted, the space used
by the record needs to be released in order to be used by the other record.
4. Performance issues are there if there are frequent deletion of records, as every
deletion needs a memory cleanup and optimization.

B Tree

B Tree is a specialized m-way tree that can be widely used for disk access. A B-Tree of order
m can have at most m-1 keys and m children. One of the main reason of using B tree is its
capability to store large number of keys in a single node and large key values by keeping the
height of the tree relatively small.

A B tree of order m contains all the properties of an M way tree. In addition, it contains the
following properties.

1. Every node in a B-Tree contains at most m children.


2. Every node in a B-Tree except the root node and the leaf node contain at least m/2
children.
3. The root nodes must have at least 2 nodes.
4. All leaf nodes must be at the same level.

It is not necessary that, all the nodes contain the same number of children but, each node
must have m/2 number of nodes.

5.7M
753
OOPs Concepts in Java

A B tree of order 4 is shown in the following image.


While performing some operations on B Tree, any property of B Tree may violate such as
number of minimum children a node can have. To maintain the properties of B Tree, the tree
may split or join.

Operations

Searching :

Searching in B Trees is similar to that in Binary search tree. For example, if we search for an
item 49 in the following B Tree. The process will something like following :

1. Compare item 49 with root node 78. since 49 < 78 hence, move to its left sub-tree.
2. Since, 40<49<56, traverse right sub-tree of 40.
3. 49>45, move to right. Compare 49.
4. match found, return.

Searching in a B tree depends upon the height of the tree. The search algorithm takes O(log
n) time to search any element in a B tree.
Inserting

Insertions are done at the leaf node level. The following algorithm needs to be followed in
order to insert an item into B Tree.

1. Traverse the B Tree in order to find the appropriate leaf node at which the node can be
inserted.
2. If the leaf node contain less than m-1 keys then insert the element in the increasing
order.
3. Else, if the leaf node contains m-1 keys, then follow the following steps.
o Insert the new element in the increasing order of elements.
o Split the node into the two nodes at the median.
o Push the median element upto its parent node.
o If the parent node also contain m-1 number of keys, then split it too by
following the same steps.

Example:

Insert the node 8 into the B Tree of order 5 shown in the following image.
8 will be inserted to the right of 5, therefore insert 8.

The node, now contain 5 keys which is greater than (5 -1 = 4 ) keys. Therefore split the node
from the median i.e. 8 and push it up to its parent node shown as follows.
Deletion

Deletion is also performed at the leaf nodes. The node which is to be deleted can either be a
leaf node or an internal node. Following algorithm needs to be followed in order to delete a
node from a B tree.

1. Locate the leaf node.


2. If there are more than m/2 keys in the leaf node then delete the desired key from the
node.
3. If the leaf node doesn't contain m/2 keys then complete the keys by taking the element
from eight or left sibling.
o If the left sibling contains more than m/2 elements then push its largest
element up to its parent and move the intervening element down to the node
where the key is deleted.
o If the right sibling contains more than m/2 elements then push its smallest
element up to the parent and move intervening element down to the node
where the key is deleted.
4. If neither of the sibling contain more than m/2 elements then create a new leaf node
by joining two leaf nodes and the intervening element of the parent node.
5. If parent is left with less than m/2 nodes then, apply the above process on the parent
too.

If the the node which is to be deleted is an internal node, then replace the node with its in-
order successor or predecessor. Since, successor or predecessor will always be on the leaf
node hence, the process will be similar as the node is being deleted from the leaf node.
Example 1

Delete the node 53 from the B Tree of order 5 shown in the following figure.

53 is present in the right child of element 49. Delete it.


Now, 57 is the only element which is left in the node, the minimum number of elements that
must be present in a B tree of order 5, is 2. it is less than that, the elements in its left and right
sub-tree are also not sufficient therefore, merge it with the left sibling and intervening
element of parent i.e. 49.

The final B tree is shown as follows.


Application of B tree

B tree is used to index the data and provides fast access to the actual data stored on the disks
since, the access to value stored in a large database that is stored on a disk is a very time
consuming process.

Searching an un-indexed and unsorted database containing n key values needs O(n) running
time in worst case. However, if we use B Tree to index this database, it will be searched in
O(log n) time in worst case.

B+ Tree

B+ Tree is an extension of B Tree which allows efficient insertion, deletion and search
operations.

In B Tree, Keys and records both can be stored in the internal as well as leaf nodes. Whereas,
in B+ tree, records (data) can only be stored on the leaf nodes while internal nodes can only
store the key values.

The leaf nodes of a B+ tree are linked together in the form of a singly linked lists to make the
search queries more efficient.

5.7M
753
OOPs Concepts in Java
B+ Tree are used to store the large amount of data which can not be stored in the main
memory. Due to the fact that, size of main memory is always limited, the internal nodes (keys
to access records) of the B+ tree are stored in the main memory whereas, leaf nodes are
stored in the secondary memory.

The internal nodes of B+ tree are often called index nodes. A B+ tree of order 3 is shown in
the following figure.

Advantages of B+ Tree

1. Records can be fetched in equal number of disk accesses.


2. Height of the tree remains balanced and less as compare to B tree.
3. We can access the data stored in a B+ tree sequentially as well as directly.
4. Keys are used for indexing.
5. Faster search queries as the data is stored only on the leaf nodes.
B Tree VS B+ Tree

SN B Tree B+ Tree

1 Search keys can not be repeatedly stored. Redundant search keys can be present.

2 Data can be stored in leaf nodes as well as Data can only be stored on the leaf nodes.
internal nodes

3 Searching for some data is a slower process Searching is comparatively faster as data can
since data can be found on internal nodes as well only be found on the leaf nodes.
as on the leaf nodes.

4 Deletion of internal nodes are so complicated Deletion will never be a complexed process
and time consuming. since element will always be deleted from the
leaf nodes.

5 Leaf nodes can not be linked together. Leaf nodes are linked together to make the
search operations more efficient.

Insertion in B+ Tree

Step 1: Insert the new node as a leaf node


Step 2: If the leaf doesn't have required space, split the node and copy the middle node to the
next index node.

Step 3: If the index node doesn't have required space, split the node and copy the middle
element to the next index page.

Example :

Insert the value 195 into the B+ tree of order 5 shown in the following figure.

195 will be inserted in the right sub-tree of 120 after 190. Insert it at the desired position.

The node contains greater than the maximum number of elements i.e. 4, therefore split it and
place the median node up to the parent.
Now, the index node contains 6 children and 5 keys which violates the B+ tree properties,
therefore we need to split it, shown as follows.
Deletion in B+ Tree

Step 1: Delete the key and data from the leaves.

Step 2: if the leaf node contains less than minimum number of elements, merge down the
node with its sibling and delete the key in between them.

Step 3: if the index node contains less than minimum number of elements, merge the node
with the sibling and move down the key in between them.

Example

Delete the key 200 from the B+ Tree shown in the following figure.
200 is present in the right sub-tree of 190, after 195. delete it.

Merge the two nodes by using 195, 190, 154 and 129.
Now, element 120 is the single element present in the node which is violating the B+ Tree
properties. Therefore, we need to merge it by using 60, 78, 108 and 120.

Now, the height of B+ tree will be decreased by 1.

You might also like