Module-6
Module-6
By:
Dr. Nagendra Panini Challa
Assistant Professor, Senior Grade 2
SCOPE, VIT-AP University, India
AGENDA
Storage and file structure: Memory
Hierarchies and Storage Devices,
Placing File Records on Disk
Hashing Techniques
Indexing Techniques
It is the primary area that offers quick access to the stored data. The primary storage as
volatile storage. It is because this type of memory does not permanently store the data.
As soon as the system leads to a power cut or a crash, the data also get lost.
Optical Storage: An optical storage can store megabytes or gigabytes of data. A Compact
Disk (CD) can store 700 megabytes of data with a playtime of around 80 minutes. On the
other hand, a Digital Video Disk or a DVD can store 4.7 or 8.5 gigabytes of data on each
side of the disk.
Tape Storage: It is the cheapest storage medium than disks. Generally, tapes are used for
archiving or backing up the data. It provides slow access to data as it accesses data
sequentially from the start. Thus, tape storage is also known as sequential-access
storage. Disk storage is known as direct-access storage as we can directly access the
data from any location on disk.
RAID 0
In this level, a striped array of disks is implemented. The data is broken down into
blocks and the blocks are distributed among disks. Each disk receives a block of
data to write/read in parallel. It enhances the speed and performance of the
storage device. There is no parity and backup in Level 0.
RAID 2 records Error Correction Code using Hamming distance for its data, striped
on different disks. Like level 0, each data bit in a word is recorded on a separate
disk and ECC codes of the data words are stored on a different set disks. Due to its
complex structure and high cost, RAID 2 is not commercially available.
RAID 4
In this level, an entire block of data is written onto data disks and then the parity is
generated and stored on a different disk. Note that level 3 uses byte-level striping,
whereas level 4 uses block-level striping. Both level 3 and level 4 require at least three
disks to implement RAID.
RAID 6
RAID 6 is an extension of level 5. In
this level, two independent parities are
generated and stored in distributed
fashion among multiple disks. Two
parities provide additional fault
tolerance. This level requires at least
four disk drives to implement RAID.
Database Management Systems (DBMS), SCOPE, VIT-AP University, India 02/03/2025 12
DBMS - FILE STRUCTURE
Here each file/records are stored one after the other in a sequential manner.
Records are stored one after the other as they are inserted into the tables. This
method is called pile file method.
When a new record is inserted, it is placed at the end of the file. In the case of
any modification or deletion of record, the record will be searched in the
memory blocks. Once it is found, it will be marked for deleting and new
block of record is entered.
Inserting a new record:
Sorted file method always involves the effort for sorting the record.
Since all the records are randomly stored, they are scattered in the memory. Hence memory is not
efficiently used.
If we are searching for range of data, then this method is not suitable. Because, each record will be stored
at random address. Hence range search will not give the correct address range and searching will be
inefficient. For example, searching the employees with salary from 20K to 30K will be efficient.
Searching for records with exact name or value will be efficient. If the Student name starting with ‘B’
will not be efficient as it does not give the exact name of the student.
If there is a search on some columns which is not a hash column, then the search will not be efficient.
This method is efficient only when the search is done on hash column. Otherwise, it will not be able find
the correct address of the data.
If there is multiple hash columns – say name and phone number of a person, to generate the address, and
if we are searching any record using phone or name alone will not give correct results.
If these hash columns are frequently updated, then the data block address is also changed accordingly.
Each update will generate new address. This is also not acceptable.
Hardware and software required for the memory management are costlier in this case. Complex programs
needs to be written to make this method efficient.
CLUSTER FILE ORGANIZATION
Indexed Clusters: - Here records are grouped based on the cluster key and stored
together. Our example above to illustrate STUDENT-COURSE cluster is an indexed
cluster. The records are grouped based on the cluster key – COURSE_ID and all the
related records are stored together. This method is followed when there is retrieval of
data for range of cluster key values or when there is a huge data growth in the clusters.
That means, if we have to select the students who are attending the course with
COURSE_ID 230-240 or there is a large number of students attending the same course,
say 250.
Hash Clusters: - This is also similar to indexed cluster. Here instead of storing the
records based on the cluster key, we generate the hash key value for the cluster key and
store the records with same hash key value together in the memory disk.
Advantages of Clustered File Organization
This method is best suited when there is frequent request for joining
the tables with same joining condition.
When there is a 1:M mapping between the tables, it results efficiently
This method is not suitable for very large databases since the
performance of this method on them is low.
We cannot use this clusters, if there is any change is joining condition.
If the joining condition changes, the traversing the file takes lot of
time.
This method is not suitable for less frequently joined tables or tables
with 1:1 conditions
PLACING FILES ON RECORD
The techniques for placing file records on disk:
In many cases, all records in a file are of the same record type.
If every record in the file has exactly the same size (in bytes), the file is said to be made up
of fixed-length records.
If different records in the file have different sizes, the file is said to be made up of variable-
length records.
A file may have variable-length records for several reasons:
The file records are of the same record type, but one or more of the fields are of varying
size (variable-length fields). For example, the Name field of EMPLOYEE can be a variable-length
field.
The file records are of the same record type, but one or more of the fields may have
multiple values for individual records; such a field is called a repeating field and a group of
values for the field is often called a repeating group.
The file records are of the same record type, but one or more of the fields are optional;
that is, they may have values for some but not all of the file records (optional fields).
The file contains records of different record types and hence of varying size (mixed file).
This would occur if related records of different types were clustered (placed together) on disk
blocks;
Database Management Systems (DBMS), SCOPE, VIT-AP University, India 02/03/2025 36
Database Management Systems (DBMS), SCOPE, VIT-AP University, India 02/03/2025 37
3) Record Blocking and Spanned versus Unspanned Records:
The records of a file must be allocated to disk blocks because a block is the unit of data
transfer between disk and memory.
When the block size is larger than the record size, each block will contain numerous
records, although some files may have unusually large records that cannot fit in one
block.
Suppose that the block size is B bytes. For a file of fixed-length records of size R bytes,
with B ≥ R, we can fit bfr = B/R records per block, where the (x) (floor function) rounds
down the number x to an integer.
The value bfr is called the blocking factor for the file. In general, R may not
divide B exactly, so we have some unused space in each block equal to
B − (bfr * R) bytes
b = (r/bfr) blocks
where the (x) (ceiling function) rounds the value x up to the next
integer.
Database Management Systems (DBMS), SCOPE, VIT-AP University, India 02/03/2025 40
4. Allocating File Blocks on Disk
There are several standard techniques for allocating the blocks of a file on disk.
In contiguous allocation, the file blocks are allocated to consecutive disk blocks.
This makes reading the whole file very fast using double buffering, but it makes
expanding the file difficult.
In linked allocation, each file block contains a pointer to the next file block. This
makes it easy to expand the file but makes it slow to read the whole file.
A combination of the two allocates clusters of consecutive disk blocks, and the clusters
are linked.
Clusters are sometimes called file segments or extents.
Another possibility is to use indexed allocation, where one or more index
blocks contain pointers to the actual file blocks.
It is also common to use combinations of these techniques.
A file header or file descriptor contains information about a file that is needed by the system
programs that access the file records.
The header includes information to determine the disk addresses of the file blocks as well as
to record format descriptions, which may include field lengths and the order of fields within a
record for fixed-length unspanned records and field type codes, separator characters, and
record type codes for variable-length records.
To search for a record on disk, one or more blocks are copied into main memory buffers.
Programs then search for the desired record or records within the buffers, using the
information in the file header.
If the address of the block that contains the desired record is not known, the search
programs must do a linear search through the file blocks.
Each file block is copied into a buffer and searched until the record is located or all the file
blocks have been searched unsuccessfully.
This can be very time-consuming for a large file.
The goal of a good file organization is to locate the block that contains a desired record with
a minimal number of block transfers.
Database Management Systems (DBMS), SCOPE, VIT-AP University, India 02/03/2025 42
HASHING TECHNIQUES
separate chaining,
linear and quadratic probing,
double hashing,
extendible hashing,
rehashing
Secondary Index
Clustering Index
Primary Index − Primary index is defined on an ordered data file.
The data file is ordered on a key field. The key field is generally the
primary key of the relation.
Secondary Index − Secondary index may be generated from a field
which is a candidate key and has a unique value in every record, or a
non-key with duplicate values.
Multilevel Index
Index records comprise search-key values and data pointers. Multilevel index is
stored on the disk along with the actual database files. As the size of the database
grows, so does the size of the indices. There is an immense need to keep the index
records in the main memory so as to speed up the search operations. If single-level
index is used, then a large size index cannot be kept in memory which leads to
multiple disk accesses.
INDEXED SEQUENTIAL ACCESS METHOD (ISAM)
An extra cost to maintain index has to be afforded. i.e.; we need to have extra
space in the disk to store this index value. When there is multiple key-index
combinations, the disk space will also increase.
As the new records are inserted, these files have to be restructured to maintain
the sequence. Similarly, when the record is deleted, the space used by it needs
to be released. Else, the performance of the database will slow down.
B+ TREE FILE ORGANIZATION
There is one main node called root of the tree – 105 is the root here.
There is an intermediary layer with nodes. They do not have actual records stored. They
are all pointers to the leaf node. Only the leaf node contains the data in sorted order.
The nodes to the left of the root nodes have prior values of root and nodes to the right
have next values of the root. i.e.; 102 and 108 respectively.
There is one final node, called leaf node, which has only values. i.e.; 100, 101, 103,
104, 106 and 107
All the leaf nodes are balanced – all the leaf nodes at same distance from the root node.
Hence searching any record is easier.
Searching any record is linear in this case. Any record can be traversed through single
path and accessed easily.
ADVANTAGES OF B+ TREES
Since all records are stored only in the leaf node and are sorted
sequential linked list, searching is becomes very easy.
Using B+, we can retrieve range retrieval or partial retrieval.
Traversing through the tree structure makes this easier and quicker.
As the number of record increases/decreases, B+ tree structure
grows/shrinks. There is no restriction on B+ tree size, like we have
in ISAM.
Since it is a balance tree structure, any insert/ delete/ update does
not affect the performance.
Since we have all the data stored in the leaf nodes and more
branching of internal nodes makes height of the tree shorter. This
reduces disk I/O. Hence it works well in secondary storage devices.
DISADVANTAGES OF B+ TREES
If order ( m ) = 4, 3
Then Maximum children = m = 4 , 3
Minimum children = ceil(m/2)=2
Max. keys = (m - 1) = 3,2
Min. keys = ceil(m/2)-1 = 1
B+ TREE MATERIALS
Notes:
https://www.studytonight.com/advanced-data-structures/b-plus-trees-data-structure
Animation:
https://www.cs.usfca.edu/~galles/visualization/BPlusTree.html
https://goneill.co.nz/btree-demo.php
Example:
DB2_4_BplusTreeExample.pdf