DBMS Unit-5
DBMS Unit-5
File Organization and Index Structure: File & Record Concept, Placing file records on Disk, Types
of Records, Types of Single-Level Index, Multilevel Indexes, Dynamic Multilevel Indexes using
B tree and B+ tree . Mongo DB, NoSQL types, Features and tools.
What is a File?
A file is named a collection of related information that is recorded on secondary storage such
as magnetic disks, magnetic tapes, and optical disks.
• It helps in the faster selection of records i.e. it makes the process faster.
• Different Operations like inserting, deleting, and updating different records are faster and
easier.
Various methods have been introduced to Organize files. These particular methods have
advantages and disadvantages on the basis of access or selection. Thus it is all upon the
programmer to decide the best-suited file Organization method according to his requirements.
We will be discussing each of the file Organizations in further sets of this article along with the
differences and advantages/ disadvantages of each file Organization method.
The easiest method for file Organization is the Sequential method. In this method, the file is stored
one after another in a sequential manner. There are two ways to implement this method:
Insertion of the new record: Let the R1, R3, and so on up to R5 and R4 be four records in the
sequence. Here, records are nothing but a row in any table. Suppose a new record R2 has to be
inserted in the sequence, then it is simply placed at the end of the file.
New Record Insertion
Insertion of the new record: Let us assume that there is a preexisting sorted sequence of four
records R1, R3, and so on up to R7 and R8. Suppose a new record R2 has to be inserted in the
sequence, then it will be inserted at the end of the file and then it will sort the sequence.
new Record Insertion
• Files can be easily stored inmagnetic tapes i.e. cheaper storage mechanism.
• Time wastage as we cannot jump on a particular record that is required, but we have to
move in a sequential manner which takes our time.
• The sorted file method is inefficient as it takes time and space for sorting records.
Heap File Organization works with data blocks. In this method, records are inserted at the end
of the file, into the data blocks. No Sorting or Ordering is required in this method. If a data block
is full, the new record is stored in some other block, Here the other data block need not be the very
next data block, but it can be any block in the memory. It is the responsibility of DBMS to store
and manage the new records.
Heap File Organization
Insertion of the new record: Suppose we have four records in the heap R1, R5, R6, R4, and R3,
and suppose a new record R2 has to be inserted in the heap then, since the last data block i.e data
block 3 is full it will be inserted in any of the data blocks selected by the DBMS, let’s say data
block 1.
If we want to search, delete or update data in the heap file Organization we will traverse the data
from the beginning of the file till we get the requested record. Thus if the database is very huge,
searching, deleting, or updating the record will take a lot of time.
• Fetching and retrieving records is faster than sequential records but only in the case of
small databases.
• When there is a huge number of data that needs to be loaded into thedatabase at a time,
then this method of file Organization is best suited.
When placing file records on disk, several strategies can be employed to optimize performance
and storage efficiency:
1. Sequential Storage: Records are stored one after another in a contiguous block of disk
space. This method is simple and efficient for sequential access but can lead to
fragmentation over time.
2. Indexed Storage: An index is created to map record keys to their physical locations on
disk. This allows for faster search and retrieval of records but requires additional storage
for the index itself.
3. Hashed Storage: Records are placed on disk based on a hash function applied to their key
values. This method provides quick access but can suffer from collisions, requiring
collision resolution techniques.
3. Heap Files: Records are stored in no particular order, and new records are placed in the
first available space. This method is simple but can lead to inefficient access patterns.
4. Sorted Files: Records are stored in a sorted order based on one or more fields. This
facilitates efficient range queries and ordered access but requires maintenance to keep the
records sorted.
5. Clustered Files: Records that are frequently accessed together are stored close to each
other on disk. This improves access times for related records but requires careful planning
and maintenance.
By understanding these concepts, you can better appreciate how data is organized and managed
within a DBMS, leading to more efficient database design and operation. If you have any specific
questions or need further details, feel free to ask!
File Organization
File Organization defines how file records are mapped onto disk blocks. We have four types of
File Organization to organize file records −
When a file is created using Heap File Organization, the Operating System allocates memory area
to that file without any further accounting details. File records can be placed anywhere in that
memory area. It is the responsibility of the software to manage the records. Heap File does not
support any ordering, sequencing, or indexing on its own.
Every file record contains a data field (attribute) to uniquely identify that record. In sequential file
organization, records are placed in the file in some sequential order based on the unique key field
or search key. Practically, it is not possible to store all the records sequentially in physical form.
Hash File Organization
Hash File Organization uses Hash function computation on some fields of the records. The output
of the hash function determines the location of disk block where the records are to be placed.
Clustered file organization is not considered good for large databases. In this mechanism, related
records from one or more relations are kept in the same disk block, that is, the ordering of records
is not based on primary key or search key.
File Operations
• Update Operations
• Retrieval Operations
Update operations change the data values by insertion, deletion, or update. Retrieval operations,
on the other hand, do not alter the data but retrieve them after optional conditional filtering. In both
types of operations, selection plays a significant role. Other than creation and deletion of a file,
there could be several operations, which can be done on files.
• Open − A file can be opened in one of the two modes, read mode or write mode. In read
mode, the operating system does not allow anyone to alter data. In other words, data is read
only. Files opened in read mode can be shared among several entities. Write mode allows
data modification. Files opened in write mode can be read but cannot be shared.
• Locate − Every file has a file pointer, which tells the current position where the data is to
be read or written. This pointer can be adjusted accordingly. Using find (seek) operation, it
can be moved forward or backward.
• Read − By default, when files are opened in read mode, the file pointer points to the
beginning of the file. There are options where the user can tell the operating system where
to locate the file pointer at the time of opening a file. The very next data to the file pointer
is read.
• Write − User can select to open a file in write mode, which enables them to edit its
contents. It can be deletion, insertion, or modification. The file pointer can be located at
the time of opening or can be dynamically changed if the operating system allows to do so.
• Close − This is the most important operation from the operating systems point of view.
When a request to close a file is generated, the operating system
o removes all the locks (if in shared mode),
o saves the data (if altered) to the secondary storage media, and
o releases all the buffers and file handlers associated with the file.
The organization of data inside a file plays a major role here. The process to locate the file pointer
to a desired record inside a file various based on whether the records are arranged sequentially or
clustered.
DBMS - Indexing
We know that data is stored in the form of records. Every record has a key field, which helps it to
be recognized uniquely.
Indexing is a data structure technique to efficiently retrieve records from the database files based
on some attributes on which the indexing has been done. Indexing in database systems is similar
to what we see in books.
Indexing is defined based on its indexing attributes. Indexing can be of the following types −
• Primary Index − Primary index is defined on an ordered data file. The data file is ordered
on a key field. The key field is generally the primary key of the relation.
• Secondary Index − Secondary index may be generated from a field which is a candidate
key and has a unique value in every record, or a non-key with duplicate values.
• Clustering Index − Clustering index is defined on an ordered data file. The data file is
ordered on a non-key field.
• Dense Index
• Sparse Index
Dense Index
In dense index, there is an index record for every search key value in the database. This makes
searching faster but requires more space to store index records itself. Index records contain search
key value and a pointer to the actual record on the disk.
Sparse Index
In sparse index, index records are not created for every search key. An index record here contains
a search key and an actual pointer to the data on the disk. To search a record, we first proceed by
index record and reach at the actual location of the data. If the data we are looking for is not where
we directly reach by following the index, then the system starts sequential search until the desired
data is found.
Multilevel Index
Index records comprise search-key values and data pointers. Multilevel index is stored on the disk
along with the actual database files. As the size of the database grows, so does the size of the
indices. There is an immense need to keep the index records in the main memory so as to speed up
the search operations. If single-level index is used, then a large size index cannot be kept in
memory which leads to multiple disk accesses.
Multi-level Index helps in breaking down the index into several smaller indices in order to make
the outermost level so small that it can be saved in a single disk block, which can easily be
accommodated anywhere in the main memory.
B+ Tree
A B+ tree is a balanced binary search tree that follows a multi-level index format. The leaf nodes
of a B+ tree denote actual data pointers. B+ tree ensures that all leaf nodes remain at the same
height, thus balanced. Additionally, the leaf nodes are linked using a link list; therefore, a B + tree
can support random access as well as sequential access.
Structure of B+ Tree
Every leaf node is at equal distance from the root node. A B+ tree is of the order n where n is fixed
for every B+ tree.
Internal nodes −
• Internal (non-leaf) nodes contain at least ⌈n/2⌉ pointers, except the root node.
Leaf nodes −
• Leaf nodes contain at least ⌈n/2⌉ record pointers and ⌈n/2⌉ key values.
• At most, a leaf node can contain n record pointers and n key values.
• Every leaf node contains one block pointer P to point to next leaf node and forms a linked
list.
B+ Tree Insertion
• B+ trees are filled from bottom and each entry is done at the leaf node.
o Partition at i = ⌊(m+1)/2⌋.
B+ Tree Deletion
o If it is an internal node, delete and replace with the entry from the left position.
• After deletion, underflow is tested,
o If underflow occurs, distribute the entries from the nodes left to it.
MongoDB
MongoDB is a document-oriented NoSQL database developed by MongoDB Inc. It stores data
in JSON-like format called BSON (Binary JSON), making it highly flexible and scalable.
• Indexing: Supports various types of indexes (single, compound, geospatial, text, etc.).
Column Store Data stored in columns rather than rows Apache Cassandra, HBase
• Flexible schema
• Horizontal scaling
Tool Purpose