0% found this document useful (0 votes)
56 views

DBMS-U5 Notes

The document discusses different methods for organizing and indexing data in a database management system. It describes how data is stored on external storage devices like disks and tapes and brought into memory as needed. It then covers various file organizations like heap files, sorted files, and indexes. It also discusses different types of indexes including clustered indexes, primary and secondary indexes, hash-based indexing, and tree-based indexing.

Uploaded by

Madhulatha 786
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
56 views

DBMS-U5 Notes

The document discusses different methods for organizing and indexing data in a database management system. It describes how data is stored on external storage devices like disks and tapes and brought into memory as needed. It then covers various file organizations like heap files, sorted files, and indexes. It also discusses different types of indexes including clustered indexes, primary and secondary indexes, hash-based indexing, and tree-based indexing.

Uploaded by

Madhulatha 786
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

DBMS – Unit - 5

Data on External Storage


A DBMS stores vast quantities of data, and the data must persist across program
executions. Therefore, data is stored on external storage devices such as disks and tapes,
and fetched into main memory as needed for processing.
 Disks are the most important external storage devices. They allow us to retrieve any
page at a (more or less) fixed cost per page. However, if we read several pages in the
order that they are stored physically, the cost can be much less than the cost of reading
the same pages in a random order.
 Tapes are sequential access devices and force us to read data one page after the other.
They are mostly used to archive data that is not needed on a regular basis.
 Each record in a file has a unique identifier called a record id, or rid for short. An rid
has the property that we can identify the disk address of the page containing the record
by using the rid.

Data is read into memory for processing, and written to disk for persistent storage, by a
layer of software called the buffer manager. When the files and access methods layer
(which we often refer to as just the file layer) needs to process a page, it asks the buffer
manager to fetch the page, specifying the page's rid. The buffer manager fetches the page
from disk if it is not already in memory.
Space on disk is managed by the disk space manager, according to the DBMS software
architecture. When the files and access methods layer needs additional space to hold new
records in a file, it asks the disk space manager to allocate an additional disk page for the
file; it also informs the disk space manager when it no longer needs one of its disk pages.
The disk space manager keeps track of the pages in use by the file layer; if a page is freed
by the file layer, the space manager tracks this, and reuses the space if the file layer requests
a new page later on.
------------------------------------------------------------------------------------------------------------
File Organizations and Indexing
File Organization refers to the logical relationships among various records that constitute
the file, particularly with respect to the means of identification and access to any specific
record. In simple terms, Storing the files in certain order is called file Organization. File
Structure refers to the format of the label and data blocks and of any logical control record.
The file layer stores the records in a file in a collection of disk pages. It keeps track of
pages allocated to each file, and as records are inserted into and deleted from the file, it
also tracks available space within pages allocated to the file.
 Heap (random order) files: Suitable when typical access is a file scan retrieving all
records.
 Sorted Files: Best if records must be retrieved in some order, or only a `range’ of
records is needed.
 Indexes: Data structures to organize records to optimize certain kinds of retrieval
operations.
 Speed up searches for a subset of records, based on values in certain (“search
key”) fields
 Updates are much faster than in sorted files.

------------------------------------------------------------------------------------------------------------
Clustered Indexes
An Index is a key built from one or more columns in the database that speeds up fetching
rows from the table or view. This key helps a Database like Oracle, SQL Server, MySQL,
etc. to find the row associated with key values quickly.
Cluster index is a type of index which sorts the data rows in the table on their key values.
In the Database, there is only one clustered index per table.
A clustered index defines the order in which data is stored in the table which can be sorted
in only one way. So, there can be an only a single clustered index for every table. In an
RDBMS, usually, the primary key allows you to create a clustered index based on that
specific column.
Characteristic of Clustered Index
 Default and sorted data storage
 Use just one or more than one columns for an index
 Helps you to store Data and index together
 Fragmentation
 Operations
 Clustered index scan and index seek
 Key Lookup

If the index is clustered, i.e., we are using the search key of a clustered file, the rid’s in
qualifying data entries point to a contiguous collection of records, and we need to retrieve
only a few data pages. If the index is unclustered, each qualifying data entry could contain
a rid that points to a distinct data page, leading to as many data page l/Os as the number of
data entries that match the range selection
Primary and Secondary Indexes
An index on a set of fields that includes the primary key is called a primary index; other
indexes are called secondary indexes.
An index that uses Alternative (1) is called a primary index, and one that uses Alternatives
(2) or (3) is called a secondary index.
Two data entries are said to be duplicates if they have the same value for the search key
field associated with the index. A primary index is guaranteed not to contain duplicates,
but an index on other (collections of) fields can contain duplicates. In general, a secondary
index contains duplicates. If we know tha.t no duplicates exist, that is, we know that the
search key contains some candidate key, we call the index a unique index.
Index data Structures
Indexing is a data structure technique which allows you to quickly retrieve records from a
database file. An Index is a small table having only two columns. The first column
comprises a copy of the primary or candidate key of a table. Its second column contains a
set of pointers for holding the address of the disk block where that specific key value stored.
An index -
 Takes a search key as input
 Efficiently returns a collection of matching records.
------------------------------------------------------------------------------------------------------------
Hash-Based Indexing
In DBMS, hashing is a technique to directly search the location of desired data on the disk
without using index structure. Data is stored in the form of data blocks whose address is
generated by applying a hash function in the memory location where these records are
stored known as a data block or data bucket.
In this approach, the records in a file are grouped in buckets, where a bucket consists of a
primary page and, possibly, additional pages linked in a chain. The bucket to which a record
belongs can be determined by applying a special function, called a hash function, to the
search key. Given a bucket number, a hash-based index structure allows us to retrieve the
primary page for the bucket in one or two disk l/Os.
 Inserting a record: When a new record requires to be inserted into the table, you can
generate an address for the new record using its hash key. When the address is
generated, the record is automatically stored in that location.
 Searching: When you need to retrieve the record, the same hash function should be
helpful to retrieve the address of the bucket where data should be stored.
 Delete a record: Using the hash function, you can first fetch the record which is you
wants to delete. Then you can remove the records for that address in memory.

Hash indexing is illustrated in Figure, where the data is stored in a file that is hashed on
age; the data entries in this first index file are the actual data records. Applying the hash
function to the age field identifies the page that the record belongs to. The hash function h
for this example is quite simple; it converts the search key value to its binary representation
and uses the two least significant bits as the bucket identifier.
It also shows an index with search key sal that contains (sal, rid) pairs as data entries. The
rid (short for record id) component of a data entry in this second index is a pointer to a
record with search key value sal.
------------------------------------------------------------------------------------------------------------
Tree-Based Indexing
An alternative to hash-based indexing is to organize records using a treelike data structure.
The data entries are arranged in sorted order by search key value, and a hierarchical search
data structure is maintained that directs searches to the correct page of data entries.
Figure shows the employee records and it is organized in a tree-structured index with
search key age. Each node in this figure (e.g., nodes labeled A, B, L1, L2) is a physical
page, and retrieving a node involves a disk I/O.
The lowest level of the tree, called the leaf level, contains the data entries; in our example,
these are employee records. To illustrate the ideas better, we have drawn Figure as if there
were additional employee records, some with age less than 22 and some with age greater
than 50. Additional records with age less than 22 would appear in leaf pages to the left
page L1, and records with age greater than 50 would appear in leaf pages to the right of
page L3.
This structure allows us to efficiently locate all data entries with search key values in a
desired range.
 All searches begin at the topmost node, called the root, and the contents of pages in
non-leaf levels direct searches to the correct leaf page.
 Non-leaf pages contain node pointers separated by search key values.
 The node pointer to the left of a key value k points to a subtree that contains only
data entries less than k.
 The node pointer to the right of a key value k points to a subtree that contains only
data entries greater than or equal to k.
------------------------------------------------------------------------------------------------------------
File Organization
File Organization refers to the logical relationships among various records that constitute
the file, particularly with respect to the means of identification and access to any specific
record. In simple terms, Storing the files in certain order is called file Organization. File
Structure refers to the format of the label and data blocks and of any logical control record.
The file layer stores the records in a file in a collection of disk pages. It keeps track of
pages allocated to each file, and as records are inserted into and deleted from the file, it
also tracks available space within pages allocated to the file.
Sequential File Organization:
It is one of the simple methods of file organization. Here each file/records are stored
one after the other in a sequential manner. This can be achieved in two ways:
Pile file method: Records are stored one after the other as they are inserted into the
tables. This method is called pile file method. When a new record is inserted, it is placed
at the end of the file. In the case of any modification or deletion of record, the record
will be searched in the memory blocks. Once it is found, it will be marked for deleting
and new block of record is entered.

Inserting a new record:

In the diagram above, R1, R2, R3 etc are the records. They contain all the attribute of a
row. i.e.; when we say student record, it will have his id, name, address, course, DOB etc.
Similarly R1, R2, R3 etc can be considered as one full set of attributes.

Sorted file method: Records are sorted (either ascending or descending) each time they
are inserted into the system. This method is called sorted file method. Sorting of records
may be based on the primary key or on any other columns. Whenever a new record is
inserted, it will be inserted at the end of the file and then it will sort – ascending or
descending based on key value and placed at the correct position. In the case of update,
it will update the record and then sort the file to place the updated record in the right
place. Same is the case with delete.

Inserting a new record:

Advantages of Sequential File Organization


 The design is very simple compared other file organization. There is no much
effort involved to store the data.
 When there are large volumes of data, this method is very fast and efficient. This
method is helpful when most of the records have to be accessed like calculating
the grade of a student, generating the salary slips etc where we use all the records
for our calculations
 This method is good in case of report generation or statistical calculations.
 These files can be stored in magnetic tapes which are comparatively cheap.
Disadvantages of Sequential File Organization
 Sorted file method always involves the effort for sorting the record. Each time any
insert/update/ delete transaction is performed, file is sorted. Hence identifying the
record, inserting/ updating/ deleting the record, and then sorting them always takes
some time and may make system slow.
-----------------------------------------------------------------------------------------------------------
Heap File Organization:
Suitable when typical access is a file scan retrieving all records.
Here records are inserted at the end of the file as and when they are inserted. There is
no sorting or ordering of the records. Once the data block is full, the next record is
stored in the new block. This new block need not be the very next block. This method
can select any block in the memory to store the new records. It is similar to pile file in
the sequential method, but here data blocks are not selected sequentially. They can be
any data blocks in the memory. It is the responsibility of the DBMS to store the records
and manage them.

Insertion of a new record


Suppose we have five records R1, R3, R6, R4 and R5 in a heap and suppose we want to
insert a new record R2 in a heap. If the data block 3 is full then it will be inserted in any of
the database selected by the DBMS, let's say data block 1.

If we want to search, update or delete the data in heap file organization, then we need to
traverse the data from staring of the file till we get the requested record.
If the database is very large then searching, updating or deleting of record will be time-
consuming because there is no sorting or ordering of records. In the heap file organization,
we need to check all the data until we get the requested record.
Advantages of Heap File Organization
 Very good method of file organization for bulk insertion. i.e.; when there is a huge
number of data needs to load into the database at a time, then this method of file
organization is best suited. They are simply inserted one after the other in the memory
blocks.
 It is suited for very small files as the fetching of records is faster in them. As the file
size grows, linear search for the record becomes time consuming.
Disadvantages of Heap File Organization
 This method is inefficient for larger databases as it takes time to search/modify the
record.
 Proper memory management is required to boost the performance. Otherwise there
would be lots of unused memory blocks lying and memory size will simply be
growing.
------------------------------------------------------------------------------------------------------------
Indexes and Performance Tuning
The choice of indexes has a tremendous impact on system performance, and must be made
in the context of the expected workload, or typical mix of queries and update operations.
Impact of the Workload
 The first thing to consider is the expected workload and the common operations.
 Different file organizations and indexes, will support different operations well.
 In general an index supports efficient retrieval of data entries that satisfy a given
selection condition.
 There are two important kinds of selections: equality selection and range selection.
 Hash based indexing techniques are optimized only for equality selections and fare
poorly on range selections, where they are typically worse than scanning the entire file
of records.
 Tree-based indexing techniques support both kinds of selection conditions efficiently.
 Both tree and hash indexes can support inserts, deletes, and updates quite efficiently.
Tree-based indexes, in particular, offer a superior alternative to maintaining fully sorted
files of records.
Two important advantages over sorted files:
 We can handle inserts and deletes of data entries efficiently.
 Finding the correct leaf page when searching for a record by search key value is much
faster than binary search of the pages in a sorted file.

Clustered Index Organization


 Clustered indexes, while less expensive to maintain than a fully sorted file, are none the
less expensive to maintain.
 When a new record has to be inserted into a full leaf page, a new leaf page must be
allocated and some existing records have to be moved to the new page.
 If records are identified by a combination of page id and slot as is typically the case in
current database systems, all places in the database that point to a moved record must
also be updated to point to the new location. Locating all such places and making these
additional updates can involve several disk I/Os.
 Clustering must be used sparingly and only when justified by frequent queries that
benefit from clustering.
 In particular, there is no good reason to build a clustered file using hashing, since range
queries cannot be answered using hash-indexes.

Composite Search Keys


 The search key for an index can contain several fields; such keys are called composite
search keys or concatenated keys.
 As an example, consider a collection of employee records, with fields name, age, and
sal, stored in sorted order by name.

 Figure illustrates the difference between a composite index with key (age, sal) a
composite index with key (sal, age), an index with key age, and an index with key sal.
 All indexes shown in the figure use Alternative (2) for data entries.

Trade-offs in Choosing Composite Keys


 A composite key index can support a broader range of queries because it matches more
selection conditions.
 Further, since data entries in a composite index contain more information about the data
record (i.e., more fields than a single-attribute index), the opportunities for index-only
evaluation strategies are increased.
 On the negative side, a composite index must be updated in response to any operation
(insert, delete, or update) that modifies any field in the search key.
 A composite index is also likely to be larger than a single attribute search key index
because the size of entries is larger.
 For a composite B+ tree index, this also means a potential increase in the number of
levels, although key Compression can be used to alleviate this problem.
------------------------------------------------------------------------------------------------------------
Indexed Sequential Access Method (ISAM)
ISAM is an advanced sequential file organization method. The data entries of the ISAM
index are in the leaf pages of the tree and additional overflow pages chained to some leaf
page. Database systems carefully organize the layout of pages so that page boundaries
correspond closely to the physical characteristics of the underlying storage device. The
ISAM structure is completely static and facilitates such low-level optimizations. The
ISAM data structure is illustrated in Figure.

 Each tree node is a disk page, and all the data resides in the leaf pages.
 This corresponds to an index that uses Alternative (1) for data entries,
 The user can create an index with Alternative (2) by storing the data records in a
separate file and storing (key, rid) pairs in the leaf pages of the ISAM index.
 When the file is created, all leaf pages are allocated sequentially and sorted on the
search key value.
 The non-leaf level pages are then allocated. If there are several inserts to the file
subsequently, so that more entries are inserted into a leaf than will fit onto a single page,
additional pages are needed because the index structure is static. These additional pages
are allocated from an overflow area.
The allocation of pages is illustrated in Figure.

 The basic operations of insertion, deletion, and search are all quite straightforward.
 For an equality selection search, we start at the root node and determine which subtree
to search by comparing the value in the search field of the given record with the key
values in the node. (
 For a range query, the starting point in the data (or leaf) level is determined similarly,
and data pages are then retrieved sequentially.
 For inserts and deletes, the appropriate page is determined as for a search, and the
record is inserted or deleted with overflow pages added if necessary.
Example:
The following example illustrates the ISAM index structure.

 Here, all searches begin at the root. For example, to locate a record with the key value
27, we start at the root and follow the left pointer, since 27 < 40. We then follow the
middle pointer, since 20 <= 27 < 33.
 For a range search, we find the first qualifying data entry as for an equality selection
and then retrieve primary leaf pages sequentially

 Assume that each leaf page can contain two entries.


 If we now insert a record with key value 23, the entry 23* belongs in the second data
page, which already contains 20* and 27* and has no more space.
 We deal with this situation by adding an overflow page and putting 23* in. the overflow
page. Chains of overflow pages can easily develop.
 For instance, inserting 48*, 41 *, and 42* leads to an overflow chain of two pages.
 All these insertions is shown in Figure.

------------------------------------------------------------------------------------------------------------
B+ Trees
B+ tree is a (key, value) storage method in a tree like structure. B+ tree has one root, any
number of intermediary nodes and a leaf node.
 Here all leaf nodes will have the actual records stored.
 Intermediary nodes will have only pointers to the leaf nodes; it does not have any data.
 Any node will have only two leaves.

 The B+ tree search structure, which is widely used, is a balanced tree in which the
internal nodes direct the search and the leaf nodes contain the data entries.
 Since the tree structure grows and shrinks dynamically, it is not feasible to allocate the
leaf pages sequentially as in ISAM, where the set of primary leaf pages were static.
 To retrieve all leaf pages efficiently, we have to link them using page pointers. By
organizing them into a doubly linked list, we can easily traverse the sequence of leaf
pages in either direction.

The reasons for using B+ Tree:

 Key are primarily utilized to aid the search by directing to the proper Leaf.
 B+ Tree uses a "fill factor" to manage the increase and decrease in a tree.
 In B+ trees, numerous keys can easily be placed on the page of memory because they
do not have the data associated with the interior nodes. Therefore, it will quickly
access tree data that is on the leaf node.
 A comprehensive full scan of all the elements is a tree that needs just one linear pass
because all the leaf nodes of a B+ tree are linked with each other.

Search:
In B+ Tree, a search is one of the easiest procedures to execute and get fast and accurate
results from it.
The following search algorithm is applicable:
 To find the required record, you need to execute the binary search on the available
records in the Tree.
 In case of an exact match with the search key, the corresponding record is returned
to the user.
 In case the exact key is not located by the search in the parent, current, or leaf
node, then a "not found message" is displayed to the user.
 The search process can be re-run for better and more accurate results.

Consider the sample B+ tree shown in Figure. This B+ tree is of order d=2. That is, each
node contains between 2 and 4 entries. Each non--leaf entry is a (key value, nodepointer)
pair; at the leaf level, the entries are data records that we denote by k*. To search for entry
5*, we follow the left-most child pointer, since 5 < 13. To search for the entries 14* or 15*,
we follow the second pointer, since 13 ≤14 < 17, and 13≤15 < 17. To find 24 *, we follow
the fourth child pointer, since 24≤ 24 < 30.

Insert:
The algorithm for insertion takes an entry, finds the leaf node where it belongs, and inserts
it there. The basic idea behind the algorithm is that we recursively insert the entry by calling
the insert algorithm on the appropriate child node.
 Usually, this procedure results in going down to the leaf node where the entry belongs,
placing the entry there, and returning all the way back to the root node.
 Occasionally a node is full and it must be split.
 When the node is split, an entry pointing to the node created by the split must be inserted
into its parent; this entry is pointed to by the pointer variable newchildentry.
 If the (old) root is split, a new root node is created and the height of the tree increases
by 1.

Suppose we have to insert a record 60 in below structure. It will go to 3rd leaf node after
55. Since it is a balanced tree and that leaf node is already full, we cannot insert the record
there. But it should be inserted there without affecting the fill factor, balance and order. So
the only option here is to split the leaf node. But how do we split the nodes?

The 3rd leaf node should have values (50, 55, 60, 65, 70) and its current root node is 50.
We will split the leaf node in the middle so that its balance is not altered. So we can group
(50, 55) and (60, 65, 70) into 2 leaf nodes. If these two has to be leaf nodes, the intermediary
node cannot branch from 50. It should have 60 added to it and then we can have pointers
to new leaf node.

This is how we insert a new entry when there is overflow. In normal scenario, it is simple
to find the node where it fits and place it in that leaf node.

Delete:
The following algorithm is applicable while deleting an element from the B+ Tree:

 Firstly, we need to locate a leaf entry in the Tree that is holding the key and pointer. ,
delete the leaf entry from the Tree if the Leaf fulfills the exact conditions of record
deletion.
 In case the leaf node only meets the satisfactory factor of being half full, then the
operation is completed; otherwise, the Leaf node has minimum entries and cannot be
deleted.
 The other linked nodes on the right and left can vacate any entries then move them to
the Leaf. If these criteria is not fulfilled, then they should combine the leaf node and its
linked node in the tree hierarchy.
 Upon merging of leaf node with its neighbors on the right or left, entries of values in
the leaf node or linked neighbor pointing to the top-level node are deleted.

Suppose we have to delete 60 from the above example. What will happen in this case? We
have to remove 60 from 4th leaf node as well as from the intermediary node too. If we
remove it from intermediary node, the tree will not satisfy B+ tree rules. So we need to
modify it have a balanced tree. After deleting 60 from above B+ tree and re-arranging
nodes, it will appear as below.

Suppose we have to delete 15 from above tree. We will traverse to the 1st leaf node and
simply delete 15 from that node. There is no need for any re-arrangement as the tree is
balanced and 15 do not appear in the intermediary node.

------------------------------------------------------------------------------------------------------------

You might also like