Unit 5
Unit 5
UNIT – V
Data on External Storage, File Organization and Indexing, Cluster Indexes, Primary and Secondary Indexes, Index data
Structures, Hash Based Indexing, Tree base Indexing, Comparison of File Organizations, Indexes and Performance Tuning,
Intuitions for tree Indexes, Indexed Sequential Access Methods (ISAM), B+ Trees: A Dynamic Index Structure.
2. FILE ORGANIZATION
The database is stored as a collection of files. Each file contains a set of records. Each record
is a collection of fields. For example, a student table (or file) contains many records and each
record belongs to one student with fields (attributes) such as Name, Date of birth, class,
department, address, etc.
File organization defines how file records are mapped onto disk blocks.
The records of a file are stored in the disk blocks because a block is the unit of data transfer
between disk and memory. When the block size is larger than the record size, each block will
contain more than one record. Sometimes, some of the files may have large records that cannot
fit in one block. In this case, we can store part of a record on one block and the rest on another. A
pointer at the end of the first block points to the block containing the remainder of the record.
File
Organization
Heap File Organization: When a file is created using Heap File Organization mechanism, the
records are stored in the file in the order in which they are inserted. So the new records are
inserted at the end of the file. In this type of organization inserting new records is more efficient.
It uses linear search to search records.
Sequential File Organization: When a file is created using Sequential File Organization
mechanism, all the records are ordered (sorted) as per the primary key value and placed in the
file. In this type of organization inserting new records is more difficult because the records need
to be sorted after inserting every new record. It uses binary search to search records.
Hash File Organization: When a file is created using Hash File Organization mechanism, a
hash function is applied on some field of the records to calculate hash value. Based on the hash
value, the corresponding record is placed in the file.
Clustered File Organization: Clustered file organization is not considered good for large
databases. In this mechanism, related records from one or more relations are kept in a same disk
block, that is, the ordering of records is not based on primary key or search key.
3. INDEXING
If the records in the file are in sorted order, then searching will become very fast. But, in
most of the cases, records are placed in the file in the order in which they are inserted, so new
records are inserted at the end of the file. It indicates, the records are not in sorted order. In
order to make searching faster in the files with unsorted records, indexing is used.
Indexing is a data structure technique which allows you to quickly retrieve records from a
database file. An Index is a small table having only two columns. The first column contains a
copy of the primary or candidate key of a table. The second column contains a set of disk block
addresses where the record with that specific key value stored.
Indexing
i. Primary Index
If the index is created by using the primary key of the table, then it is known as primary
indexing.
As primary keys are unique and are stored in a sorted manner, the performance of the
searching operation is quite efficient.
The primary index can be classified into two types: dense index and sparse index.
Dense index
If every record in the table has one index entry in the index table, then it is called dense
index.
In this, the number of records (rows) in the index table is same as the number of records
(rows) in the main table.
As every record has one index entry, searching becomes faster.
TS TS Hyderabad KCR
AP AP Amaravathi Jagan
TN TN Madras Palani Swamy
MH MH Bombay Thackray
Sparse index
If only few records in the table have index entries in the index table, then it is called
sparse index.
In this, the number of records (rows) in the index table is less than the number of records
(rows) in the main table.
As not all the record have index entries, searching becomes slow for records that does not
have index entries.
TS TS Hyderabad KCR
TN AP Amaravathi Jagan
MH TN Madras Palani Swamy
MH Bombay Thackray
In secondary indexing, to reduce the size of mapping, another level of indexing is introduced.
It contains two levels. In the first level each record in the main table has one entry in the first-
level index table.
lOMoAR cPSD| 34109649
The index entries in the fisrt level index table are divided into different groups. For each
group, one index entry is created and added in the second level index table.
Multi-level Index: When the main table size becomes too large, creating secondary level index
improves the search process. Even if the search process is slow; we can add one more level of
indexing and so on. This type of indexing is called multi-level index.
Clustering Index
Sometimes the index is created on non-primary key columns which may not be unique
for each record.
In this case, to identify the record faster, we will group two or more columns to get the
unique value and create index out of them. This method is called a clustering index.
The records which have similar characteristics are grouped, and indexes are created for
these group.
Example: Consider a college contains many students in each department. All the students belong
to the same Dept_ID are grouped together and treated as a single cluster. One index pointers
point to the one cluster as a whole. The idex pointer points to the first record in each cluster.
Here Dept_ID is a non-unique key.
lOMoAR cPSD| 34109649
The concept of hashing and hash table is shown in the below figure
STATIC HASHING
In static hashing, the hash function produce only fixed number of hash values. For
example consider the hash function
f(x) = x mod 7
For any value of x, the above function produces one of the hash value from {0, 1, 2, 3, 4, 5, 6}. It
means static hashing maps search-key values to a fixed set of bucket addresses.
Suppose, latter if we want to insert 23, it produce hash value as 2 ( 23 mod 7 = 2 ). But, in the above
hash table, the location with hash value 2 is not empty (it contains 16*). So, a collision occurs. To
resolve this collision, the following techniques are used.
o Open addressing
o Separate Chaining or Closed addressing
i. Open Addressing:
Open addressing is a collision resolving technique which stores all the keys inside the
hash table. No key is stored outside the hash table. Techniques used for open addressing are:
o Linear Probing
o Quadratic Probing
o Double Hashing
Linear Probing:
In linear probing, when there is a collision, we scan forwards for the next empty slot to
lOMoAR cPSD| 34109649
fill the key’s record. If you reach last slot, then start from beginning.
Example: Consider figure 1. When we try to insert 23, its hash value is 2. But the slot
with 2 is not empty. Then move to next slot (hash value 3), even it is also full, then move
once again to next slot with hash value 4. As it is empty store 23 there. This is shown in
the below diagram.
0 21*
3 10*
4 23*
5 12*
Quadratic Probing:
In quadratic probing, when collision occurs, it compute new hash value by taking the
original hash value and adding successive values of quadratic polynomial until an open
slot is found. If here is a collision, it use the following hash function: h(x) = ( f(x) + i 2 )
mod n , where I = 1, 2, 3, 4,….. and f(x) is initial hash value.
Example: Consider figure 1. When we try to insert 23, its hash value is 2. But the slot
with hash value 2 is not empty. Then compute new hash value as (2 +1 2) mod 7 =3, even
it is also full, and then once again compute new hash value as (2 +2 2) mod 7 = 6. As it is
empty store 23 there. This is shown in the below diagram.
lOMoAR cPSD| 34109649
Hash
Value Data Record
0
21*
1
f(23) = 23 mod 7 = 2 2
16*
3
10*
4
5
12*
6
23*
Figure 5.3: Quadratic Probing
Double Hashing
In double hashing, there are two hash functions. The second hash function is used to
provide an offset value in case the first function causes a collision. The following
function is an example of double hashing: (firstHash(key) + i * secondHash(key)) %
tableSize. Use i = 1, 2, 3, …
ii. Separate Chaining or Closed addressing:
To handle the collision, This technique creates a linked list to the slot for which collision
occurs. The new key is then inserted in the linked list. These linked lists to the slots appear like
chains. So, this technique is called as separate chaining. It is also called as closed addressing.
Example: Inserting 10, 21, 16, 12, 23, 19, 28, 30 in hash table.
f(21) = 21 mod 7 = 0 1
f(23) = 23 mod 7 = 2 4
Dynamic Hashing
The drawback of static hashing is that it does not expand or shrink dynamically as the size of the database
grows or shrinks. In Dynamic hashing, data buckets grow or shrink (added or removed dynamically) as
the records increase or decrease. Dynamic hashing is also known as extended hashing. In dynamic
hashing, the hash function is made to produce a large number of values. For Example, there are three data
records D1, D2, and D3.
The hash function generates three addresses 1001, 0101, and 1010 respectively. This method of storing
considers only part of this address – especially only the first bit to store the data. So it tries to load three
of them at addresses 0 and 1.
But the problem is that No bucket address is remaining for D3. The bucket has to grow dynamically to
accommodate D3. So it changes the address to have 2 bits rather than 1 bit, and then it updates the
existing data to have a 2-bit address. Then it tries to accommodate D3
We can use tree-like structures as index as well. For example, a binary search tree can
also be used as an index. If we want to find out a particular record from a binary search tree, we
have the added advantage of binary search procedure, that makes searching be performed even
faster. A binary tree can be considered as a 2-way Search Tree, because it has two pointers in
each of its nodes, thereby it can guide you to two distinct ways. Remember that for every node
storing 2 pointers, the number of value to be stored in each node is one less than the number of
pointers, i.e. each node would contain 1 value each.
The abovementioned concept can be further expanded with the notion of the m-Way
Search Tree, where m represents the number of pointers in a particular node. If m = 3, then each
node of the search tree contains 3 pointers, and each node would then contain 2 values.
lOMoAR cPSD| 34109649
Advantages of ISAM :
Faster retrieval compared to pure sequential methods.
Suitable for applications with a mix of sequential and random access.
Disadvantages of ISAM :
Index maintenance can add overhead in terms of storage and update operations.
Not as efficient as fully indexed methods for random access.
search). After finding leaf node, insert it in that leaf node if space is available, else create an
overflow node and insert the record index in it, and link the overflow node to the leaf node.
Deletion operation: First locate a leaf node where the deletion to be take place (use binary
search). After finding leaf node, if the value to be deleted is in leaf node or in overflow node,
remove it. If the overflow node is empty after removing the deleted value, then delete overflow
node also.
Example: Insert 10, 23, 31, 20, 68, 35, 42, 61, 27, 71, 46 and 59
31
23 68 42 59
10 20 23 27 68 71 31 35 42 46 59 61
After inserting 24, 33, 36, and 39 in the above tree, it looks like
31
23 68 42 59
10 20 23 27 68 71 31 35 42 46 59 61
24 33 36
39
Deletion: From the above figure, after deleting 42, 71, 24 and 36
31
23 68 42 59
10 20 23 27 68 31 35 46 59 61
33
39
lOMoAR cPSD| 34109649
6. B+ TREE
B+ Tree is an extension of Binary Tree which allows efficient insertion, deletion and search
operations. It is used to implement indexing in DBMS. In B+ tree, data can be stored only on the
leaf nodes while internal nodes can store the search key values.
Example: Searching for 35 in the below given B+ tree. The search path is shown in red color.
18
11 31 64
8 15 23 42 59 68
2 5 8 9 11 12 15 16 18 20 23 27 31 35 42 46 59 61 64 66 68 71
lOMoAR cPSD| 34109649
B+ Insertion:
1. Apply search operation on B+ tree and find a leaf node where the new value has to insert.
2. If the leaf node is not full, then insert the value in the leaf node.
3. If the leaf node is full, then
a. Split that leaf node including newly inserted value into two nodes such that each
contains half of the values (In case of odd, 2nd node contains extra value).
b. Insert smallest value from new right leaf node (2nd node) into the parent node.
Add pointers from these new leaf nodes to their parent node.
c. If the parent is full, split it too. Add the middle key (In case of even,1st value from
2nd part)of this parent node to its parent node.
d. Repeat until a parent is found that need not split.
4. If the root splits, create a new root which has one key and two pointers.
Initially
After inserting 1
1
After inserting 5
1 5
After inserting 3
1 3 5
5
After inserting 7
1 3 5 7
1 3 5 7
After inserting 9
5
1 3 5 7 9
After inserting 2
5
1 2 3 5 7 9
lOMoAR cPSD| 34109649
After inserting 4
5 3 5
1 2 3 4 5 7 9 1 2 3 4 5 7 9
After inserting 6
3 5
1 2 3 4 5 7 9 6
3 5 7
1 2 3 4 5 6 7 9
After inserting 8
3 5 7
1 2 3 4 5 6 7 8 9
After inserting 10
3 5 7
1 2 3 4 5 6 7 8 9 10
3 5 7 9
1 2 3 4 5 6 7 8 9 10
lOMoAR cPSD| 34109649
3 5 9
1 2 3 4 5 6 7 8 9 10
B+ Deletion
Identify the leaf node L from where deletion should take place.
Remove the data value to be deleted from the leaf node L
If L meets the "half full" criteria, then its done.
If L does not meets the "half full" criteria, then
o If L's right sibling can give a data value, then move smallest value in right sibling
to L (After giving a data value, the right sibling should satisfy the half full
criteria. Otherwise it should not give)
o Else, if L's left sibling can give a data value, then move largest value in left
sibling to L (After giving a data value, the left sibling should satisfy the half full
criteria. Otherwise it does not give)
o Else, merge L and a sibling
o If any internal nodes (including root) contain key value same as deleted value,
then delete those values and replace with it successor. This deletion may
propagate up to root. (If the changes propagate up to root then tree height
decreases).
5 14 24 33
2 3 5 7 14 16 19 20 22 24 27 29 33 34 38 39
lOMoAR cPSD| 34109649
Delete 19 : Half full criteria is satisfied even after deleting 19, so just delete 19 from leaf node
19
5 14 24 33
2 3 5 7 14 16 20 22 24 27 29 33 34 38 39
Delete 20: Half full criteria is not satisfied after deleting 20, so bring 24 from its right siblingand change key
values in the internal nodes.
19
5 14 27 33
2 3 5 7 14 16 22 24 27 29 33 34 38 39
Delete 24: Half full criteria is not satisfied after deleting 24, bringing a value from its siblings
also not possible. Therefore merge it with its right sibling and change key values in the internal
nodes.
19
5 14 33
2 3 5 7 14 16 22 27 29 33 34 38 39
lOMoAR cPSD| 34109649
Delete 5: Half full criteria is not satisfied after deleting 5, bringing a value from its siblings also
not possible. Therefore merge it with its left sibling (you can also merge with right) and change
key values in the internal nodes.
19
14 33
2 3 7 14 16 22 27 29 33 34 38 39
Delete 7: Half full criteria is satisfied even after deleting 7, so just delete 7 from leaf node.
17
14 33
2 3 14 16 22 27 29 33 34 38 39
Delete 2: Half full criteria is not satisfied after deleting 2, bringing a value from its siblings also
not possible. Therefore merge it with its right sibling and change key values in the internal
nodes.
22 33
3 14 16 22 27 29 33 34 38 39
While indexes can improve query execution speed, the price we pay is on index
maintenance. Update and insert operations need to update the index with new data. This means
that writes will
ill slow down slightly with each index we add to a table. We also need to monitor
index usage and identify when an existing index is no longer needed. This allows us to keep our
lOMoAR cPSD| 34109649
indexing relevant and trim enough to ensure that we don’t waste disk space and I/O on write
operations to any unnecessary indexes. To improve performance of the system, we need to do the
following: