Unit-6 Storage Strategies
Unit-6 Storage Strategies
Storage
Strategies
Index (Indexing)
2
What is database Index?
Indexes are special lookup tables that the database
search engine can use to speed up data retrieval.
A database index is a data structure that improves the
speed of data retrieval operations on a database table.
An index in a database is very similar to an index in
the back of a book.
Indexes are used to retrieve data from the database
very fast. The users cannot see the indexes, they are
just used to speed up searches/queries.
Updating a table with indexes takes more time than
updating a table without (because the indexes also
need an update).
Indexing
Indexing is used to optimize the performance of a database by
minimizing the number of disk accesses required when a query is
processed.
The index is a type of data structure. It is used to locate and access
the data in a database table quickly.
Indexes can be created using some database columns
The first column of the database is the search key that contains a
copy of the primary key or candidate key of the table. The values of
the primary key are stored in sorted order so that the corresponding
data can be accessed easily.
The second column of the database is the data reference. It contains
a set of pointers holding the address of the disk block where the
value of the particular key can be found.
Indexing methods
1) Ordered Index
The indices are usually sorted to make searching faster.
The indices which are sorted are known as ordered
indices.
Suppose we have an employee table with thousands of
record and each of which is 10 bytes long. If their IDs
start with 1, 2, 3....and so on and we have to search
employee with ID-543.
In the case of a database with no index, we have to
search the disk block from starting till it reaches 543.
In the case of an index, we will search using indexes.
2) Primary Index
If the index is created on the basis of the primary key of the
table, then it is known as primary indexing. These primary
keys are unique to each record and contain 1:1 relation
between the records.
As primary keys are stored in sorted order, the performance of
the searching operation is quite efficient.
The primary index can be classified into two types:
1) Dense index
2) Sparse index.
1) Dense index
The dense index contains an index record for every search key
value in the data file. It makes searching faster.
In this, the number of records in the index table is same as the
number of records in the main table.
Primary Index Conti…
It needs more space to store index record itself. The index records
have the search key and a pointer to the actual record on the disk.
In dense index, there is an index record for every search key value in
the database.
This makes searching faster but requires more space to store index
records.
In this, the number of records in the index table is same as the
number of records in the main table.
Index records contain search key
value and a pointer to the actual
record on the disk.
Primary Index Conti..
2) Sparse index.
In the data file, index record appears only for a few items. Each
item points to a block.
In sparse index, index records are not created for every search
key.
The index record appears only for a few
items in the data file.
It requires less space, less maintenance
overhead for insertion, and deletions
but is slower compared to the
dense index for locating records.
3) Clustered Index
A clustered index can be defined as an ordered data file.
Sometimes the index is created on non-primary key columns
which may not be unique for each record.
In this case, to identify the record faster, we will group two or
more columns to get the unique value and create index out of
them. This method is called a clustering index.
The records which have similar
characteristics are grouped, and indexes
are created for these group.
Clustered Index
Example: suppose a company contains
several employees in each department.
Suppose we use a clustering index,
where all employees which belong to
the same Dept_ID are considered
within a single cluster, and index
pointers point to the cluster as a
whole. Here Dept_Id is a non-unique
key. we use separate disk block for
separate clusters
4) Secondary Index
In the sparse indexing, as the size of the table grows, the size of
mapping also grows.
These mappings are usually kept in the primary memory so
that address fetch should be faster.
Then the secondary memory searches the actual data based on
the address got from mapping. If the mapping size grows then
fetching the address itself becomes slower. In this case, the
sparse index will not be efficient. To overcome this problem,
secondary indexing is introduced.
In secondary indexing, to reduce the size of mapping, another
level of indexing is introduced.
In this method, the huge range for the columns is selected
initially so that the mapping size of the first level becomes
small.
Secondary Index
Then each range is further divided into smaller ranges.
The mapping of the first level is stored in the primary memory,
so that address fetch is faster. The mapping of the second level
and actual data are stored in the secondary memory (hard disk).
Example
If you want to find the record of roll 111 in the diagram, then it
will search the highest entry which is smaller than or equal to
111 in the first level index.
It will get 100 at this level. Then in the second index level, again
it does max (111)<=111 and gets 110.
Secondary Index
Now using the address 110, it goes to the data block and
starts searching each record till it gets 111.
This is how a search is performed
in this method.
Inserting, updating or deleting is
also done in the same manner.
B-tree
17
B-tree
B-tree is a data structure that store data in its node in sorted
order. We can represent sample B-tree as follows.
B-tree stores data in such a way that each node contains keys in
ascending order.
Each of these keys has two references to another two child nodes.
The left side child node keys are less than the current keys and the
right side child node keys are greater than the current keys.
Searching a record in B-tree
35
Hashing
In a huge database structure, it is very inefficient to search all the
index values and reach the desired data. Hashing technique is used to
calculate the direct location of a data record on the disk without using
index structure.
In this technique, data is stored at the data blocks whose address is
generated by using the hashing function. The memory location where
these records are stored is known as data bucket or data blocks.
In this, a hash function can choose any of the column value to
generate the address.
Most of the time, the hash function uses the primary key to generate
the address of the data block.
A hash function is a simple mathematical function to any complex
mathematical function.
We can even consider the primary key itself as the address of the data
block.
Hashing
Hashing method is used to index and retrieve items in a
database as it is faster to search that specific item using the
shorter hashed key instead of using its original value.
Data is stored in the form of data blocks whose address is
generated by applying a hash function in the memory
location where these records are stored known as a data
Hashing
The above diagram shows data block addresses same as primary key
value.
This hash function can also be a simple mathematical function like
exponential, mod, cos, sin, etc.
Suppose we have mod (5) hash function to determine the address of the
data block.
In this case, it applies mod (5) hash function on the primary keys and
generates 3, 3, 1, 4 and 2 respectively, and records are stored in those data
block addresses
Why do we need hashing?
Here, are the situations in the DBMS where you need to apply the
Hashing method:
For a huge database structure, it's tough to search all the index
values through all its level and then you need to reach the
destination data block to get the desired data.
Hashing method is used to index and retrieve items in a
database as it is faster to search that specific item using the
shorter hashed key instead of using its original value.
Hashing is an ideal method to calculate the direct location of a
data record on the disk without using index structure.
It is also a helpful technique for implementing dictionaries.
Important terminologies using in hashing
Data bucket – Data buckets are memory locations where the
records are stored. It is also known as Unit Of Storage.
Key: A DBMS key is an attribute or set of an attribute which helps
you to identify a row(tuple) in a relation(table). This allows you to
find the relationship between two tables.
Hash function: A hash function, is a mapping function which maps
all the set of search keys to the address where actual records are
placed.
Linear Probing – Linear probing is a fixed interval between probes.
In this method, the next available data block is used to enter the
new record, instead of overwriting on the older record.
Quadratic probing- It helps you to determine the new bucket
address. It helps you to add Interval between probes by adding the
consecutive output of quadratic polynomial to starting value given
by the original computation.
Important terminologies using in hashing
Hash index – It is an address of the data block. A hash
function could be a simple mathematical function to
even a complex mathematical function.
Double Hashing –Double hashing is a computer
programming method used in hash tables to resolve the
issues of has a collision.
Bucket Overflow: The condition of bucket-overflow is
called collision. This is a fatal stage for any static has to
function.
Types of hashing:
1) Static hashing
In the static hashing, the resultant data bucket address will always
remain the same.
That means if we generate an address for EMP_ID =103 using the
hash function mod (5) then it will always result in same bucket
address 3. Here, there will be no change in the bucket address.
Therefore, in this static hashing method, the number of data buckets
in memory always remains constant.
Operations of Static hashing
1) Searching a record
When a record needs to be searched, then the same hash function retrieves the
address of the bucket where the data is stored.
2) Insert a Record
When a new record is inserted into the table, then we will generate an address
for a new record based on the hash key and record is stored in that location.
3) Delete a Record
To delete a record, we will first fetch the record which is supposed to be
deleted. Then we will delete the records for that address in memory.
4) Update a Record
To update a record, we will first search it using a hash function, and then the
data record is updated.
If we want to insert some new record into the file but the address of a data
bucket generated by the hash function is not empty, or data already exists in
that address. This situation in the static hashing is known as bucket
overflow. This is a critical situation in this method.
Dynamic hashing
The dynamic hashing method is used to overcome the problems of
static hashing like bucket overflow.
In this method, data buckets grow or shrink as the records increases
or decreases. This method is also known as Extendable hashing
method.
This method makes hashing dynamic, i.e., it allows insertion or
deletion without resulting in poor performance.
How to search a key
First, calculate the hash address of the key.
Check how many bits are used in the directory, and these bits are
called as i.
Take the least significant i bits of the hash address. This gives an index
of the directory.
Now using the index, go to the directory and find bucket address
where the record might be.
Dynamic Hashing
How to insert a new record
Firstly, you have to follow the same procedure for retrieval, ending
up in some bucket.
If there is still space in that bucket, then place the record in it.
If the bucket is full, then we will split the bucket and redistribute the
records.
Example
Consider the following grouping of keys into buckets, depending on the
prefix of their hash address
Dynamic hashing
The last two bits of 2 and 4 are 00. So it will go into bucket B0.
The last two bits of 5 and 6 are 01, so it will go into bucket B1.
The last two bits of 1 and 3 are 10, so it will go into bucket B2.
The last two bits of 7 are 11, so it will go into B3.
Dynamic hashing
Insert key 9 with hash address 10001 into the above
structure:
Since key 9 has hash address 10001, it must go into the first
bucket. But bucket B1 is full, so it will get split.
The splitting will separate 5, 9 from 6 since last three bits of 5,
9 are 001, so it will go into bucket B1, and the last three bits of
6 are 101, so it will go into bucket B5.
Keys 2 and 4 are still in B0. The record in B0 pointed by the
000 and 100 entry because last two bits of both the entry are
00.
Keys 1 and 3 are still in B2. The record in B2 pointed by the
010 and 110 entry because last two bits of both the entry are
10.
Key 7 are still in B3. The record in B3 pointed by the 111 and
011 entry because last two bits of both the entry are 11.
Dynamic Hashing