0% found this document useful (0 votes)
210 views20 pages

11 What Is Hashing in DBMS

Hashing is a technique in DBMS that allows direct access to data on disk without using an index structure. A hash function maps keys to memory locations called buckets where the associated records are stored. Common hashing techniques include linear probing, where new records are placed in the next available bucket, and chaining, where overflow buckets are linked to full buckets. Dynamic hashing methods like linear hashing allow the number of buckets to grow and shrink as needed.

Uploaded by

iamchirec
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
210 views20 pages

11 What Is Hashing in DBMS

Hashing is a technique in DBMS that allows direct access to data on disk without using an index structure. A hash function maps keys to memory locations called buckets where the associated records are stored. Common hashing techniques include linear probing, where new records are placed in the next available bucket, and chaining, where overflow buckets are linked to full buckets. Dynamic hashing methods like linear hashing allow the number of buckets to grow and shrink as needed.

Uploaded by

iamchirec
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 20

What is Hashing in DBMS?

In DBMS, hashing is a technique to directly search the location of desired data on the disk without using index
structure. Hashing method is used to index and retrieve items in a database as it is faster to search that specific
item using the shorter hashed key instead of using its original value. Data is stored in the form of data blocks
whose address is generated by applying a hash function in the memory location where these records are stored
known as a data block or data bucket.
In this DBMS tutorial, you will learn,
Why do we need Hashing?
Here, are the situations in the DBMS where you need to apply the Hashing method:
For a huge database structure, it’s tough to search all the index values through all its level and then you need to
reach the destination data block to get the desired data.
Hashing method is used to index and retrieve items in a database as it is faster to search that specific item using
the shorter hashed key instead of using its original value.
Hashing is an ideal method to calculate the direct location of a data record on the disk without using index
structure.
It is also a helpful technique for implementing dictionaries.
Important Terminologies in Hashing
Here, are important terminologies which are used in Hashing:
Data bucket – Data buckets are memory locations where the records are stored. It is also known as Unit Of
Storage.
Key: A DBMS key is an attribute or set of an attribute which helps you to identify a row(tuple) in a relation(table).
This allows you to find the relationship between two tables.
Hash function: A hash function, is a mapping function which maps all the set of search keys to the address where
actual records are placed.
Linear Probing – Linear probing is a fixed interval between probes. In this method, the next available data block is
used to enter the new record, instead of overwriting on the older record.
Quadratic probing– It helps you to determine the new bucket address. It helps you to add Interval between probes
by adding the consecutive output of quadratic polynomial to starting value given by the original computation.
Hash index – It is an address of the data block. A hash function could be a simple mathematical function to even a
complex mathematical function.
Double Hashing –Double hashing is a computer programming method used in hash tables to resolve the issues of
has a collision.
Bucket Overflow: The condition of bucket-overflow is called collision. This is a fatal stage for any static has to
function.
Types of Hashing Techniques

There are mainly two types of SQL hashing methods/techniques:

Static Hashing

Dynamic Hashing

Static Hashing

In the static hashing, the resultant data bucket address will always remain the same.

Therefore, if you generate an address for say Student_ID = 10 using hashing function mod(3), the resultant bucket
address will always be 1. So, you will not see any change in the bucket address.
Therefore, in this static hashing method, the number of data buckets in memory always remains constant.

Static Hash Functions

Inserting a record: When a new record requires to be inserted into the table, you can generate an address for the
new record using its hash key. When the address is generated, the record is automatically stored in that location.

Searching: When you need to retrieve the record, the same hash function should be helpful to retrieve the address
of the bucket where data should be stored.

Delete a record: Using the hash function, you can first fetch the record which is you wants to delete. Then you can
remove the records for that address in memory.

Static hashing is further divided into

Open hashing

Close hashing.

Open Hashing

In Open hashing method, Instead of overwriting older one the next available data block is used to enter the new
record, This method is also known as linear probing.

For example, A2 is a new record which you wants to insert. The hash function generates address as 222. But it is
already occupied by some other value. That’s why the system looks for the next data bucket 501 and assigns A2 to
it.

How Open Hash Works

Close Hashing

In the close hashing method, when buckets are full, a new bucket is allocated for the same hash and result are
linked after the previous one.

(i) Open Hashing: If a hashing function generates the address for which the data can be seen already in the stored
state, in that case, the next level of the bucket will automatically get allocated. This mechanism can be termed to
be a linear probing technique.

For example, if R3 is the fresh address which is needed to be put, then the hash-based function will generate
address as the number 102 for the R3 address. The address generated is in full state and therefore the system is
meant to search for the new data bucket which is 113 and assign R3 to that data bucket.
(ii) Closed Hashing: When the buckets are completely full, a new bucket is then allocated for a particular hash
result which is linked right after the one completed previously and therefore this method is called to be Overflow
chaining technique.

For example, R3 is the fresh address which is required to be put in the new table the hashing function is used to
generate address as the number 110 to it. This bucket, in turn, is full and therefore cannot receive new data and
therefore a fresh bucket is put in the end after 100.

LINEAR HASHING

1 INTRODUCTION

Hashing is one of the techniques used to organize records in a file for faster access to records given a key.
Remember that key is a set of fields values of which uniquely identify a record in the file. A hash function maps key
values to a number that indicates page/block of the file where the record having the key value is stored. Files in
database systems used to store data such as employee records are usually organized in terms of one or more disk
blocks or pages which can be referred to with serial numbers such as 0,1,2, etc. For example, if the file is expected
to store records of employees and key is employee name, then name such as 'Akhil' is mapped to a number using
appropriate hash function. The record of Akhil would then be stored in a page having the number. Hashing
techniques refer to the pages as buckets. Based on the key values you can come up with maximum number, N, of
buckets for your file. An example of hashing for storing data of students is shown below:

Hash function specified above returns bucket number for a record based on number of letters in employee name
after subtracting 4 from it. Of course, your hash function could be different from this one. Record with key value
'Akhil' maps to bucket 1 and hence the record is stored in bucket 1. Record with key value 'Sparsh' maps to bucket
2 and hence the record is stored in bucket 2. Note that above diagram shows only key values. But record may
contain other fields such as grade and roll number and hence these are also stored as part of the record in the
same bucket. However, this does not preclude you to have a different scheme to store records. For example, you
may want to store just key value and pointer to actual record in the primary bucket and actual records in some
other page of the file. For now, we assume that record is stored in the primary bucket itself. The hash function
maps record with key 'Piyush' also to the same bucket 2. This is called collision. Here we assumed that bucket
capacity is one record. Since there is no room in bucket 2, we allocated a new bucket and chained it to bucket 2.
This new page is called overflow page. Records with next three names map to the same bucket 3. Hence we
needed two overflow pages. In case the bucket capacity is more than one record, usual practical case, records with
collision can be stored in the primary bucket till it is full. Any other records with collision need to be stored in
overflow pages. One has to choose a hash function that gives least number of collisions for better performance.
Searching for equality is very fast with this organization. Given a key value, apply the hash function, get its bucket
number, check if the bucket contains record with the key value and return the record if it is in the bucket.
Otherwise check overflow pages one by one. This hashing technique does not read all pages of the file and hence
faster. Of course, the technique will be faster only if you have to match key values for equality.
However, performance degrades if the chosen hash function is not efficient. In such case, it gives rise to too many
collisions and hence the file will have lengthy overflow page chains. Another reason for many collisions is fixed
size of primary buckets. Due to the fixed size, this hashing technique is called static hashing. The problem can be
alleviated using one of the dynamic hashing techniques where size of the primary buckets is not fixed but changes
as and when records are inserted. Linear hashing is one such technique. Advantage of the scheme is non-existence
of long chains of overflow pages that degrade performance of record insertion and retrieval. Of course, the scheme
can also lead to overflow pages if data is very skewed. In this case, the overflow pages are attached to the main
bucket.

Though an example of a file for storing data is given in this article, you can use hashing scheme for data stored in
main memory as well.

2 LINEAR HASHING

Linear hashing is a dynamic hashing technique. It allows a file to extend or shrink its number of buckets without a
directory as used in Extendible Hashing. Suppose you start with a number of buckets N to put records in the
buckets 0 to N-1. Let this be round i which will be 0 initially. You start with an initial mod hash function hi(K) = K
mod 2iN. When there is a collision, the first bucket, i.e., bucket 0, is split into two buckets: bucket 0 and a new
bucket N at the end of the file. The records in bucket 0 are redistributed between the two buckets using another
hash function hi+1 = K mod 2(i+1)N. Important property of the hash function is that records hashed into bucket 0
based on hi will be hashed to bucket 0 or bucket N based on the new hash function hi+1. Still there could be
overflow buckets that are attached to main buckets as usual. As more records are inserted, rest of the buckets 1 to
N-1 are split in the linear order and the file will have now 2N buckets and at this point hash function hi+1 can be
used to search for any record. This is starting of round i+1 or just 1.

In order to keep track of which hash function is needed to be used we use a split pointer s that points to next
bucket that will be split. This is initially set to 0 and incremented every time a split occurs. When s becomes N after
incrementing, this signals that all buckets have been split and hash function hi+1 applies to all buckets. At this
point the split pointer s is reset to 0. After this, next collision hash function to be used would be hi+2 = K mod
2(i+2)N.

Searching for a bucket with hash key K can be done as following. Apply current hash function hi. If bucket b = hi(K)
< s, then apply hash function hi+1 because bucket b is already split.

Here is a simple example of using linear hashing to store 14 records with number of initial buckets N = 4.

3 AN EXAMPLE

This example shows various aspects of linear hashing with more data. Initially the file (at least logically) will be
empty. As records are inserted into the file, buckets are added to the file. Key of each record could consist of one
or more fields of the record. We assume that a hash function is used to come up with a number before hashing
function specified in the linear hashing is used.

Let each bucket has capacity for storing 4 records.

Let initial number of buckets N be 4.

Record key values could be numbers, names or a combination of these depending on fields you select as key for
the file. If a key contains more than one field or a key field is not a number, we have to use a hash function to get a
number because we are going to use other hash functions to map record keys to bucket numbers.

Note:- The example is directly taken from third edition of the book Database Management Systems by
Ramakrishnan and Gehrke.

3.1 INSERTING RECORDS

1) Insert record with keys 32, 44, 36, 9, 25, 5, 14, 18, 10, 30, 31, 35, 7 and 11

Current round i = 0.

Split pointer s = 0.

Hash function to be used h0(K) = K mod 4.

h0(32) = 32 mod 4 = 0. Insert record in bucket 0.

h0(44) = 44 mod 4 = 0. Insert record in bucket 0.

Similarly you can find out hash values for other keys and insert in appropriate buckets as shown in the following
diagram:

2) Insert record with key 43.

Current round i = 0.

Hash function to be used h0(K) = K mod 4

h0(43) = 43 mod 4 = 3. Insert record in bucket 3. But the bucket is full.

Add an overflow page and chain it to 3.


Because there is an overflow, split bucket 0 pointed to by s and add a new bucket 4.

Redistribute bucket 0 contents between buckets 0 and 4 using next hash function h1(K) = K mod 2 * 4.

Increment split pointer.

3) Insert record with key 37.

Current round i = 0.

Hash function to be used h0(K) = K mod 4

h0(37) = 37 mod 4 = 1. Insert record in bucket 1.

4) Insert record with key 29.


Current round i = 0.

Hash function to be used h0(K) = K mod 4

h0(29) = 29 mod 4 = 1. Insert record in bucket 1. But the bucket is full.

Split bucket 1 pointed to by s.

Add new bucket 5.

Redistribute bucket 1 contents and 29 between buckets 1 and 5 with new hash function h1(K) = K mod 2 * 4.

Increment split pointer

No overflow page is needed because we are splitting the same bucket to which the key was mapped with the
original hash function.

5) Insert record with key 22.

Current round i = 0.

Hash function to be used h0 (K) = K mod 4.

h0 (22) = 22 mod 4 = 2. Insert record in bucket 2. But the bucket is full.

Split bucket 2 pointed to by s.

Add new bucket 6.

Redistribute bucket 2 contents and 22 between buckets 2 and 6 with new hash function h1(K) = K mod 2 * 4.

Increment split pointer

No overflow page is needed because we are splitting the same bucket to which the key was mapped with the
original hash function.
6) Insert record with key 66.

Current round i = 0.

Hash function to be used h0(K) = K mod 4

h0(66) = 66 mod 4 = 2. Insert record in bucket 2.

Since 2 < s, we have to use h1 hash function.

h1(66) = 66 mod 8 = 2. The same bucket number and hence insert record in bucket 2.

7) Insert record with key 34.

Current round i = 0.

Hash function to be used h0(K) = K mod 4

h0(34) = 34 mod 4 = 2. Insert record in bucket 2.

2 < s, hence need to use h1 hash function.

h1(34) = 34 mod 8 = 2. Again the same bucket. Insert record in bucket 2.


8) Insert record with key 50.

Current round i = 0.

Hash function to be used h0(K) = K mod 4

h0(50) = 50 mod 4 = 2. Insert record in bucket 2.

As 2 < s, use h1 for hashing.

h1(50) = 50 mod 8 = 2. The same bucket. Insert record in bucket 2.

But the bucket is full.

Split bucket 3.

Add new bucket 7.

Redistribute bucket 3 contents and 50 between buckets 2 and 7 with new hash function h1(K) = K mod 2 * 4.

Add an overflow bucket, insert 50 in it and chain it to bucket 2

Increment split pointer. It will be 4.

Since it is equivalent to initial number of buckets N = 4, reset it to 0.

Use has function to be used as h1(K) = K mod 2 * 4 for further insertion of records.

Set next round i to i + 1. Now the round i will be 1.


9) Insert record with key 45.

Current round i = 1.

Hash function to be used h1(K) = K mod 2 * 4.

h1(45) = 45 mod 8 = 5. Insert record in bucket 5.

10) Insert record with key 53.

Current round i = 1.

Hash function to be used h1(K) = K mod 21 * 4

h1(53) = 53 mod 8 = 5. Insert record in bucket 5. But the bucket is full.

Therefore, split bucket 0 as pointed to by split pointer.

Add new bucket 8.

Redistribute its contents between buckets 0 and 8 with new hash function h2(K) = K mod 22 * 4 = K mod 16.

h1(32) = 32 mod 16 = 0. Hence 32 remains in the same bucket

Add an overflow bucket, insert 53 in it and chain it to bucket 5


Increment split pointer. It will be 1.

11) Insert record with key 48, 64, 80

Current round i = 1.

Hash function to be used h1(K) = K mod 21 * 4 = K mod 8.

h1(48) = 48 mod 8 = 0. Insert record in bucket 0.

Since 0 < s, h2(48) = 48 mod 22*4= 48 mod 16 = 0. Insert record in bucket 0.

h1(64) = 64 mod 8= 0. Insert record in bucket 0.

Since 0 < s, h2(64) = 64 mod 22*4= 64 mod 16 = 0. Insert record in bucket 0.

h1(80) = 80 mod 8 = 0. Insert record in bucket 0.

Since 0 < s, h2(80) = 80 mod 22*4= 80 mod 16 = 0. Insert record in bucket 0.

12) Insert record with key 40

Current round i = 1.
Hash function to be used h1(K) = K mod 21 * 4 = K mod 8.

h1(40) = 40 mod 8 = 0.

Since 0 < s, h2(40) = 40 mod 22 * 4 = 40 mod 16 = 8. Insert record in bucket 8.

3.2 SEARCHING

Searching can be done as following. If current hash function is hi(K), let bucket b = hi(K). If b < s, then let b =
hi+1(K). Search for the record with K in bucket b.

In the above example, current hash function is h1(K) = K mod 21*4 = K mod 8.

Extendible Hashing (Dynamic approach to DBMS)

Extendible Hashing is a dynamic hashing method wherein directories, and buckets are used to hash data. It is an
aggressively flexible method in which the hash function also experiences dynamic changes.

Main features of Extendible Hashing: The main features in this hashing technique are:

Directories: The directories store addresses of the buckets in pointers. An id is assigned to each directory which
may change each time when Directory Expansion takes place.

Buckets: The buckets are used to hash the actual data.

Basic Structure of Extendible Hashing:


Frequently used terms in Extendible Hashing:

Directories: These containers store pointers to buckets. Each directory is given a unique id which may change each
time when expansion takes place. The hash function returns this directory id which is used to navigate to the
appropriate bucket. Number of Directories = 2^Global Depth.

Buckets: They store the hashed keys. Directories point to buckets. A bucket may contain more than one pointers to
it if its local depth is less than the global depth.

Global Depth: It is associated with the Directories. They denote the number of bits which are used by the hash
function to categorize the keys. Global Depth = Number of bits in directory id.

Local Depth: It is the same as that of Global Depth except for the fact that Local Depth is associated with the
buckets and not the directories. Local depth in accordance with the global depth is used to decide the action that
to be performed in case an overflow occurs. Local Depth is always less than or equal to the Global Depth.

Bucket Splitting: When the number of elements in a bucket exceeds a particular size, then the bucket is split into
two parts.

Directory Expansion: Directory Expansion Takes place when a bucket overflows. Directory Expansion is performed
when the local depth of the overflowing bucket is equal to the global depth.

Basic Working of Extendible Hashing:


Step 1 – Analyze Data Elements: Data elements may exist in various forms eg. Integer, String, Float, etc.. Currently,
let us consider data elements of type integer. eg: 49.

Step 2 – Convert into binary format: Convert the data element in Binary form. For string elements, consider the
ASCII equivalent integer of the starting character and then convert the integer into binary form. Since we have 49
as our data element, its binary form is 110001.

Step 3 – Check Global Depth of the directory. Suppose the global depth of the Hash-directory is 3.

Step 4 – Identify the Directory: Consider the ‘Global-Depth’ number of LSBs in the binary number and match it to
the directory id.
Eg. The binary obtained is: 110001 and the global-depth is 3. So, the hash function will return 3 LSBs of 110001 viz.
001.

Step 5 – Navigation: Now, navigate to the bucket pointed by the directory with directory-id 001.

Step 6 – Insertion and Overflow Check: Insert the element and check if the bucket overflows. If an overflow is
encountered, go to step 7 followed by Step 8, otherwise, go to step 9.

Step 7 – Tackling Over Flow Condition during Data Insertion: Many times, while inserting data in the buckets, it
might happen that the Bucket overflows. In such cases, we need to follow an appropriate procedure to avoid
mishandling of data.
First, Check if the local depth is less than or equal to the global depth. Then choose one of the cases below.

Case1: If the local depth of the overflowing Bucket is equal to the global depth, then Directory Expansion, as well
as Bucket Split, needs to be performed. Then increment the global depth and the local depth value by 1. And,
assign appropriate pointers.
Directory expansion will double the number of directories present in the hash structure.

Case2: In case the local depth is less than the global depth, then only Bucket Split takes place. Then increment only
the local depth value by 1. And, assign appropriate pointers.

Step 8 – Rehashing of Split Bucket Elements: The Elements present in the overflowing bucket that is split are
rehashed w.r.t the new global depth of the directory.

Step 9 – The element is successfully hashed.

Example based on Extendible Hashing: Now, let us consider a prominent example of hashing the following
elements: 16,4,6,22,24,10,31,7,9,20,26.
Bucket Size: 3 (Assume)
Hash Function: Suppose the global depth is X. Then the Hash Function returns X LSBs.

Solution: First, calculate the binary forms of each of the given numbers.
16- 10000
4- 00100
6- 00110
22- 10110
24- 11000
10- 01010
31- 11111
7- 00111
9- 01001
20- 10100
26- 11010

Initially, the global-depth and local-depth is always 1. Thus, the hashing frame looks like this:

Inserting 16:
The binary format of 16 is 10000 and global-depth is 1. The hash function returns 1 LSB of 10000 which is 0. Hence,
16 is mapped to the directory with id=0.

Inserting 4 and 6:
Both 4(100) and 6(110)have 0 in their LSB. Hence, they are hashed as follows:
Inserting 22: The binary form of 22 is 10110. Its LSB is 0. The bucket pointed by directory 0 is already full. Hence,
Over Flow occurs.

As directed by Step 7-Case 1, Since Local Depth = Global Depth, the bucket splits and directory expansion takes
place. Also, rehashing of numbers present in the overflowing bucket takes place after the split. And, since the
global depth is incremented by 1, now,the global depth is 2. Hence, 16,4,6,22 are now rehashed w.r.t 2 LSBs.
[ 16(10000),4(100),6(110),22(10110) ]
*Notice that the bucket which was underflow has remained untouched. But, since the number of directories has
doubled, we now have 2 directories 01 and 11 pointing to the same bucket. This is because the local-depth of the
bucket has remained 1. And, any bucket having a local depth less than the global depth is pointed-to by more than
one directories.

Inserting 24 and 10: 24(11000) and 10 (1010) can be hashed based on directories with id 00 and 10. Here, we
encounter no overflow condition.

Inserting 31,7,9: All of these elements[ 31(11111), 7(111), 9(1001) ] have either 01 or 11 in their LSBs. Hence, they
are mapped on the bucket pointed out by 01 and 11. We do not encounter any overflow condition here.

Inserting 20: Insertion of data element 20 (10100) will again cause the overflow problem.
20 is inserted in bucket pointed out by 00. As directed by Step 7-Case 1, since the local depth of the bucket =
global-depth, directory expansion (doubling) takes place along with bucket splitting. Elements present in
overflowing bucket are rehashed with the new global depth. Now, the new Hash table looks like this:

Inserting 26: Global depth is 3. Hence, 3 LSBs of 26(11010) are considered. Therefore 26 best fits in the bucket
pointed out by directory 010.
The bucket overflows, and, as directed by Step 7-Case 2, since the local depth of bucket < Global depth (2<3),
directories are not doubled but, only the bucket is split and elements are rehashed.
Finally, the output of hashing the given list of numbers is obtained.

Hashing of 11 Numbers is Thus Completed.

Key Observations:

A Bucket will have more than one pointers pointing to it if its local depth is less than the global depth.

When overflow condition occurs in a bucket, all the entries in the bucket are rehashed with a new local depth.

If Local Depth of the overflowing bucket

The size of a bucket cannot be changed after the data insertion process begins.

Advantages:
Data retrieval is less expensive (in terms of computing).

No problem of Data-loss since the storage capacity increases dynamically.

With dynamic changes in hashing function, associated old values are rehashed w.r.t the new hash function.

Limitations Of Extendible Hashing:

The directory size may increase significantly if several records are hashed on the same directory while keeping the
record distribution non-uniform.

Size of every bucket is fixed.

Memory is wasted in pointers when the global depth and local depth difference becomes drastic.

This method is complicated to code.

Performance comparison of extendible hashing and linear hashing techniques

Computing methodologies

Modeling and simulation

Simulation evaluation

Information systems

Data management systems

Data structures

Data access methods

Information storage systems

Record storage systems

Record storage alternatives

Hashed file organization

You might also like