CO3 Notes Hashing
CO3 Notes Hashing
Hashing
Another type of primary file organization is based on hashing, which provides very fast access
to records under certain search conditions. This organization is usually called a hash file. The
search condition must be an equality condition on a single field, called the hash field. In most
cases, the hash field is also a key field of the file, in which case it is called the hash key. The
idea behind hashing is to provide a function h, called a hash function or randomizing function,
which is applied to the hash field value of a record and yields the address of the disk block in
which the record is stored. A search for the record within the block can be carried out in a main
memory buffer. For most records, we need only a single-block access to retrieve that record.
• In a huge database structure, it is very inefficient to search all the index values and reach
the desired data.
• Hashing technique is used to calculate the direct location of a data record on the disk
without using index structure.
• Data is stored at the data blocks whose address is generated by using the hashing
function.
• The memory location where these records are stored is known as data bucket or data
blocks.
Formally, let K denote the set of all search-key values, and let B denote the set of all bucket
addresses. A hash function h is a function from K to B. Let h denote a hash function.
Static Hashing:
◼ A bucket is a unit of storage containing one or more records (a bucket is typically a
disk block).
◼ In a hash file organization we obtain the bucket of a record directly from its search-
key value using a hash function.
◼ Hash function h is a function from the set of all search-key values K to the set of all
bucket addresses B.
◼ In most cases, the hash field is also a key field of the file, in which case it is called the
hash key.
◼ Hash function is used to locate records for access, insertion as well as deletion.
◼ One common hash function is the h(K) = K mod M function, which returns the
remainder of an integer hash field value K after division by M; this value is then used
for the record address.
Hash file organization of account file, using branch-name as key.
◼ There are 10 buckets,
◼ The numeric representation of the ith character is assumed to be the integer i.
◼ The hash function returns the sum of the numeric representations of the characters
modulo 10.
◼ E.g., h(Perryridge) = 125%10 = 5 h(Redwood) = 84%10 = 4
◼ h(Brighton) = 93%10 = 3
Other hashing functions can be used. One technique, called folding, involves applying an
arithmetic function such as addition or a logical function such as exclusive or to different
portions of the hash field value to calculate the hash address (for example, with an address
space from 0 to 999 to store 1,000 keys, a 6-digit key 235469 may be folded and stored at the
address: (235+964) mod 1000 = 199).
Another technique involves picking some digits of the hash field value—for instance, the third,
fifth, and eighth digits—to form the hash address (for example, storing 1,000 employees with
Social Security numbers of 10 digits into a hash file with 1,000 positions would give the Social
Security number 301-67-8923 a hash value of 172 by this hash function).
So far, we have assumed that, when a record is inserted, the bucket to which it is mapped has
space to store the record. If the bucket does not have enough space, a bucket overflow is said
to occur. Bucket overflow can occur for several reasons:
• Insufficient buckets. The number of buckets, which we denote nB, must be chosen such
that nB > nr / fr, where nr denotes the total number of records that will be stored and fr
denotes the number of records that will fit in a bucket. This designation, of course, assumes
that the total number of records is known when the hash function is chosen.
• Skew. Some buckets are assigned more records than are others, so a bucket may overflow
even when other buckets still have space. This situation is called bucket skew.
◼ There are numerous methods for collision resolution, including the following:
➢ Closed Hashing (Overflow Chaining)
➢ Open Hashing (Open Addressing)
Despite allocation of a few more buckets than required, bucket overflow can still occur. We
handle bucket overflow by using overflow buckets. If a record must be inserted into a bucket
b, and b is already full, the system provides an overflow bucket for b, and inserts the record
into the overflow bucket. If the overflow bucket is also full, the system provides another
overflow bucket, and so on. All the overflow buckets of a given bucket are chained together in
a linked list, as in Figure 11.24. Overflow handling using such a linked list is called overflow
chaining or closed hashing.
Under an alternative approach, called open hashing, the set of buckets is fixed, and there are
no overflow chains. Instead, if a bucket is full, the system inserts records in some other bucket
in the initial set of buckets B. One policy is to use the next bucket (in cyclic order) that has
space; this policy is called linear probing. Other policies, such as computing further hash
functions, are also used. Open hashing has been used to construct symbol tables for compilers
and assemblers, but closed hashing is preferable for database systems. Thus, open hashing is
of only minor importance in database implementation.
Deficiencies or disadvantages of Static Hashing
In static hashing, function h maps search-key values to a fixed set of B of bucket addresses.
➢ Databases grow with time. If initial number of buckets is too small, performance
will degrade due to too much overflows.
➢ If file size at some point in the future is anticipated and number of buckets
allocated accordingly, significant amount of space will be wasted initially.
➢ If database shrinks, again space will be wasted.
➢ One option is periodic re-organization of the file with a new hash function, but
it is very expensive.
These problems can be avoided by using techniques that allow the number of buckets to be
modified dynamically.
Dynamic Hashing:
The main problem with Static Hashing is that the number of buckets is fixed. If a file shrinks
greatly, a lot of space is wasted; more important, if a file grows a lot, long overflow chains
develop, resulting in poor performance.
The hashing scheme described so far is called static hashing because a fixed number of buckets
M is allocated. The function does key-to-address mapping, whereby we are fixing the address
space. This can be a serious drawback for dynamic files.
Newer dynamic file organizations based on hashing allow the number of buckets to vary
dynamically with only localized reorganization.
◼ Good for database that grows and shrinks in size.
◼ Allows the hash function to be modified dynamically.
◼ Extendable hashing – one form of dynamic hashing in which copes with changes in
database size by splitting the buckets as the database grows and shrinks.
➢ The hash function in extendible hashing method generates values over a
relatively large range—namely, b-bit binary integers. A typical value for b is
32.
➢ At any time use only a prefix of the hash function to index into a table of bucket
addresses.
➢ Bucket address table size = 2i.
➢ Initially i = 0
➢ Value of i grows and shrinks as the size of the database grows and shrinks.
➢ Multiple entries in the bucket address table may point to a bucket.
➢ Thus, actual number of buckets is < 2i
Step 1 – Analyse Data Elements: Data elements may exist in various forms eg. Integer, String,
Float, etc. Currently, let us consider data elements of type integer. eg: 49.
Step 2 – Convert into binary format: Convert the data element in Binary form. For string
elements, consider the ASCII equivalent integer of the starting character and then convert the
integer into binary form. Since we have 49 as our data element, its binary form is 110001.
Step 3 – Check Global Depth of the directory. Suppose the global depth of the Hash-directory
is 3.
Step 4 – Identify the Directory: Consider the ‘Global-Depth’ number of LSBs in the binary
number and match it to the directory id. Eg. The binary obtained is: 110001 and the global-
depth is 3. So, the hash function will return 3 LSBs of 110001 viz. 001.
Step 5 – Navigation: Now, navigate to the bucket pointed by the directory with directory-id
001.
Step 6 – Insertion and Overflow Check: Insert the element and check if the bucket overflows.
If an overflow is encountered, go to step 7 followed by Step 8, otherwise, go to step 9.
Step 7 – Tackling Over Flow Condition during Data Insertion: Many times, while inserting
data in the buckets, it might happen that the Bucket overflows. In such cases, we need to follow
an appropriate procedure to avoid mishandling of data. First, Check if the local depth is less
than or equal to the global depth. Then choose one of the cases below.
➢ Case1: If the local depth of the overflowing Bucket is equal to the global depth,
then Directory Expansion, as well as Bucket Split, needs to be performed. Then
increment the global depth and the local depth value by 1. And, assign
appropriate pointers. Directory expansion will double the number of directories
present in the hash structure.
➢ Case2: In case the local depth is less than the global depth, then only Bucket
Split takes place. Then increment only the local depth value by 1. And, assign
appropriate pointers.
Step 8 – Rehashing of Split Bucket Elements: The Elements present in the overflowing bucket
that is split are rehashed w.r.t the new global depth of the directory.
Step 9 – The element is successfully hashed.
Example: Now, let us consider a prominent example of hashing the following elements:
16,4,6,22,24,10,31,7,9,20,26.
Bucket Size: 3 (Assume)
1) First, calculate the binary forms of each of the given numbers.
16- 10000
4- 00100
6- 00110
22- 10110
24- 11000
10- 01010
31- 11111
7- 00111
9- 01001
20- 10100
26- 11010
2) Initially, the global-depth and local-depth is always 1. Thus, the hashing frame looks like
this:
Inserting 16:
The binary format of 16 is 10000 and global-depth is 1. The hash function returns 1 LSB of
10000 which is 0. Hence, 16 is mapped to the directory with id=0.
Inserting 4 and 6:
Both 4(100) and 6(110) have 0 in their LSB. Hence, they are hashed as follows:
Inserting 22: The binary form of 22 is 10110. Its LSB is 0. The bucket pointed by directory 0
is already full. Hence, Over Flow occurs.
As directed by Step 7-Case 1, Since Local Depth = Global Depth, the bucket splits and
directory expansion take place. Also, rehashing of numbers present in the overflowing bucket
takes place after the split. And, since the global depth is incremented by 1, now, the global
depth is 2. Hence, 16,4,6,22 is now rehashed w.r.t 2 LSBs. [
16(10000),4(100),6(110),22(10110)]
Inserting 24 and 10: 24(11000) and 10 (1010) can be hashed based on directories with id 00
and 10. Here, we encounter no overflow condition.
Inserting 31,7,9: All of these elements [ 31(11111), 7(111), 9(1001)] have either 01 or 11 in
their LSBs. Hence, they are mapped on the bucket pointed out by 01 and 11. We do not
encounter any overflow condition here.
Inserting 20: Insertion of data element 20 (10100) will again cause the overflow problem.
20 is inserted in bucket pointed out by 00. As directed by Step 7-Case 1, since the local depth
of the bucket = global-depth, directory expansion (doubling) takes place along with bucket
splitting. Elements present in overflowing bucket are rehashed with the new global depth. Now,
the new Hash table looks like this:
Inserting 26: Global depth is 3. Hence, 3 LSBs of 26(11010) are considered. Therefore 26
best fits in the bucket pointed out by directory 010.
The bucket overflows, and, as directed by Step 7-Case 2, since the local depth of bucket <
Global depth (2<3), directories are not doubled but, only the bucket is split and elements are
rehashed. Finally, the output of hashing the given list of numbers is obtained.