DS Module-X

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 74

DEPARTMENT OF

COMPUTER SCIENCE & ENGINEERING


Data Structures
(DS)
B.Tech. – CSE
Academic Year : 2023-24
Term – III
Module - X
Contents
 Introduction

Hash Tables

Hash functions

Pros and Cons of Hashing

Applications
Hash Table Representation

Hash table is one of the most important data structures that uses a special function known as a hash function that
maps a given value with a key to access the elements faster.
(or)


Hashing is the process of mapping large amount of data item to smaller table with the help of hashing function.

(or)


Hashing is a technique that is used to uniquely identify a specific object from a group of similar objects

(or)


Hashing is an important Data Structure which is designed to use a special function called the Hash function
which is used to map a given value with a particular key for faster access of elements.
Why hashing?(Need of Hashing)

Suppose we want to design a system for storing employee records keyed using phone numbers. And we want
following queries to be performed efficiently:

Insert a phone number and corresponding information.

Search a phone number and fetch the information.

Delete a phone number and related information.

We can think of using the following data structures to maintain information about different phone numbers.

Array of phone numbers and records.

Linked List of phone numbers and records.

Balanced binary search tree with phone numbers as keys.

Direct Access Table.

For arrays and linked lists, we need to search in a linear fashion, which can be costly in practice.

With balanced binary search tree, we get moderate search, insert and delete times. All of these operations can
be guaranteed to be in O(Logn) time.

Another solution that one can think of is to use a direct access table where we make a big array and use phone
numbers as index in the array

First problem with this solution is extra space required is huge.

For example if phone number is n digits, we need O(m * 10n) space for table where m is size of a pointer to
record.

Another problem is an integer in a programming language may not store n digits.

Due to above limitations Direct Access Table cannot always be used.

Hashing is the solution that can be used in almost all such situations and performs extremely well compared to
above data structures like Array, Linked List, Balanced BST in practice.

With hashing we get O(1) search time on average (under reasonable assumptions) and O(n) in worst case.

Hashing is an improvement over Direct Access Table
A Hash table is a data structure that stores some information, and the information has basically two
main components, i.e., key and value. The hash table can be implemented with the help of an
associative array. The efficiency of mapping depends upon the efficiency of the hash function used
for mapping.

Hashing is also known as Hashing Algorithm or Message Digest Function.

Hashing allows to update and retrieve any data entry in a constant time O(1).

Constant time O(1) means the operation does not depend on the size of the
data.

It minimizes the number of comparisons while performing the search.

Hashing is used with a database to enable items to be retrieved more quickly.

It is used in the encryption and decryption of digital signatures.

In hashing, large keys are converted into small keys by using hash functions.

The values are then stored in a data structure called hash table.

The idea of hashing is to distribute entries (key/value pairs) uniformly across
an array.
Hashing Implementation

Hashing is implemented using two important concepts they are


Hash Function --An element is converted into an integer by using a
hash function.


Hash Table----The element generated using hash function can be
used as an index to store the original element, which falls into the
hash table.
Hash Table

Hash table is a type of data structure which is used for storing and accessing data very quickly.

Insertion of data in a table is based on a key value. Hence every entry in the hash table is defined with some
key.

Hash table or hash map is a data structure used to store key-value pairs

It is a collection of items stored to make it easy to find them later.

It uses a hash function to compute an index into an array of buckets or slots from which the desired value can
be found.

It is an array of list where each list is known as bucket.

It contains value based on the key.

Hash table is used to implement the map interface and extends Dictionary class.

Hash table is synchronized and contains only unique elements.
Hash Function

A hash function is any function that can be used to map data of arbitrary size to fixed-size values.

Hash function is a function which is applied on a key by which it produces an integer, which can be used as an
address of hash table.

Hash function is a function that maps any big number or string to a small integer value.

A fixed process converts a key to a hash key is known as a Hash Function.

This function takes a key and maps it to a value of a certain length which is called a Hash value or Hash.

Hash value represents the original string of characters, but it is normally smaller than the original

The values returned by a hash function are called hash values, hash codes, hash sums, or simply hashes.

The values returned by a hash function are called hash values, hash codes, digests, or simply hashes.

The values are used to index a fixed-size table called a hash table.

Use of a hash function to index a hash table is called hashing or scatter storage addressing.
Characteristics of good hash function

To achieve a good hashing mechanism, It is important to have a good hash function with
the following basic requirements:

The hash function should generate different hash values for the similar string.

The hash function is easy to understand and simple to compute.

The hash function should produce the keys which will get distributed, uniformly over an
array.

A number of collisions should be less while placing the data in the hash table.

The hash function is a perfect hash function when it uses all the input data.
Types of hash function

There are various types of hash functions available such as-

Division Hash Function

Mid Square Hash Function

Multiplicative Hash Function

Digit Folding Hash Function

Digital Analysis2
Division method

Division method or reminder method takes an item and divides it by the table size and returns the remainder as
its hash value

h(key) = key mod table_size

i.e. key % table_size

Since it requires only a single division operation, hashing by division is quite fast.

It has been found that the best results with the division method are achieved when the table size is prime

After computing the hash values, we can insert each item into the hash table at the designated position

It is easy to search for an item using hash function where it computes the slot name for the item and then
checks the hash table to see if it is present.

the hash function is dependent upon the remainder of a division

For example:-if the record 52,68,99,84 is to be placed in a hash table
and let us take the table size is 10.

Then:


h(key)=record% table size.

H(52)----- 2=52%10

H(68) ----8=68%10

H(99)-----9=99%10

H(84)------4=84%10

In the hash table, 4 of the 10 slots are occupied, it is referred to as the load factor and
denoted by, λ = No. of items / table size.

For example , λ = 4/10.

by this hash function, many keys can have the same hash. This is called Collision.

A prime not too close to an exact power of 2 is often good choice for table_size.


Suppose r = 256 and table_size = 17, in which r % table_size i.e. 256 % 17 = 1.

So for key = 37596, its hash is

37596 % 17 = 12

But for key = 573, its hash function is also

573 % 12 = 12
Mid square method

In this method firstly key is squared and then mid part of the result is taken
as the index.

The mid square method is a very good hash function. It involves squaring
the value of the key and then extracting the middle r digits as the hash
value. The value of r can be decided according to the size of the hash table.

Suppose the hash table has 100 memory locations. So r=2 because two
digits are required to map the key to memory location.

k = 50

k*k = 2500

h(50) = 50

The hash value obtained is 50
Multiplicative Hash Function

A faster but often misused alternative is multiplicative hashing, in which the hash index is computed

as

H(key)= floor(M * frac(KA)).


Where A ---is an real constant value suggested by donald knuth ie to be

0.68103398987,K---is the key of record to be inserted and M--- is an integer constant


Here k is again an integer hash code, a is a real number and frac is the function that returns the

fractional part of a real number.


Multiplicative hashing sets the hash index from the fractional part of multiplying k by a large real

number.

Multiplicative hashing is cheaper than modular hashing because multiplication is usually
considerably faster than division (or mod).

It also works well with a bucket array of size m=2p, which is convenient.

K=45 M=25 A=0.68103398987

H(45)=Floor(M*Frac(KA))

=Floor(25*Frac(45*0.68103398987))

=Floor(25*Fra(30.64652954415))

=Floor(25*30)

=Floor(750)

H(45) =700
Digit Folding Hash Function

In this method the key is interpreted as an integer using some radix (say 10).


The integer is divided into segments, each segment except possibly the last having the same number of digits.


These segments are then added to obtain the home address.


Two folding methods are used they are:


Fold shift


Fold boundary


Fold Shift


In fold shift the key value is divided into parts whose size matches the size of the required address. Then the

left and right parts are shifted and added with the middle part.
Fold Boundary

In fold boundary the left and right numbers are folded on a fixed boundary
between them and the center number. The two outside values are thus
reversed.


If the need is to generate a 5 digit address then make units of 4 digits and
add them which will make a 4 or 5 digit address.
Digit Analysis

Digit analysis, is used with static files. A static file is one in which all the identifiers are known in advance.

Using this method, we first transform the identifiers into numbers using some radix, r.

We then examine the digits of each identifier, deleting those digits that have the most skewed distributions.

We continue deleting digits until the number of remaining digits is small enough to give an address in the range of
the hash table.

The digits used to calculate the hash address must be the same for all identifiers and must not have abnormally
high peaks or valleys (the standard deviation must be small).

Abrd gftr poid werd

2465 2685 2975 2535

246 268 297 253

46 68 97 53

6 8 7 3
Collision

A situation when the resultant hashes for two or more data elements in the data set U, maps to the
same location in the has table, is called a hash collision

In such a situation two or more data elements would qualify to be stored/mapped to the same location
in the hash table.

In computer science, a collision or clash is a situation that occurs when two distinct pieces of data
have the same hash value

Hash collisions are practically unavoidable when hashing a random subset of a large set of possible
keys.

according to the birthday problem there is approximately a 95% chance of at least two of the keys
being hashed to the same slot.

almost all hash table implementations have some collision resolution strategy to handle such events

The impact of collisions depends on the application.

it is practically never

possible to find a hashing function which computes unique

hashes for each element in the data set U
Collision Resolution Techniques

Collision Resolution Techniques are the techniques used for resolving or
handling the collision

Collision resolution techniques are classified as
Separate Chaining

This technique creates a linked list to the slot for which
collision occurs.

The new key is then inserted in the linked list.

These linked lists to the slots appear like chains.

That is why, this technique is called as separate chaining.
Let us consider a simple hash function as “key mod 7” and sequence of keys as 50, 700, 76, 85, 92, 73, 101.
Advantages:

1) Simple to implement.

2) Hash table never fills up, we can always add more elements to chain.

3) Less sensitive to the hash function or load factors.

4) It is mostly used when it is unknown how many and how frequently keys may be inserted or
deleted.
Disadvantages:

1) Cache performance of chaining is not good as keys are stored using linked list. Open addressing
provides better cache performance as everything is stored in same table.

2) Wastage of Space (Some Parts of hash table are never used)

3) If the chain becomes long, then search time can become O(n) in worst case.

4) Uses extra space for links.
Time Complexity

For Searching-

In worst case, all the keys might map to the same bucket of the hash
table.

In such a case, all the keys will be present in a single linked list.

Sequential search will have to be performed on the linked list to perform
the search.

So, time taken for searching in worst case is O(n).
For Deletion-

In worst case, the key might have to be searched first and
then deleted.

In worst case, time taken for searching is O(n).

So, time taken for deletion in worst case is O(n).
Open Addressing

Like separate chaining, open addressing is a method for handling
collisions.

In Open Addressing, all elements are stored in the hash table itself.

So at any point, size of the table must be greater than or equal to the total
number of keys

Linear Probing

Quadratic Probing

Double Hashing
Linear Probing

In linear probing,

When collision occurs, we linearly probe for the next bucket.

We keep probing until an empty bucket is found.

Let us consider a simple hash function as “key mod 7” and sequence of
keys as 50, 700, 76, 85, 92, 73, 101.
Advantage-


It is easy to compute.


Linear probing has the best cache performance

Disadvantages-


Primary Clustering: One of the problems with linear probing is Primary clustering, many consecutive

elements form groups and it starts taking time to find a free slot or to search an element.


Secondary Clustering: Secondary clustering is less severe, two records do only have the same collision

chain(Probe Sequence) if their initial position is the same.


Time Complexity-

Worst time to search an element in linear probing is O (table size).

This is because

Even if there is only one element present and all other elements are
deleted.

Then, “deleted” markers present in the hash table makes search the entire
table.
Quadratic Probing:

Quadratic probing is an open-addressing scheme where we look for i2‘th slot in i’th iteration if the given hash

value x collides in the hash table.


Quadratic Probing is similar to Linear probing.


How Quadratic Probing is done?


Let hash(K) be the slot index computed using the hash function.


If the slot hash(K) % S is full, then we try (hash(K) + 1*1) % S.


If (hash(K) + 1*1) % S is also full, then we try (hash(K) + 2*2) % S.


If (hash(K) + 2*2) % S is also full, then we try (hash(K) + 3*3) % S.


This process is repeated for all the values of i until an empty slot is found.

With linear probing we know that we will always find an open spot if one
exists (It might be a long search but we will find it).


However, this is not the case with quadratic probing unless you take care in
the choosing of the table size.


In order to guarantee that our quadratic probes will hit every single available
spots eventually, our table size must meet these requirements:


Be a prime number


never be more than half full (even by one element)

(48+1)%7=0


(5+1)%7=6


(5+4)%7=2


55%7=6


( 55+1)%7=8


(55+4)%7=3


(7+1)%7=1 (7+4)%7=
Advantages


Quadratic is a very simple and fast way to avoid the clustering problem of linear hash.


It's normally used only when table size is prime (which may also be good for other reasons)


Though Quadratic probing isn't perfect, but it does offer some advantages over alternatives:


Quadratic probing can be a more efficient algorithm in a closed hash table, since it better avoids the

clustering problem that can occur with linear probing, although it is not immune


simpler logic for storage management


reduced storage requirement in general

Disadvantages


It has secondary clustering. Two keys have the same probe sequence when they hash to the same location.
Double hashing

Double hashing is a collision resolving technique in Open Addressed Hash tables.


Double hashing uses the idea of applying a second hash function to key when a collision occurs.


The result of the second hash function will be the number of positions from the point of collision to

insert.


There are a couple of requirements for the second function:


it must never evaluate to 0


must make sure that all cells can be probed


First hash function is typically hash1(key) = key % TABLE_SIZE


A popular second hash function is: Hash2(key) = R - ( key % R ) where R is a prime number that is

smaller than the size of the table.


Advantages


It drastically reduces clustering.


It requires fewer comparisons.


Double hashing is useful if an application requires a smaller hash table since it effectively

finds a free slot.


double hashing can find the next free slot faster than the linear probing approach

Disadvantages


As the table fills up the performance degrades.


computational cost may be high
Rehashing

As the name suggests, rehashing means hashing again


Once the hash table gets too full, the running time for operations will start to take too long and may fail.


To solve this problem, a table at least twice the size of the original will be built and the elements will be transferred to the new

table.


Basically, when the load factor increases to more than its pre-defined value (default value of load factor is 0.75), the complexity

increases.


So to overcome this, the size of the array is increased (doubled) and all the values are hashed again and stored in the new double

sized array to maintain a low load factor and low complexity.


The new size of the hash table:


should also be prime


will be used to calculate the new insertion spot (hence the name rehashing)

This is a very expensive operation! O(N) since there are N elements to
rehash and the table size is roughly 2N.

This is ok though since it doesn't happen that often.

The question becomes when should the rehashing be applied?

Some possible answers:

once the table becomes half full

once an insertion fails

once a specific load factor has been reached, where load factor is the ratio
of the number of elements in the hash table to the table size

To reduce the load factor and the time complexity.
Extendible Hashing

Extendible Hashing is a dynamic hashing method wherein directories, and buckets are

used to hash data.


It is an aggressively flexible method in which the hash function also experiences dynamic

changes.


Main features of Extendible Hashing: The main features in this hashing technique are:


Directories: The directories store addresses of the buckets in pointers. An id is assigned to

each directory which may change each time when Directory Expansion takes place.


Buckets: The buckets are used to hash the actual data.
Frequently used terms in Extendible Hashing:

Directories: These containers store pointers to buckets. Each directory is given a unique id which

may change each time when expansion takes place. The hash function returns this directory id

which is used to navigate to the appropriate bucket. Number of Directories = 2^Global Depth.


Buckets: They store the hashed keys. Directories point to buckets. A bucket may contain more

than one pointers to it if its local depth is less than the global depth.


Global Depth: It is associated with the Directories. They denote the number of bits which are

used by the hash function to categorize the keys. Global Depth = Number of bits in directory id.

Local Depth: It is the same as that of Global Depth except for the fact that Local Depth is
associated with the buckets and not the directories. Local depth in accordance with the
global depth is used to decide the action that to be performed in case an overflow occurs.
Local Depth is always less than or equal to the Global Depth.


Bucket Splitting: When the number of elements in a bucket exceeds a particular size, then
the bucket is split into two parts.


Directory Expansion: Directory Expansion Takes place when a bucket overflows.
Directory Expansion is performed when the local depth of the overflowing bucket is equal
to the global depth.
Basic Working of Extendible Hashing:
Steps to Follow

Step 1 – Analyze Data Elements: Data elements may exist in various forms eg. Integer, String, Float, etc.. Currently, let us

consider data elements of type integer. eg: 49.


Step 2 – Convert into binary format: Convert the data element in Binary form. For string elements, consider the ASCII

equivalent integer of the starting character and then convert the integer into binary form. Since we have 49 as our data element, its

binary form is 110001.


Step 3 – Check Global Depth of the directory. Suppose the global depth of the Hash-directory is 3.


Step 4 – Identify the Directory: Consider the ‘Global-Depth’ number of LSBs in the binary number and match it to the directory

id.


Eg. The binary obtained is: 110001 and the global-depth is 3. So, the hash function will return 3 LSBs of 110001 viz. 001.


Step 5 – Navigation: Now, navigate to the bucket pointed by the directory with directory-id 001.


Step 6 – Insertion and Overflow Check: Insert the element and check if the bucket overflows. If an overflow is encountered, go

to step 7 followed by Step 8, otherwise, go to step 9.



Step 7 – Tackling Over Flow Condition during Data Insertion: Many times, while inserting data in the buckets, it might
happen that the Bucket overflows. In such cases, we need to follow an appropriate procedure to avoid mishandling of data.


First, Check if the local depth is less than or equal to the global depth. Then choose one of the cases below.


Case1: If the local depth of the overflowing Bucket is equal to the global depth, then Directory Expansion, as well as Bucket
Split, needs to be performed. Then increment the global depth and the local depth value by 1. And, assign appropriate pointers.


Directory expansion will double the number of directories present in the hash structure.


Case2: In case the local depth is less than the global depth, then only Bucket Split takes place. Then increment only the local
depth value by 1. And, assign appropriate pointers.


Step 8 – Rehashing of Split Bucket Elements: The Elements present in the overflowing bucket that is split are rehashed w.r.t
the new global depth of the directory.


Step 9 – The element is successfully hashed
Handling Over Flow Condition
Example

let us consider a prominent example of hashing the following elements: 16,4,6,22,24,10,31,7,9,20,26.


Bucket Size: 3 (Assume)


Hash Function: Suppose the global depth is X. Then the Hash Function returns X LSBs.


First, calculate the binary forms of each of the given numbers.


4- 00100


16- 10000


6- 00110


22- 10110


24- 11000


10- 01010


31- 11111


7- 00111


9- 01001


20- 10100


26- 01101

Initially, the global-depth and local-depth is always 1. Thus, the hashing
frame looks like this:


Inserting 16:


The binary format of 16 is 10000 and global-depth is 1. The hash function
returns 1 LSB of 10000 which is 0. Hence, 16 is mapped to the directory

Inserting 4 and 6:


Both 4(100) and 6(110)have 0 in their LSB. Hence, they are hashed as
follows

Inserting 22: The binary form of 22 is 10110. Its LSB is 0. The bucket
pointed by directory 0 is already full. Hence, Over Flow occurs.

Inserting 24 and 10: 24(11000) and 10 (1010) can be hashed based on
directories with id 00 and 10. Here, we encounter no overflow condition.

Inserting 31,7,9: All of these elements[ 31(11111), 7(111), 9(1001) ] have
either 01 or 11 in their LSBs.


Hence, they are mapped on the bucket pointed out by 01 and 11. We do not
encounter any overflow condition here.

Inserting 20: Insertion of data element 20 (10100) will again
cause the overflow problem.

20 is inserted in bucket pointed out by 00. As directed by Step 7-
Case 1, since the local depth of the bucket = global-depth, directory
expansion (doubling) takes place along with bucket splitting.
Elements present in overflowing bucket are rehashed with the new
global depth. Now, the new Hash table looks like this:

Inserting 26: Global depth is 3. Hence, 3 LSBs of 26(11010) are considered.
Therefore 26 best fits in the bucket pointed out by directory 010.

The bucket overflows, and, as directed by Step 7-Case 2, since the local
depth of bucket < Global depth (2<3), directories are not doubled but,
only the bucket is split and elements are rehashed.


Finally, the output of hashing the given list of numbers is obtained.
Advantages:


Data retrieval is less expensive (in terms of computing).


No problem of Data-loss since the storage capacity increases dynamically.


With dynamic changes in hashing function, associated old values are rehashed w.r.t the new hash function.

Limitations Of Extendible Hashing:


The directory size may increase significantly if several records are hashed on the same directory while

keeping the record distribution non-uniform.


Size of every bucket is fixed.


Memory is wasted in pointers when the global depth and local depth difference becomes drastic.


This method is complicated to code.
Applications of Extendible hashing


Extendible hashing can be used in applications


where exact match query is the most important


query such as hash join
BIBLIOGRAPHY
TEXT BOOKS:

1. Pradip Dey, Manas Ghosh, Reema Thareja, “Computer Programming and Data Structures” –

[ Module-1]

2. E. Balagurusamy, “Computer Programming and Data Structures”, 4th Edition, Tata McGraw Hill

Education Private Limited – [ Module-2, 3, 5]

3. Narashimha K , “Data structures and Algorithms Made Easy”, 5th Edition, CareerMonk

Publications- [ Module-4]

4. Dr.N.B.Venkateswaru, Dr.E.V.Prasad, “C and Data Structures” , S.Chand – [ Module-6]


73 Reema Thareja, “Data Structures using C”, Oxford- 2nd Edition – [ Module-7, 8, 9,10]
5.
BIBLIOGRAPHY
REFERENCE BOOKS:
1. Mark Allen Weiss, “Data Structures and Algorithm Analysis in C++”, Third Edition, Pearson

2. Michael T.Goodrich, Roberto Tamassia, David Mount, “Data Structures and Algorithms in C++”, Wiley India

Edition

3. Herbert Schildt , ”The Complete Reference” , Fourth Edition

4. Aaron M. Tenenbaum, Yedidvah Langsam, Moshe J. Augenstein, “Data Structures using C” , Pearson Education

5. Sahni, “Data Structure s, Algorithms and Applications in C++”, McGraw-Hill International Edition

6. Gilberg, Forouzan, “ Data Structures- A Pseduocode Approach with C++”, Cengage Learnng

7. Behrouz A Forouzan, Richard F.Giberg, “ C Programming and Data Structures”, Cengage


74

You might also like