0% found this document useful (0 votes)
11 views22 pages

Hashing

Uploaded by

l227437
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views22 pages

Hashing

Uploaded by

l227437
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 22

DSA CS2002

Hashing
Hashing
ref sec 8.2 Data Structrures in C++
10.5 in the textbook
Purpose:
Makes data retrieval much faster. Assume we have keys
ranging from 1…N and a table of size M smaller than N. Now
have a function h: {1..N}  {0.. M-1} and let h(x) be the location
where a record with key x should be stored. This is the hash
function.

In static hashing the keys are stored in a fixed-size table


called the hash table, ht. The hash table is partitioned into b
buckets, ht[0], …, ht[b-1]. Each bucket is capable of storing s
records. Thus, a bucket is said to consist of s slots.. A hash
function h(x) performs identifier transformation on x, it maps the
set of possible identifiers (keys) onto the integers 0 through b-1.
Hashing
Identifier density= n/T where n is the number of
identifiers in the table and T is the total number
of possible identifiers.
Loading factor =  = n/(sM)
Overflow : occurs when a new identifier is hashed by
h into a full bucket.
Collision: occurs when two nonidentical identifiers
are hashed into the same bucket.When bucket
size s is 1, collision and overflow occur
simultaneously.
Example
Consider hash table ht with b =26 buckets and s=2.
Assume that there are n=10 unique identifiers and that
each identifier begins with a letter. The loading factor
for this table is 10/52 =0.19. The hash function h must
map each of the possible identifiers into one of the
numbers 0 to 25. If the internal binary representation for
the letters A to Z corresponds to the numbers 0 to 25,
and the hash function:
h(x) = first character of x
The identifiers GA, D , A, G, L, A2, A1, A3, A4 and E
will be hashed into buckets 6, 3, 0, 6, 11, 0, 0,0,0, 4
respectively by this function
For address 0 an overflow occurs when
identifier A1 gets hashed into bucket ht[0]

When no overflows occur, the time required


to enter or search for identifiers using
hashing depends only on the time required
to compute the hash function h and time to
search one bucket. Since bucket size, s, is
usually small the search for an identifier
within a bucket is carried out using
sequential search. This time is independent
of n.
Hash function
• Desired properties of hash function
o It be easy to compute
o It should minimize the number of collisions
o It should depend on all characters in the
identifier
o For random inputs, it should not result in biased
use of hash table, i.e., for identifier x the
probability that h(x)= i should be 1/b for all
buckets i. Such a hash function is called a
uniform hash function.
Hash Function
A simple type of hash function is the modulo
operator (%). The hash function is
h(x) = x%M
This function gives bucket addresses in the
range 0 to (M-1). The choice of M is critical.
If M is a power of 2, then h(x) depends only
on the least significant bits of x.
e.g. if we use 5 bits to represent character keys
AKEY
00001 01011 00101 11001 = 1*323 + 11*322 + 5*32 + 25 =44217
Now with M =2i , i<= 5, all identifiers ending with the character
‘Y’ will have the same bucket address, i.e., the bucket address will
depend on one part of the key only. The use of hash table is thus
biased.
Hash Function
Similarly if M is divisible by two, the odd keys were
mapped to odd buckets and even keys are mapped to
even buckets.

These difficulties can be avoided by making M a


prime number, then only factors of M are M and 1.
In practice it has been observed that it is sufficient to
choose M such that it has no prime divisors less than
20.
Overflow Handling
Two ways to handle overflows :
• Open addressing
• Chaining
Overflow Handling :Open Addressing

In open addressing , we assume the hash table is an array and


that s =1. The hash table, ht, is initialized so that each slot contains
null identifier. When a new identifier is hashed into a full bucket,
we need to find another bucket for this identifier. The simplest
solution is to find the closest unfilled bucket.
This is called linear probing or linear open addressing.
Example 8.3 : Assume we have M=26 and s=1 and the
following identifiers: GA, D, A, G, L, A2, A1, A3, A4,
Z, ZA, E
h(x) = first character of x
Overflow Handling :Open Addressing
When linear open addressing is used to handle overflows, a
hash table search for identifier x proceeds as follows:

(1) Compute h(x)


(2) Examine identifiers at positions ht[h(x)], ht[h(x) +1],
…., ht[h(x) +j], in this order until one of the following
happens:
(a) ht[h(x) +j] = x, in this case x is found
(b) ht[h(x) +j] is null, x not in the table.
(c) We return to starting position ht[h(x)]; the table is
full and x is not in the table.
Overflow Handling :Open Addressing - Analysis
For example 8.3:
Number of buckets examined is 1 for A, 2 for A2, 3 for
A1, 1 for D, 5 for A3,…., 10 for ZA – a total of 39
buckets examined for 12 identifiers, average of 3.25
buckets per identifier

Expected average number of identifier probes , P, to


look for an identifier = (2-)/(2-2) , where  is the
loading density
For above example  = 12/26 =0.47 and P=1.5
Overflow Handling :Open Addressing - Analysis
• Problem with linear open addressing is it tends to create
clusters of identifiers. Moreover these clusters tend to
merge as more identifiers are entered, leading to bigger
clusters.
• Worst-case performance is really bad.
• Example of average behaviour:
M= 11113, n=10000, =0.9
Hashing using linear probing: 5.5
Binary search: 12.29
Linear search: 5000
• Some improvement in the growth of clusters and hence
in the average number of probes needed for search can
be obtained by quadratic probing or by using rehashing
techniques
Homework
Q1. Using the simple linear probing method show how the
following sequence of numbers will be inserted into a
hash table using the modulo function as our hash
function and table size of 13:
13 17 21 12 9 8 6 26 0 4 34 47
Overflow Handling : Chaining
In linear probing the search for an identifier
involves comparison with identifiers that have
different hash values.
Using separate chaining we reduce the
number the comparisons carried out by
maintaining list of identifiers:
• One list is kept per bucket, each list containing all the
synonyms for that bucket.
• A search involves computing the hash function h(x) and
examining only those identifiers in the list for h(x).
• As the size for these lists is not known in advance, the
best way to maintain them is as linked chains.
Overflow Handling : Chaining
• Each chain has a head node
• The size of head node is small as it only
retains a link
• Since the lists are to be accessed at
random, the head nodes should be
sequential. We assume they are numbered
0 to b-1 if the hash function h has range 0
to b-1.
Overflow Handling : Chaining
Using Chaining for example 8.3 which had M=26 and s=1 and the following
identifiers: GA, D, A, G,
L, A2, A1, A3, A4, Z, ZA, E
h(x) = first character of x

In this example new identifiers


are inserted at the front of the
chains. The number of probes
needed for A4, D, E, G, L
and ZA is 1; for A3, GA and
Z is 2; for A1 3, for A2 4;
for A 5. A Total of 24 which
gives an average of 2 which is
less than the average for
linear probing
Overflow Handling : Chaining - Search
Overflow Handling : Chaining - Analysis

• The expected number of identifier comparisons = (1+


/2), where  is the loading density n/b
• Deletion is possible and easy

As long as a uniform hash function is used, the


performance of a hash table depends only on
the method used to handle overflows
Homework
Q1. Given M=11 and the following sequence of numbers,
show the final hash table using separate chaining and
the simple modulo function as our hash function:
112 41 34 2 98 16 35 74 77 0 32

You might also like