0% found this document useful (0 votes)
35 views36 pages

Lect Hashing

Uploaded by

Hasnain Nisar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views36 pages

Lect Hashing

Uploaded by

Hasnain Nisar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 36

CT 363 – Design and Analysis of

Algorithms Lecture# 05: Hashing

Instructor: Engr. Vanesh Kumar


1

Introduction
• Searching methods studied so far …
• Linear searching (O(n) search time)
• Binary searching (O(lg n) search time)

• An array, map, symbol table, or dictionary is an ADT composed of a


collection of (key, value) pairs, such that each key is distinct •
Collection of student records in a class (SID, Student Data)

• Typical dictionary operations include Insert, Search and Delete

• The storage size is dependent on the number of distinct keys • Is it


2
possible to improve search time?

Hash Table
• A hash table is a data structure that stores elements and allows
insertions, lookups, and deletions to be performed in O(1) time.

• A hash table is an alternative method for representing a dictionary

• In a hash table, a hash function is used to map keys into positions


in a table. This act is called hashing
• Hash Table Operations
• Search: compute h(k) and see if a pair exists
• Insert: compute h(k) and place it in that position
• Delete: compute h(k) and delete the pair in that position
3
• In ideal situation, hash table search, insert or delete takes Θ(1)
• No comparisons required

Hash Table Example


• Internet routers is a good example of why hash tables are required

• A router table (especially in those routers in the backbone


networks of internet operators) may contain hundreds of
thousands or millions of entries.

• When a packet has to be routed to a specific IP address, the router


has to determine the best route by querying the router table in an
efficient manner. Hash Tables are used as an efficient lookup
structure having as key the IP address and as value the path that
should be follow for that address.

How does it work


• The table part is just an ordinary array, it is the Hash that we are
interested in

• The Hash is a function that transforms a key into address or index


of array(table) where record will be stored
• Key could be an integer, a string, etc
• e.g. a name or ID that is a part of a large employee structure

• The size of the array is TableSize


• each key is mapped into some number in the range 0 to TableSize – 1.

• If h is a hash function and k is key then h(k) is called the hash of


the key and is the index at which a record with the key k should be
placed.
5
• The hash function generates this address by performing some
simple arithmetic or logical operations on the key.
Example (ideal) Hash Function
• Pairs are: (22,a),(33,c),(3,d),(72,e),(85,f)
• (key, value) pairs

• Hash table is HT[0:7], m = 8 (where m is the number of positions


in the hash table)

• Hash function h is k % m = k % 8
• Where are the pairs stored?

[0] [1] [2] [3] [4] [5] [6] [7]


(72,e) (33,c) (3,d) (85,f)(22,a)
6

[0] [1] [2] [3] [4] [5] [6] [7]


Characteristics of a Hash Function
• Distribute data "uniformly" across entire set of possible hash
values.

• Generate very different hash values for similar strings


• Strings such as pt and pts should hash to different slots

• The hash function when applied to equal Objects, returns the same
number for each.

• The hash function when applied to unequal Objects, is very unlikely


to return the same number for each
7
• Derive a hash value that is independent from any patterns that may
exist in the distribution of the keys
Finding the Hash Function
• How can we come up with this magic function?

• In general, we cannot--there is no such magic function ☹ • In a


few specific cases, where all the possible values are known in
advance, it has been possible to compute a perfect hash function

• What is the next best thing?


• A perfect hash function would tell us exactly where to look • In
general, the best we can do is a function that tells us where to start
looking!
8

Hash Function Methods


Division Hash Method: Map a key k into one of the m slots by taking
the remainder of k divided by m.
h(k)=k mod m

Advantages:
• fast, requires only one operation

Disadvantage:
• Certain values of m are bad, e.g.,
• Power of 2
• Non-prime numbers

9
Generally a prime number is a best choice to spread keys evenly.
Hash Function Methods
The Folding Method: The key K is partitioned into a number of parts,
each of which has the same length as the required address with the
possible exception of the last part.

• This parts are then added together, ignoring the final carry, to
form an address.

Example:
• If key=356942781 is to be transformed into a three digit address
• P1=356, P2=942, P3=781 are added to yield 079

10

Hash Function Methods


The Mid-Square Method: The key K is multiplied by itself and the
address is obtained by selecting an appropriate number of digits
from the middle of the square.

This number of digits selected depends on the size of the table.

Example:
• If key=123456 is to be transformed
• (123456)2 = 15241383936
• If 3-digit address is required, position 5 to 7 is chosen giving 138 11

Hashing a String Key


Table size [0..99]
A..Z ---> 1,2, ...26
0..9 ----> 27,...36
Key: CS1 --->3+19+28 (concat) = 31,928
(31,928)2 = 1,019,397,184 - 10 digits

Extract middle 2 digits (5th and 6th) as table size is


0..99. Get 39, so: H(CS1) = 39

12

Example (Imperfect) Hash Function


• Pairs are: (22,a),(33,c),(3,d),(72,e),(85,f)
• (key, value) pairs

• Hash table is HT[0:7], m = 8 (where m is the number of positions


in the hash table)
• Hash function h is k % m = k % 8
• Where would a new pair (57, g) stored?
• 57 % 8 = 1

Now what?
(72,e) (33,c) (3,d) (85,f)(22,a)
13

[0] [1] [2] [3] [4] [5] [6] [7]


Collisions
• Two or more keys hashing to the same slot leads to collision

• Collisions are normally treated as “first come, first served”—the


first value that hashes to the location gets it

• We have to find something to do with the second and subsequent


values that hash to this same location

• There are two methods to resolve collision:


• Separate chaining
• Open addressing
• Linear Probing
• Quadratic Probing
14
• Double Hashing

Separate Chaining
• Idea is to keep a list of all elements that hash to the same value.
• The array elements are pointers to the first nodes of the lists. • A
new item is inserted to the front of the list.

• Choosing the size of the table


• Small enough not to waste space
• Large enough such that lists remain short
• Typically 1/5 or 1/10 of the total number of elements

• Advantages:
• Better space utilization for large number of items.
• Simple collision handling: searching linked list.
• Overflow: we can store more items than the hash table size. • Deletion is
15
quick and easy: deletion from the linked list.

Separate Chaining Example


Keys: 0, 1, 4, 9, 16, 25, 36, 49, 64,
81 hash(key) = key % 10.
0
0

1
81 1
2

3
4
64 4
5
25
6
36 16
7

8
16
9
49 9

Separate Chaining Operations


• Initialization: all entries are set to NULL

• Find:
• locate the cell using hash function.
• sequential search on the linked list in that cell.

• Insertion:
• Locate the cell using hash function.
• (If the item does not exist) insert it as the first item in the list.

• Deletion:
• Locate the cell using hash function.
17
• Delete the item from the linked list.

Separate Chaining Analysis


• We hope that number of elements per bucket roughly equal in
size, so that the lists will be short

• Load factor λ definition:


• Ratio of number of elements (N) in a hash table to the hash TableSize.
• i.e. λ = N/TableSize
• The average length of a list is also λ.
• For chaining λ is not bound by 1; it can be > 1.
• If we can estimate N and choose TableSize to be roughly as large,
then the average bucket will have only one or two members

18

Separate Chaining Analysis


Average time per dictionary operation:
• TableSize buckets, N elements in dictionary 🡺 average N/TableSize
elements per bucket

• Insert, search, remove operation take O(1 + λ) time each


• If inserted at head of the list, insert takes O(1) time

• Average search cost = 1 + λ/2

• How long does it take to search for an element with a given key?
• Worst case: Θ(n)
• If we can choose TableSize to be about N, constant search time •
Assuming each element is likely to be hashed to any bucket, running
19
time constant, independent of N

Open Addressing
• If we have enough contiguous memory to store all the keys
(TableSize > N) 🡺 store the keys in the table itself
No need to use linked lists anymore
Generally the load factor should be below 0.5.

• Basic idea:
• Insertion: if a slot is full, try another one, until you find an empty one
• Search: follow the same sequence of probes
• Deletion: more difficult ... (we’ll see why)

• There are three common collision resolution strategies:


• Linear Probing
20
• Quadratic probing
• Double hashing

Linear Probing: Inserting a key


• In linear probing, collisions are resolved by sequentially scanning
an array (with wraparound) until an empty cell is found.

• Example:
• Insert items with keys: 89, 18, 49, 58, 9 into an empty hash table.
• Table size is 10.
• Hash function is hash(x) = x mod 10.
21
Figure 20.4
Linear probing
hash table
after
each insertion
22

Linear Probing: Searching a key


• The Search operation follows the same probe sequence as the
insert algorithm.
• A find for 58 would involve 4 probes.
• A find for 19 would involve 5 probes.

• Three cases:
• Position in table is occupied with an element of equal key
• Position in table is empty
• Position in table occupied with a different element

• For Case 2, probe the next higher index until the element is found
or an empty position is found
23
• The process wraps around to the beginning of the table

Linear Probing: Deleting a key


• Problems
• Cannot mark the slot as empty
• Impossible to retrieve keys inserted after that slot was occupied
• Solution
• Mark the slot with a sentinel value DELETED
• The deleted slot can later be used for insertion
• Searching will be able to find all the keys

24

Primary Clustering Problem


• As long as table is big enough, a free cell can always be found,
but the time to do so can get quite large.

• Worse, even if the table is relatively empty, blocks of occupied


cells start forming.
• Search time increases

• This effect is known as primary clustering.


• Any key that hashes into the cluster will require several attempts
to resolve the collision, and then it will add to the cluster.

25

Linear Probing Analysis


• Initialization: O(b), b# of buckets

• Insert and search: O(n), n number of elements in table; all n key


values have same home bucket

• No better than linear list for maintaining dictionary!


26

Linear Probing Analysis --


Example • What is the average number of probes for a
successful search and an unsuccessful Unsuccessful Search:
search for this hash table? • We assume that the hash function
• Hash Function: h(x) = x mod 11 uniformly distributes the keys.
Successful Search: • 0: 0,1 -- 1: 1 -- 2: 2,3,4,5,6 -- 3: 3,4,5,6 •
• 20: 9 -- 30: 8 -- 2 : 2 -- 13: 2, 3 -- 25: 3,4 4: 4,5,6 -- 5: 5,6 -- 6: 6 -- 7: 7 -- 8:
• 24: 2,3,4,5 -- 10: 10 -- 9: 9,10, 0 8,9,10,0,1 • 9: 9,10,0,1 -- 10: 10,0,1
Avg. Probe for SS = Avg. Probe for US =
(1+1+1+2+2+4+1+3)/8=15/8 (2+1+5+4+3+2+1+1+5+4+3)/11=31/11
09 6
1 7
22 8 30
3 13 9 20
4 25
10 10 27
5 24

Quadratic Probing
• Quadratic Probing eliminates primary clustering problem of
linear probing.

• Collision function is quadratic.


• The popular choice is f(i) = i2.

• If the hash function evaluates to h and a search in cell h is


inconclusive, we try cells h + 12, h+22, … h + i2.
• i.e. It examines cells 1,4,9 and so on away from the original probe.

• Remember that subsequent probe points are a quadratic


number of positions from the original probe point.28
Figure 20.6
A quadratic
probing hash
table
after each
insertion (note
that
the table size
was
poorly chosen
because it is not
a
prime number).
29

Quadratic Probing
• Problem:
• We may not be sure that we will probe all locations in the table
(i.e. there is no guarantee to find an empty cell if table is more
than half full.)
• If the hash table size is not prime this problem will be much
severe.
• However, there is a theorem stating that:
• If the table size is prime and load factor is not larger than 0.5,
all probes will be to different locations and an item can always
be inserted.
30

Some Considerations
• How efficient is calculating the quadratic probes?
• Linear probing is easily implemented. Quadratic probing appears
to require * and % operations.
• However by the use of the following trick, this is overcome:
• Hi = Hi-1+2i – 1 (mod M)
31

Some Considerations
• What happens if load factor gets too high?
• Dynamically expand the table as soon as the load factor reaches 0.5,
which is called rehashing.
• Always double to a prime number.
• When expanding the hash table, reinsert the new table by using the
new hash function.
32

Analysis of Quadratic Probing


• Quadratic probing has not yet been mathematically analyzed

• Although quadratic probing eliminates primary clustering,


elements that hash to the same location will probe the same
alternative cells. This is know as secondary clustering.

• Techniques that eliminate secondary clustering are available.


• the most popular is double hashing.

33

Double Hashing
• Use one hash function to determine the first slot
• Use a second hash function to determine the increment for
the probe sequence
• h(k,i) = (h1(k) + i h2(k) ) mod m, i=0,1,...

• Initial probe: h1(k)


• Second probe is offset by h2(k) mod m, so on...

• Advantage: avoids clustering

• Disadvantage: harder to delete an element


• can generate m2 probe sequences maximum 34
= (1 + 4) mod 13 = 5 9
Double h(14,2) = (h1(14) + 2 10 11 12
Hashing h2(14)) mod 13 = (1 + 8) 79
mod 13 = 9
Example
69
h1(k) = k mod 13 98
h2(k) = 1+ (k mod 11) 0
h(k,i) = (h1(k) + i h2(k) ) 1 72
mod 13 2
3
14
4
• Insert key 14:
5
h1(14,0) = 14 mod 13 = 16 50

h(14,1) = (h1(14) + 7 35
h2(14)) mod 13 8

The relative efficiency of


four collision-resolution methods

36
Hashing Applications
• Compilers use hash tables to implement the symbol table (a
data structure to keep track of declared variables).
• Game programs use hash tables to keep track of positions it
has encountered (transposition table)
• Online spelling checkers.

37

Summary
• Hash tables can be used to implement the insert and find
operations in constant average time.
• it depends on the load factor not on the number of items in the
table.

• It is important to have a prime TableSize and a correct


choice of load factor and hash function.

• For separate chaining the load factor should be close to 1.

• For open addressing load factor should not exceed 0.5


unless this is completely unavoidable.
• Rehashing can be implemented to grow (or shrink) the table. 38

You might also like