0% found this document useful (0 votes)
15 views55 pages

Dsa 4

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views55 pages

Dsa 4

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 55

Hashing

Lecturer: Dr. Keyhanipour


Outline
• Dictionary ADT
• Hashing:
• The Concept
• Collision Resolution
• Chooing A Hash Function
• Advanced Collision Resolution

2
Dictionary
• Dictionary ADT – a dynamic set with methods:
• Search(S, k) – an access operation that returns a pointer 𝑥 to an
element where 𝑥. 𝑘𝑒𝑦 = 𝑘
• Insert(S, x) – a manipulation operation that adds the element
pointed to by 𝑥 to 𝑆
• Delete(S, x) – a manipulation operation that removes the element
pointed to by 𝑥 from 𝑆
• Each element has a key part

3
Dictionaries
• Dictionaries store elements so that they can be located quickly
using keys
• Example:
• A dictionary may hold bank accounts
• Each account is an object that is identified by an account number
• Each account stores a wealth of additional information
• An application wishing to operate on an account would have to
provide the account number as a search key.

4
Dictionaries (2)
• Supporting order (methods such as min, max, successor and
predecessor) is not required, thus it is enough that keys are
comparable for equality

5
Dictionaries (3)
• Different data structures to realize dictionaries:
• Arrays, linked lists (inefficient)
• Hash tables
• Binary trees
• Red/Black trees
• AVL trees
• B-trees

6
An Example
• Suppose that a large phone company wants to provide the
caller ID capability:
• given a phone number, return the caller’s name
• phone numbers range from 0 to r = 108 − 1
• want to do this as efficiently as possible

7
The Problem
• A few suboptimal ways to design this dictionary
• Direct addressing: an array indexed by key
• takes 𝑂(1) time,
• needs 𝑂(𝑟) space - huge amount of wasted space

(null) (null) Jens Jensen (null) (null)


0000-0000 0000-0000 9635-8904 0000-0000 0000-0000

• A linked list: takes 𝑂(𝑟) time, 𝑂(𝑟) space

Jens Jensen Ole Olsen


9635-8904 9635-9999

8
Another Solution
• We can do better, with a Hash table
• Like an array, but come up with a function to map the large
range into one which we can manage
• e.g., take the original key, modulo the (relatively small) size of the
array, and use that as an index
• Insert (9635-8904, Jens Jensen) into a hashed array with, say,
five slots
• 96358904 𝑚𝑜𝑑 5 = 4

(null) (null) (null) (null) Jens Jensen


0 1 2 3 4
9
Another Solution (2)
• A lookup uses the same process: hash the query key, then
check the array at that slot

(null) (null) (null) (null) Jens Jensen


0 1 2 3 4

10
Direct-address Tables
• Assumptions:
• The universe 𝑈 of keys is reasonably small:
𝑈 = {0, 1, 2, … , 𝑚 − 1}, for some small 𝑚
• No two elements have the same key
• Implementation:
• Allocate an array of size 𝑚
• Insert the kth element into the kth slot in the array

11
Direct-address Tables
• How to implement a dynamic set by a direct-address table T :
• Each key in the universe 𝑈 = {0, 1, … , 9} corresponds to an index in the table.
• The set 𝐾 = {2, 3, 5, 8} of
actual keys determines the
slots in the table that
contain pointers to
elements. The other slots,
heavily shaded, contain
NIL.
• If the set contains no
element with key k, then:
T[k] = NIL.

12
Direct-address Tables
• An Array is an example of a direct-address table:
• Given a key 𝑘, the corresponding is stored in 𝑇[𝑘]
• Assume no two elements have the same key

• Operations of direct-address tables: all (1) time


• DIRECT-ADDRESS-SEARCH(𝑇, 𝑘): return 𝑇[𝑘]
• DIRECT-ADDRESS-INSERT(𝑇, 𝑥): 𝑇 𝑥. 𝑘𝑒𝑦 = 𝑥
• DIRECT-ADDRESS-DELETE(𝑇, 𝑥): 𝑇 𝑥. 𝑘𝑒𝑦 = 𝑁𝐼𝐿

13
Direct-address Tables
• Advantage:
• O(1) time for all operations
• Disadvantages:
• Wasteful if the number of elements actually inserted is significantly
smaller than the size of the universe (m)
• Only applicable for small values of m , i.e. a limited range of keys

14
Hash Tables
• Performance is almost as the direct-address table, but without
the limitations
• The universe 𝑈 may be very large
• The storage requirement is 𝑂(|𝐾|), where 𝐾 is the set of keys actually
used
• Disadvantage – 𝑂(1) performance is now average case, not the
worst case

15
Hash Tables
• We need a hash function to map keys from the universe 𝑈 into
the hash table
ℎ: 𝑈 → {0, 1, … , 𝑚 − 1}
• For each key 𝑘, the hash function computes a hash value ℎ(𝑘)
• If two keys hash to the same value:
ℎ(𝑘1) = ℎ(𝑘2), we call this a collision

16
Hash Tables
• Using a hash function ℎ
to map keys to hash-table
slots.
• Because keys 𝑘2 and 𝑘5
map to the same slot,
they collide.

17
Collisions
• Can we avoid collisions altogether?
• No. Since |𝑈| > 𝑚, some keys must have the same hash value (based
on the Pigeonhole principle)
• A good hash function will be as ‘random’ as possible

18
Collision Resolution
• Chaining (also called open hash)
• Elements stored in their ‘correct’ slot
• Collisions resolved by creating linked lists

• Open addressing (also called


closed hash)
• All elements stored inside
the table
• Maybe rehashed if their slot
is full

19
Collision Resolution: Chaining (Open Hash)

20
Chaining (Open Hash)
• Collision resolution by chaining. Each hash-table slot 𝑇[𝑗] contains a
linked list of all the keys whose hash value is 𝑗 .
• For example, ℎ 𝑘1 = ℎ(𝑘4 ), ℎ 𝑘5 = ℎ(𝑘2 ) = ℎ(𝑘7 ) and ℎ 𝑘8 = ℎ(𝑘6 ).
• The linked list can be either singly or doubly linked; we show it as
doubly-linked list because deletion is faster that way.

21
Collision resolution – Chaining
• All keys that have the same hash value are placed in a linked list
• Insertion:
• CHAINED-HASH-INSERT(T, x) insert x at the head of list T[h(key[x])]
• Can be done at the beginning of the list in 𝑂(1) time
• Searching:
• CHAINED-HASH-SEARCH(T, k) search for an element with key k in list T[h(k)].
• Running time is proportional to the length of the list of elements in slot h(k).
• Deletion:
• CHAINED-HASH-DELETE(T, x) delete x from the list T[h(key[x])]
• Given pointer x to the element to delete, so no search is needed to find this element
(𝑂(1) time).

22
Analysis of Hashing
• The hash function h maps the universe U of keys into the slots
of hash table T[0...m-1]
ℎ: 𝑈 → {0, 1, … , 𝑚 − 1}
• Assumptions:
• Each key is equally likely to be hashed into any slot (bucket); Simple
Uniform Hashing
• Assume time to compute h(k) is (1)
• Given hash table T with m slots holding n elements, the load
factor is defined as: a=n/m

23
Analysis of Hashing
• Given a key, how long does it take to find an element with that key, or
to determine that there is no element with that key?
• Analysis is in terms of the Load factor: 𝛼 = 𝑛/𝑚.
• n = # elements in the table.
• m = # slots in the table = # (possibly empty) linked lists.
• Load factor is average number of elements per linked list.
• Can have α < 1, α > 1 or α = 1.
• Worst case is when all n keys hash to the same slot ⇒ get a single list of length
n ⇒ worst-case time to search is θ(n), plus time to compute hash function.
• Average case depends on how well the hash function distributes the keys
among the slots.

24
Analysis of Hashing
• To find an element:
• using ℎ , look up its position in table 𝑇
• search for the element in the linked list of the hashed slot
• Unsuccessful search:
• element is not in the linked list
• uniform hashing yields an average list length 𝛼 = 𝑛/𝑚
• expected number of elements to be examined 𝛼
• search time 𝑂(1 + 𝛼) (this includes computing the hash value)

25
Analysis of Hashing
• Successful search:
• assume that a new element is inserted at the end of the linked list
• The expected time for a successful search is also 𝜃(1 + 𝛼).

26
Analysis of Hashing: Summary
• Searching takes:
• constant time on average
• 𝑂(𝑛) worst-case
• Insertion takes 𝑂(1) worst-case time
• Deletion takes 𝑂(1) worst-case time when the lists are doubly-
linked

27
Hash Function Design

28
Hash Function Properties
• Properties of a good hash function:
• Easy to evaluate – ℎ(𝑥) can be computed very quickly
• Uniform distribution over all the table slots
• Different keys are mapped to different slots (as much as possible)
• Good hash functions are very rare
• Often use heuristics, based on the domain of the keys, to create
a hash function that performs well.

29
Hash Functions
• Hash functions assume that the keys are natural numbers.
• How to deal with hashing non-integer keys:
• Find some way of turning the keys into integers
• In our example, remove the hyphen in 9635-8904 to get 96358904!
• For a character string, add up the ASCII values of the characters
• Example: Interpret a character string as an integer expressed in some radix
notation. Suppose the string is “CLRS”. ASCII values: C = 67, L = 76, R = 82,
S = 83. So interpret CLRS as: 67 × 103 + 76 × 102 + 82 × 101 + 83 ×
100 = 75503.
• Then use a standard hash function on the integers.

30
Hash Functions: Division Method
• Use the remainder:
• ℎ(𝑘) = 𝑘 𝑚𝑜𝑑 𝑚
• 𝑘 is the key, 𝑚 the size of the table
• Need to choose 𝑚
• 𝑚 = 𝑏 𝑒 (bad)
• if 𝑚 is a power of 2, ℎ(𝑘) gives the 𝑒 least significant bits of 𝑘
• all keys with the same ending go to the same place
• 𝑚 prime (good)
• helps ensure uniform distribution
• primes not too close to exact powers of 2
31
Hash Functions: Division Method
• Example 1:
• hash table for 𝑛 = 2000 character strings
• we don’t mind examining 3 elements
• 𝑚 = 701, a prime near 2000/3, but not near any power of 2
• → ℎ(𝑘) = 𝑘 𝑚𝑜𝑑 701
• Further examples:
• 𝑚 = 13
• ℎ(3)? • 3 mod 13 = 3
• ℎ(12)? • 12 mod 13 = 12
• ℎ(13)? • 13 mod 13 = 0
• ℎ(47)? • 47 mod 13 = 8
32
Hash Functions: Multiplication Method
• Use:
• ℎ 𝑘 = 𝑚(𝑘 𝐴 𝑚𝑜𝑑 1)
• 𝑘 is the key, 𝑚 the size of the table, and 𝐴 is a constant: 0 < 𝐴 < 1
• The steps involved:
• map 0 … 𝑘𝑚𝑎𝑥 into 0 … 𝑘𝑚𝑎𝑥 A
• take the fractional part (𝑚𝑜𝑑 1)
• map it into 0 … 𝑚 − 1
• Disadvantage: Slower than division method.
• Advantage: Value of m is not critical.

33
Hash Functions: Multiplication Method
• Choice of 𝑚 and 𝐴:
• value of 𝑚 is not critical, typically use 𝑚 = 2𝑝
• optimal choice of 𝐴 depends on the characteristics of the data
5−1
• Knuth says use: 𝐴 = ≈ 0.6180339887 … (conjugate of the golden
2
ratio) – Fibonacci hashing

34
Hash Functions: Multiplication Method
• Example:
• 𝑚 = 8 (𝑖𝑚𝑝𝑙𝑖𝑒𝑠 𝑝 = 3), 𝐴 = 0.62. ℎ 𝑘 = 𝑚(𝑘 𝐴 𝑚𝑜𝑑 1)
• If 𝑘 = 72, then:
• To compute ℎ(𝑘):
• 𝑘𝐴 = 44.64,
• 𝑘𝐴 𝑚𝑜𝑑 1 = 0.64,
• 𝑚(𝑘𝐴 𝑚𝑜𝑑 1) = 5.12,
• ℎ 𝑘 = 𝑚(𝑘 𝐴 𝑚𝑜𝑑 1) = 5

35
Universal Hashing
• For any choice of hash function, there exists a bad set of
identifiers
• A malicious adversary could choose keys to be hashed such
that all go into the same slot (bucket)
• Average retrieval time is Θ(𝑛)
• Solution:
• a random hash function
• choose hash function independently of keys!
• create a set of hash functions 𝐻 , from which ℎ can be randomly
selected

36
Universal Hashing (2)

• A collection 𝐻 of hash functions is universal if for any randomly


chosen 𝑓 from 𝐻 (and two keys 𝑘 and 𝑙):

Pr{𝑓(𝑘) = 𝑓(𝑙)} ≤ 1/𝑚

• Using chaining and universal hashing, the expected time for


each SEARCH operation is 𝑂(1).

37
Collision Resolution: Open Addressing

38
More on Collisions
• A key is mapped to an already occupied table location
• what to do?
• Use a collision handling technique
• We’ve seen Chaining
• Can also use Open Addressing
• Probing
• Double Hashing

39
Open Addressing
• All elements are stored in the hash table (can fill up!), i.e., 𝑛 ≤ 𝑚
• Each table entry contains either an element or null
• When searching for an element, systematically probe table slots
• Modify hash function to take the probe number 𝑖 as the second
parameter:
ℎ: 𝑈 × {0,1, … , 𝑚 − 1} → {0,1, … , 𝑚 − 1}
• Hash function, ℎ, determines the sequence of slots examined for a
given key

40
Open Addressing
• Probe sequence for a given key k given by:
ℎ(𝑘, 0), ℎ(𝑘, 1), . . . , ℎ(𝑘, 𝑚 − 1) − a permutation of 0,1, . . . , 𝑚 − 1

• Each element occupies a single slot in the hash table – no


chaining is done
• To insert an element, we probe the table according to the hash
function until an empty slot is found.
• Thus, in open addressing, the hash table can “fill up” so that no
further insertions can be made; one consequence is that the
load factor 𝛼 can never exceed 1.
41
Open Addressing
• Insertion: HASH_INSERT (T, k)
• The HASH-INSERT procedure takes 1 i0
as input a hash table 𝑇 and a key 𝑘. 2 repeat j  h(k, i)
• It either returns the slot number 3 if (T[j]== NIL)
where it stores key 𝑘 or flags an error 4 then T[j]  k
because the hash table is already full.
5 return j
6 else i  i + 1
7 until (i == m)
8 error “hash table overflow”

42
Open Addressing
• Search: HASH_SEARCH (T, k)
• The procedure HASH-SEARCH 1 i0
takes as input a hash table 𝑇 and
2 repeat j  h(k, i)
a key 𝑘, returning 𝑗 if it finds that
slot 𝑗 contains key 𝑘, or NIL if key 3 if (T[j]== k)
𝑘 is not present in table 𝑇 . 4 return j
5 else i  i + 1
6 until ((T[j]== NIL) or (i == m))
7 return NIL

43
Open Addressing
• Open addressing techniques:
• Linear probing
• Quadratic probing
• Double hashing

44
Linear Probing
• A hash value is computed using any hash function ℎ′ , and then
the number of the current attempt is added to it:
ℎ 𝑘, 𝑖 = ℎ′ 𝑘 + 𝑖 𝑚𝑜𝑑 𝑚

• Slots are examined sequentially, until an empty one is found

45
Linear Probing: An Example
• Example:
• You are given a hash table 𝐻 with 11 slots (𝑚)
• Suppose ℎ′ 𝑘 = 𝑘; then use linear probing and a hash function:
ℎ(𝑘, 𝑖) = (𝑘 + 𝑖) 𝑚𝑜𝑑 𝑚
To hash the following elements: 10, 22, 31, 4, 15, 28, 17, 88, 59.

46
Linear Probing
Solution:
• ℎ(10, 0) = (10 + 0) 𝑚𝑜𝑑 11 = 10
• ℎ(22, 0) = (22 + 0) 𝑚𝑜𝑑 11 = 0
• ℎ(31, 0) = (31 + 0) 𝑚𝑜𝑑 11 = 9
• ℎ(4, 0) = (4 + 0) 𝑚𝑜𝑑 11 = 4
• ℎ(15, 0) = (15 + 0) 𝑚𝑜𝑑 11 = 4
• ℎ(15, 1) = (15 + 1) 𝑚𝑜𝑑 11 = 5
• ℎ(28, 0) = (28 + 0) 𝑚𝑜𝑑 11 = 6
• ℎ(17, 0) = (17 + 0) 𝑚𝑜𝑑 11 = 6

0 1 2 3 4 5 6 7 8 9 10
22 4 15 28 31 10
47
Linear Probing
Solution (cont.):
• ℎ(17, 1) = (17 + 1) 𝑚𝑜𝑑 11 = 7
• ℎ(88, 0) = (88 + 0) 𝑚𝑜𝑑 11 = 0
• ℎ(88, 1) = (88 + 1) 𝑚𝑜𝑑 11 = 1
• ℎ(59, 0) = (59 + 0) 𝑚𝑜𝑑 11 = 4
• ℎ(59, 1) = (59 + 1) 𝑚𝑜𝑑 11 = 5
• ℎ(59, 2) = (59 + 2) 𝑚𝑜𝑑 11 = 6
• ℎ(59, 3) = (59 + 3) 𝑚𝑜𝑑 11 = 7
• ℎ(59, 4) = (59 + 4) 𝑚𝑜𝑑 11 = 8

0 1 2 3 4 5 6 7 8 9 10
22 88 4 15 28 17 59 31 10
48
Linear Probing
• Linear probing is easy to implement, but it suffers from a
problem known as Primary clustering: filling consecutive slots.
• Long runs of occupied slots build up, increasing the average
search time.

49
Quadratic Probing
• In this case, the second attempt is a more complex function
of 𝑖:
ℎ 𝑘, 𝑖 = ℎ′ 𝑘 + 𝑐1 𝑖 + 𝑐2 𝑖 2 𝑚𝑜𝑑 𝑚

• Tries to avoid primary clustering


• However, suffers from secondary clustering, which is to
make full use of the hash table, the values of 𝑐1 , 𝑐2 , and 𝑚
are constrained.

50
Quadratic Probing: An Example
• Example:
• You are given a hash table 𝐻 with 11 slots
• Suppose ℎ′ 𝑘 = 𝑘; then use linear probing and a hash function
ℎ(𝑘, 𝑖) = (𝑘 + 𝑖 + 3𝑖 2 ) 𝑚𝑜𝑑 11
To hash the following elements: 10, 22, 31, 4, 15, 28

51
Quadratic Probing
Solution: ℎ(𝑘, 𝑖) = (𝑘 + 𝑖 + 3𝑖 2 ) 𝑚𝑜𝑑 11
• ℎ(10, 0) = (10 𝑚𝑜𝑑 11) = 10
• ℎ(22, 0) = (22 𝑚𝑜𝑑 11) = 0
• ℎ(31, 0) = (31 𝑚𝑜𝑑 11) = 9
• ℎ 4, 0 = 4 𝑚𝑜𝑑 11 = 4
• ℎ(15, 0) = (15 𝑚𝑜𝑑 11) 𝑚𝑜𝑑 11 = 4
• ℎ(15, 1) = ((15 + 1 + 3) 𝑚𝑜𝑑 11) = 8
• ℎ(28, 0) = (28 𝑚𝑜𝑑 11) = 6

0 1 2 3 4 5 6 7 8 9 10
22 4 28 15 31 10

52
Double Hashing
• Given two hash functions:
ℎ 𝑘, 𝑖 = ℎ1 𝑘 + 𝑖 × ℎ2 𝑘 𝑚𝑜𝑑 𝑚

• One of the best methods for open addressing collision


resolution
• Permutations are almost random
• For the entire hash to be searched, 𝑚 and ℎ2 𝑘 must be
relatively prime

53
Double Hashing
• Possible selections of ℎ2 𝑘 :
1. Select 𝑚 to be a power of 2, and design ℎ2 𝑘 to produce odd
numbers
2. Select 𝑚 to be prime, and 𝑚′ to be 𝑚 − 1
• ℎ1 𝑘 = 𝑘 𝑚𝑜𝑑 𝑚
• ℎ2 𝑘 = 1 + (𝑘 𝑚𝑜𝑑 𝑚′ )

54
Double Hashing: Example
ℎ 𝑘, 𝑖 = ℎ1 𝑘 + 𝑖 × ℎ2 𝑘 𝑚𝑜𝑑 𝑚
• Suppose we have a hash table of size 13 with:
ℎ1 𝑘 = 𝑘 𝑚𝑜𝑑 13 and ℎ2 𝑘 = 1 + (𝑘 𝑚𝑜𝑑 11).
13 11
• Since 14 ՞ ≡ 1 and14 ՞ ≡ 3, we insert the key 14 into
empty slot 9, after examining slots 1 and 5 and finding
them to be occupied.
• ℎ1 14 = 1, ℎ2 14 = 4
• → ℎ 14,0 = 1 + 0 × 4 𝑚𝑜𝑑 13 = 1 (Collision)
• → ℎ 14,1 = 1 + 1 × 4 𝑚𝑜𝑑 13 = 5 (Collision)
• → ℎ 14,2 = 1 + 2 × 4 𝑚𝑜𝑑 13 = 9

55

You might also like