AST20105 Data Structure and Algorithms: Chapter 9 - Hash Table
AST20105 Data Structure and Algorithms: Chapter 9 - Hash Table
2
Hash Function
● Hash function is very important part of hash table design.
● A hash function, denoted as h, provides a method to find the table index from key
/ data
○ In other words, it maps keys into locations of a hash table
● A key k hashes to location h(k), where h(k) is the hash value of k
● Hashing refers to a process of inserting keys / data to hash table with the help of
hash function
3
Hash Function
● Hash function is considered to be good, if it provides uniform distribution of hash
values.
● The reason, why hash function is a subject to the principal concern, is that poor
hash functions cause collisions and some other unwanted effects, which badly
affect hash table overall performance.
4
Hashing
● Hashing is a technique to convert a range of key values into a range of indexes
of an array.
5
Load Factor
● Basic underlying data structure used to store hash table is an array.
● The load factor is the ratio between the number of stored items and array's size.
Load factor = # of elements in the array / size of array
● Some says the ideal load factor should be maintained below 0.75.
● Hash table can whether be of a constant size or being dynamically resized, when
load factor exceeds some threshold.
● Resizing is done before the table becomes full to keep the number of collisions
under certain amount and prevent performance degradation.
6
Collision
● What happens, if hash function returns the same hash value for different keys?
● It yields an effect, called collision.
● Solutions:
○ Design a better hash function that can be computed efficiently and minimize
the number of collision.
○ Design collision resolution algorithm.
7
Collision
● Collisions are practically unavoidable and should be considered when one
implements hash table.
● Due to collisions, keys are also stored in the table, so one can distinguish
between key-value pairs having the same hash value.
● There are various ways of collision resolution. Basically, there are two different
strategies:
○ Closed addressing (not changing the hash value)
○ Open addressing (changing to other hash value)
8
Open Hashing (Closed Addressing)
● Each slot of the hash table contains
a link to another data structure (i.e.
linked list), which stores key-value
pairs with the same hash.
● When collision occurs, this data
structure is searched for key-value
pair, which matches the key.
9
Closed Hashing (Open Addressing)
● Each slot actually contains a key-value pair.
● When collision occurs, open addressing algorithm calculates another location to
locate a free slot.
● Hash tables, based on open addressing strategy experience drastic performance
decrease, when table is tightly filled (load factor is 0.75 or more).
10
Closed Hashing vs Open Hashing
11
Simple Example of Hash Table
12
Value Key
Example 1
A 65
L 76
G 71
● Put “ALGORITHMS” into a hash table while
O 79
the keys are their ASCII values.
R 82
I 73
T 84
H 72
M 77
S 83
13
Example 1
● Suppose the size of hash table is 16.
● Putting “A” into the hash table:
○ Key % arrSize = 65 % 16 = 1
○ “A” is put at index 1
● Putting “L” into the hash table:
○ Key % arrSize = 76 % 16 = 12
○ “L” is put at index 12
● And so forth...
14
Value Key
Example 2 C 67
O 79
● Put “COMPUTERS” into a hash table while
the keys are their ASCII values. M 77
P 80
U 85
T 84
E 69
R 82
S 83
15
Example 2
● Suppose the size of hash table is 16.
● Putting “C” into the hash table:
○ Key % arrSize = 67 % 16 = 3
○ “C” is put at index 3
● And so forth...
○ “O” @ 15
○ ”M” @ 13 Slot 5 is already occupied!
Collision handling is
○ “P” @ 0 required
○ “U” @ 5
○ “T” @ 4
● When put “E” into the hash table
○ Key % arrSize = 69 % 16 = 5
16
Hash Functions
17
Hashing Function
● The number of hash functions that can be used to assign positions to n items in a
table of m positions (for n <= m) is equal to mn.
○ Most of these functions are too unwieldy for practical applications and
cannot be represented by a concise formula.
18
Hashing Function - Division
● A hash function guarantees that the number it returns is a valid index to one of the
table cells.
● The simplest way to accomplish this is to use division modulo
○ TSize = sizeof(table), as in
○ h(K) = K mod TSize, if K is a number.
19
Hashing Function - Folding
● In this method, the key is divided into several parts.
● These parts are combined or folded together and are often transformed in a
certain way to create the target address.
● There are two types of folding:
○ Shift folding
○ Boundary folding
20
Hashing Function - Shift Folding
● In shift folding, they are put underneath one another and then processed.
● For example, a social security number (SSN)
○ 123456789 can be divided into three parts,
○ 123, 456, 789, and then these parts can be added.
● The resulting number, 1,368,
○ can be divided modulo TSize or,
○ if the size of the table is 1,000, the first three digits can be used for the
address.
21
Hashing Function - Boundary Folding
● With boundary folding, the key is seen as being written on a piece of paper that is
folded on the borders between different parts of the key.
● In this way, every other part will be put in the reverse order.
● Consider the same three parts of the SSN: 123, 456 and 789.
○ The first part, 123, is taken in the same order,
○ then the piece of paper with the second part is folded underneath it so that
123 is aligned with 654, which is the second part, 456, in reverse order.
○ When the folding continues, 789 is aligned with the two previous parts.
● The result is 123 + 654 +789 = 1,566.
○ can be divided modulo TSize or,
○ if the size of the table is 1,000, the first three digits can be used for the
address.
22
Hashing Function - Mid-Square Method
● In the mid-square method, the key is squared and the middle or mid part of the
result is used as the address.
● If the key is a string, it has to be preprocessed to produce a number by using, for
instance, folding.
● In a mid-square hash function, the entire key participates in generating the
address so that there is a better chance that different addresses are generated for
different keys.
● For example,
○ if the key is 3,121, then 3,1212 = 9,740,641,
○ and for the 1,000-cell table, h(3,121) = 406, which is the middle part of
3,1212.
23
Collision Handling
24
Collision Handling
● In the small number of cases, where multiple keys map to the same integer, then
elements with different keys may be stored in the same "slot" of the hash table.
● It is clear that when the hash function is used to locate a potential match, it will be
necessary to compare the key of that element with the search key.
● There may be more than one element which should be stored in a single slot of
the table.
● Various techniques are used to manage this problem:
○ separate chaining (or chaining)
○ probing (linear and quadratic) and
○ re-hashing, etc.
25
Separate Chaining (Chaining)
● One simple scheme is to chain all collisions in
lists attached to the appropriate slot.
● This allows an unlimited number of collisions to
be handled and doesn't require a prior
knowledge of how many elements are contained
in the collection.
● The tradeoff is the same as with linked lists
versus array implementations of collections:
linked list overhead in space and, to a lesser
extent, in time.
26
Separate Chaining (Chaining)
● To insert key k to hash table Worst case: O(1)
○ Compute h(k) to determine where to insert the element
○ If T[h(k)] is NULL, make this table cell to point to a node contains k
○ Otherwise, add a node contains k to the beginning of the list
● To search for a key k Worst case: O(n)
○ Compute h(k) and search within the list at T[h(k)]
● To delete a key k from the hash table T Worst case: O(n)
○ Compute h(k) to determine where to remove the element
○ Search within the list at T[h(k)] and delete the node contains k if it is found
27
Separate Chaining (Chaining)
Pros: Cons:
● The number of keys in each linked ● More space is needed as linked list
list is a small constant (assuming the is used
hash function is well defined) and ● Memory allocation of node and
this facilitates constant time, i.e. manipulation of pointers slow down
O(1) for searching, insertion and the program
deletion of elements on average
● Deletion is easy
28
Probing
● Probing refers to finding other available place if collision occurs
● Hash function of open addressing is as follows:
h(k, i) = ( hi(k) + f(i) ) mod size, where f(i) is the collision resolution function
○ Typically f(0) = 0
● Hash functions for different open addressing schemes:
○ Linear probing: f(i) = i
○ Quadratic probing: f(i) = i2
○ Re-hashing: f(i) = i * h2(k), where h2(k) is another hash function.
29
Probing
● To insert key k to hash table
○ Probe hash table until an empty slot is found
● To search for a key k
○ Probe hash table until the key is found or confirmed that it is not found
● To delete a key k from the hash table T
● Problem:
○ If the key is deleted, but there are keys that hash to the same location stored
in other locations in the table, then searches for those keys will be treated as
unsuccessful
● Must be “lazy” delete
○ Keep the key in the table, but mark it as deleted.
○ New key will overwrite the location marked as deleted
30
Probing: Linear Probing
● One of the simplest re-hashing
functions is +1 (or -1), ie on a
collision, look in the neighboring slot
in the table.
● It calculates the new address
extremely quickly and may be
extremely efficient.
31
Probing: Linear Probing
● Clustering
○ Linear probing is subject to a clustering phenomenon.
○ Re-hashes from one location occupy a block of slots in the table which
"grows" towards slots to which other keys hash.
○ This exacerbates the collision problem and the number of re-hashed can
become large.
32
Probing: Linear Probing
Pros: Cons:
● Easy to implement. ● Hash table has fixed size.
● Use less memory than separate ● Likely with block of contiguously
chaining. occupied entries (clustering) and this
● Fast when table is sparse. causes bad performance since:
○ It increases chances of
collisions
○ It increases the searching time
of elements
33
Probing: Quadratic Probing
Quadratic probing is an open-addressing scheme where
we look for i2-th slot in i-th iteration if the given hash value x
collides in the hash table.
How Quadratic Probing is done?
Let hash(x) be the slot index computed using the hash
function.
● If the slot hash(x) % S is full, then we try (hash(x) + 12) % S.
● If (hash(x) + 1*1) % S is also full, then we try (hash(x) + 22) % S.
● If (hash(x) + 2*2) % S is also full, then we try (hash(x) + 32) % S.
● This process is repeated for all the values of i until an empty slot is
found.
34
Probing: Quadratic Probing
Pros: Cons:
● Easy to implement. ● Keys that hash to the same initial
● Resolve primary clustering issue. location will probe the same
alternative cells and this causes
clustering around the probe
sequences (called second
clustering)
35
Probing: Re-hashing
● Re-hashing schemes use a second hashing operation when there is a collision.
● If there is a further collision, we re-hash until an empty "slot" in the table is found.
● The re-hashing function can either be a new function or a re-application of the
original one.
● As long as the functions are applied to a key in the same order, then a sought key
can always be located.
36
Probing: Re-hashing
37
Probing: Re-hashing
● How to choose second hash function?
○ Shouldn’t evaluate to zero
○ Relatively prime to the size of table .
○ Otherwise, only a fraction of table entries will be examined
● Pros:
○ Eliminate secondary clustering
● Cons:
○ Time consuming to compute two hash functions
38
Q&A
39