CSC508 Hashing
CSC508 Hashing
CSC508 Hashing
TOPIC 6B
Hashing
Zulaile Mabni
CHAPTER OBJECTIVES
▪ Learn about hashing
▪ Learn about hash methods
▪ Mid-square
▪ Folding
▪ Division
▪ Learn about collision
▪ Open addressing
▪ Chaining
2
▪ Need a data structure in which finds/searches are very fast
▪ Insert and Delete process should be fast too
▪ Objects have unique keys
▪ A key may be a single property/attribute value
▪ Or may be created from multiple properties/values
3
▪ Maximize efficiency: implement the operations Insert(),
Delete() and Search()/Find() efficiently.
▪ Arrays:
▪ not space efficient (assumes we leave empty space for keys
not currently in the structure)
▪ Linked List
▪ space efficient
▪ Insert(), Delete() and Search()/Find() not too efficient
▪ Hash Tables:
▪ Better than the above in terms of space and efficiency
4
▪ Very useful data structure
▪ Good for storing and retrieving key-value pairs
▪ Not good for iterating through a list of items
▪ Example applications:
▪ Storing objects according to ID numbers
▪ When the ID numbers are widely spread out
▪ When you don’t need to access items in ID order
5
▪ A hash value or hash index is used to index
the hash table (array)
▪ A hash function takes a key and returns a
hash value/index
▪ The hash index is a integer (to index an array)
6
▪ You want a hash function/algorithm that is:
▪ Fast
▪ Easy to compute
▪ Minimize the number of collisions
▪ Creates a good distribution of hash values so that the items
(based on their keys) are distributed evenly through the
array
▪ Hash functions can use as input
▪ Integer key values
▪ String key values
▪ Multipart key values
▪ Multipart fields, and/or
▪ Multiple fields
7
▪ The performance of the hash table depends on having a hash
function that evenly distributes the keys: uniform hashing is the
ideal target
▪ Choosing a good hash function requires taking into account the
kind of data that will be used.
▪ E.g., Choosing the first letter of a last name will likely cause
lots of collisions depending on the nationality of the
population.
▪ Most programming languages (including java) have hash
functions built in.
8
▪ Division (Modular arithmetic)
▪ key mod m
▪ m is the array size; in general, it should be prime number
▪ Key X is converted into an integer iX
▪ This integer divided by size of hash table to get remainder,
giving address of X in HT
9
▪ Stands for modulo
▪ When you divide x by y, you get a result and a remainder
▪ Mod is the remainder
▪ 8 mod 5 = 3
▪ 9 mod 5 = 4
▪ 10 mod 5 = 0
▪ 15 mod 5 = 0
10
Hash Tables – Conceptual View
h(key) = key mod 8
5 Obj3 Obj2
key=4 key=36
4 b2
3 b3
2 Obj4
key=2
1 b4
0 Obj5
key=1
11
Suppose that each key is a string. The following
Java method uses the division method to compute
the address of the key:
12
▪ Mid-Square
▪ Hash method, h, computed by squaring the
identifier
▪ Using appropriate number of bits from the middle
of the square to obtain the bucket address
▪ Middle bits of a square usually depend on all the
characters, it is expected that different keys will
yield different hash addresses with high
probability, even if some of the characters are the
same
13
14
▪ Folding
▪ Key X is partitioned into parts such that all the parts, except
possibly the last parts, are of equal length
▪ Parts then added, in convenient way, to obtain hash address
15
= 105 % 100
16
▪ Usage summary:
int hashValue = hashFunction (int key);
▪ Or hashValue = hashFunction (String key);
▪ Or hashValue = hashFunction (itemType item);
▪ Insert method:
public void insert (int key, itemType item) {
hashValue = hashFunction (key);
table[hashValue] = item;
}
17
For example, if we hash keys 0…1000 into a hash table with 5
entries and use h(key) = key mod 5 , we get the following
sequence of events:
0 0 0
1 1 21 … 1 21 … There is a
collision at
2 2 … 2 2 … 2 2 … array entry #4
3 3 3
4 4 4 34 … ???
18
▪ Algorithms to handle collisions
▪ Two categories of collision resolution techniques
▪ Open addressing (closed hashing)
▪ Chaining (open hashing)
19
▪ A problem arises when we have two keys that hash in the
same array entry – this is called a collision.
▪ There are two ways to resolve collision:
20
The problem is that keys 34 and 54 hash in the same entry (4). We
solve this collision by placing all keys that hash in the same hash table
entry in a chain (linked list) or bucket (array) pointed by this entry:
CHAIN
21
22
Collisions are resolved by systematically examining other table
indexes, i0 , i1 , i2 , … until an empty slot is located.
▪ The key is first mapped to an array cell using the hash function (e.g.
key % array-size)
▪ If there is a collision find an available array cell
▪ There are different algorithms to find (to probe for) the next array cell
▪ Linear probing
▪ Quadratic probing
▪ Random probing
▪ Double Hashing
23
▪ Suppose that an item with key X is to be inserted in HT
▪ Use hash function to compute index h(X) of item in HT
▪ Suppose h(X) = t.
▪ If HT[t] is empty, store item into array slot.
▪ Suppose HT[t] already occupied by another item; collision
occurs
▪ Linear probing: starting at location t, search array sequentially to
find next available array slot:
▪ (t + 1) % HTSize, (t + 2) % HTSize,…,(t + j) % HTSize
▪ Be sure to wrap around the end of the array!
▪ Stop when you have tried all possible array indices
▪ If the array is full, you need to throw an exception or, better
yet, resize the array
24
Pseudocode implementing linear probing:
hIndex = hashmethod(insertKey);
found = false;
while(HT[hIndex] != emptyKey && !found)
if(HT[hIndex].key == key)
found = true;
else
hIndex = (hIndex + 1) % HTSize;
if(found)
System.out.println(”Duplicate items not allowed”);
else
HT[hIndex] = newItem;
25
26
▪ Uses a random number generator to find the
next available slot
▪ ith slot in the probe sequence is: (h(X) + ri) %
HTSize where ri is the ith value in a random
permutation of the numbers 1 to HTSize – 1
▪ Suppose HTSize = 101, for h(X) = 26, and r1 = 2, r2 =
5, r3 = 8.
▪ The probe sequence of X has the elements 26,
28,31,34
▪ Allinsertions and searches use the same
sequence of random numbers
27
▪ In Quadratic probing, starting at position t,
check the array locations ( t + 1²) % HTSize, (t
+ 2²) % HTSize,…, (t + i²) % HTSize.
▪ We do not know if it probes all the positions
in the table
▪ When HTSize is prime, quadratic probing
probes about half the table before repeating
the probe sequence
28
29
▪ Apply a second hash function after the first
▪ The second hash function, like the first, is dependent on the key
▪ Secondary hash function must
▪ Be different than the first
▪ And, obviously, not generate a zero
▪ Good algorithm:
▪ arrayIndex = (arrayIndex + stepSize) % arraySize;
▪ Where stepSize = constant – (key % constant)
▪ And constant is a prime less than the array size
30
31
32
33
▪ http://www.cs.auckland.ac.nz/software/AlgAnim/hash_tables.h
tml
34
▪ Malik D.S., Nair P.S., Data Structures Using Java, Course
Technology, 2003.
35