Basic Algorithim 6
Basic Algorithim 6
Hashing (Ch. 5)
Morgan Ericsson
Algorithms/ Hashing 1
Today
» Hashing
» Seperate chaining
» Probing
» Double hashing
» Rehashing
Algorithms/ Hashing 2
Hashing
Algorithms/ Hashing 4
Question
» Assume we have a list of integers
» We can find a given integer
» O(n) if we use sequential search
» O(log2 n) if we use a binary search tree
» Can we do better?
» Think back to union find/disjoint set
Algorithms/ Hashing 5
Binary list
» Let’s try to adopt the idea from union find
» But now we only care if a number is in the list or
not
» So, if we want to insert a number
» lst[6433] = True
» If we want to check
» lst[2137] is not None
» Reasonable idea, but how larger should the list be?
» How much space is wasted?
Algorithms/ Hashing 6
Flawed but good idea
» We use the integer value to map a key to a position
» The problem is the range of the random integers
» If we wanted to store all numbers 1 to 100 it would
not be a problem
» But in the general case, the numbers might be too
large and the list would be very sparse
Algorithms/ Hashing 7
Flawed but good idea
1 import numpy as np
2
3 lst = np.zeros(7630+1, dtype=bool)
4
5 for v in [7630, 3275, 6433, 5913, 2137]:
6 lst[v] = True
7
8 print(len(lst[lst == True]))
9 print(len(lst[lst == False]))
5
7626
Algorithms/ Hashing 8
General idea
» We use the key to map to an element
» If unlimited space, we can just use the value
» If less then limited space, we need a “better”
mapping function
» What if we use %?
Algorithms/ Hashing 9
Another way to map
1 lst_sz = 5
2 lst = np.zeros(lst_sz, dtype=bool)
3
4 for v in [7630, 3275, 6433, 5913, 2137]:
5 lst[v % lst_sz] = True
6
7 print(len(lst[lst == True]))
8 print(len(lst[lst == False]))
3
2
Algorithms/ Hashing 10
Hash
» The mapping is called a hashing function or
hash(Code)
» The problem we observed is a collison
» When two (or more) keys map to the same
position
» A perfect hashing function should never produce
collisions
» but it can be difficult to define
» Especially since it should also be efficient to compute
Algorithms/ Hashing 11
Another try
1 lst_sz = 31
2 lst = np.zeros(lst_sz, dtype=bool)
3
4 def hashf(v):
5 return v % lst_sz
6
7 for v in [7630, 3275, 6433, 5913, 2137]:
8 lst[hashf(v)] = True
9
10 print(len(lst[lst == True]))
5
Algorithms/ Hashing 12
More problems
» Without perfect hashing, we can never find the key
» Hashing is a one-way function
» 99 % 10 == 9 % 10
» There is no order
» What if they keys are not integers?
Algorithms/ Hashing 13
Another example
1 lst_sz = 31
2 lst = np.zeros(lst_sz, dtype=bool)
3
4 def hashf(v:str) -> int:
5 pass
6
7 for v in ['Liam', 'Olivia', 'Charlotte', 'Lucas', 'Mia']:
8 lst[hashf(v)] = True
Algorithms/ Hashing 14
Options
» Use the first (or last) k letters
» Can be a bad idea, many domains have common pre-
and suffixes
» Names
» Phone numbers
» …
» Use the whole key?
Algorithms/ Hashing 15
Example
1 def hashf(key:str) -> int:
2 hv = 0
3 for c in key:
4 hv += ord(c)
5
6 return hv
Algorithms/ Hashing 16
Example
1 lst_sz = 31
2 lst = np.zeros(lst_sz, dtype=bool)
3
4 for v in ['Liam', 'Olivia', 'Charlotte', 'Lucas', 'Mia']:
5 lst[hashf(v) % lst_sz] = True
6
7 print(len(lst[lst == True]))
Algorithms/ Hashing 17
All is good?
1 print(hashf('abc') == hashf('acb'))
True
Algorithms/ Hashing 18
What can we do?
1 def hashf(key:str) -> int:
2 hv = 0
3 for ix,c in enumerate(key, start=2):
4 hv += ix * ord(c)
5
6 return hv
Algorithms/ Hashing 19
Solved?
1 print(hashf('abc') == hashf('acb'))
2 print(hashf('abc') == hashf('-10['))
False
True
Algorithms/ Hashing 20
Keep in mind
» Try to avoid repetition and “round” numbers
» It is generally a bad idea to use sizes of even 10s
» Or powers of two
» Use prime numbers to break patterns
» Preferably close to powers of two
» Use as much of the key as possible
» More bits means more variation
Algorithms/ Hashing 21
Hashing in Python (and Java)
» Types define a hash value
» hash('Olivia')
» hash((1, 3))
» Similar in Java
» Required for certain things to work, e.g., sets
Algorithms/ Hashing 22
Simple example
1 class Person:
2 def __init__(self, n:str, a:int) -> None:
3 self.name = n
4 self.age = a
5
6 def __str__(self) -> str:
7 return f'{self.name} ({self.age})'
8
9 p1 = Person('Olivia', 34)
10 print(hash(p1))
382262012
Algorithms/ Hashing 23
For free?
1 p1 = Person('Olivia', 34)
2 p2 = Person('Olivia', 34)
3
4 print(hash(p1) == hash(p2))
False
Algorithms/ Hashing 24
For free?
1 from fastcore.basics import patch
2
3 @patch
4 def __hash__(self:Person) -> int:
5 hv = 17
6 hv = 31 * hv + hash(self.name)
7 hv = 31 * hv + hash(self.age)
8 return hv
Algorithms/ Hashing 25
New try
1 p1 = Person('Olivia', 34)
2 p2 = Person('Olivia', 34)
3
4 print(hash(p1) == hash(p2))
True
Algorithms/ Hashing 26
Using our function
1 plst = [None] * 31
2
3 p1 = Person('Olivia', 34)
4 p2 = Person('Mia', 11)
5
6 plst[hash(p1) % 31] = p1
7 plst[hash(p2) % 31] = p2
8
9 print(plst[hash(p1) % 31])
Olivia (34)
Algorithms/ Hashing 27
Suppose Olivia had a birthday …
1 p1.age += 1
2
3 print(plst[hash(p1) % 31])
None
Algorithms/ Hashing 28
Simple hash table
1 class HT:
2 def __init__(self):
3 self.sz = 31
4 self.table = [None] * 31
5
6 def insert(self, key):
7 self.table[hash(key) % self.sz] = key
8
9 def contains(self, key):
10 return self.table[hash(key) % self.sz] == key
11
12 def __len__(self):
13 return len([v for v in self.table \
14 if v is not None])
Algorithms/ Hashing 29
Testing it
1 import random
2
3 ht = HT()
4 for i in range(10):
5 v = random.randint(1, 100_000)
6 ht.insert(v)
7
8 print(len(ht))
9
Algorithms/ Hashing 30
Seperate chaining
Algorithms/ Hashing 32
Collisions
» We have seen that some hashing functions can result
in collisions
» Can we manage the collisions?
Algorithms/ Hashing 33
Hash functions
» We want the hash functions to distribute keys
uniformely across the integer interval
» Bins and balls
» Assume that each key is a ball and each position
is a bin
» If we randomly toss a random ball, it should be
equally likely to end up in any of the bins
» If have m bins and toss n balls, we would expect
there to be n / m balls in each bin after a while
Algorithms/ Hashing 34
Uniformity
Tossing 1 000 000 balls into 31 bins
Algorithms/ Hashing 35
We also know
» We can expect two balls in the same bin after
∼ √ π m2 tosses
» Every bin has ≥ 1 balls after ∼ m ln m tosses
log m
» After m tosses, the most loaded bin has Θ( log log m )
balls
Algorithms/ Hashing 36
So, how can we deal with collisions?
» Seperate chaining
» We make each bin a linked list
» And place keys that collide in the same bin
Algorithms/ Hashing 37
A simple linked list
1 from dataclasses import dataclass
2
3 @dataclass
4 class LLNode:
5 key: int
6 nxt: 'LLNode|None' = None
Algorithms/ Hashing 38
Seperate chaining
1 class HTSC:
2 def __init__(self, m=5):
3 self.sz = m
4 self.table = [None] * self.sz
Algorithms/ Hashing 39
Inserting
1 @patch
2 def insert(self:HTSC, key):
3 hv = hash(key) % self.sz
4 self.table[hv] = self._inschain(self.table[hv], key)
5
6 @patch
7 def _inschain(self:HTSC, n:LLNode|None, key) -> LLNode:
8 if n is None:
9 return LLNode(key)
10 n.nxt = self._inschain(n.nxt, key)
11 return n
Algorithms/ Hashing 40
Finding
1 @patch
2 def find(self:HTSC, key) -> bool:
3 hv = hash(key) % self.sz
4 if self.table[hv] is not None:
5 p = self.table[hv]
6 while p is not None:
7 if p.key == key:
8 return True
9 p = p.nxt
10 return False
Algorithms/ Hashing 41
Len
1 @patch
2 def __len__(self:HTSC) -> int:
3 l = 0
4 for t in self.table:
5 l += self._lchain(t)
6 return l
7
8 @patch
9 def _lchain(self:HTSC, n:LLNode|None) -> int:
10 l = 0
11 while n is not None:
12 l += 1
13 n = n.nxt
14 return l
Algorithms/ Hashing 42
Testing it
1 ht = HTSC()
2
3 for v in ['Liam', 'Olivia', 'Charlotte', 'Lucas', 'Mia']:
4 ht.insert(v)
5
6 #print(len(ht))
7 print(len(ht))
5
Algorithms/ Hashing 43
Testing it some more
1 ht = HTSC()
2 for i in range(200):
3 v = random.randint(1, 100_000)
4 ht.insert(v)
5
6 print(len(ht))
200
Algorithms/ Hashing 44
Linear again?
» With seperate chaining, we need to search the list to
determine if the value exists or not
» We know that each list holds on average n / m
» Where n is the number of keys and m is the size
» So, can be significant better than than O(n)
Algorithms/ Hashing 45
How should we pick m?
» If m is too small, the lists will be too long
» If m is too large, we will waste space
» A good rule of thumb is to set m to n / 5
» Then access should be O(1)
» Since we expect around 5 elements per bin /
bucket
Algorithms/ Hashing 46
Testing the idea
1 n = 63 * 5
2 m = 63
3
4 rv = []
5 for _ in range(n):
6 rv.append(random.randint(0, 62))
7
8 ht = HTSC(m)
9 for v in rv:
10 ht.insert(v)
11
12 ll = [ht._lchain(n) for n in ht.table]
Algorithms/ Hashing 47
Testing the idea
Algorithms/ Hashing 48
Uniform?
Algorithms/ Hashing 49
Faking uniformity
1 n = 63 * 5
2 m = 63
3
4 ht = HTSC(m)
5 for i in range(n):
6 ht.insert(i)
7
8 ll = [ht._lchain(n) for n in ht.table]
Algorithms/ Hashing 50
Faking uniformity
Algorithms/ Hashing 51
Increasing the range
1 n = 63 * 5
2 m = 63
3
4 ht = HTSC(m)
5 for i in range(n):
6 v = random.randint(0, 100_000)
7 ht.insert(v)
8
9 ll = [ht._lchain(n) for n in ht.table]
Algorithms/ Hashing 52
Increasing the range
Algorithms/ Hashing 53
Linear probing
Algorithms/ Hashing 55
Linear probing
» Seperate chaining works, but:
» Introduces a second data structure
» Has overhead in creating nodes
» What if we “chain” in the existing list?
Algorithms/ Hashing 56
Linear probing
» If a slot is taken, find the next empty one
» If hash(v) = i and i is taken, try i + 1, i + 2, …
until an empty slot is found
» Must repeat the same when searching…
» The list must be larger than the number of keys
Algorithms/ Hashing 57
Linear probing
1 class HTLP:
2 def __init__(self, m=5):
3 self.sz = m
4 self.table = [None] * self.sz
Algorithms/ Hashing 58
Inserting
1 @patch
2 def insert(self:HTLP, key):
3 hv = hash(key) % self.sz
4 if self.table[hv] is None:
5 self.table[hv] = key
6 else:
7 while self.table[hv] is not None:
8 hv = (hv + 1) % self.sz
9 self.table[hv] = key
Algorithms/ Hashing 59
Finding
1 @patch
2 def find(self:HTLP, key) -> bool:
3 hv = hash(key) % self.sz
4 while self.table[hv] is not None:
5 if self.table[hv] == key:
6 return True
7 hv = (hv + 1) % self.sz
8 return False
Algorithms/ Hashing 60
Len
1 @patch
2 def __len__(self:HTLP) -> int:
3 return len([v for v in self.table \
4 if v is not None])
Algorithms/ Hashing 61
Testing it…
1 ht = HTLP(7)
2
3 for v in ['Liam', 'Olivia', 'Charlotte', 'Lucas', 'Mia']:
4 ht.insert(v)
5
6 assert ht.find('Liam') == True
7 assert ht.find('John') == False
8 print(len(ht))
5
Algorithms/ Hashing 62
Testing it
1 @patch
2 def insert(self:HTLP, key) -> int:
3 hv = hash(key) % self.sz
4 if self.table[hv] is None:
5 self.table[hv] = key
6 return 0
7 else:
8 off = 0
9 while self.table[hv] is not None:
10 hv = (hv + 1) % self.sz
11 off += 1
12 self.table[hv] = key
13 return off
Algorithms/ Hashing 63
Testing it
1 ht = HTLP(7)
2
3 for v in ['Liam', 'Olivia', 'Charlotte', 'Lucas', 'Mia']:
4 print(ht.insert(v))
0
0
0
0
0
Algorithms/ Hashing 64
Testing it some more
Algorithms/ Hashing 65
Breaking it
1 ht = HTLP(100)
2
3 for i in range(1001, 1800, 100):
4 ht.insert(i)
5
6 print(ht.insert(1002))
7
Algorithms/ Hashing 66
What is going on?
0: [ ]
1: [ 1001 ]
2: [ 1101 ]
3: [ 1201 ]
4: [ 1301 ]
5: [ 1401 ]
6: [ 1501 ]
7: [ 1601 ]
8: [ 1701 ]
9: [ 1002 ]
10: [ ]
Algorithms/ Hashing 67
Clustering
» A cluster is a contigious block of items
» Collisions create clusters, since we keep adding after
the expected position
» So, new keys are likely to hash into the middle of big
clusters
» Which will increase the displacement (offset)
Algorithms/ Hashing 68
Knuth’s parking problem
» Cars arrive at a (one-way) street with m parking
spaces
» Each car desires a specific space i, but will try i + 1,
i + 2, … if i is taken
» What is the average displacement?
» with m / 2 cars, ∼ 3 / 2
» with m cars, ∼ √π m / 8
Algorithms/ Hashing 69
Analysis of linear probing
» Assume we have a list of size m and n = αm keys
» We can then determine the average number of probes
if we have a search hit
1 1
(1 + )
2 1−α
Algorithms/ Hashing 70
Analysis of linear probing
» And if we miss/insert
1 1
(1 + 2
)
2 (1 − α)
Algorithms/ Hashing 71
Does it make sense?
1 m = 100
2
3 disp = []
4 for t in range(1000):
5 ht = HTLP(m)
6
7 for _ in range(50):
8 ht.insert(random.randint(0, 100_000))
9
10 for _ in range(1):
11 disp.append(ht.insert(random.randint(0, 100_000)))
12
13 print(np.mean(disp), 'Expect:', 3/2)
1.479 Expect: 1.5
Algorithms/ Hashing 72
Does it make sense?
1 m = 100
2 n = 75
3
4 disp = []
5 for t in range(1000):
6 ht = HTLP(m)
7
8 for _ in range(n):
9 ht.insert(random.randint(0, 100_000))
10
11 for _ in range(1):
12 disp.append(ht.insert(random.randint(0, 100_000)))
13
14 print(np.mean(disp), 'Expect:', 0.5*(1+1/(1-n/m)**2))
5.716 Expect: 8.5
Algorithms/ Hashing 73
Does it make sense?
Algorithms/ Hashing 74
Analysis of linear probing
» If m is too large, too much wasted space
» If m is too small, search time blows up
» Rule of thumb, α =n/m∼1/2
» Probes for hit is about 3 / 2
» Probes for miss/insert is about 5 /2
Algorithms/ Hashing 75
Dealing with clusters
» How can we deal with clusters?
» Attack linearity?
» Longer steps to avoid clusters?
Algorithms/ Hashing 76
A new class
1 class HTQP:
2 def __init__(self, m=5):
3 self.sz = m
4 self.table = [None] * self.sz
Algorithms/ Hashing 77
Quadratic probing
1 @patch
2 def insert(self:HTQP, key):
3 hv = hash(key) % self.sz
4 if self.table[hv] is None:
5 self.table[hv] = key
6 return 0
7 else:
8 offs = 0
9 while self.table[hv] is not None:
10 hv = (hv + 2**offs) % self.sz
11 offs += 1
12 self.table[hv] = key
13 return offs
Algorithms/ Hashing 78
Better?
1 ht = HTQP(100)
2
3 for i in range(1001, 1800, 100):
4 ht.insert(i)
5
6 print(ht.insert(1002))
1
Algorithms/ Hashing 79
Better?
0: [ ]
1: [ 1001 ]
2: [ 1101 ]
3: [ 1002 ]
4: [ 1201 ]
5: [ ]
6: [ ]
7: [ ]
8: [ 1301 ]
9: [ ]
10: [ ]
Algorithms/ Hashing 80
Guidelines
» Try to keep n / m to about 0.5
» Critical for quadratic probing, can fail otherwise
» Good rule for performance for linear probing
Algorithms/ Hashing 81
Deleting
» As usual in seperate chaining
» But not in linear probing
Algorithms/ Hashing 82
Remember
1 ht = HTLP(100)
2
3 for i in range(1001, 1800, 100):
4 ht.insert(i)
5
6 ht.insert(1002)
7
Algorithms/ Hashing 83
Deleting
1 ix = ht.table.index(1301)
2 ht.table[ix] = None
3
4 print(ht.find(1001))
5 print(ht.find(1101))
6 print(ht.find(1002))
True
True
False
Algorithms/ Hashing 84
Deleting
» Use a seperate list to indicate deleted
» Similar to one of the initial list implementations
» When inserting, use this to check if free
» When searching, use this to check if active
Algorithms/ Hashing 85
Adding list
1 @patch
2 def __init__(self:HTLP, m=5):
3 self.sz = m
4 self.table = [None] * self.sz
5 self.active = [False] * self.sz
Algorithms/ Hashing 86
Delete
1 @patch
2 def delete(self:HTLP, key) -> bool:
3 hv = hash(key) % self.sz
4 while self.table[hv] is not None:
5 if self.table[hv] == key:
6 self.active[hv] = False
7 break
8 hv = (hv + 1) % self.sz
Algorithms/ Hashing 87
New insert
1 @patch
2 def insert(self:HTLP, key) -> int:
3 hv = hash(key) % self.sz
4 if not self.active[hv]:
5 self.table[hv] = key
6 self.active[hv] = True
7 else:
8 while self.active[hv]:
9 hv = (hv + 1) % self.sz
10 self.table[hv] = key
11 self.active[hv] = True
Algorithms/ Hashing 88
New find
1 @patch
2 def find(self:HTLP, key) -> bool:
3 hv = hash(key) % self.sz
4 while self.table[hv] is not None:
5 if self.table[hv] == key:
6 return self.active[hv]
7 hv = (hv + 1) % self.sz
8 return False
Algorithms/ Hashing 89
Testing
1 ht = HTLP(100)
2 for i in range(1001, 1800, 100):
3 ht.insert(i)
4
5 ht.insert(1002)
6 ht.delete(1301)
7
8 print(ht.find(1001))
9 print(ht.find(1301))
10 print(ht.find(1002))
True
False
True
Algorithms/ Hashing 90
Reading instructions
Algorithms/ Hashing 92
Reading instructions
» Ch. 5.1 - 5.5
» Interesting, but not required
» 5.6 discusses hashing in Java
» 5.7 discusses more advanced versions of hasing
» 5.8 discusses universal hash functions
» 5.9 discusses hashing to secondary storage
Algorithms/ Hashing 93