0% found this document useful (0 votes)
8 views89 pages

Basic Algorithim 6

basic algorithim 6

Uploaded by

Osama Rashdan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views89 pages

Basic Algorithim 6

basic algorithim 6

Uploaded by

Osama Rashdan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 89

Algorithms

Hashing (Ch. 5)
Morgan Ericsson

Algorithms/ Hashing 1
Today
» Hashing
» Seperate chaining
» Probing
» Double hashing
» Rehashing

Algorithms/ Hashing 2
Hashing

Algorithms/ Hashing 4
Question
» Assume we have a list of integers
» We can find a given integer
» O(n) if we use sequential search
» O(log2 n) if we use a binary search tree
» Can we do better?
» Think back to union find/disjoint set

Algorithms/ Hashing 5
Binary list
» Let’s try to adopt the idea from union find
» But now we only care if a number is in the list or
not
» So, if we want to insert a number
» lst[6433] = True
» If we want to check
» lst[2137] is not None
» Reasonable idea, but how larger should the list be?
» How much space is wasted?
Algorithms/ Hashing 6
Flawed but good idea
» We use the integer value to map a key to a position
» The problem is the range of the random integers
» If we wanted to store all numbers 1 to 100 it would
not be a problem
» But in the general case, the numbers might be too
large and the list would be very sparse

Algorithms/ Hashing 7
Flawed but good idea
1 import numpy as np
2
3 lst = np.zeros(7630+1, dtype=bool)
4
5 for v in [7630, 3275, 6433, 5913, 2137]:
6 lst[v] = True
7
8 print(len(lst[lst == True]))
9 print(len(lst[lst == False]))
5
7626

Algorithms/ Hashing 8
General idea
» We use the key to map to an element
» If unlimited space, we can just use the value
» If less then limited space, we need a “better”
mapping function
» What if we use %?

Algorithms/ Hashing 9
Another way to map
1 lst_sz = 5
2 lst = np.zeros(lst_sz, dtype=bool)
3
4 for v in [7630, 3275, 6433, 5913, 2137]:
5 lst[v % lst_sz] = True
6
7 print(len(lst[lst == True]))
8 print(len(lst[lst == False]))
3
2

Algorithms/ Hashing 10
Hash
» The mapping is called a hashing function or
hash(Code)
» The problem we observed is a collison
» When two (or more) keys map to the same
position
» A perfect hashing function should never produce
collisions
» but it can be difficult to define
» Especially since it should also be efficient to compute
Algorithms/ Hashing 11
Another try
1 lst_sz = 31
2 lst = np.zeros(lst_sz, dtype=bool)
3
4 def hashf(v):
5 return v % lst_sz
6
7 for v in [7630, 3275, 6433, 5913, 2137]:
8 lst[hashf(v)] = True
9
10 print(len(lst[lst == True]))
5

Algorithms/ Hashing 12
More problems
» Without perfect hashing, we can never find the key
» Hashing is a one-way function
» 99 % 10 == 9 % 10
» There is no order
» What if they keys are not integers?

Algorithms/ Hashing 13
Another example
1 lst_sz = 31
2 lst = np.zeros(lst_sz, dtype=bool)
3
4 def hashf(v:str) -> int:
5 pass
6
7 for v in ['Liam', 'Olivia', 'Charlotte', 'Lucas', 'Mia']:
8 lst[hashf(v)] = True

Algorithms/ Hashing 14
Options
» Use the first (or last) k letters
» Can be a bad idea, many domains have common pre-
and suffixes
» Names
» Phone numbers
» …
» Use the whole key?

Algorithms/ Hashing 15
Example
1 def hashf(key:str) -> int:
2 hv = 0
3 for c in key:
4 hv += ord(c)
5
6 return hv

Algorithms/ Hashing 16
Example
1 lst_sz = 31
2 lst = np.zeros(lst_sz, dtype=bool)
3
4 for v in ['Liam', 'Olivia', 'Charlotte', 'Lucas', 'Mia']:
5 lst[hashf(v) % lst_sz] = True
6
7 print(len(lst[lst == True]))

Algorithms/ Hashing 17
All is good?
1 print(hashf('abc') == hashf('acb'))
True

Problem? Depends on our use… Not great for phone


numbers, for example…

Algorithms/ Hashing 18
What can we do?
1 def hashf(key:str) -> int:
2 hv = 0
3 for ix,c in enumerate(key, start=2):
4 hv += ix * ord(c)
5
6 return hv

Algorithms/ Hashing 19
Solved?
1 print(hashf('abc') == hashf('acb'))
2 print(hashf('abc') == hashf('-10['))
False
True

It is not easy to create a perfect (good) hash function.

Algorithms/ Hashing 20
Keep in mind
» Try to avoid repetition and “round” numbers
» It is generally a bad idea to use sizes of even 10s
» Or powers of two
» Use prime numbers to break patterns
» Preferably close to powers of two
» Use as much of the key as possible
» More bits means more variation

Algorithms/ Hashing 21
Hashing in Python (and Java)
» Types define a hash value
» hash('Olivia')
» hash((1, 3))
» Similar in Java
» Required for certain things to work, e.g., sets

Algorithms/ Hashing 22
Simple example
1 class Person:
2 def __init__(self, n:str, a:int) -> None:
3 self.name = n
4 self.age = a
5
6 def __str__(self) -> str:
7 return f'{self.name} ({self.age})'
8
9 p1 = Person('Olivia', 34)
10 print(hash(p1))
382262012

Algorithms/ Hashing 23
For free?
1 p1 = Person('Olivia', 34)
2 p2 = Person('Olivia', 34)
3
4 print(hash(p1) == hash(p2))
False

Hash is based on object identity. Problem? Depends on


our use…

Algorithms/ Hashing 24
For free?
1 from fastcore.basics import patch
2
3 @patch
4 def __hash__(self:Person) -> int:
5 hv = 17
6 hv = 31 * hv + hash(self.name)
7 hv = 31 * hv + hash(self.age)
8 return hv

We can define our own hash function

Algorithms/ Hashing 25
New try
1 p1 = Person('Olivia', 34)
2 p2 = Person('Olivia', 34)
3
4 print(hash(p1) == hash(p2))
True

Based on object values rather than identify

Algorithms/ Hashing 26
Using our function
1 plst = [None] * 31
2
3 p1 = Person('Olivia', 34)
4 p2 = Person('Mia', 11)
5
6 plst[hash(p1) % 31] = p1
7 plst[hash(p2) % 31] = p2
8
9 print(plst[hash(p1) % 31])
Olivia (34)

Algorithms/ Hashing 27
Suppose Olivia had a birthday …
1 p1.age += 1
2
3 print(plst[hash(p1) % 31])
None

Custom hash functions and mutability is a problem…

Algorithms/ Hashing 28
Simple hash table
1 class HT:
2 def __init__(self):
3 self.sz = 31
4 self.table = [None] * 31
5
6 def insert(self, key):
7 self.table[hash(key) % self.sz] = key
8
9 def contains(self, key):
10 return self.table[hash(key) % self.sz] == key
11
12 def __len__(self):
13 return len([v for v in self.table \
14 if v is not None])

Algorithms/ Hashing 29
Testing it
1 import random
2
3 ht = HT()
4 for i in range(10):
5 v = random.randint(1, 100_000)
6 ht.insert(v)
7
8 print(len(ht))
9

Algorithms/ Hashing 30
Seperate chaining

Algorithms/ Hashing 32
Collisions
» We have seen that some hashing functions can result
in collisions
» Can we manage the collisions?

Algorithms/ Hashing 33
Hash functions
» We want the hash functions to distribute keys
uniformely across the integer interval
» Bins and balls
» Assume that each key is a ball and each position
is a bin
» If we randomly toss a random ball, it should be
equally likely to end up in any of the bins
» If have m bins and toss n balls, we would expect
there to be n / m balls in each bin after a while

Algorithms/ Hashing 34
Uniformity
Tossing 1 000 000 balls into 31 bins

Algorithms/ Hashing 35
We also know
» We can expect two balls in the same bin after
∼ √ π m2 tosses
» Every bin has ≥ 1 balls after ∼ m ln m tosses
log m
» After m tosses, the most loaded bin has Θ( log log m )
balls

Algorithms/ Hashing 36
So, how can we deal with collisions?
» Seperate chaining
» We make each bin a linked list
» And place keys that collide in the same bin

Algorithms/ Hashing 37
A simple linked list
1 from dataclasses import dataclass
2
3 @dataclass
4 class LLNode:
5 key: int
6 nxt: 'LLNode|None' = None

Algorithms/ Hashing 38
Seperate chaining
1 class HTSC:
2 def __init__(self, m=5):
3 self.sz = m
4 self.table = [None] * self.sz

Algorithms/ Hashing 39
Inserting
1 @patch
2 def insert(self:HTSC, key):
3 hv = hash(key) % self.sz
4 self.table[hv] = self._inschain(self.table[hv], key)
5
6 @patch
7 def _inschain(self:HTSC, n:LLNode|None, key) -> LLNode:
8 if n is None:
9 return LLNode(key)
10 n.nxt = self._inschain(n.nxt, key)
11 return n

Algorithms/ Hashing 40
Finding
1 @patch
2 def find(self:HTSC, key) -> bool:
3 hv = hash(key) % self.sz
4 if self.table[hv] is not None:
5 p = self.table[hv]
6 while p is not None:
7 if p.key == key:
8 return True
9 p = p.nxt
10 return False

Algorithms/ Hashing 41
Len
1 @patch
2 def __len__(self:HTSC) -> int:
3 l = 0
4 for t in self.table:
5 l += self._lchain(t)
6 return l
7
8 @patch
9 def _lchain(self:HTSC, n:LLNode|None) -> int:
10 l = 0
11 while n is not None:
12 l += 1
13 n = n.nxt
14 return l

Algorithms/ Hashing 42
Testing it
1 ht = HTSC()
2
3 for v in ['Liam', 'Olivia', 'Charlotte', 'Lucas', 'Mia']:
4 ht.insert(v)
5
6 #print(len(ht))
7 print(len(ht))
5

Algorithms/ Hashing 43
Testing it some more
1 ht = HTSC()
2 for i in range(200):
3 v = random.randint(1, 100_000)
4 ht.insert(v)
5
6 print(len(ht))
200

Algorithms/ Hashing 44
Linear again?
» With seperate chaining, we need to search the list to
determine if the value exists or not
» We know that each list holds on average n / m
» Where n is the number of keys and m is the size
» So, can be significant better than than O(n)

Algorithms/ Hashing 45
How should we pick m?
» If m is too small, the lists will be too long
» If m is too large, we will waste space
» A good rule of thumb is to set m to n / 5
» Then access should be O(1)
» Since we expect around 5 elements per bin /
bucket

Algorithms/ Hashing 46
Testing the idea
1 n = 63 * 5
2 m = 63
3
4 rv = []
5 for _ in range(n):
6 rv.append(random.randint(0, 62))
7
8 ht = HTSC(m)
9 for v in rv:
10 ht.insert(v)
11
12 ll = [ht._lchain(n) for n in ht.table]

Algorithms/ Hashing 47
Testing the idea

Algorithms/ Hashing 48
Uniform?

Algorithms/ Hashing 49
Faking uniformity
1 n = 63 * 5
2 m = 63
3
4 ht = HTSC(m)
5 for i in range(n):
6 ht.insert(i)
7
8 ll = [ht._lchain(n) for n in ht.table]

Algorithms/ Hashing 50
Faking uniformity

Algorithms/ Hashing 51
Increasing the range
1 n = 63 * 5
2 m = 63
3
4 ht = HTSC(m)
5 for i in range(n):
6 v = random.randint(0, 100_000)
7 ht.insert(v)
8
9 ll = [ht._lchain(n) for n in ht.table]

Algorithms/ Hashing 52
Increasing the range

Algorithms/ Hashing 53
Linear probing

Algorithms/ Hashing 55
Linear probing
» Seperate chaining works, but:
» Introduces a second data structure
» Has overhead in creating nodes
» What if we “chain” in the existing list?

Algorithms/ Hashing 56
Linear probing
» If a slot is taken, find the next empty one
» If hash(v) = i and i is taken, try i + 1, i + 2, …
until an empty slot is found
» Must repeat the same when searching…
» The list must be larger than the number of keys

Algorithms/ Hashing 57
Linear probing
1 class HTLP:
2 def __init__(self, m=5):
3 self.sz = m
4 self.table = [None] * self.sz

Algorithms/ Hashing 58
Inserting
1 @patch
2 def insert(self:HTLP, key):
3 hv = hash(key) % self.sz
4 if self.table[hv] is None:
5 self.table[hv] = key
6 else:
7 while self.table[hv] is not None:
8 hv = (hv + 1) % self.sz
9 self.table[hv] = key

Algorithms/ Hashing 59
Finding
1 @patch
2 def find(self:HTLP, key) -> bool:
3 hv = hash(key) % self.sz
4 while self.table[hv] is not None:
5 if self.table[hv] == key:
6 return True
7 hv = (hv + 1) % self.sz
8 return False

Algorithms/ Hashing 60
Len
1 @patch
2 def __len__(self:HTLP) -> int:
3 return len([v for v in self.table \
4 if v is not None])

Algorithms/ Hashing 61
Testing it…
1 ht = HTLP(7)
2
3 for v in ['Liam', 'Olivia', 'Charlotte', 'Lucas', 'Mia']:
4 ht.insert(v)
5
6 assert ht.find('Liam') == True
7 assert ht.find('John') == False
8 print(len(ht))
5

Algorithms/ Hashing 62
Testing it
1 @patch
2 def insert(self:HTLP, key) -> int:
3 hv = hash(key) % self.sz
4 if self.table[hv] is None:
5 self.table[hv] = key
6 return 0
7 else:
8 off = 0
9 while self.table[hv] is not None:
10 hv = (hv + 1) % self.sz
11 off += 1
12 self.table[hv] = key
13 return off

Algorithms/ Hashing 63
Testing it
1 ht = HTLP(7)
2
3 for v in ['Liam', 'Olivia', 'Charlotte', 'Lucas', 'Mia']:
4 print(ht.insert(v))
0
0
0
0
0

Algorithms/ Hashing 64
Testing it some more

Algorithms/ Hashing 65
Breaking it
1 ht = HTLP(100)
2
3 for i in range(1001, 1800, 100):
4 ht.insert(i)
5
6 print(ht.insert(1002))
7

Algorithms/ Hashing 66
What is going on?
0: [ ]
1: [ 1001 ]
2: [ 1101 ]
3: [ 1201 ]
4: [ 1301 ]
5: [ 1401 ]
6: [ 1501 ]
7: [ 1601 ]
8: [ 1701 ]
9: [ 1002 ]
10: [ ]

Algorithms/ Hashing 67
Clustering
» A cluster is a contigious block of items
» Collisions create clusters, since we keep adding after
the expected position
» So, new keys are likely to hash into the middle of big
clusters
» Which will increase the displacement (offset)

Algorithms/ Hashing 68
Knuth’s parking problem
» Cars arrive at a (one-way) street with m parking
spaces
» Each car desires a specific space i, but will try i + 1,
i + 2, … if i is taken
» What is the average displacement?
» with m / 2 cars, ∼ 3 / 2
» with m cars, ∼ √π m / 8

Algorithms/ Hashing 69
Analysis of linear probing
» Assume we have a list of size m and n = αm keys
» We can then determine the average number of probes
if we have a search hit

1 1
(1 + )
2 1−α

Algorithms/ Hashing 70
Analysis of linear probing
» And if we miss/insert

1 1
(1 + 2
)
2 (1 − α)

Algorithms/ Hashing 71
Does it make sense?
1 m = 100
2
3 disp = []
4 for t in range(1000):
5 ht = HTLP(m)
6
7 for _ in range(50):
8 ht.insert(random.randint(0, 100_000))
9
10 for _ in range(1):
11 disp.append(ht.insert(random.randint(0, 100_000)))
12
13 print(np.mean(disp), 'Expect:', 3/2)
1.479 Expect: 1.5

Algorithms/ Hashing 72
Does it make sense?
1 m = 100
2 n = 75
3
4 disp = []
5 for t in range(1000):
6 ht = HTLP(m)
7
8 for _ in range(n):
9 ht.insert(random.randint(0, 100_000))
10
11 for _ in range(1):
12 disp.append(ht.insert(random.randint(0, 100_000)))
13
14 print(np.mean(disp), 'Expect:', 0.5*(1+1/(1-n/m)**2))
5.716 Expect: 8.5

Algorithms/ Hashing 73
Does it make sense?

Algorithms/ Hashing 74
Analysis of linear probing
» If m is too large, too much wasted space
» If m is too small, search time blows up
» Rule of thumb, α =n/m∼1/2
» Probes for hit is about 3 / 2
» Probes for miss/insert is about 5 /2

Algorithms/ Hashing 75
Dealing with clusters
» How can we deal with clusters?
» Attack linearity?
» Longer steps to avoid clusters?

Algorithms/ Hashing 76
A new class
1 class HTQP:
2 def __init__(self, m=5):
3 self.sz = m
4 self.table = [None] * self.sz

Algorithms/ Hashing 77
Quadratic probing
1 @patch
2 def insert(self:HTQP, key):
3 hv = hash(key) % self.sz
4 if self.table[hv] is None:
5 self.table[hv] = key
6 return 0
7 else:
8 offs = 0
9 while self.table[hv] is not None:
10 hv = (hv + 2**offs) % self.sz
11 offs += 1
12 self.table[hv] = key
13 return offs

Algorithms/ Hashing 78
Better?
1 ht = HTQP(100)
2
3 for i in range(1001, 1800, 100):
4 ht.insert(i)
5
6 print(ht.insert(1002))
1

Algorithms/ Hashing 79
Better?
0: [ ]
1: [ 1001 ]
2: [ 1101 ]
3: [ 1002 ]
4: [ 1201 ]
5: [ ]
6: [ ]
7: [ ]
8: [ 1301 ]
9: [ ]
10: [ ]

Algorithms/ Hashing 80
Guidelines
» Try to keep n / m to about 0.5
» Critical for quadratic probing, can fail otherwise
» Good rule for performance for linear probing

Algorithms/ Hashing 81
Deleting
» As usual in seperate chaining
» But not in linear probing

Algorithms/ Hashing 82
Remember
1 ht = HTLP(100)
2
3 for i in range(1001, 1800, 100):
4 ht.insert(i)
5
6 ht.insert(1002)
7

Algorithms/ Hashing 83
Deleting
1 ix = ht.table.index(1301)
2 ht.table[ix] = None
3
4 print(ht.find(1001))
5 print(ht.find(1101))
6 print(ht.find(1002))
True
True
False

Algorithms/ Hashing 84
Deleting
» Use a seperate list to indicate deleted
» Similar to one of the initial list implementations
» When inserting, use this to check if free
» When searching, use this to check if active

Algorithms/ Hashing 85
Adding list
1 @patch
2 def __init__(self:HTLP, m=5):
3 self.sz = m
4 self.table = [None] * self.sz
5 self.active = [False] * self.sz

Algorithms/ Hashing 86
Delete
1 @patch
2 def delete(self:HTLP, key) -> bool:
3 hv = hash(key) % self.sz
4 while self.table[hv] is not None:
5 if self.table[hv] == key:
6 self.active[hv] = False
7 break
8 hv = (hv + 1) % self.sz

Algorithms/ Hashing 87
New insert
1 @patch
2 def insert(self:HTLP, key) -> int:
3 hv = hash(key) % self.sz
4 if not self.active[hv]:
5 self.table[hv] = key
6 self.active[hv] = True
7 else:
8 while self.active[hv]:
9 hv = (hv + 1) % self.sz
10 self.table[hv] = key
11 self.active[hv] = True

Algorithms/ Hashing 88
New find
1 @patch
2 def find(self:HTLP, key) -> bool:
3 hv = hash(key) % self.sz
4 while self.table[hv] is not None:
5 if self.table[hv] == key:
6 return self.active[hv]
7 hv = (hv + 1) % self.sz
8 return False

Algorithms/ Hashing 89
Testing
1 ht = HTLP(100)
2 for i in range(1001, 1800, 100):
3 ht.insert(i)
4
5 ht.insert(1002)
6 ht.delete(1301)
7
8 print(ht.find(1001))
9 print(ht.find(1301))
10 print(ht.find(1002))
True
False
True

Algorithms/ Hashing 90
Reading instructions

Algorithms/ Hashing 92
Reading instructions
» Ch. 5.1 - 5.5
» Interesting, but not required
» 5.6 discusses hashing in Java
» 5.7 discusses more advanced versions of hasing
» 5.8 discusses universal hash functions
» 5.9 discusses hashing to secondary storage

Algorithms/ Hashing 93

You might also like