06 - APS - Hash Table
06 - APS - Hash Table
Hash table
Damjan Strnad
2
Hash table
●
a hash table is a data structure that stores key-data
pairs (k,r), where k is the key and r is the associated
data of a table element
●
hash table allows direct access to data through key
values, therefore a natural implementation of a hash
table is by using an array:
– the element index in the array is calculated from the key
3
Hash table
●
if the array is large enough and the keys are integers,
each key maps to a unique array location and we can use
direct addressing (array location k belongs to key k):
– the set of active keys K is a subset of a set of all keys U
and determines the
U
locations with valid T
0 key data
0 6 NIL
pointers to stored 2 1
8 1
elements; other 2
K 1 NIL
locations have value 3 3
3
NIL 7 4 4
4
– such table is not yet 5 NIL
5
7
7
8
NIL
4
Hash table
●
when the number of possible keys is bigger than the array
size, we calculate the element address from the key value
using a hash function h : U {0,...,m-1}, which maps from a
set of keys U into the slots of a hash table T:
– h(k) is the address of element with key k in a hash table
U T
0
h(k1)
k1
K k2
k3 h(k2)=h(k3)
k4 k5 h(k5)
h(k4)
m-1
5
Hash table
●
the advantage of a hash table to a table with direct
addressing is smaller memory consumption (O(|K|) instead
O(|U|)
●
average access time is still O(1), but not for the worst case
●
the disadvantage of a hash table is that two keys can map
to the same slot, which is called a collision
●
the number of collisions can be reduced with a good hash
function that maps keys uniformly across the table
addresses
●
because collisions can occur for |U|>m in any case, we
must use one of techniques for collision resolving
6
Hash functions
●
a good hash function maps each of the keys with equal
probability into one of m slots in a hash table
●
we will assume that keys are natural numbers; when they
are not, we have to transform them into natural numbers:
– example: let the key be a string CLRS. The ASCII values for
individual letters are: C=67, L=76, R=82, S=83. There are
128 values for a 7-bit ASCII, therefore the string CLRS can
be uniquely transformed into a natural number as:
(67 · 1283) + (76 · 1282) + (82 · 1281) + (83 · 1280) = 141 764 947
●
two methods for construction of good hash functions:
– division method
– multiplication method
7
Division method
●
uses the following equation for a hash function:
h(k) = k mod m
●
example: hash table size is m=12, the key is k=100
– h(100) = 100 mod 12 = 4
– values 4,16,28,... (k=4+12i, i=0,1,2,...) map into the
same slot
●
using powers of 2 for m is not always good:
– operation k mod m, where m=2p, returns bottom p bits
of the key; if those are not uniformly distributed among
all possible keys (e.g., postal codes), it will cause poor
dispersion of h(k) and consequently many collisions
●
in practice a good value for m is a prime number that is
not very close to the power of 2
8
Multiplication method
●
uses the following equation for a hash function:
h(k) = ⌊m(kA mod 1)⌋
●
A is a constant from range (0,1), a recommended value is:
A=( √ 5−1)/2≈0,618034
●
(kA mod 1) means we only keep the fractional part of the
product
●
the method advantage is that the value of m is not critical,
the disadvantage is it is slow compared to division method
●
example: m=8, A=13/32, k=21: h(k )=⌊8⋅(21⋅13 mod 1)⌋=
32
=⌊8⋅(8,53125 mod 1)⌋=
=⌊8⋅0,53125⌋=
=⌊ 4,25⌋=
=4
9
k1 k8 NIL
k1
K k8 NIL
k3 k2
k7 k2 k3 NIL
k6 k7
k4 k5 NIL
k6 NIL
NIL
NIL
k4 k5 NIL
10
Open addressing
●
open addressing is the second technique for resolving
collisions
●
all elements are stored in the same hash table; each slot
contains either a key or a NIL
●
searching is done by systematic inspection of slots until
the sought element is found or it is determined that the
element is not in the table
●
the advantage of open addressing is that we avoid
pointers; the saved memory can be used to enlarge the
hash table
●
hash table size m must be greater than the expected
number of elements n, therefore open addressing is only
used when the latter is known
15
Probe sequences
●
uniform hashing is a generalization of simple uniform
hashing in which every of m! possible permutations of
〈0, 1, ... , m-1〉 is selected as a probe sequence with equal
probability
●
uniform hashing is difficult to achieve in practice, so we
approximate it using methods that guarantee at least that
the probe sequence is a permutation of 〈0, 1, ... , m-1〉:
– linear probing
– quadratic probing
– double hashing
20
Linear probing
●
uses the following hash function:
h(k,i) = (h‘(k) + i) mod m; i=0, 1, ..., m-1
●
h‘(k) is an auxiliary hash function which determines the
first inspected slot T[h‘(k)]
●
the probe sequence is 〈T[h‘(k)], T[h‘(k)+1], ... , T[m-1],
T[0], T[1], ..., T[h‘(k)-1]〉
●
implementation of linear probing is simple, but the
problem is emergence of long clusters of occupied slots in
the hash table (i.e., primary clustering), which increases
the average search time
21
Quadratic probing
●
uses the following hash function:
h(k,i) = (h‘(k) + c1i + c2i2) mod m; i=0, 1, ..., m-1
● c1 and c2 are non-zero constants, which must be chosen
so that all hash table slots are addressed:
– for m=2n a good choice is c1=c2=0.5
– for arbitrary m and c1=c2=0.5 instead of „mod m“ we use
the first higher power of 2 and skip values h(k,i)≥m
●
the first probed slot is T[h‘(k)]; the index of next probed
slots changes according to a quadratic function of i
●
in practice quadratic probing performs better than linear
probing
22
Double hashing
●
uses the following hash function:
h(k,i) = (h1(k) + i·h2(k)) mod m; i=0, 1, ..., m-1
● h1 are h2 are auxiliary hash functions; the values of h2 should
be non-zero 0 and coprime to m
– for m=2n this is achieved if h2 only returns odd values
● the first probed slot is T[h1(k)]; addresses of subsequently
probed slots depend on h2(k)
●
example:
– h1(k)=k mod m, h2(k)=1 + (k mod (m-1)), k=123456, m=701
– h1(k) = 123456 mod 701 = 80, h2(k) = 1 + (123456 mod 700) =
257
– h(k,i) = [80 + i · 257] mod 701
– the probe sequence is 〈80, 337, 594, 150, 407, ...〉
23