0% found this document useful (0 votes)
7 views28 pages

06 - APS - Hash Table

The document provides an overview of hash tables, detailing their structure, advantages, and methods for resolving collisions. It discusses hash functions, including division and multiplication methods, and techniques for collision resolution such as chaining and open addressing. Additionally, it covers probing strategies like linear, quadratic, and double hashing, along with their respective implementations and complexities.

Uploaded by

kgztjmqsss
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views28 pages

06 - APS - Hash Table

The document provides an overview of hash tables, detailing their structure, advantages, and methods for resolving collisions. It discusses hash functions, including division and multiplication methods, and techniques for collision resolution such as chaining and open addressing. Additionally, it covers probing strategies like linear, quadratic, and double hashing, along with their respective implementations and complexities.

Uploaded by

kgztjmqsss
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

Algorithms & data structures

Hash table
Damjan Strnad
2

Hash table

a hash table is a data structure that stores key-data
pairs (k,r), where k is the key and r is the associated
data of a table element

hash table allows direct access to data through key
values, therefore a natural implementation of a hash
table is by using an array:
– the element index in the array is calculated from the key
3

Hash table

if the array is large enough and the keys are integers,
each key maps to a unique array location and we can use
direct addressing (array location k belongs to key k):
– the set of active keys K is a subset of a set of all keys U
and determines the
U
locations with valid T
0 key data
0 6 NIL
pointers to stored 2 1
8 1
elements; other 2
K 1 NIL
locations have value 3 3
3
NIL 7 4 4
4
– such table is not yet 5 NIL
5

a hash table NIL


6

7
7
8
NIL
4

Hash table

when the number of possible keys is bigger than the array
size, we calculate the element address from the key value
using a hash function h : U  {0,...,m-1}, which maps from a
set of keys U into the slots of a hash table T:
– h(k) is the address of element with key k in a hash table

U T
0

h(k1)
k1
K k2
k3 h(k2)=h(k3)
k4 k5 h(k5)

h(k4)

m-1
5

Hash table

the advantage of a hash table to a table with direct
addressing is smaller memory consumption (O(|K|) instead
O(|U|)

average access time is still O(1), but not for the worst case

the disadvantage of a hash table is that two keys can map
to the same slot, which is called a collision

the number of collisions can be reduced with a good hash
function that maps keys uniformly across the table
addresses

because collisions can occur for |U|>m in any case, we
must use one of techniques for collision resolving
6

Hash functions

a good hash function maps each of the keys with equal
probability into one of m slots in a hash table

we will assume that keys are natural numbers; when they
are not, we have to transform them into natural numbers:
– example: let the key be a string CLRS. The ASCII values for
individual letters are: C=67, L=76, R=82, S=83. There are
128 values for a 7-bit ASCII, therefore the string CLRS can
be uniquely transformed into a natural number as:
(67 · 1283) + (76 · 1282) + (82 · 1281) + (83 · 1280) = 141 764 947

two methods for construction of good hash functions:
– division method
– multiplication method
7

Division method

uses the following equation for a hash function:
h(k) = k mod m

example: hash table size is m=12, the key is k=100
– h(100) = 100 mod 12 = 4
– values 4,16,28,... (k=4+12i, i=0,1,2,...) map into the
same slot

using powers of 2 for m is not always good:
– operation k mod m, where m=2p, returns bottom p bits
of the key; if those are not uniformly distributed among
all possible keys (e.g., postal codes), it will cause poor
dispersion of h(k) and consequently many collisions

in practice a good value for m is a prime number that is
not very close to the power of 2
8

Multiplication method

uses the following equation for a hash function:
h(k) = ⌊m(kA mod 1)⌋

A is a constant from range (0,1), a recommended value is:
A=( √ 5−1)/2≈0,618034

(kA mod 1) means we only keep the fractional part of the
product

the method advantage is that the value of m is not critical,
the disadvantage is it is slow compared to division method

example: m=8, A=13/32, k=21: h(k )=⌊8⋅(21⋅13 mod 1)⌋=
32
=⌊8⋅(8,53125 mod 1)⌋=
=⌊8⋅0,53125⌋=
=⌊ 4,25⌋=
=4
9

Resolving collisions by chaining



in chaining, elements that hash to the same slot are
stored in a linked list at that slot

each slot contains a pointer to the head of the linked list; if
the slot is empty the pointer is NIL
U T
NIL

k1 k8 NIL
k1
K k8 NIL

k3 k2
k7 k2 k3 NIL
k6 k7
k4 k5 NIL

k6 NIL
NIL

NIL

k4 k5 NIL
10

Resolving collisions by chaining



example: a phone book
keys indices key-value pairs
11

Chained hash table operations


● insertion: INSERT-INTO-CHAINED-SLOT(T,x)
– inserts element x into the head of list T[h(key[x])]
– time complexity is O(1) if we assume that x is not yet in T;
if we need to verify that assumption, we must execute the
search
● searching: FIND-IN-CHAINED-SLOT(T,k)
– finds the element with key k in the list T[h(k)]
– worst case time complexity is proportional to the length of
linked list in slot h(k)
● removal: REMOVE-FROM-CHAINED-SLOT(T,x)
– removes element x from list T[h(key[x])]
– if the lists are doubly linked, the time complexity is O(1); if
they are singly linked, we must find the predecessor of x,
for which time complexity equals to that of searching
12

Chained hashing analysis



a load factor α for a hash table storing n elements in m
slots is defined as:
n
α=
m

α is the average number of elements stored in a chain
(list)

the worst case occurs when all elements are in the same
slot; in this case the time complexity of search is O(n), to
which we must add the time of computing the hash
function value

average time complexity depends on how well the hash
function h distributes the keys among m slots
13

Chained hashing analysis



suppose that each key is equally likely mapped to any of
m slots; this assumption is called simple uniform
hashing

theorem: With chained hashing both successful and
unsuccessful search has average time complexity Θ(1+α)
if we assume simple uniform hashing

if the hash table does not contain the element with key k,
the search is unsuccessful, otherwise it is successful

in expression Θ(1+α) the constant 1 is the time for
computing the hash function and α is the average search
time for a list of length α=n/m
14

Open addressing

open addressing is the second technique for resolving
collisions

all elements are stored in the same hash table; each slot
contains either a key or a NIL

searching is done by systematic inspection of slots until
the sought element is found or it is determined that the
element is not in the table

the advantage of open addressing is that we avoid
pointers; the saved memory can be used to enlarge the
hash table

hash table size m must be greater than the expected
number of elements n, therefore open addressing is only
used when the latter is known
15

Open addressing – insertion



insertion is performed by probing the hash table until we
find a free slot

the hash function is extended by using an additional
argument, which is the probing trial counter:
h : U × {0, 1, ... ,m-1}  {0, ... ,m-1}

for each key k a probe sequence 〈h(k,0), h(k,1), ... ,
h(k,m-1)〉 is determined:
– h(k,0) is the first possible slot for key k, h(k,1) is the second
possible slot for key k, etc.

the probe sequence is a permutation of 〈0, 1, ... , m-1〉,
which means that it contains all m slots of a hash table
16

Open addressing – insertion


● procedure INSERT-INTO-HASH-TABLE(T,k) returns a
slot index into which key k is written or an overflow error if
there is no free slot
INSERT-INTO-HASH-TABLE(T,k)
i ← 0
repeat
j ← h(k,i)
if T[j] = NIL or T[j] = DELETED then % slot j is free
T[j] ← k % insert k into slot j
return j % return slot number
else % slot is occupied, try next one
i ← i + 1
until i = m
error „hash table overflow“
17

Open addressing – search



the procedure for finding key k searches the same
sequence of slots as the insertion procedure

search is unsuccessful if a free slot is encountered or if all
slots are full but none contains key k

worst case time complexity is O(n), where n is the number
of elements stored in the hash table
SEARCH-IN-HASH-TABLE(T,k)
i ← 0
repeat
j ← h(k,i)
if T[j] = k then % key k found in slot j
return j % return slot number
i ← i + 1
until T[j] = NIL or i = m
return NIL % key not found
18

Open addressing – removal



when removing an element from a hash table with open
addressing, we need to write a special value at the location of
the removed element; the special value signals that the table
location was not empty all of the time

this allows us to continue subsequent searches to the first truly
empty slot, stepping over once occupied locations (if we
stored NIL to the removed element location, all elements
behind it in the probe sequence would become inaccessible)

when inserting DELETE-IN-HASH-TABLE(T,k)
i ← 0
new elements repeat
the locations j ← h(k,i)
if T[j] = k then % key found
containing the T[j] ← DELETED % mark deleted
special value return
can be i ← i + 1
until T[j] = NIL or i = m
overwritten error „key not found“
19

Probe sequences

uniform hashing is a generalization of simple uniform
hashing in which every of m! possible permutations of
〈0, 1, ... , m-1〉 is selected as a probe sequence with equal
probability

uniform hashing is difficult to achieve in practice, so we
approximate it using methods that guarantee at least that
the probe sequence is a permutation of 〈0, 1, ... , m-1〉:
– linear probing
– quadratic probing
– double hashing
20

Linear probing

uses the following hash function:
h(k,i) = (h‘(k) + i) mod m; i=0, 1, ..., m-1

h‘(k) is an auxiliary hash function which determines the
first inspected slot T[h‘(k)]

the probe sequence is 〈T[h‘(k)], T[h‘(k)+1], ... , T[m-1],
T[0], T[1], ..., T[h‘(k)-1]〉

implementation of linear probing is simple, but the
problem is emergence of long clusters of occupied slots in
the hash table (i.e., primary clustering), which increases
the average search time
21

Quadratic probing

uses the following hash function:
h(k,i) = (h‘(k) + c1i + c2i2) mod m; i=0, 1, ..., m-1
● c1 and c2 are non-zero constants, which must be chosen
so that all hash table slots are addressed:
– for m=2n a good choice is c1=c2=0.5
– for arbitrary m and c1=c2=0.5 instead of „mod m“ we use
the first higher power of 2 and skip values h(k,i)≥m

the first probed slot is T[h‘(k)]; the index of next probed
slots changes according to a quadratic function of i

in practice quadratic probing performs better than linear
probing
22

Double hashing

uses the following hash function:
h(k,i) = (h1(k) + i·h2(k)) mod m; i=0, 1, ..., m-1
● h1 are h2 are auxiliary hash functions; the values of h2 should
be non-zero 0 and coprime to m
– for m=2n this is achieved if h2 only returns odd values
● the first probed slot is T[h1(k)]; addresses of subsequently
probed slots depend on h2(k)

example:
– h1(k)=k mod m, h2(k)=1 + (k mod (m-1)), k=123456, m=701
– h1(k) = 123456 mod 701 = 80, h2(k) = 1 + (123456 mod 700) =
257
– h(k,i) = [80 + i · 257] mod 701
– the probe sequence is 〈80, 337, 594, 150, 407, ...〉
23

Double hashing – example



insert key k=14 into the hash table of size m=13 below using
open addressing with double hashing; the hashing functions are
h1(k) = k mod 13 and h2(k) = 1 + (k mod 11):
0 – h1(14) = 14 mod 13 = 1
1 79
2
– h2(14) = 1 + (14 mod 11) = 1 + 3 = 4
3 68 – h(14,i) = [h1(14) + ih2(14)] mod 13 = [1 + 4i] mod 13
4
5 96
– the probe sequence for i = 0, 1, ..., m-1 is
6 〈1,5,9,0,4,8,12,3,7,11,2,6,10〉
7 – we try to store the key into slot 1, which is occupied
8 – next we try to store the key into slot 5, which is also
9 occupied
10 49 – in the third trial we successfully store the key into
11
slot 9 which is still free*
12
24

Double hashing – example



insert key k=14 into the hash table of size m=13 below using
open addressing with double hashing; the hashing functions are
h1(k) = k mod 13 and h2(k) = 1 + (k mod 11):
0 – h1(14) = 14 mod 13 = 1
1 79
2
– h2(14) = 1 + (14 mod 11) = 1 + 3 = 4
3 68 – h(14,i) = [h1(14) + ih2(14)] mod 13 = [1 + 4i] mod 13
4
5 96
– the probe sequence for i = 0, 1, ..., m-1 is
6 〈1,5,9,0,4,8,12,3,7,11,2,6,10〉
7 – we try to store the key into slot 1, which is occupied
8 – next we try to store the key into slot 5, which is also
9 14 occupied
10 49 – in the third trial we successfully store the key into
11
slot 9 which is still free
12
25

Double hashing – example



we have removed key 96 from a hash table and marked the
removed element location using a special symbol ×

we want to verify if key 14 is stored in the hash table:
0 – the probe sequence will be identical to the one
1 79 used at insertion, i.e. 〈1,5,9,0,4,8,12,3,7,11,2,6,10〉
2 – the first inspected slot is 1, where the key is not
3 68
found
4
5 ×
– the next inspected slot is 5, where the special
6 symbol is encountered, which means we must
7 continue the search
8 – the next inspected slot is 9, where the key is finally
9 14 found*
10 49
11
12
26

Double hashing – example



we have removed key 96 from a hash table and marked the
removed element location using a special symbol ×

we want to verify if key 14 is stored in the hash table:
0 – the probe sequence will be identical to the one
1 79 used at insertion, i.e. 〈1,5,9,0,4,8,12,3,7,11,2,6,10〉
2 – the first inspected slot is 1, where the key is not
3 68
found
4
5 ×
– the next inspected slot is 5, where the special
6 symbol is encountered, which means we must
7 continue the search
8 – the next inspected slot is 9, where the key is finally
9 14 found
10 49
11
12
27

Open addressing analysis



the average number of trials for an unsuccessful search is
at most 1/(1−α) if we assume uniform hashing and a load
factor α < 1

element insertion requires at most 1/(1−α) trials if uniform
hashing is assumed

if a load factor α < 1, uniform hashing and equal
probability of finding every key in a hash table are
assumed, then the average number of trials for a
successful search is at most:
1 1
⋅ln
α 1−α
28

Disadvantages and applications



disadvantages of a hash table:
– it cannot be a dynamic data structure owing to a fixed hash
function
– it cannot be adapted to a changed element distribution
– it does not allow efficient implementation of operations like
minimum or maximum element extraction, predecessor or
successor search, sorting, ...

applications:
– dictionaries and associative arrays
– symbol tables in compilers
– memory pages in operating systems

You might also like