0% found this document useful (0 votes)
26 views

Lecture 7 - Hash - Table - Direct - Adreess - Tables - Hash - Tables - Intro - Separate - Chaining

The document discusses hash tables and different techniques for implementing them, including direct-address tables and using hash functions with separate chaining to handle collisions. It covers the basic operations for direct-address tables and hash tables, as well as advantages and disadvantages of each approach and examples of good and bad hash functions.

Uploaded by

Bogdan David
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views

Lecture 7 - Hash - Table - Direct - Adreess - Tables - Hash - Tables - Intro - Separate - Chaining

The document discusses hash tables and different techniques for implementing them, including direct-address tables and using hash functions with separate chaining to handle collisions. It covers the basic operations for direct-address tables and hash tables, as well as advantages and disadvantages of each approach and examples of good and bad hash functions.

Uploaded by

Bogdan David
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 77

DATA STRUCTURES (AND ALGORITHMS)

Hast Tables: Introduction, Separate Chaining.

2023 - 2024

Babeş - Bolyai University


Faculty of Mathematics and Computer Science
In Lecture 6...

ADT List

ADT Stack

ADT Queue

1
Today

Hash Tables

Direct-address table

Introduction to hash tables

Hash tables with separate chaining

2
Direct-address table

Consider the following problem:

We have to store data where every element has a key (a natural


number)

No two elements have the same key

The universe of keys is relatively small, U = {0, 1, 2, . . . , m − 1}

We have to support the basic dictionary operations:

INSERT

DELETE and

SEARCH

3
Direct-address table

Solution:

Use an array T with m positions (since the the keys belong to


{0, 1, 2, ..., m − 1})

The element with key k, will be stored in the T[k] slot

Slots not corresponding to existing elements will contain the


value NIL

4
Operations for a direct-address table - search

Searching in a direct-address table:


function search(T, k) is:
//pre: T is an array (the direct-address table), k is a key
search ← T[k]
end-function

5
Operations for a direct-address table - insert

Inserting in a direct-address table:


subalgorithm insert(T, x) is:
//pre: T is an array (the direct-address table), x is an element
T[key(x)] ← x //key(x) returns the key of an element
end-subalgorithm

6
Operations for a direct-address table - delete

Deleting from a direct-address table:


subalgorithm delete(T, x) is:
//pre: T is an array (the direct-address table), x is an element

7
Operations for a direct-address table - delete

Deleting from a direct-address table:


subalgorithm delete(T, x) is:
//pre: T is an array (the direct-address table), x is an element
T[key(x)] ← NIL
end-subalgorithm

7
Direct-address table - Advantages and disadvantages

Advantages of direct-address tables:

8
Direct-address table - Advantages and disadvantages

Advantages of direct-address tables:

They are simple

8
Direct-address table - Advantages and disadvantages

Advantages of direct-address tables:

They are simple

They are time-efficient - all operations run in Θ(1) time.

8
Direct-address table - Advantages and disadvantages

Advantages of direct-address tables:

They are simple

They are time-efficient - all operations run in Θ(1) time.

Disadvantages of direct-address tables:

8
Direct-address table - Advantages and disadvantages

Advantages of direct-address tables:

They are simple

They are time-efficient - all operations run in Θ(1) time.

Disadvantages of direct-address tables:

The keys have to be natural numbers

The keys have to come from a small universe (interval)

8
Direct-address table - Advantages and disadvantages

Advantages of direct-address tables:

They are simple

They are time-efficient - all operations run in Θ(1) time.

Disadvantages of direct-address tables:

The keys have to be natural numbers

The keys have to come from a small universe (interval)

The number of actual keys << the cardinal of the universe ⇒


storage space is wasted

8
Hash tables

A hash table is a generalization of a direct-address table


that represents a time-space trade-off.

Searching for an element still takes Θ(1) time, but as


average case complexity (worst case complexity is higher).

9
Hash tables - main idea

There is still a table T of size m (but m is not the number of


possible keys, |U|).

There is also a function h, called hash function, that maps a


key k to an index in the table T :

h : U → {0, 1, ..., m − 1}

Remarks:

In case of direct-address tables, an element with key k is stored in


T [k ].

In case of hash tables, an element with key k is stored in T [h(k )].

10
Hash tables - main idea

The aim of the hash function is to reduce the range of array


indexes that are needed => instead of |U| indexes, we only need
m indexes.

11
Hash tables - main idea

The aim of the hash function is to reduce the range of array


indexes that are needed => instead of |U| indexes, we only need
m indexes.

Two keys may hash to the same index => a collision =>
we need techniques for resolving the conflict created by
collisions.

11
Hash tables - main idea

The aim of the hash function is to reduce the range of array


indexes that are needed => instead of |U| indexes, we only need
m indexes.

Two keys may hash to the same index => a collision =>
we need techniques for resolving the conflict created by
collisions.

The two main points of discussion for hash tables are:

How to define the hash function?

How to resolve collisions?

11
A good hash function

A good hash function:

is deterministic

can be computed in Θ(1) time

can minimize the number of collisions

satisfies (approximately) the assumption of simple uniform


hashing: each key is equally likely to hash to any of the m slots,
independently of where any other key has hashed to
1
P(h(k ) = j) = ∀j = 0, ..., m − 1 ∀k ∈ U
m

12
Examples of bad hash functions

h(k ) = constant number

13
Examples of bad hash functions

h(k ) = constant number


does not reduce the number of collisions

13
Examples of bad hash functions

h(k ) = constant number


does not reduce the number of collisions

h(k ) = random number

13
Examples of bad hash functions

h(k ) = constant number


does not reduce the number of collisions

h(k ) = random number


it is not deterministic

13
Examples of bad hash functions

h(k ) = constant number


does not reduce the number of collisions

h(k ) = random number


it is not deterministic

assuming that a key is a Personal Numeric Code, a hash


function considering just parts of it (first digit, birth year/date,
county code, etc.)
favors the collisions

13
Examples of bad hash functions

h(k ) = constant number


does not reduce the number of collisions

h(k ) = random number


it is not deterministic

assuming that a key is a Personal Numeric Code, a hash


function considering just parts of it (first digit, birth year/date,
county code, etc.)
favors the collisions

h(k) = k % m, when m = 16
favors the collisions, by considering only the last four bites

13
Hash function

The simple uniform hashing assumption is hard to satisfy.

In practice, we use heuristic techniques to create hash


functions that perform well.

14
The division method

The division method


h(k ) = k mod m

For example:
.
m = 13
k = 24 => h(k) = 11
k = 26 => h(k) = 0
k = 131 => h(k) = 1
.

Requires only one division so it is quite fast

Experiments show that good values for m are primes not too
close to exact powers of 2

15
The division method

Interestingly, Java uses the division method with a table size


which is power of 2 (initially 16).

To avoid a problem, a second function is used, before


applying the mod:

16
The multiplication method

The multiplication method:


h(k ) = floor (m ∗ frac(k ∗ A)) where
m - the hash table size
A - constant, 0 < A < 1
frac(k ∗ A) - fractional part of k ∗ A


Some values for A work better than others. Knuth suggests
5−1
2 = 0.6180339887 (golden ratio)
For example:
m = 13 A = 0.6180339887
k=63 => h(k) = floor(13 * frac(63 * A)) = floor(12.16984) = 12
k=52 => h(k) = floor(13 * frac(52 * A)) = floor(1.790976) = 1
k=129=> h(k)= floor(13 * frac(129 * A)) = floor(9.442999) = 9

The value of m is not critical, typically m = 2p


17
Universal hashing

If we know the exact hash function used by a hash table, we


can always generate a set of keys that will collide. This reduces
the performance of the table.

Example:
m = 13
h(k ) = k mod m
k = 1, 14, 27, 40, 53, 66, etc.

18
Universal hashing

Instead of having one hash function, we have a collection H


of hash functions that map a given universe U of keys into the
range {0, 1, . . . , m − 1}.

Such a collection is universal if for each pair of distinct keys


x, y ∈ U the number of hash functions from H for which
h(x) = h(y ) is |H|
m .

With a hash function randomly chosen from H the chance of


collision between x and y , where x ̸= y , is m1 .

19
Universal hashing

Example 1:
Fix a prime number p > the maximum possible value for a key from
U.
For every a ∈ {1, . . . , p − 1} and b ∈ {0, . . . , p − 1} we can define a
hash function ha,b (k ) = ((a ∗ k + b) mod p) mod m.

For example:
h3,7 (k ) = ((3 ∗ k + 7) mod p) mod m
h4,1 (k ) = ((4 ∗ k + 1) mod p) mod m
h8,0 (k ) = ((8 ∗ k ) mod p) mod m

There are p ∗ (p − 1) possible hash functions that can be


chosen for any p.

20
Universal hashing

Example 2:
If the key k is an array < k1 , k2 , . . . , kr > such that ki < m (or it can
be transformed into such an array, by writing k in base m), let
< x1 , x2 , . . . , xr > be a fixed sequence of random numbers, such
that xi ∈ {0, . . . , m − 1} (another number in base m with the same
length).
Pr
h(k ) = i=1 ki ∗ xi mod m

21
Universal hashing

Example 3 (the matrix method):


Suppose the keys are u − bits long and m = 2b .
Pick a random b − by − u matrix (called h) with 0 and 1 values only.
Opt for h(k ) = h ∗ k (mod 2) (we do addition mod 2).
 
  1  
1 0 0 0   1
 0  
0 1 1 1   = 1

1
1 1 1 0 0
0

22
Using keys that are not natural numbers

The majority of the previously presented hash functions


assume that the keys are natural numbers.

If this is not true, there are two options:

Define special hash functions that work with the given keys
For example, for real number from the [0,1) interval h(k ) = [k ∗ m]
can be used

Use a function that converts the key to a natural number


hashCode in Java, hash in Python

23
Using keys that are not natural numbers

If the key is a string s:

we can consider the ASCII codes for every letter

we can use 1 for a, 2 for b, etc.

Possible implementations for hashCode

s[0] + s[1] + ... + s[n − 1]

Anagrams have the same sum (SAUCE and CAUSE)

Assuming maximum length of 10 for a word (and the second


letter representation), hashCode values range from 1 (the word a) to
260 (zzzzzzzzzz). Considering a dictionary of about 50,000 words,
we would have on average 192 words for a hashCode value.

24
Using keys that are not natural numbers

s[0] ∗ 26n−1 + s[1] ∗ 26n−2 + ... + s[n − 1], where n is the


length of the string

Generates a much larger interval of hashCode values.

Instead of 26 (chosen because we have 26 letters) we can


use a prime number as well (Java uses 31, for example).

25
Collisions

When two keys, x and y , have the same value for the hash
function, h(x) = h(y ), we have a collision.

A good hash function can reduce the number of collisions,


but it cannot eliminate them at all:

Try fitting m + 1 keys into a table of size m

There are different collision resolution methods:

Separate chaining

Coalesced chaining

Open addressing

26
Separate chaining

Collision resolution by separate chaining: each slot from


the hash table T contains a linked list with all the elements that
hash to that slot.

27
Separate chaining - Example

m = 10
h(k) = k % m

28
Separate chaining - Operations

The operations are performed on the corresponding linked


list:

insert(T , x)

29
Separate chaining - Operations

The operations are performed on the corresponding linked


list:

insert(T , x) - insert a new node to the beginning of the list


T [h(key (x))]

search(T , k )

29
Separate chaining - Operations

The operations are performed on the corresponding linked


list:

insert(T , x) - insert a new node to the beginning of the list


T [h(key (x))]

search(T , k ) - search for an element with key k in the list T [h(k )]

delete(T , x)

29
Separate chaining - Operations

The operations are performed on the corresponding linked


list:

insert(T , x) - insert a new node to the beginning of the list


T [h(key (x))]

search(T , k ) - search for an element with key k in the list T [h(k )]

delete(T , x) - delete x from the list T [h(key (x))]

29
Hash table with separate chaining - representation

A hash table with separate chaining is represented in the


following way:

Representation of a node:
Node:
key: TKey
next: ↑ Node

Representation of a hash table with separate chaining:


HashTable:
T: ↑Node[] //an array of pointers to nodes
m: Integer
h: TFunction: TKey → {0, 1, ..., m-1} //the hash function

For simplicity, we keep only the keys in nodes


30
Hash table with separate chaining - search

Searching in a hash table with separate chaining:


function search(ht, k) is:
//pre: ht is a HashTable, k is a TKey
//post: function returns True if k is in ht, False otherwise

31
Hash table with separate chaining - search

Searching in a hash table with separate chaining:


function search(ht, k) is:
//pre: ht is a HashTable, k is a TKey
//post: function returns True if k is in ht, False otherwise
position ← ht.h(k)

31
Hash table with separate chaining - search

Searching in a hash table with separate chaining:


function search(ht, k) is:
//pre: ht is a HashTable, k is a TKey
//post: function returns True if k is in ht, False otherwise
position ← ht.h(k)
currentNode ← ht.T[position]

31
Hash table with separate chaining - search

Searching in a hash table with separate chaining:


function search(ht, k) is:
//pre: ht is a HashTable, k is a TKey
//post: function returns True if k is in ht, False otherwise
position ← ht.h(k)
currentNode ← ht.T[position]
while currentNode ̸= NIL and [currentNode].key ̸= k execute
currentNode ← [currentNode].next
end-while

31
Hash table with separate chaining - search

Searching in a hash table with separate chaining:


function search(ht, k) is:
//pre: ht is a HashTable, k is a TKey
//post: function returns True if k is in ht, False otherwise
position ← ht.h(k)
currentNode ← ht.T[position]
while currentNode ̸= NIL and [currentNode].key ̸= k execute
currentNode ← [currentNode].next
end-while
if currentNode ̸= NIL then
search ← True
else
search ← False
end-if
end-function

Usually search returns the value associated with the key k

31
Analysis of hashing with chaining

The average time-performance depends on the quality of


the hash function.

Simple Uniform Hashing (SUH) assumption: each element


is equally likely to hash to any of the m slots, independently of
where any other elements have hashed to.

The load factor, α, of the table T with m slots containing n


elements is n/m and represents the average number of elements
stored in a chain.

In case of separate chaining, it can be less than, equal to, or


greater than 1.

32
Analysis of hashing with chaining - Search

There are two cases:

unsuccessful search

successful search

We assume that:

the hash value is computed in (Θ(1))

the time required to search by key k depends linearly on the length


of the list T [h(k )]

33
Analysis of hashing with chaining - Search

Theorem: In a hash table with separate chaining, an


unsuccessful search takes Θ(1 + α), on the average, under the
assumption of SUH.

Theorem: In a hash table with separate chaining, a


successful search takes time Θ(1 + α), on the average, under
the assumption of SUH.

Proof idea: Θ(1) is needed to compute the hash value, while


Θ(α) is the average time needed to search in one of the m lists.

34
Analysis of hashing with chaining - Search

If n = O(m):

α = n/m = O(m)/m = Θ(1)

searching takes constant time on average

Worst-case time complexity is Θ(n)

when all the elements collide ⇒ they are in the same list and
we are searching this list

In practice, hash tables are pretty fast.

35
Analysis of hashing with chaining - Insert

We create a new node and add it to the beginning of the list


at index h(key (x))

36
Analysis of hashing with chaining - Insert

We create a new node and add it to the beginning of the list


at index h(key (x))

The worst-case time complexity is Θ(1)

If we have to check whether the key already exists in the


table, then the complexity of searching should be taken into
account, too.

36
Analysis of hashing with chaining - Delete

We have to search for the node containing the element to be


deleted and remove the node (if found)
We can also find the node previous to the one to be deleted, while
searching, in order to facilitate deletion

The time-complexity is given by the searching part.

37
Conclusions

All dictionary operations can be supported in Θ(1) time


on average.

In theory, we can keep any number of elements in a hash


table with separate chaining, but the complexity is proportional to
α. If α is too large ⇒ resize and rehash.

38
Example

Assume we have a hash table that uses separate chaining


for collision resolution, with:
m=6
the following resizing policy: if α ≥ 0.7 ⇒ we double the size of the
table

Using the division method, insert the following elements, in


the given order, in the hash table: 38, 11, 8, 72, 55, 29, 2.

39
Example

h(38) = 2 (load factor will be 1/6)


h(11) = 5 (load factor will be 2/6)
h(8) = 2 (load factor will be 3/6)
h(72) = 0 (load factor will be 4/6)
h(55) = 1 (load factor will be 5/6 - greater than 0.7)

The table after the first five elements were added:

40
Example

Is it OK if after the resize the hash table, with m = 12 is the


following?

41
Example

The hash value depends on the size of the hash table. If the
size of the hash table changes, the value of the hash function
changes as well.
search and remove operations might not find the element.

After resizing, we have to rehash the elements by adding


them again in the resized hash table.

42
Example

After rehashing and adding the other two elements:

43
Iterator

For the exemplified hash table, the easiest order in which the
elements can be iterated is: 2, 32, 5, 72, 55, 8, 11. 44
Iterator

The iterator for a hash table with separate chaining is a


combination of:
an iterator on an array (table)
an iterator on linked lists

We need a compound cursor consisting of:


the current index in the table
the pointer to the current node in the current linked list

Representation of the Iterator over a Hash Table with


separate chaining:
IteratorHT:
ht: HashTable
currentPos: Integer
currentNode: ↑ Node

45
Iterator - init

How can we implement the init operation?

The constructor of an iterator over a hash tables with


separate chaining:
subalgorithm init(ith, ht) is: //pre: ith is an IteratorHT, ht is a HashTable
ith.ht ← ht
ith.currentPos ← 0
while ith.currentPos < ht.m and ht.T[ith.currentPos] = NIL execute
ith.currentPos ← ith.currentPos + 1
end-while
if ith.currentPos < ht.m then
ith.currentNode ← ht.T[ith.currentPos]
else
ith.currentNode ← NIL
end-if
end-subalgorithm

What is the time complexity?


46
Iterator - init

How can we implement the init operation?

The constructor of an iterator over a hash tables with


separate chaining:
subalgorithm init(ith, ht) is: //pre: ith is an IteratorHT, ht is a HashTable
ith.ht ← ht
ith.currentPos ← 0
while ith.currentPos < ht.m and ht.T[ith.currentPos] = NIL execute
ith.currentPos ← ith.currentPos + 1
end-while
if ith.currentPos < ht.m then
ith.currentNode ← ht.T[ith.currentPos]
else
ith.currentNode ← NIL
end-if
end-subalgorithm

What is the time complexity?


O(m) 46
Iterator - other operations

How can we implement the getCurrent operation?

How can we implement the next operation?

How can we implement the valid operation?

47
Sorted containers

How can we define a sorted container on a hash table with


separate chaining?

48
Sorted containers

How can we define a sorted container on a hash table with


separate chaining?

We can store the individual lists in a sorted order and for the
iterator we can merge them.

48
Sorted containers

How can we define a sorted container on a hash table with


separate chaining?

We can store the individual lists in a sorted order and for the
iterator we can merge them.

Hash tables are in general not very suitable for sorted


containers.

48
Containers represented using hash tables

Hash tables are used for representing the following


containers:
ADT Map (Sorted Map)
Python’s dictionaries ( {:} ), Java HashMap, unordered_map in
C++ STL

ADT MultiMap (Sorted MultiMap)


HashMultimap in Guava (Google Core Libraries for Java)
unordered_multimap in C++ STL

ADT Set
HashSet in Java Collections API, Python’s sets ( {} )

ADT Bag
HashMultiset in Guava (for Java)

49
Hash table - Applications

Real-world applications of hash tables:

Programming languages
Implementation of built-in data types (dict in Python, HashMap in
Java)

Compilers
For storing the programming language’s keywords and for mapping
the variables names with memory locations

File system
For mapping file names to the the file path and to the physical location
of that file on the disk

Password Verification:
For storing hashed passwords

Data Integrity Checks


50
To generate checksums on data files
Bibliography

David M. Mount, Lecture notes for the course Data Structures


(CMSC 420), at the Dept. of Computer Science, University of
Maryland, College Park, 2001

Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and


Clifford Stein, Introduction to Algorithms, Third Edition, The MIT
Press, 2009

Narasimha Karumanchi, Data Structures and Algorithms Made


Easy: Data Structures and Algorithmic Puzzles, Fifth Edition,
2016

Clifford A. Shaffer, A Practical Introduction to Data Structures


and Algorithm Analysis, Third Edition, 2010

51
Thank you

52

You might also like