0% found this document useful (0 votes)
3 views

Chapter10_HashTables

This lecture covers hash tables, focusing on their structure, hash functions, and collision handling methods. It discusses the importance of efficient hash functions, the inevitability of collisions, and various strategies for collision resolution such as separate chaining and open addressing. The lecture also highlights the performance implications of different probing techniques, including linear probing, quadratic probing, and double hashing.

Uploaded by

aya boumelha
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Chapter10_HashTables

This lecture covers hash tables, focusing on their structure, hash functions, and collision handling methods. It discusses the importance of efficient hash functions, the inevitability of collisions, and various strategies for collision resolution such as separate chaining and open addressing. The lecture also highlights the performance implications of different probing techniques, including linear probing, quadratic probing, and double hashing.

Uploaded by

aya boumelha
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 49

Lecture 10

Hash Tables

1
Main Points
 Introduction
 Hash Table, Hash Function, Collisions
 Handling Collisions
 Exercise

2
Let’s Look at this

3
Hash Tables: An introduction
 Goal: Is it possible to design a search of O(1): a
constant search time no mater where the element is
located in the list
 Example: Consider a list of employees in a small
company. Each of 100 employees has an ID number in
the range 0-99. If we store the elements( employee
records) in the array, then each employee’s ID
number will be an index to the array element where
record will be stored
 There is a one-to-one correspondence between the
element key and the array index
 What about if the company want to use a 5-digits ID
number as the primary key.
 Compare the size of the array with the number of
employees
4
One solution to the previous
problem
 What if we keep the array size down to the
size that we will actually be using (100
elements) and we use just the last two
digits of key to identify each employee?
 For example the employee with the key number
45678 will be stored in the element of the array
with index 78 and employee with 23456 will be
stored in array index 56
 We are looking for a way to convert a 5 digit
number into a two digit array index: some
function need to do the transformation: a hash
function to use a hash table (array)
5
Hash Table, Hash Function,
Collisions
 A Hash Table is a data structure in which
keys are mapped to array positions by a
hash function
 A Hash Function, is a function which,
when applied to the key, produces an
integer which can be used as an address
(index) in the hash table
 For our previous example:
int HashFunction (int id_number){
return( id_number % table_size);
}
6
Hash Table, Hash Function,
Collisions: continue…
 Suppose that the hash table contains
records of two employees with IDs 45678
and 23456 respectively.
 We need to add another employee whose ID is
34878!!
 The array element with index 78 has already a
value.
 When more than one element tries to occupy the
same array position, we have a collision
 Collision is a condition resulting when two or
more keys produce the same hash location

7
Issues surrounding Hashing
 Which HF to use?
 The Big Challenge
 Easy to compute. If the ‘Hash Algorithm’ is too
inefficient it will overshadow the advantages of the
technique
 Should distribute entries uniformly through the HT
slots
 Should minimize Collisions

 Can we avoid collisions?


 No - Collisions are inherent!  Regardless of the
type of the HF, we will likely experience collisions
because the domain of keys is usually larger than the
number of buckets

8
Other examples
 Implementing a Dictionary:
 HF: Sum the ASCII codes for the letters then mod n (n is HT
size!)
 raw and war would have the same key!
 How to implement a ‘Spelling Checker’?
 Create a HT for all the words in a dictionary
 When you encounter a word, whose spelling you want to
check, just hash it and see if it exists in the table.
 If it does, then you've spelled it correctly.
 If not, then you haven't.
 This allows you to look up a word in O(1) time rather than O(n)
time, which, in a dictionary on the order of 800,000 words, is a
big time saver.
 Cryptography: Use a hash function to encrypt your
passwords

9
Hash Function: Examples
 HFs generally take records whose key values come
from a large range, and stores those records in a HT
with a relatively small number of slots – Depends a lot
of the keys set
 Some HF Examples:
 Division
 F(x) = x mod m; best value for m is prime
 Mid-Square:
 Middle K digits in X²
 Folding:
 Given a key x1x2… xr
 F1(x1x2… xr) = x1x2+ x3x4…+ xr-1xr
 F2(x1x2… xr) = x2x1+ x4x3…+ xrxr-1
 Example: x = 251367
 F1(X) = 25 + 13 + 67 = 125
 F2(X) = 52 + 31 + 76 = 159
 Truncating:
 F(x) = last K digits of x or first K digits

10
Summarize what we learnt so
far….
 To add/retrieve an element from the
hash table:
Algorithm to add: Algorithm to get a
Add (key, value){ value:
Index=hash(key); dataT getValue(key){
Index=hash(key);
hashTable[Index]=value; return(hashTable[index])
} }
Will these algorithms always work? We can make
them work if we know all possible search keys,
appropriate Table size and perfect hash function
11
Collision, Handling Collision
 The two ways of dealing with
collisions are:
 Chaining: use linked list
 Open Addressing: Linear probing,
Quadratic probing, and Double Hashing

12
Problem
 Consider the following Hash table and
the following hash function: 0
1 1
H(x)=x² % 10
2
What we want is to insert [1,9] 3

4 Collisions! Because of the hash 4 2


5 5
function 6 4

Finish inserting the other 7


8
numbers using linear probing 9 3

13
Solution 1: Use Separate
Chaining
 Colliding records are chained together in
separate linked lists
 HT slots don’t hold data, rather it stores
pointers to the synonyms’ linked lists.
 If a collision happens, insert in the
corresponding linked list - O(1). (Insert
always at the head)
 Search/Delete ?
 Drawback: Use of another Data Structure,
Linear search through the Linked Lists

14
Advantages of Separate
chaining
 Simple collision handling
 No Overflow: we can store more
elements than the hash table size
 Deletion is done from the linked list

15
Example

16
Separate Chaining an
illustration
 Assume that we want to a list of
students into a hash table using their
IDs. The following program represent
how collision is solved using separate
chaining.
 We are using a table of 10 cells
 We are using ID%Size as a hash function
 Implement this code

17
18
19
20
21
22
23
24
Solution 2: Open addressing

 All data go inside the table itself


 Works when load factor is below 0.5
 Load factor?? In next slides
 If a collision happens
 Alternative cells are tried till and
empty cell is found.
25
Open Addressing
 No linked lists – All items are stored in the same HT
 Alternative cells ( h0(k), h1(k), .., hn(k)) are tried till an
empty cell is found. Each try is called a Probe
 hi(k) = hash(k) + fi(k)
 The function fi(k) is the collision resolution strategy
 Since each cell in the HT can hold only one item. A
bigger table is needed than in chaining
 Generally, HT Size >= 2N
 Several Methods:
 Linear Probing
 Quadratic Probing
 Double Hashing

26
Linear Probing
 In this method, f (k) is linear = i
i
 Linear probing Insert algorithm:
 If(table is full): error
 probe=h(k)
 While(table[probe] occupied)
 probe=(probe+1) mod m
 Table[probe]=k
 Search(k) Algorithm:
 Compute h(k)
 Look at HT[h(k)]:
 If empty (element does not exist)
 If full:
 Compare to K, if equal return it else:
 Loop/’circular linear search’ through successive slots
 If found return it
 If an empty slot found, element does not exist
 Drawback  Clustering
 Elements tend to cluster around full slots! Hence,
resulting in very long probes. A Solution  Quadratic Probing 27
Consider the following example
 H(x)= x%10
 F(i)=i
-Find 58(#tries?)
- Insert 19
- Find 19(#tries?)

28
Linear probing: drawbacks
 As long as the table is big enough, an
empty cell can always be found but
the time to do so can get quite large
 More, even if the table is relatively
empty, blocks of occupied cells start
forming
 Primary clustering

29
Linear Searching Analysis
 We want to compute the average probes for a 0 9
successful and unsuccessful search for this hash 1
table 2 2
 H(x)= x mod 11 3 13
 Case 1: Successful Search 4 25
20,30,2,13,25,24,10,9 5 24
Avg=(1+1+1+2+2+4+1+3)/8=15/8 6
(<2: two search/each) 7
 Case 2: Unsuccessful Search 8 30
we are searching for: 0,1,2,3,4,5,6,7,8,9,10 9 20
Avg=(2+1+1+4+3+2+1+1+5+3+1)/11=24/11 10 10

30
Solution 3: Quadratic Probing
 Eliminates Clustering by probing separated slots
 fi(k) is linear = i²
 If collision happens at HT[k], look successively at
K+1², K+2², … till empty cell found

 Solves the ‘Clustering’ BUT can lead to ‘Secondary


Clustering’:
 I.e, Colliding Elements will try the same probes

31
Example
 H(x)=x mod 10 0
1
 Insert: 3,5,13,24,33,45,54 2
 54? 3 3
4 13
5 5
6 45
7 33
8 24
9

32
Solution 4: Double Hashing
 Avoids both Primary and Secondary
Clustering
 Idea:
 The probe should depend on the key instead of
being the same for all keys
 Use another Hash Function. Hence, the
increment is defined by second function
 The Second HF should:
 Depend on the key
 Be different from the first! Why?
 Not returning Zero

33
Double Hashing Insert
Algorithm
 If (table is full) error
 probe=h1(key), offset=h2(k)
 While(table[probe] is occupied)
 probe=(probe+offset) mod m
 Table[probe]=k
 The probe goes to probe, probe+offset,
probe+2*offset, probe+3*offset…..

34
Double Hashing (Cont.)
 Ideal functions are of this format:
 h2(Key) = Const – (Key % Const)
 Where Const is a prime number less than
HT size
 Example: (Const = 5)

35
Illustration of linear probing and
double hashing

36
37
38
39
40
41
42
Load Factor and Hash Tables

43
Analysis of Separate Chaining
 Load factor λ definition:
 Ratio of number elements (N) in a hash
table to hash table size
 i.e: λ= N/TableSize
 The average length of a list is also λ
 For chaining λ is not bound by 1, it can
be >1 (Hash Table size is 10 but N=100)
 So to delete/search an element: time to
compute hash function + length of the
chain: λ so search and delete:
O(1+λ)=O(λ) 44
Separate Chaining Performance
 Search cost is proportional to length
of chain
 Worst case, all keys hash to same chain
 When size of hash table is too large
 Many empty slots
 When size of hash table is small
 You end up having long chains

45
Linear Probing Performance
 Insert and search depend on the length of the
cluster
 Average length of the cluster is λ
 Worst case: all keys hashed to same cluster
 When size of the array is too large:
 You will have many HT entries empty
 When the size is too small
 clusters!
 Typical choice: size=2*N_elements

46
Analysis of double hashing and
quadratic hashing
 Remember, the load factor λ=N elements in the HT/HT
size
 This means 1- λ: represent a fraction of how many empty
location in HT
 So the expected number of probes to find an empty
location (i.e: unsuccessful search)will be 1/(1- λ)
 Even though double hashing avoids the clustering of
linear probing and quadratic probing, the estimate of its
efficiency was proved to the be the same as quadratic
probing.

47
Hash Tables: A Summary
 A hash table is based on an array
 The range of key values is usually greater than the
size of the array
 A key value is hashed to an array index by a hash
function
 The hashing of a key to an already filled cell is called a
collision
 Collision can be handled using open addressing or
separate chaining
 In open addressing, data items that hash to a full
array cell are placed in another cell in the array
 In separate chaining, each array element consist of a
linked list

48
Hash Tables: A Summary
 In linear probing the step size is always 1
 The number of tries required to find an item is
called the probe length
 In linear probing, contiguous sequence of
filled cells appear: primary cluster
 Quadratic probing eliminates primary
clustering but suffers from less severe
secondary clustering
 In double hashing the step depends on the
key and is obtained from a second hash
function

49

You might also like