0% found this document useful (0 votes)
55 views42 pages

08 Hashing

The document discusses hashing as a technique for implementing symbol tables. Hashing allows constant time operations by mapping keys to array indices via a hash function, but collisions may occur when different keys hash to the same index. Various hashing techniques are presented, along with requirements for good hash functions and examples of poor hash functions to avoid. Resolving collisions is identified as an important consideration for hashing.

Uploaded by

Aryan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
55 views42 pages

08 Hashing

The document discusses hashing as a technique for implementing symbol tables. Hashing allows constant time operations by mapping keys to array indices via a hash function, but collisions may occur when different keys hash to the same index. Various hashing techniques are presented, along with requirements for good hash functions and examples of poor hash functions to avoid. Resolving collisions is identified as an important consideration for hashing.

Uploaded by

Aryan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 42

Computer Science and Engineering| Indian Institute of Technology Kharagpur

cse.iitkgp.ac.in

Algorithms – I (CS29003/203)

Autumn 2022, IIT Kharagpur

Hashing
Computer Science and Engineering| Indian Institute of Technology Kharagpur
cse.iitkgp.ac.in

Resources
• Apart from the book
• UC Davis ECS 36C Course by Prof. Joël Porquet-Lupine

Oct 14, 19 2022 CS21003/CS21203 / Algorithms - I | Hashing 2


Computer Science and Engineering| Indian Institute of Technology Kharagpur
cse.iitkgp.ac.in

Introduction
• Symbol table
• Association of a value with a key
• table[key] = value
• Typical operations
• Insert(), Remove(), Contains()/Get()
• Other possible operations
• Min(), Max(), Rank(), SortedList(), etc.
• Example #1: frequency of letters
• Compute the frequency of all the (alphabetical) characters in a text

Source: UC Davis, ECS 36C course, Spring


3
2020
Oct 14, 19 2022 CS21003/CS21203 / Algorithms - I | Hashing
Computer Science and Engineering| Indian Institute of Technology Kharagpur
cse.iitkgp.ac.in

Naïve Implementations
• Unordered array
• Linear search: Sequentially check each item until match is found

• Ordered/sorted array
• Binary search: Can perform binary search to find item

Source: UC Davis, ECS 36C course, Spring


4
2020
Oct 14, 19 2022 CS21003/CS21203 / Algorithms - I | Hashing
Computer Science and Engineering| Indian Institute of Technology Kharagpur
cse.iitkgp.ac.in

Advanced Implementations
• Binary search tree: Organization of data items in BST, following
certain order
• Any node's key is greater than all keys in its left subtree
• Any node's key is smaller than all keys in its right subtree
• At most, explore height of tree

• AVL tree
• Balanced BST
• Height of tree is kept to

Source: UC Davis, ECS 36C course, Spring


5
2020
Oct 14, 19 2022 CS21003/CS21203 / Algorithms - I | Hashing
Computer Science and Engineering| Indian Institute of Technology Kharagpur
cse.iitkgp.ac.in

Ideal Data Structure


• Can we achieve constant time on all three main operations?

• Using data as index


• Keys are small integers
• ASCII is 128 characters
• Can directly be used as index
• table[key] = value

• Limitations
• Doesn't (efficiently) support other operations – min(),max() etc.
• Wasted memory for unused keys – e.g., ASCII defines about 30 control
characters
• What if the key is not a small integer?
Source: UC Davis, ECS 36C course, Spring
6
2020
Oct 14, 19 2022 CS21003/CS21203 / Algorithms - I | Hashing
Computer Science and Engineering| Indian Institute of Technology Kharagpur
cse.iitkgp.ac.in

Example #2: frequency of words


• Compute the frequency of all the words in a text

• Issues
• Very large number of words in English
• Would need a huge table to index by word
• int table[171476];
• Cannot index an array with a string
• table["time"] is not a proper C construct
• Can only index arrays with an integer

Source: UC Davis, ECS 36C course, Spring


7
2020
Oct 14, 19 2022 CS21003/CS21203 / Algorithms - I | Hashing
Computer Science and Engineering| Indian Institute of Technology Kharagpur
cse.iitkgp.ac.in

Hashing
• Compute an array index from any key
• With direct addressing, a key is stored in slot
• With hashing, this element is stored in slot
• we use a hash function to compute the slot from the key
• maps the universe of keys into the slots of a hash table

• The size of the hash table is typically much less than


• We say that an element with key hashes to slot
• And is the hash value of key

Oct 14, 19 2022 CS21003/CS21203 / Algorithms - I | Hashing 8


Computer Science and Engineering| Indian Institute of Technology Kharagpur
cse.iitkgp.ac.in

Hashing
• We will see how to provide hashing for any type of key, e.g.,
Boolean, integers, floating-point numbers, strings, user-defined
objects, etc.
• Two keys may hash to the same slot. Such a situation is called a
collision
• How to resolve potential collisions properly?

Source: UC Davis, ECS 36C course, Spring


9
2020
Oct 14, 19 2022 CS21003/CS21203 / Algorithms - I | Hashing
Computer Science and Engineering| Indian Institute of Technology Kharagpur
cse.iitkgp.ac.in

Hashing
• Before going to collisions, lets discuss some hashing techniques
• Required properties:
• Simple to compute and fast
• Hash of a key has to be deterministic
• Equal keys must produce the same hash value
• assert(a == b && hash(a) == hash(b));
• Uniform distribution of the keys across the index range

• Two implicit steps:


• Transform any potentially non-integer key into an integer hash code
• Reduce this hash code to an index range
Source: UC Davis, ECS 36C course, Spring
10
2020
Oct 14, 19 2022 CS21003/CS21203 / Algorithms - I | Hashing
Computer Science and Engineering| Indian Institute of Technology Kharagpur
cse.iitkgp.ac.in

Some Pitfalls
• Worse (yet functional) hash function

• Problem
• Can only insert one key
• All the following keys will collide, regardless of their value

•Solution
• Need to distinguish keys by value

Source: UC Davis, ECS 36C course, Spring


11
2020
Oct 14, 19 2022 CS21003/CS21203 / Algorithms - I | Hashing
Computer Science and Engineering| Indian Institute of Technology Kharagpur
cse.iitkgp.ac.in

Key Subset Selection


• Possible scenario
• Keys are phone numbers: e.g., (530) 752-1011
• Hash table is of size 1000
• Idea: use first three digits as a direct index?
• hash("(530) 752-1011") = 530

•Problem
• Phone numbers with same area code will collide
•Solution
• Select another set of 3 digits that shows better "random" properties (e.g.,
the last three)
• Or consider the entire phone number instead

Source: UC Davis, ECS 36C course, Spring


12
2020
Oct 14, 19 2022 CS21003/CS21203 / Algorithms - I | Hashing
Computer Science and Engineering| Indian Institute of Technology Kharagpur
cse.iitkgp.ac.in

Hashing Positive Integers


• Key is a simple unsigned integer in the range 0..K-1
• Array index is always an unsigned integer in the range 0..M-1
• Hashing is done via modulo operation

Source: UC Davis, ECS 36C course, Spring


13
2020
Oct 14, 19 2022 CS21003/CS21203 / Algorithms - I | Hashing
Computer Science and Engineering| Indian Institute of Technology Kharagpur
cse.iitkgp.ac.in

Consideration about table size (1/2)


• Does the size of the hash table matter? 100 97
• Same scenario as previous slide
• Keys are random numbers between 0 and
999
• Distribution of keys is uniform
• Hashing of 25 keys
• But hash table can either be of size 100 or
size 97

• Conclusion
• Table size is not critical
• If distribution of keys is uniform

Source: UC Davis, ECS 36C course, Spring


14
2020
Oct 14, 19 2022 CS21003/CS21203 / Algorithms - I | Hashing
Computer Science and Engineering| Indian Institute of Technology Kharagpur
cse.iitkgp.ac.in

Consideration about table size (2/2)


• Does the size of the hash table matter? 100 97
• Same scenario as previous slide
• Keys are random numbers between 0 and
999
• Distribution of keys is non-uniform
• Hashing of 25 keys
• But hash table can either be of size 100 or
size 97

• Conclusion
• Table size becomes critical
• If distribution of keys is non-uniform
• Prime numbers exhibit good properties

Source: UC Davis, ECS 36C course, Spring


15
2020
Oct 14, 19 2022 CS21003/CS21203 / Algorithms - I | Hashing
Computer Science and Engineering| Indian Institute of Technology Kharagpur
cse.iitkgp.ac.in

Hashing Strings
• Naïve approach: Sum ASCII value of each character and then
reduce to table size

• Observations:
• Average word length is 4.5. 4.5*127 = 576
• Longest word is 45 letters long. 45*127 = 5715
•Issues:
• Assuming a hash table of size 10,000
• Indices up to 500 will be crowded
• Indices above 5000 will never be used

Source: UC Davis, ECS 36C course, Spring


16
2020
Oct 14, 19 2022 CS21003/CS21203 / Algorithms - I | Hashing
Computer Science and Engineering| Indian Institute of Technology Kharagpur
cse.iitkgp.ac.in

Hashing Strings
• Experiment

Source: UC Davis, ECS 36C course, Spring


17
2020
Oct 14, 19 2022 CS21003/CS21203 / Algorithms - I | Hashing
Computer Science and Engineering| Indian Institute of Technology Kharagpur
cse.iitkgp.ac.in

Hashing Strings
• Better approach: Multiply 31 and add ASCII value of each character
and then reduce to table size

• Horner’s method
• Consider string of length L as a polynomial

• Makes the distribution more uniform


• If unsigned hash value gets too big, overflow is controlled
• 31 is an interesting prime number (Mersenne prime (like 127 or 8191))
• Multiplying item by 32 (by shifting <<5) and subtract item
• Experiments shown good distribution of indices overall
Source: UC Davis, ECS 36C course, Spring
18
2020
Oct 14, 19 2022 CS21003/CS21203 / Algorithms - I | Hashing
Computer Science and Engineering| Indian Institute of Technology Kharagpur
cse.iitkgp.ac.in

Collisions
• Uniform Hashing:
• We assume that the hash function has a uniform distribution
• Keys are mapped as evenly as possible over the index range
• Collision Probability:
• Assuming an array index in the range 0..M-1
• After how many hashed keys will there be the first collision?

• Experiments:
• Example:- Output range of 97
• 10 hashes to observe the first collision
• Now, what is the output range was 223?
• Need to restart a simulation.
• Or need for better mathematical tools!

10

Source: UC Davis, ECS 36C course, Spring


19
2020
Oct 14, 19 2022 CS21003/CS21203 / Algorithms - I | Hashing
Computer Science and Engineering| Indian Institute of Technology Kharagpur
cse.iitkgp.ac.in

Collisions
• Birthday paradox:
• In a group of random people, what is the probability that two people
have the same birthdate, assuming a year consists of 365 days?
• (pigeonhole principle) 100% if group counts 366 people
• Surprisingly it is 50% if group consists of 23 people only
• And 99.9% if number of persons is 70
• Collision Probability:
• Probability of at least two people having same birthdate = 1 – probability
of all birthdates are separate
• Consider containers and assume balls are to be randomly placed in
these containers
• There are options for the first ball to sit
• For the second ball, options for 2 distinct containers
• For the third ball, options for 3 distinct containers
• Continuing, for ball, options for distinct containers
• Without considering separate containers, the number possibilities is
Oct 14, 19 2022 CS21003/CS21203 / Algorithms - I | Hashing 20
Computer Science and Engineering| Indian Institute of Technology Kharagpur
cse.iitkgp.ac.in

Collisions
• Collision probability:

•Approximation
• Remember that , for small

Oct 14, 19 2022 CS21003/CS21203 / Algorithms - I | Hashing 21


Computer Science and Engineering| Indian Institute of Technology Kharagpur
cse.iitkgp.ac.in

Collisions
• Collision probability:
• Number of hashed keys to first collision is
• For an output range of 97:
• Our experiments showed first collision at 10th record
• For an output range of 223:
•Observation:
• Unless table size is quadratically bigger than necessary it is impossible to
avoid collisions
•Conclusion:
• Need to deal with collisions gracefully
• Two options:
• Separate chaining: couple the hash table with another container
• Open addressing: put the key at another index in case of collision

Source: UC Davis, ECS 36C course, Spring


22
2020
Oct 14, 19 2022 CS21003/CS21203 / Algorithms - I | Hashing
Computer Science and Engineering| Indian Institute of Technology Kharagpur
cse.iitkgp.ac.in

Separate Chaining
• Principle:
• Hash table is an array of linked-lists
• Keys are still hashed to array index in the range 0..M-1

• Insertion: key pushed in the chain


• Search: only need to search key in the chain

Source: UC Davis, ECS 36C course, Spring


23
2020
Oct 14, 19 2022 CS21003/CS21203 / Algorithms - I | Hashing
Computer Science and Engineering| Indian Institute of Technology Kharagpur
cse.iitkgp.ac.in

Separate Chaining
• Hash table of chains, as a
list of lists in python

• Python’s internal hash functionality

Oct 14, 19 2022 CS21003/CS21203 / Algorithms - I | Hashing 24


Computer Science and Engineering| Indian Institute of Technology Kharagpur
cse.iitkgp.ac.in

Separate Chaining
• Defining a custom hash function

• Attaching it to the Hash Table object

• Inserting a key

• Finding a key

Oct 14, 19 2022 CS21003/CS21203 / Algorithms - I | Hashing 25


Computer Science and Engineering| Indian Institute of Technology Kharagpur
cse.iitkgp.ac.in

Separate Chaining
• Deleting a key from the hash table

• Helper code to print the hash table

Oct 14, 19 2022 CS21003/CS21203 / Algorithms - I | Hashing 26


Computer Science and Engineering| Indian Institute of Technology Kharagpur
cse.iitkgp.ac.in

Separate Chaining
• Driver code (1/2)

Oct 14, 19 2022 CS21003/CS21203 / Algorithms - I | Hashing 27


Computer Science and Engineering| Indian Institute of Technology Kharagpur
cse.iitkgp.ac.in

Separate Chaining
• Driver code (2/2)

Oct 14, 19 2022 CS21003/CS21203 / Algorithms - I | Hashing 28


Computer Science and Engineering| Indian Institute of Technology Kharagpur
cse.iitkgp.ac.in

Separate Chaining - Performance analysis


• Hashing key to array index:
• Searching key in chain:
• If using poor hash function, such as GalaxyHashCode()
• One chain can end up containing all the items
• Runtime complexity of operations: !
•If using hash function providing uniform distribution
• On average, chains will be of length
• is also know as the load factor
• Runtime complexity of operations:
• If load factor is maintained to be constant, then complexity also becomes
constant
• Number of items in hash table cannot be controlled
• But number of chains can be ...

Source: UC Davis, ECS 36C course, Spring


29
2020
Oct 14, 19 2022 CS21003/CS21203 / Algorithms - I | Hashing
Computer Science and Engineering| Indian Institute of Technology Kharagpur
cse.iitkgp.ac.in

Separate Chaining - Resizing


Resizing - Rehashing
• Increase the size of hash table
• Rehash every item into new table

• Load factor = 7/5=1.4

• Load factor = 7/11=0.63


Source: UC Davis, ECS 36C course, Spring
30
2020
Oct 14, 19 2022 CS21003/CS21203 / Algorithms - I | Hashing
Computer Science and Engineering| Indian Institute of Technology Kharagpur
cse.iitkgp.ac.in

Separate Chaining - Resizing


• Implementation

Oct 14, 19 2022 CS21003/CS21203 / Algorithms - I | Hashing 31


Computer Science and Engineering| Indian Institute of Technology Kharagpur
cse.iitkgp.ac.in

Separate Chaining - Resizing


• Implementation

Complexity
• Resizing happens infrequently
• Amortized time complexity: O(1)
• Ideally, resize to another prime
number

Oct 14, 19 2022 CS21003/CS21203 / Algorithms - I | Hashing 32


Computer Science and Engineering| Indian Institute of Technology Kharagpur
cse.iitkgp.ac.in

Separate Chaining - Conclusion


Complexity
• Proportional to load factor. Typically load factor is maintained at 1.0

• Pros
• Very effective if keys are uniformly distributed
• Cons
• Rely on another data structure, such as lists, to implement the chains
• Note that the chains could also be implemented with balanced trees!
• Cost associated to this additional data structure
• Memory management, traversal, etc.

Source: UC Davis, ECS 36C course, Spring


33
2020
Oct 14, 19 2022 CS21003/CS21203 / Algorithms - I | Hashing
Computer Science and Engineering| Indian Institute of Technology Kharagpur
cse.iitkgp.ac.in

Open Addressing
Principle
• We are not going to chain collisions
• Maximum number of keys that can fit is same as the hash table size
• Collided keys are inserted at the same flat hash table, but at the next
available index
• Formally: Key is inserted at the first index available

• where
• Function is the collision resolution strategy
• Linear probing, quadratic probing etc.
•Operations
• Insert: Keep probing until an empty index is found
• Search: Keep probing until key is found
• Or until encountering an empty available index (i.e., key is not found)
Source: UC Davis, ECS 36C course, Spring
34
2020
Oct 14, 19 2022 CS21003/CS21203 / Algorithms - I | Hashing
Computer Science and Engineering| Indian Institute of Technology Kharagpur
cse.iitkgp.ac.in

Linear Probing
Principle
• Function is a linear function of
• Typically
• In case of collision, try next indices sequentially
Example
• Hash table of size M = 17
• Insert sequence of characters
Observations
• Keys can form clusters of contiguous blocks
• Caused by multiple collisions
• Keys hashed into a cluster might require multiple attempts to be
inserted

Source: UC Davis, ECS 36C course, Spring


35
2020
Oct 14, 19 2022 CS21003/CS21203 / Algorithms - I | Hashing
Computer Science and Engineering| Indian Institute of Technology Kharagpur
cse.iitkgp.ac.in

Linear Probing - Implementation

Same hash function as for separate chaining implementation

Oct 14, 19 2022 CS21003/CS21203 / Algorithms - I | Hashing 36


Computer Science and Engineering| Indian Institute of Technology Kharagpur
cse.iitkgp.ac.in

Linear Probing - Implementation

Try to maintain load


factor of maximum 0.5

Oct 14, 19 2022 CS21003/CS21203 / Algorithms - I | Hashing 37


Computer Science and Engineering| Indian Institute of Technology Kharagpur
cse.iitkgp.ac.in

Deletion Principle
Scenario
• Assuming following hash table
• Cluster of three keys
• How to remove key U?
• Key V should still be reachable
Problems
• If free index at U, V is not reachable anymore
• Hash(V) points to available entry
• Same issue for key U if remove key 3
Solution
• Free key's index upon removal
• But rehash every key of cluster located directly after key
• V would be reinserted at index 1, and stay reachable

Source: UC Davis, ECS 36C course, Spring


38
2020
Oct 14, 19 2022 CS21003/CS21203 / Algorithms - I | Hashing
Computer Science and Engineering| Indian Institute of Technology Kharagpur
cse.iitkgp.ac.in

Deletion Implementation

Oct 14, 19 2022 CS21003/CS21203 / Algorithms - I | Hashing 39


Computer Science and Engineering| Indian Institute of Technology Kharagpur
cse.iitkgp.ac.in

Linear Probing - Conclusion


Complexity
• Difficult complexity analysis
• Often simplified to

Pros
•No overhead for memory management
• Locality of reference, especially if multiple consecutive items can be
in the same cache line

Cons
• Requires low load factor (0.5) compared to separate chaining
• Bigger hash table
• Causes primary clustering

Source: UC Davis, ECS 36C course, Spring


40
2020
Oct 14, 19 2022 CS21003/CS21203 / Algorithms - I | Hashing
Computer Science and Engineering| Indian Institute of Technology Kharagpur
cse.iitkgp.ac.in

Open Addressing Strategies


Linear probing

• If index at is taken, try , then , … ,

Quadratic probing

• If index at is taken, try , then , … ,

Double hashing
• Use second hash functions to compute intervals between probes

Source: UC Davis, ECS 36C course, Spring


41
2020
Oct 14, 19 2022 CS21003/CS21203 / Algorithms - I | Hashing
Computer Science and Engineering| Indian Institute of Technology Kharagpur
cse.iitkgp.ac.in

Thank You

You might also like