0% found this document useful (0 votes)
14 views59 pages

Hashing 2

Uploaded by

Zunaira Amjad
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views59 pages

Hashing 2

Uploaded by

Zunaira Amjad
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 59

Hashing

Data Structure
Hashing
• Traditional Searching Methods

the target, with a time complexity of 𝑂(𝑛).


• Sequential Search: Searches elements one by one, using key comparison to find

to locate the element, achieving a time complexity of 𝑂(log⁡𝑛)


• Binary Search: Divides the search space in half each time and uses key comparison

• Binary Search Tree: Decides the search direction based on key comparison at each
node.
• Hashing as a Different Approach:
• Direct Access: Hashing calculates the position of an element directly based on the

• A hash Function (h) used to transform a key 𝐾 (e.g., string, number, or record)
key’s value, reducing search time to O(1) ideally, without needing comparisons.

into a table index.


Hashing
For Example
• All the letters of the variable name can be added together and the sum
can be used as an index. In this case, the table needs 3,782 cells (for a
variable K made out of 31 letters “z,” h(K) = 31 · 122 = 3,782).
• But even with this size, the function h does not return unique values.
For example, h(“abc”) = h(“acb”). This problem is called collision.
• Sum of "abc" = 97 + 98 + 99 = 294 and Sum of "cba" = 99 + 98 + 97 = 294
• The worth of a hash function depends on how well it avoids collisions.
Avoiding collisions can be achieved by making the function more
sophisticated, but this sophistication should not go too far because the
computational cost in determining h(K) can be prohibitive, and less
sophisticated methods may be faster.
Hash Function
• Perfect Hash Function: A hash function that maps every distinct key to a
unique table position, avoiding collisions entirely.
• To create a perfect hash function, the table size must at least match the
number of unique elements; however, this isn’t always practical as the
number of elements may not be known in advance.
• A good hash function should have the following properties:
• Efficiently computable.
• Should uniformly distribute the keys (Each table position is equally likely for each.
• Should minimize collisions.
• Should have a low load factor(number of items in the table divided by the size of
the table).
Different types of hash functions
1. Division
• In the Division method for hashing, the key 𝐾 is divided by the table size (or a chosen divisor) and the
remainder is used as the hash index. This guarantees that the hash function returns a number within
the valid index range of the table.
• Formula: h(K)=K mod Tsize, where TSize is the size of the hash table.
• Prime Divisor: It is generally recommended to use a prime number for TSize. This choice minimizes
patterns in the keys that could cause clustering (multiple keys hashing to the same index).
• Alternative for Non-Prime Table Sizes: If a non-prime Tsize is used, choosing a prime number p > Tsize
can be effective by computing h(K)=(K mod p)mod TSize.
• Why Primes are Preferred:
• Prime divisors reduce the risk of collisions, as they help spread out the indices generated by the hash function.
Non-prime divisors can work if they are carefully chosen, such as when they lack small prime factors (generally,
divisors with factors greater than 20 perform better).
• When to Use Division Hashing:
• The division method is preferred when there is minimal information about the distribution of keys. It is a simple,
effective approach with minimal computational overhead.
• Data(k) = 100, 200, 300,400, 500
• Tsize=10
• All keys map to index 0

• TSize = 7
• h(100)=100 mod 7 = 2
• ℎ(200)=200 mod 7 = 4
• ℎ(300)=300 mod 7 = 6
• h(400)=400 mod 7 = 1
• ℎ(500)=500 mod 7 = 3
2. Folding
• In the Folding method for hashing, the key is divided into multiple
parts, which are then "folded" together through a series of
transformations to form the hash index.
• This method is particularly useful for keys that are long numbers or
strings, like social security numbers (SSNs), by breaking them down
into manageable parts and then combining these parts to get a final
hash address.
• There are two types of folding:
• shift folding and
• boundary folding.
Shift folding
Shift Folding
Boundary Hash
• In this method, the key parts are alternately reversed to prevent
patterns that may emerge if the data has certain ordering structures.
• Example: Using the same SSN parts, "123", "456", and "789":Start
with 123.
• Reverse the second part, making it 654.
• Add the third part normally, 789.
• The resulting sum is 123 + 654 + 789 = 1566, which can be processed modulo
TSize to get the final hash.
• A bit-oriented version of shift folding is obtained by applying the
exclusive-or operation, ˆ.
Bit-Oriented Folding
• For bit patterns instead of numeric values, bitwise operations like exclusive OR
(XOR) are used to combine parts. This is common with strings or binary data.
• String XOR Example: For a string like "abcd", each character is XOR’d to produce a
hash:
• For example, for the string “abcd,” h(“abcd”) = “a”ˆ“b”ˆ“c”ˆ“d.”
• For larger strings, chunks equal to the number of bytes in an integer are XOR’d together
(e.g., "ab" XOR "cd").
• Advantages and Use Cases
• Flexibility: Folding allows flexibility in handling various key formats (e.g., numbers, strings).
• Collision Handling: Folding, especially boundary folding, helps reduce collision chances by
mixing key parts effectively.
• Efficiency: This method is fast, especially for fixed-size data, and can be enhanced with
bitwise operations for better performance.
3. Mid-Square Function
• The Mid-Square Hashing Method is a hashing technique where the
key is squared, and the middle portion of the resulting number is
used as the hash address.
• This method is effective because it involves all parts of the key in
generating the address, increasing the chances that unique keys will
generate unique hash values.
• Let’s go through the key points of this method and an example for
clarity.
Key Points of the Mid-Square Method
• Squaring the Key: The key (a number) is squared, resulting in a large
number.
• Extracting the Middle Part: The middle portion of this squared value is
taken as the hash address. This reduces the effect of any patterns in the
key's original format, making it more suitable for unique address
generation.
• Preprocessing for Strings: If the key is a string, it needs to be converted to
a numeric value (often using methods like folding) before applying the
mid-square approach.
• Table Size and Power of 2: It’s often more efficient to choose a power of
2 for the table size. This makes it easier to extract the middle part of the
binary representation of the squared key using bitwise operations.
Example of the Mid-Square Method
Suppose the key is 3,121, and we have a table with 1,000 cells.
1. Square the Key:
3,1212=3,121^2 = 9,740,641
2. Extract the Middle Portion:
For a 1,000-cell table, we need three digits in the hash address. We take the middle three digits of
9,740,641, which are 406.
Therefore, h(3,121)=406.
This middle section gives us the hash address, and it falls within the range of our table.
3. Binary Extraction (for Power of 2 Table Size):
Now, let’s assume the table size is 1,024 (a power of 2). In binary, 9,740,641 is represented as:
100101001010000101100001

The middle section, 0101000010, corresponds to 322 in decimal. Thus: h(3,121)=322


By using bitwise operations like masking and shifting, we can efficiently extract this middle section.
Advantages of the Mid-Square Method
● Unique Address Generation: Since every bit of the key participates in forming the
address, the mid-square method can help produce a wide range of addresses,
reducing collisions.
● Simplicity: The method is straightforward and easy to implement.

● Flexibility: It can be adapted to both numerical and string-based keys (once strings
are preprocessed).
Use Cases
The mid-square method is suitable in scenarios where:
● There’s a high probability of patterns in the keys, such as sequential or similar keys,
where mid-square can help spread out the addresses.
● You need a straightforward and fast hashing function that can be implemented using
bitwise operations for better efficiency.
Extraction Method
• The Extraction Hashing Method is a technique where only a selected
portion of the key is used to compute the hash address.
• This method is useful when the key contains redundant or
consistent portions that can be safely ignored without impacting the
uniqueness of the hash.
• By focusing on only a part of the key, this method simplifies the
hashing process while ensuring that the key’s uniqueness is still
captured effectively.
Key Points of the Extraction Method
Partial Key Use: Instead of using the entire key, only a part of it is used
to calculate the hash address.
Choosing the Right Portion: The chosen part of the key should be
enough to ensure unique addresses. In some cases, certain parts of the
key can be omitted because they are either constant or redundant for
the context of the dataset.
Handling Redundant Information: In certain contexts, such as
standardized IDs or codes, portions of the key that are the same for all
entries can be excluded from the hash function.
Example of the Extraction Method
Consider a Social Security Number (SSN): 123-45-6789
Here are several ways the extraction method can be applied:
● First Four Digits: You could use the first four digits, 1234, to generate
the hash address.
● Last Four Digits: You could alternatively use the last four digits, 6789,
for the hash address.
● Combined Portions: Another approach might be to combine the first
two digits with the last two digits, yielding 1289.
Advantages of the Extraction Method
Efficiency: By using only a relevant portion of the key, the extraction
method can speed up hashing and reduce unnecessary calculations.
Simplicity: The method is easy to implement and doesn’t require
complex operations.
Contextual Relevance: It works well when certain parts of the key are
predictable or redundant, making it ideal for structured keys such as IDs
or codes.
Space Optimization: By omitting unnecessary portions, the hash
function can potentially reduce memory usage and speed up
computation.
Use Cases
The Extraction Method is particularly useful in situations where the key contains
repetitive or predictable portions. Here are some common use cases:
● Structured ID Systems: When dealing with structured IDs, such as university
student numbers, where some digits are common to all members of a group.
● Product Codes or ISBNs: In databases that store product codes or ISBNs from
the same publisher, where parts of the code are identical for all products.
● Employee IDs or National Identification Numbers: In organizations where
employee IDs or national identification numbers share common prefixes
based on regions, departments, or other groupings.
Radix Transformation
The radix transformation is a method where we transform the original key 𝐾 into a
different numerical base (or radix) before hashing.

Given key 𝐾=345 (decimal).


Example of Radix Transformation:

Converting 𝐾 to base 9 (nonal) gives 𝐾=42 K=423 in base 9.


Hashing with Transformed Key
The transformed base 9 value (423) is used in hashing.

ℎ(𝐾)=423mod 𝑇𝑆𝑖𝑧𝑒 where 𝑇𝑆𝑖𝑧𝑒is the hash table size.


Hash function:

Collisions can still occur even after radix transformation.


Choosing appropriate radix and table size may reduce but not eliminate collisions.
Universal Hash Functions
• Used when minimal information is available about key distribution.
• A universal class of functions is expected to evenly distribute keys
with low collision probability.
• Randomized function selection ensures that collisions between
distinct keys are rare.
What is Collision?
• Collision in Hashing occurs when two different keys map to the same
hash value. Hash collisions can be intentionally created for many
hash algorithms. The probability of a hash collision depends on the
size of the algorithm, the distribution of hash values and the
efficiency of Hash function.
Collision
• Let us consider a hash function:
• key mod Tsize
• Where Tsize=7
• Sequence of keys that are to be inserted: 34, 45, 56, 78, 92.
• Insert 34: 34 mod 7 = 6
• Insert 45: 45 mod 7 = 3
• Insert 56: 56 mod 7 = 0
• Insert 78: 78 mod 7 = 1
• Insert 92: 92 mod 7 = 1
Collision Resolution
There are mainly two methods to handle collision:
1. Separate Chaining
2. Open Addressing
Open Addressing
• In the open addressing method, when a key collides with another key,
the collision is resolved by finding an available table entry other than
the position (address) to which the colliding key is originally hashed.
• If position h(K) is occupied, then the positions in the probing sequence
are tried until either an available cell is found or the same positions are
tried repeatedly or the table is full.
norm(h(K) + p(1)), norm(h(K) + p(2)), . . . , norm(h(K) + p(i)), . . .
• Function p is a probing function, i is a probe, and norm is a
normalization function, most likely, division modulo the size of the
table.
Linear Probing
• The simplest method is linear probing, for which p(i) = i, and for the
ith probe, the position to be tried is (h(K) + i) mod TSize.
• Simple form where p(i) = i, so each probe checks the next position
sequentially.
• Searches continue in a loop if the end of the table is reached.
• Clustering Issue: Consecutive occupied cells create clusters, making
future insertions slower.
Collision resolution using linear probing
• Let us consider a hash function:
• key mod Tsize
• Where Tsize=7
• Sequence of keys that are to be inserted: 34, 45, 56, 78, 92.
• Insert 34: 34 mod 7 = 6
• Insert 45: 45 mod 7 = 3
• Insert 56: 56 mod 7 = 0
• Insert 78: 78 mod 7 = 1
• Insert 92: 92 mod 7 = 1
Linear Probing
Quadratic Probing
• When a collision occurs (i.e., two keys hash to the same index), quadratic probing
attempts to resolve the conflict by searching for the next available slot using a
quadratic function.
• The probing sequence is determined by the formula

where 𝑖 is the probe number (starting from 1). However, the sequence may be
• h(K)+i^2 ,

adjusted to ensure that the probes do not cover only half of the table.
For example, the sequence is symmetrical with probes like
ℎ(𝐾)+1,ℎ(𝐾)+4,ℎ(𝐾)+9,…


• One issue with quadratic probing is secondary clustering, where keys hashed to the
same position follow the same probe sequence, potentially leading to clusters of
occupied positions. These clusters are less problematic than primary clusters (from
linear probing) but still exist.
Double Hashing
• If a collision occurs after applying a hash function h(k), then another
hash function is calculated for finding the next slot.

• h(k, i) = (h1(k) + ih2(k)) mod m


Double Hashing
• Double hashing uses two hash functions.
• The first function ℎ(𝐾) finds the primary position,
• and the second function ℎ_𝑝(𝐾)is used for resolving collisions.
• The probe sequence becomes
• ℎ(𝐾),ℎ(𝐾)+ℎ_𝑝(𝐾),ℎ(𝐾)+2⋅ℎ_𝑝(𝐾),…
• This reduces secondary clustering because the second function
depends on the key, meaning different keys that hash to the same
position can follow different probe sequences.
Load Factor
• The efficiency of these methods depends on the table size (denoted
as TSize) and the loading factor (LF), which is the ratio of the
number of elements in the table to the table size.
Separate Chaining
• Keys do not have to be stored in the table itself.
• In chaining, each position of the table is associated with a linked list
or chain of structures whose info fields store keys or references to
keys. This method is called separate chaining,
• and a table of references (pointers) is called a scatter table.
• In this method, the table can never overflow, be cause the linked
lists are extended only upon the arrival of new keys, as illustrated in
Figure 10.5 on the next slide.
Separate Chaining
Separate Chaining
• For short linked lists, this is a very fast method, but increasing the
length of these lists can significantly degrade retrieval performance.
• Performance can be improved by maintaining an order on all these
lists so that, for unsuccessful searches, an exhaustive search is not
required in most cases or by using self-organizing linked list.
• This method requires additional space for maintaining pointers
Coalesced Hashing/Coalesced Chaining
• A variation of chaining called coalesced hashing combines the concept
of linear probing with chaining. Here, when a key collides with another
key in the table, the first available position for that key is found using
linear probing. This position is then linked to the previously collided
key.
• In coalesced hashing:
• Each table entry holds two members: info (which stores the key) and next
(which stores the index of the next key that is hashed to that position).
• The next field helps avoid a sequential search for elements in the linked list,
allowing direct access to the next colliding key.
• Available positions: Positions in the table that are available for insertion are
marked with a special value (such as -2), while -1 indicates the end of a chain.
Coalesced Hashing/Coalesced Chaining
• Coalesced hashing requires additional space for the next pointer in
each table entry. This adds TSize * sizeof(next) to the total space
required, which is less than the space needed for separate chaining
(since the scatter table only stores pointers to linked lists).
• Despite using less space than separate chaining, the size of the table
still limits the number of keys that can be hashed into it.
Overflow Area (Cellar):
If the hash table cannot
accommodate all the keys due to
space limitations, an overflow area
(also known as a cellar) can be
used. This area stores the keys for
which there is no room in the table.

If implemented as a list of arrays, the


overflow area can be dynamically
allocated to accommodate more keys
as necessary.
Basic Operations
• Following are the basic primary operations of a hash table.

• Search − Searches an element in a hash table.


• Insert − Inserts an element in a hash table.
• Delete − Deletes an element from a hash table.
Deletion in Hash Tables
• Deleting an element from a hash table depends on the collision
resolution method used. Let's explore the different scenarios,
focusing on chaining and open addressing (specifically linear
probing).
Chaining Deletion:
• In chaining, the data is stored in linked lists, so removing an element
involves deleting the corresponding node from the linked list.
• Steps for Deletion:
• Find the position using the hash function.
• If the element is found in the linked list at that position, delete the node
from the list.
• If the linked list is empty after deletion, that position can simply be marked
as empty.
• Advantage: Chaining makes deletion relatively straightforward, as no
further adjustments to the table are necessary, and it doesn't affect
other keys.
Open Addressing Deletion:
• In open addressing, elements are stored directly in the hash table,
and collisions are resolved by probing (e.g., linear probing, quadratic
probing).
• When deleting an element, extra care must be taken because the
probing sequence may be disrupted. The deletion of an element
could lead to incorrect results when searching for other elements
that were originally placed in the deleted slot.
Solution to Open Addressing Deletion:
• To avoid the problem of prematurely terminating searches due to deleted
elements, we use markers:
• When an element is deleted, instead of leaving the cell empty, we place a marker to
indicate that the slot was once occupied but is now free.
• This prevents the search from stopping prematurely, as the algorithm knows that it
should continue searching further, even if a deleted slot is encountered.
• Re-insertion: When a new element is inserted into the table and finds a slot
marked as deleted, it can overwrite the marker, effectively reusing the slot.
• Problem with Multiple Deletions: If many deletions occur and only a few new
elements are inserted, the table may become overloaded with deleted
markers. This can cause search times to increase, as each search must test
the deleted elements, leading to inefficient searching.
Purging the Table:
• To address the issue of numerous deleted markers, the table should
be purged after a certain number of deletions:
• During the purge operation, all undeleted elements are moved to the cells
that were previously occupied by deleted elements.
• After this process, the table is "cleaned," and the previously deleted cells
are marked as free (Figure 10.10d) on the next slide.
• Goal of Purging: The goal is to reduce the overhead caused by
deleted elements and ensure that the table remains efficient in
terms of searching and insertion.
Rehashing
• As the name suggests, rehashing means hashing again. Basically,
when the load factor increases to more than its predefined value
(the default value of the load factor is 0.75), the complexity
increases. So to overcome this, the size of the array is increased
(doubled) and all the values are hashed again and stored in the new
double-sized array to maintain a low load factor and low complexity.
• Load factor:

Load Factor = Total elements in hash table/ Size of hash


table
Rehashing
• Rehashing is a technique used to resolve the issue of performance degradation in hash tables
when they become too full. When the table size is saturated, hash collisions increase, and
operations (insertion, deletion, and search) take longer. Rehashing involves the following:
• Table Expansion: When the table exceeds a certain load factor (a threshold indicating how full the table
is), a larger table is created. The size of the new table is typically doubled, though it can also be based on
a prime number or other methods.
• Hash Function Adjustment: When a table is resized, the hash function might need to be modified
(typically by increasing the table size), since the hash values depend on the table size.
• Rehashing Items: After resizing, all items in the old table are rehashed and inserted into the new table.
This process may involve recalculating the hash for each item to fit into the new table size.
• Dealing with Full Tables: When the table is too full and rehashing is needed, the elements are reinserted
into the new, larger table with the updated hash function. If the table fills up again, further rehashing is
required.
• Rehashing is essential to maintain efficient hash table operations even as the number of
elements grows
Code
• I will upload code on GCR
Applications of Hash Table
• Hash tables are implemented where
• constant time lookup and insertion is required
• cryptographic applications
• indexing data is required

You might also like