Data Structure Unit II
Data Structure Unit II
HASHING
LAKSHMI.S,
ASSISTANT PROFESSOR IN COMPUTER SCIENCE,
SRI ADI CHUNCHANAGIRI WOMEN’S COLLEGE, CUMBUM
5 Marks :
Hashing is a technique used to efficiently store and retrieve data in computer science. It
transforms input data (keys) into a fixed-size string of characters, which typically appears
random. This process is performed by a hash function, which takes an input and produces a
hash value, or hash code.
Key Concepts
1. Hash Function: A mathematical function that converts an input (or 'key') into a fixed-
size value. The ideal hash function minimizes collisions (when different inputs
produce the same hash value).
2. Hash Table: A data structure that uses hashing to store key-value pairs. It consists of
an array where each index corresponds to a hash value. When data is stored, its key
is hashed to determine its index in the table.
3. Collision Resolution: Since multiple keys can generate the same hash value, methods
like chaining (using linked lists) or open addressing (finding another slot) are
employed to resolve these collisions.
4. Load Factor: A measure of how full the hash table is, defined as the ratio of the
number of stored entries to the number of available slots. Maintaining an optimal load
factor improves performance.
5. Applications: Hashing is widely used in various applications, including databases for
indexing, implementing associative arrays, caching, and cryptography.
In summary, hashing provides a fast and efficient way to manage data, allowing for
quick access and modifications while minimizing the risk of collisions through effective
design strategies.
Static hashing is a method of implementing hash tables where the size of the hash table is
fixed and determined at the time of creation. This approach is straightforward and
efficient for certain use cases but comes with specific limitations.
Key Features
1. Fixed Size: The hash table has a predetermined number of slots. This means that once
the table is created, its size cannot be altered, making it suitable for applications with
a known upper limit on the number of entries.
2. Hash Function: A hash function maps keys to indices in the table. Ideally, this
function distributes keys uniformly across the available slots, minimizing collisions
(when multiple keys hash to the same index).
3. Collision Handling: In static hashing, collisions can be managed using methods such
as:
o Chaining: Each slot in the table points to a linked list (or another
data structure) that contains all entries hashing to that index.
o Open Addressing: Alternative slots are found for colliding entries based on
probing sequences (like linear or quadratic probing).
4. Performance: The performance of static hashing is influenced by the load factor. A
lower load factor (fewer entries relative to slots) typically results in faster
operations, while a higher load factor increases the chances of collisions, potentially
degrading performance.
5. Applications: Static hashing is effective in scenarios where the dataset is relatively
stable in size, such as implementing symbol tables in compilers or managing fixed-
size databases.
In summary, static hashing offers a simple and efficient way to store and retrieve data with
a fixed-size hash table, making it useful in various applications where data volume is
predictable. However, its rigidity in size can be a drawback in dynamic environments.
Dynamic hashing is an advanced technique for managing hash tables that allows for the size
of the table to grow or shrink as needed. This flexibility addresses some of the limitations of
static hashing, particularly in scenarios where the dataset can vary significantly.
Key Features
1. Expandable Structure: Unlike static hashing, dynamic hashing allows the hash
table to increase in size dynamically when the number of entries exceeds a certain
threshold. This prevents excessive collisions and maintains performance.
2. Hash Function: In dynamic hashing, the hash function may be adjusted or changed
as the table grows. This adaptability helps maintain an even distribution of keys
across the available slots.
3. Directory Structure: Dynamic hashing often employs a directory that maps hash
values to bucket locations. Each bucket can hold multiple entries, and as the
number of entries increases, new buckets can be added to the directory.
4. Collision Handling: Similar to static hashing, collisions are managed through
chaining or open addressing. However, because the structure can grow, collisions can
be more effectively resolved with additional buckets.
5. Performance: The ability to dynamically adjust the size of the hash table allows for
consistent performance in terms of insertion, deletion, and search operations. This
adaptability reduces the likelihood of performance degradation over time as the
dataset grows.
In summary, dynamic hashing provides a scalable solution for hash table management,
effectively handling varying data sizes while maintaining efficient operations. This makes it
particularly useful in applications like databases and memory management systems where
the volume of data can change frequently.
Bloom filters are a probabilistic data structure designed for efficiently testing whether an
element is a member of a set. They provide a space-efficient way to handle large datasets
while allowing for a certain rate of false positives.
Key Features
In summary, Bloom filters provide a highly efficient method for approximate membership
testing, balancing space efficiency with the trade-off of possible false positives. This makes
them a valuable tool in scenarios where speed and memory usage are critical considerations.
Fibonacci heaps are a type of heap data structure that offers efficient support for priority
queue operations. They are particularly notable for their amortized time complexity,
which makes them suitable for various applications in algorithm design.
Key Features
This efficiency is due to the lazy merging of trees and the delayed consolidation
process.
3. Lazy Consolidation: Unlike other heap structures that immediately merge trees after
every operation, Fibonacci heaps defer this consolidation. This allows operations like
decrease key to remain very efficient.
4. Potential Function: The amortized analysis leverages a potential function that
measures the “goodness” of the heap's structure. This helps ensure that despite the
occasional expensive operation, the overall sequence of operations remains
efficient.
5. Applications: Fibonacci heaps are particularly useful in network optimization
algorithms like Dijkstra’s and Prim’s algorithms due to their efficient decrease-key
operation. They are also used in applications that require frequent merging of
heaps.
In summary, Fibonacci heaps offer a powerful data structure for priority queue operations,
emphasizing efficiency through lazy consolidation and amortized analysis. Their
performance characteristics make them a strong choice for complex algorithms requiring
frequent updates to heap elements.
Pairing heaps are a type of heap data structure that offers a simple and efficient way to
implement priority queues. They are notable for their ease of implementation and
good amortized performance, especially for operations like merging and decreasing
keys.
Key Features
1. Structure: A pairing heap is a collection of trees, where each tree satisfies the min-
heap property (the parent node is less than or equal to its children). The trees are
not necessarily balanced but are combined in a simple manner.
2. Amortized Time Complexity:
o Insertion: O(1) time
o Find Minimum: O(1) time
o Extract Minimum: O(log n) amortized time
o Decrease Key: O(1) amortized time
o Delete: O(log n) amortized time
This efficiency stems from the simple operations and the lazy merging of trees.
3. Pairing Operation: When extracting the minimum, the trees rooted at the children of
the minimum element are paired together, and the resulting trees are merged. This
pairing reduces the overall number of trees and keeps the structure manageable.
4. Simplicity: Pairing heaps are relatively easy to implement compared to other
complex heaps like Fibonacci heaps. Their operations are straightforward, which can
lead to easier maintenance and debugging.
5. Applications: Pairing heaps are effective in scenarios where frequent merging
of heaps is required, such as in graph algorithms and certain optimization
problems. They are a good alternative to Fibonacci heaps in applications that
prioritize simplicity.
In summary, pairing heaps provide an efficient and simple approach to implementing priority
queues, balancing good performance with ease of use. Their unique pairing operation and
favorable amortized complexities make them suitable for various algorithmic applications.
Symmetric min-max heaps are a specialized data structure that efficiently supports both
minimum and maximum retrieval operations, offering a balance between the two in a single
structure. They are particularly useful for scenarios where quick access to both the minimum
and maximum elements is required.
Key Features
1. Dual Structure: A symmetric min-max heap maintains both min and max properties.
The root contains the minimum element, and the children of the min nodes contain the
maximum elements, creating a balanced structure that allows efficient access to both
extremes.
2. Heap Properties:
o The even levels (starting from zero) follow the min-heap property, meaning
each parent is less than or equal to its children.
o The odd levels follow the max-heap property, meaning each parent is greater
than or equal to its children. This interleaving allows for efficient retrieval of
both minimum and maximum values.
3. Operations and Time Complexity:
o Insertion: O(log n) time
o Find Minimum: O(1) time
o Find Maximum: O(1) time
o Extract Minimum: O(log n) time
o Extract Maximum: O(log n) time
The structured organization allows for efficient operations while maintaining balance.
4. Complexity: While symmetric min-max heaps provide efficient access to both min
and max values, their implementation is more complex than standard min-heaps or
max-heaps. Careful attention is required to maintain the dual properties during
insertion and deletion operations.
5. Applications: These heaps are useful in applications that require frequent retrieval
of both the smallest and largest elements, such as in certain optimization problems,
scheduling, and decision-making algorithms.
Interval heaps are a specialized data structure designed to efficiently support a range of
operations on intervals. They are particularly useful in scenarios where elements have
associated priorities, and operations need to consider both the minimum and maximum values
of intervals.
Key Features
The efficient performance for these operations makes interval heaps suitable for
managing dynamic sets of intervals.
In summary, interval heaps provide a powerful and efficient data structure for managing
intervals, combining the properties of both min-heaps and max-heaps. Their unique structure
allows for fast retrieval and manipulation of interval endpoints, making them valuable in
various computational scenarios.
In data structures, a leftist tree (or leftist heap) is a type of binary tree designed
for efficient merging of two heaps, which is particularly useful in applications that require
frequent merge operations, like some priority queue implementations.
Insertion: Insert by creating a new node and merging it with the existing leftist tree.
Merge: The main operation where two leftist trees are combined. This can be done in
O(logn)O(\log n)O(logn) time because only nodes along the rightmost path need to
be rearranged.
Deletion of Minimum (or Maximum): Similar to deleting the root in a heap. The
root is removed, and its left and right subtrees are merged to maintain the leftist and
heap properties.
Use Cases:
10 MARKS
Introduction:
A priority queue is a type of queue that arranges elements based on their priority
values. Elements with higher priority values are typically retrieved or removed before
elements with lower priority values. Each element has a priority value associated with it.
When we add an item, it is inserted in a position based on its priority value.
There are several ways to implement a priority queue, including using an array, linked list,
heap, or binary search tree, binary heap being the most common method to implement. The
reason for using Binary Heap is simple, in binary heaps, we have easy access to the min (in
min heap) or max (in max heap) and binary heap being a complete binary tree are easily
implemented using arrays. Since we use arrays, we have cache friendliness advantage also.
Priority queues are often used in real-time systems, where the order in which elements are
processed is not simply based on the fact who came first (or inserted first), but based on
priority. Priority Queue is used in algorithms such as Dijkstra’s algorithm, Prim’s
algorithm, Kruskal’s algorithm and Huffnam Coding.
highestPriority = pr[i].priority;
ind = i;
}
else if (highestPriority < pr[i].priority) {
highestPriority = pr[i].priority;
ind = i;
}}
// Function to remove the element with the highest priority void dequeue()
{
// Driver Code
int main()
{
return 0;
}
Output
16
14
2)12Implement Priority Queue Using Linked List:
Binary Heap is generally preferred for priority queue implementation because heaps
provide better performance compared to arrays or LinkedList. Considering the properties of
a heap, The entry with the largest key is on the top and can be removed immediately. It will,
however, take time O(log n) to restore the heap property for the remaining keys. However if
another entry is to be inserted immediately, then some of this time may be combined with the
O(log n) time needed to insert the new entry. Thus the representation of a priority queue as a
heap proves advantageous for large n, since it is represented efficiently in contiguous storage
and is guaranteed to require only logarithmic time for both insertions and deletions.
Efficient algorithms can be implemented. Priority queues are used in many algorithms
to improve their efficiency, such as Dijkstra’s algorithm for finding the shortest path in
a graph and the A* search algorithm for pathfinding.
Included in real-time systems. This is because priority queues allow you to quickly
retrieve the highest priority element, they are often used in real-time systems where
time is of the essence.
Disadvantages of Priority Queue:
High complexity. Priority queues are more complex than simple data structures like
arrays and linked lists, and may be more difficult to implement and maintain.
High consumption of memory. Storing the priority value for each element in a priority
queue can take up additional memory, which may be a concern in systems with limited
resources.
It is not always the most efficient data structure. In some cases, other data structures
like heaps or binary search trees may be more efficient for certain operations, such as
finding the minimum or maximum element in the queue.
At times it is less predictable:. This is because the order of elements in a priority queue
is determined by their priority values, the order in which elements are retrieved may be
less predictable than with other data structures like stacks or queues, which follow a
first-in, first-out (FIFO) or last-in, first-out (LIFO) order.
Key Features:
Applications
Single-ended priority queues are widely used in algorithms like Dijkstra’s shortest path
algorithm and A* search algorithm, where elements need to be processed in order of priority
but are only removed one at a time from one end.
Double ended priority queue
A double ended priority queue supports operations of both max heap (a max
priority queue) and min heap (a min priority queue). The following operations are expected
from double ended priority queue.
We can try different data structure like Linked List. In case of linked list, if we
maintain elements in sorted order, then time complexity of all operations become O(1)
except the operation insert() which takes O(n) time.
We can try two heaps (min heap and max heap). We maintain a pointer of every max
heap element in min heap. To get minimum element, we simply return root. To get
maximum element, we return root of max heap. To insert an element, we insert in min heap
and max heap both. The main idea is to maintain one to one correspondence, so that
deleteMin() and deleteMax() can be done in O(Log n) time.
1. getMax() : O(1)
2. getMin() : O(1)
3. deleteMax() : O(Log n)
4. deleteMin() : O(Log n)
5. size() : O(1)
6. isEmpty() : O(1)
Another solution is to use self balancing binary search tree. A self balancing BST is
implemented as set in C++ and TreeSet in Java.
1. getMax() : O(1)
2. getMin() : O(1)
3. deleteMax() : O(Log n)
4. deleteMin() : O(Log n)
5. size() : O(1)
6. isEmpty() : O(1)
s.erase(s.begin());
}
// Deletes maximum element. Works in O(Log n) time
void deleteMax()
{
if (s.size() == 0)
return;
auto it = s.end();
it--;
s.erase(it);
}};
// Driver code
int main()
{
DblEndedPQ d;
d.insert(10);
d.insert(50);
d.insert(40);
d.insert(20);
cout << d.getMin() << endl;
cout << d.getMax() << endl;
d.deleteMax();
cout << d.getMax() << endl;
d.deleteMin();
cout << d.getMin() << endl;
return 0; }
Output
10
50
40
20
3. Static Hashing in data structure?
When a search key is specified in a static hash, the hashing algorithm always returns
the same address. For example, if you take the mod-4 hash function, only 5 values will be
produced. For this to work, the output address must always be the same. The number of
buckets at any given time is constant.
The data bucket address obtained by static hashing will always be the same. So, if we
use the mod(5) hash function to get the address of EmpId = 103, we always get the same data
bucket address 3. In this case, the data bucket position remains unchanged. Therefore, all
existing data buckets in memory remain unchanged while the entire hashing process remains
the same. In this case, there are five partitions in the memory used to hold data.
Searching the data: When data is needed, the same hash function is used to get
the address of the packet where the data is stored.
Inserting Data: When new data is entered into the table, the hash key is used to create the
address of the new data and place the data there.
Deleting a Record: To delete a record, we must first provide the record to be destroyed. The
data of this address will be deleted from memory.
Updating a Record: To update the info, we will first find it using the hash function and then
change the profile info. If we want to add data to the file, but the address of the dataset
created by the hash function is not empty or there is data in the address, we cannot add
information. Bucket overflow is the term used to describe this situation in static hashing.
Open Hashing:When the hash function returns an address that already contains
information, the next packet is assigned to it through a process called linear probing. For
example, if the new address needed after input is R4, the hash function will produce 112 as
the address of R4. However, the residence is completely finished; so the system selects 113
as the next available packet and assigns it R4.
Close Hashing:When the data group is full, a new bucket will be created for the same hash
result and added after the old group. This method is called overflow chaining. For example,
if the new address to be added to the file is R4, the hash function will give it address 112.
But the bucket is too full to hold any more information. In this case, a new bucket will be
placed and tied to the end of the 112 bucket.
Conclusion:
Static hashing is one of many other hashing techniques used to show that data is not
stored sequentially and also provides the correct memory address for each value in the data.
Unlike other hashes, static hash methods can be used for static results without changing the
values of objects, objects, and relational data in the DBMS.
k = 0 (Single
Node) o
k = 1 (2 nodes)
[We take two k = 0 order Binomial Trees, and
make one as a child of other]
o
/
o
k = 2 (4 nodes)
[We take two k = 1 order Binomial Trees, and
make one as a child of other]
o
/ \
o o
/
o
k = 3 (8 nodes)
[We take two k = 2 order Binomial Trees, and
make one as a child of other]
o
/|\
o o o
/\|
o oo
Binomial Heap:
A Binomial Heap is a set of Binomial Trees where each Binomial Tree follows the
Min Heap property. And there can be at most one Binomial Tree of any degree.
12 10 20
A Binomial Heap with 12 nodes. It is a collection of 2
/ \ /|\
Binomial Trees of orders 2 and 3 from left to right.
15 50 70 50 40
| /| |
Binary Representation of a number and Binomial Heaps :
30 80 85 65
|
A Binomial Heap with n nodes has the number of Binomial Trees equal to the number
1
of set bits in the binary representation of n. For example, let n be 13, there are 3 set bits
0
in the binary representation of n (00001101), hence 3 Binomial Trees. We can also
0
relate the degree of these Binomial Trees with positions of set bits. With this relation,
A Binomial Heap with 13 nodes. It is a collection of 3
we can conclude that there are O(Logn) Binomial Trees in a Binomial Heap with ‘n’
Binomial Trees of orders 0, 2, and 3 from left to right.
nodes.
10 20
/ \ /|\
Operations of Binomial Heap:
15 50 70 50 40
| /| |
The main operation in Binomial Heap is a union(), all other operations mainly use
30 80 85 65
this operation. The union() operation is to combine two Binomial Heaps into one. Let us
first discuss other operations, we will discuss union later.
insert(H, k): Inserts a key ‘k’ to Binomial Heap ‘H’. This operation first creates a
Binomial Heap with a single key ‘k’, then calls union on H and the new Binomial heap.
getting(H): A simple way to get in() is to traverse the list of the roots of Binomial Trees
and return the minimum key. This implementation requires O(Logn) time. It can be
optimized to O(1) by maintaining a pointer to the minimum key root.
extracting(H): This operation also uses a union(). We first call getMin() to find the
minimum key Binomial Tree, then we remove the node and create a new Binomial
Heap by connecting all subtrees of the removed minimum node. Finally, we call union()
on H and the newly created Binomial Heap. This operation requires O(Logn) time
delete(H): Like Binary Heap, the delete operation first reduces the key to minus infinite,
then calls extracting().
Finding Minimum
Θ(1) O(log(n)) O(1)
key
Extract-Minimum
Θ(log(n)) Θ(log(n)) O(log(n))
key