Unit1 Notes ADS
Unit1 Notes ADS
Unit1 Notes ADS
In computer science, a dictionary is a data structure that stores a collection of key-value pairs,
where each key must be unique. It is also known by other names such as a map, associative
array, or symbol table. The idea is to associate a value with a unique identifier (key), allowing
efficient retrieval and modification of values based on their keys.
An Abstract Data Type (ADT) is a high-level description of a set of operations that can be
performed on a data structure, without specifying how these operations are implemented. It
defines what operations are possible and what their semantics are, but it does not prescribe how
the operations should be carried out. In the case of a dictionary, the ADT would include
operations like insert, delete, and search.
Implementation of Dictionaries:
There are various ways to implement dictionaries, each with its own advantages and
disadvantages. Here are a few common implementations:
Hash Tables:
Pros: Efficient average-case time complexity for basic operations (insert, delete, search).
Cons: Possibility of collisions (two keys hashing to the same index), which requires collision
resolution strategies.
Idea: Store key-value pairs in a binary tree, where keys to the left are smaller, and keys to the
right are larger.
Pros: Maintains a sorted order of keys, easy to find the minimum and maximum keys.
Cons: The tree can become unbalanced, leading to inefficient operations in the worst case.
Skip Lists:
Idea: Linked lists with multiple layers of links, allowing for faster search operations.
Pros: Simpler than many other data structures, and has good average-case performance.
Cons: More complex than a simple linked list, and may not perform as well as hash tables in
certain scenarios.
Idea: Organize keys in a tree structure where each node represents a character in a key.
Cons: Can be memory-intensive, especially if there are many keys with common prefixes.
Hashing:
Hashing is a technique used to map data of arbitrary size to fixed-size values, usually for the
purpose of quickly and efficiently locating a data record. It's commonly employed in data
structures like hash tables to achieve constant-time average complexity for basic operations like
insertion, deletion, and retrieval.
Hash Function:
A hash function takes an input (or "key") and produces a fixed-size string of characters, which is
typically a hash code. The goal is to distribute the keys uniformly across the range of possible
hash values to minimize collisions. An ideal hash function is fast to compute and minimizes the
likelihood of two different keys producing the same hash code.
Collisions occur when two distinct keys hash to the same value. Various techniques are
employed to address collisions:
Collision Resolution
We now return to the problem of collisions. When two items hash to the same slot, we must have
a systematic method for placing the second item in the hash table. This process is called collision
resolution. If the hash function is perfect, collisions will never occur. However, since this is
often not possible, collision resolution becomes a very important part of hashing.
One method for resolving collisions looks into the hash table and tries to find another open slot
to hold the item that caused the collision. A simple way to do this is to start at the original hash
value position and then move in a sequential manner through the slots until we encounter the first
slot that is empty. Note that we may need to go back to the first slot (circularly) to cover the
entire hash table. This collision resolution process is referred to as open addressing in that it
tries to find the next open slot or address in the hash table. By systematically visiting each slot
one at a time, we are performing an open addressing technique called linear probing.
Figure 8 shows an extended set of integer items under the simple remainder method hash
function (54,26,93,17,77,31,44,55,20). Table 4 above shows the hash values for the original
items. Figure 5 shows the original contents. When we attempt to place 44 into slot 0, a collision
occurs. Under linear probing, we look sequentially, slot by slot, until we find an open position.
Again, 55 should go in slot 0 but must be placed in slot 2 since it is the next open position. The
final value of 20 hashes to slot 9. Since slot 9 is full, we begin to do linear probing. We visit slots
10, 0, 1, and 2, and finally find an empty slot at position 3.
Once we have built a hash table using open addressing and linear probing, it is essential that we
utilize the same methods to search for items. Assume we want to look up the item 93. When we
compute the hash value, we get 5. Looking in slot 5 reveals 93, and we can return True. What if
we are looking for 20? Now the hash value is 9, and slot 9 is currently holding 31. We cannot
simply return False since we know that there could have been collisions. We are now forced to
do a sequential search, starting at position 10, looking until either we find the item 20 or we find
an empty slot.
A disadvantage to linear probing is the tendency for clustering; items become clustered in the
table. This means that if many collisions occur at the same hash value, a number of surrounding
slots will be filled by the linear probing resolution. This will have an impact on other items that
are being inserted, as we saw when we tried to add the item 20 above. A cluster of values
hashing to 0 had to be skipped to finally find an open position. This cluster is shown in Figure 9.
One way to deal with clustering is to extend the linear probing technique so that instead of
looking sequentially for the next open slot, we skip slots, thereby more evenly distributing the
items that have caused collisions. This will potentially reduce the clustering that occurs. Figure
10 shows the items when collision resolution is done with a "plus 3" probe. This means that once
a collision occurs, we will look at every third slot until we find one that is empty.
The general name for this process of looking for another slot after a collision is rehashing. With
simple linear probing, the rehash function is newh ashvalue =rehash( oldhashv alue)
where reha sh(pos)= (pos+1)% sizeofta ble . The "plus 3" rehash can be defined
as reha sh(pos)= (pos+3) %sizeof table . In general, reh ash(pos )=(pos+ skip) % sizeof table.
It is important to note that the size of the "skip" must be such that all the slots in the table will
eventually be visited. Otherwise, part of the table will be unused. To ensure this, it is often
suggested that the table size be a prime number. This is the reason we have been using 11 in our
examples.
A variation of the linear probing idea is called quadratic probing. Instead of using a constant
"skip" value, we use a rehash function that increments the hash value by 1, 3, 5, 7, 9, and so on.
This means that if the first hash value is h, the successive values are h+1
, h +4, h +9 , h +16 , and so on. In other words, quadratic probing uses a skip consisting of
successive perfect squares. Figure 11 shows our example values after they are placed using this
technique.
An alternative method for handling the collision problem is to allow each slot to hold a reference
to a collection (or chain) of items. Chaining allows many items to exist at the same location in
the hash table. When collisions happen, the item is still placed in the proper slot of the hash
table. As more and more items hash to the same location, the difficulty of searching for the item
in the collection increases. Figure 12 shows the items as they are added to a hash table that uses
chaining to resolve collisions.
Collision Resolution with Chaining
Linear Probing
Quadratic Probing
Double hashing
Linear Probing:
Suppose a new record R with key k is to be added to the memory table T but that the memory
locations with the hash address H (k). H is already filled.
Our natural key to resolve the collision is to crossing R to the first available location following T
(h). We assume that the table T with m location is circular, so that T [i] comes after T [m].
Linear probing is simple to implement, but it suffers from an issue known as primary clustering.
Long runs of occupied slots build up, increasing the average search time. Clusters arise because
an empty slot proceeded by i full slots gets filled next with probability (i + 1)/m. Long runs of
occupied slots tend to get longer, and the average search time increases.
Given an ordinary hash function h': U {0, 1...m-1}, the method of linear probing uses the hash
function.
1. h (k, i) = (h' (k) + i) mod m
Where 'm' is the size of hash table and h' (k) = k mod m. for i=0, 1....m-1.
Given key k, the first slot is T [h' (k)]. We next slot T [h' (k) +1] and so on up to the slot T [m-1].
Then we wrap around to slots T [0], T [1]....until finally slot T [h' (k)-1]. Since the initial probe
position dispose of the entire probe sequence, only m distinct probe sequences are used with linear
probing.
Example: Consider inserting the keys 24, 36, 58,65,62,86 into a hash table of size m=11 using
linear probing, consider the primary hash function is h' (k) = k mod m.
2. Quadratic Probing:
Suppose a record R with key k has the hash address H (k) = h then instead of searching the location
with addresses h, h+1, and h+ 2...We linearly search the locations with addresses
Where (as in linear probing) h' is an auxiliary hash function c1 and c2 ≠0 are auxiliary constants
and i=0, 1...m-1. The initial position is T [h' (k)]; later position probed is offset by the amount that
depend in a quadratic manner on the probe number i.
Example: Consider inserting the keys 74, 28, 36,58,21,64 into a hash table of size m =11 using
quadratic probing with c1=1 and c2=3. Further consider that the primary hash function is h' (k) = k
mod m.
Insert 28.
Insert 36.
Insert 58.
Insert 21.
Insert 64.
h (64, 0) = (64 mod 11 + 0 + 3 x 0)
= (9 + 0+ 0) mod 11 = 9.
T [9] is free; insert key 64 at this place.
3. Double Hashing:
Double Hashing is one of the best techniques available for open addressing because the
permutations produced have many of the characteristics of randomly chosen permutations.
Where h1 and h2 are auxiliary hash functions and m is the size of the hash table.
h1 (k) = k mod m or h2 (k) = k mod m'. Here m' is slightly less than m (say m-1 or m-2).
Example: Consider inserting the keys 76, 26, 37,59,21,65 into a hash table of size m = 11 using
double hashing. Consider that the auxiliary hash functions are h1 (k)=k mod 11 and h2(k) = k mod
9.
1. Insert 76.
h1(76) = 76 mod 11 = 10
h2(76) = 76 mod 9 = 4
h (76, 0) = (10 + 0 x 4) mod 11
= 10 mod 11 = 10
T [10] is free, so insert key 76 at this place.
2. Insert 26.
h1(26) = 26 mod 11 = 4
h2(26) = 26 mod 9 = 8
h (26, 0) = (4 + 0 x 8) mod 11
= 4 mod 11 = 4
T [4] is free, so insert key 26 at this place.
3. Insert 37.
h1(37) = 37 mod 11 = 4
h2(37) = 37 mod 9 = 1
h (37, 0) = (4 + 0 x 1) mod 11 = 4 mod 11 = 4
T [4] is not free, the next probe sequence is
h (37, 1) = (4 + 1 x 1) mod 11 = 5 mod 11 = 5
T [5] is free, so insert key 37 at this place.
4. Insert 59.
h1(59) = 59 mod 11 = 4
h2(59) = 59 mod 9 = 5
h (59, 0) = (4 + 0 x 5) mod 11 = 4 mod 11 = 4
Since, T [4] is not free, the next probe sequence is
h (59, 1) = (4 + 1 x 5) mod 11 = 9 mod 11 = 9
T [9] is free, so insert key 59 at this place.
5. Insert 21.
h1(21) = 21 mod 11 = 10
h2(21) = 21 mod 9 = 3
h (21, 0) = (10 + 0 x 3) mod 11 = 10 mod 11 = 10
T [10] is not free, the next probe sequence is
h (21, 1) = (10 + 1 x 3) mod 11 = 13 mod 11 = 2
T [2] is free, so insert key 21 at this place.
6. Insert 65.
h1(65) = 65 mod 11 = 10
h2(65) = 65 mod 9 = 2
h (65, 0) = (10 + 0 x 2) mod 11 = 10 mod 11 = 10
T [10] is not free, the next probe sequence is
h (65, 1) = (10 + 1 x 2) mod 11 = 12 mod 11 = 1
T [1] is free, so insert key 65 at this place.
Thus, after insertion of all keys the final hash table is
Rehashing
Rehashing is a technique that dynamically expands the size of the Map, Array, and Hashtable to
maintain the get and put operation complexity of O(1).
It can be also defined as rehashing is the process of re-calculating the hash code of already stored
entries and moving them to a bigger size hash map when the number of elements in the map reaches
the maximum threshold value.
In simple words, rehashing is the reverse of hashing process. It retains the performance. In
rehashing, the load factor plays a vital role.
Load Factor
The load factor is a measure that decides when to increase the HashMap or Hashtable capacity to
maintain the get() and put() operation of complexity O(1). The default value of the load factor of
HashMap is 0.75 (75% of the map size). In short, we can say that the load factor decides "when
to increase the number of buckets to store the key-value pair."
LARGER load factor: Lower space consumption but higher lookups SMALLER Load
factor: Larger space consumption compared to the required number of elements.
Example:
It represents that the 12th key-value pair of HashMap will keep its size to 16. As soon as the
13th element (key-value pair) will come into the HashMap, it will increase its size from
default 24 = 16 buckets to 25 = 32 buckets.
Load Factor Example
We know that the default bucket size of the HashMap is 16. We insert the first element, now
check whether we need to increase the HashMap capacity or not. It can be determined by the
formula:
In this case, the size of the HashMap is 1, and the bucket size is 16. So, 1/16 = 0.0625. Now
compare the obtained value with the default load factor (0.75).
The value is smaller than the default value of the load factor. So, no need to increase the
HashMap size. Therefore, we do not need to increase the size of the HashMap up to the
12th element because
12/16 = 0.75
The obtained value is equal to the default load factor, i.e., 0.75.
As soon as, we insert the 13th element in the HashMap, the size of HashMap is increased because:
13/16 = 0.8125
Here, the obtained value is greater than the default load factor value.
In order to insert the 13th element into the HashMap, we need to increase the HashMap size.
If you want to keep get() and put() operation complexity O(1), it is advisable to have a load factor
around 0.75.
Rehashing is required when the load factor increases. The load factor increases when we insert
key-value pair in the map and it also increases the time complexity. Generally, the time complexity
of HashMap is O(1). In order to reduce the time complexity and load factor of the HashMap, we
implement the rehashing technique.
o The dynamic hashing method is used to overcome the problems of static hashing like
bucket overflow.
o In this method, data buckets grow or shrink as the records increases or decreases. This
method is also known as Extendable hashing method.
o This method makes hashing dynamic, i.e., it allows insertion or deletion without resulting
in poor performance.
For example:
Consider the following grouping of keys into buckets, depending on the prefix of their hash
address:
The last two bits of 2 and 4 are 00. So it will go into bucket B0. The last two bits of 5 and 6 are
01, so it will go into bucket B1. The last two bits of 1 and 3 are 10, so it will go into bucket B2.
The last two bits of 7 are 11, so it will go into B3.
Insert key 9 with hash address 10001 into the above structure:
o Since key 9 has hash address 10001, it must go into the first bucket. But bucket B1 is full,
so it will get split.
o The splitting will separate 5, 9 from 6 since last three bits of 5, 9 are 001, so it will go into
bucket B1, and the last three bits of 6 are 101, so it will go into bucket B5.
o Keys 2 and 4 are still in B0. The record in B0 pointed by the 000 and 100 entry because
last two bits of both the entry are 00.
o Keys 1 and 3 are still in B2. The record in B2 pointed by the 010 and 110 entry because
last two bits of both the entry are 10.
o Key 7 are still in B3. The record in B3 pointed by the 111 and 011 entry because last two
bits of both the entry are 11.
Advantages of dynamic hashing
o In this method, the performance does not decrease as the data grows in the system. It simply
increases the size of memory to accommodate the data.
o In this method, memory is well utilized as it grows and shrinks with the data. There will
not be any unused memory lying.
o This method is good for the dynamic database where data grows and shrinks frequently.