Adsa Unit-I
Adsa Unit-I
DICTIONARIES
For example: Consider a data structure that stores bank accounts data.
Where account number server as a key to identify the account details.
Abstract Data Type (ADT): It's a theoretical concept used in computer science
to define a data type purely by its behavior (operations and their properties)
rather than its implementation.
In an ADT, you specify:
1. Data: The type of data that the ADT can hold.
2. Operations: The operations that can be performed on that data (like
adding, removing, or accessing elements).
3. Properties: The expected behavior of these operations, such as their time
complexity.
Dictionary ADT:
1. Insert (key, value): Inserts <key, value> pair in the dictionary. If key
already used overwrites the existing <key, value> pair.
2. Delete (key): Removes the <key, value> pair associated with the
specified key.
3. Search (key): Retrieves the value associated with the specified key.
4. Update (key, value): Modifies the value for the specified key.
3. Properties:
List or Array: Simple but less efficient, especially for large datasets
(O(n) for search).
Applications of Dictionaries:
Database indexing
Caching mechanisms
HASHING
HASH TABLE:
Hash table is a data structure used for storing and retrieving data very quickly.
Insertion of data in the hash table is based on the key value. Hence every entry
in the hash table is associated with some key. Using the hash key the required
piece of data can be searched in the hash table by few or more key
comparisons. The searching time is then dependent upon the size of the hash
table.
For example: Consider that we want place some employee records in the hash
table The record of employee is placed with the help of key: employee ID. The
employee ID is a 7 digit number for placing the record in the hash table. To
place the record 7 digit number is converted into 3 digits by taking only last
three digits of the key. If the key is 4967000 it can be stored at 0th position.
The second key 8421002, the record of those key is placed at 2nd position in
the array. Hence the hash function will be- H(key) = key%1000 Where
key%1000 is a hash function and key obtained by hash function is called hash
key.
Bucket and Home bucket: The hash function H(key) is used to map several
dictionary entries in the hash table. Each position of the hash table is called
bucket. The function H(key) is home bucket for the dictionary with pair whose
value is key.
TYPES OF HASH FUNCTION: There are various types of hash function that
are used to place the record in the hash table-
54%10=4 0
1
72%10=2 2 72
3
89%10=9
4 54
37%10=7 5
6
7 37
8
9 89
2. Mid Square: In the mid square method, the key is squared and the middle
or mid part of the result is used as the index. If the key is a string, it has to be
preprocessed to produce a number.
Consider that if we want to place a record 3111 then (3111) 2 = 9678321 for the
hash table of size 1000 H(3111) = 783 (the middle 3 digits)
H(key) = floor(50*(107*0.61803398987))
= floor(3306.4818458045)
= 3306
At 3306 location in the hash table the record 107 will be placed.
4. Digit Folding: The key is divided into separate parts and using some simple
operation these parts are combined to produce the hash key.
For eg; consider a record 12365412 then it is divided into separate parts as
123 654 12 and these are added together H(key) = 123+654+12 = 789 The
record will be placed at location 789
5. Digit Analysis: The digit analysis is used in a situation when all the
identifiers are known in advance. We first transform the identifiers into
numbers using some radix, r. Then examine the digits of each identifier. Some
digits having most skewed distributions are deleted. This deleting of digits is
continued until the number of remaining digits is small enough to give an
address in the range of the hash table. Then these digits are used to calculate
the hash address.
COLLISION:
The hash function is a function that returns the key value using which the
record can be placed in the hash table. Thus this function helps us in placing
the record in the hash table at appropriate position and due to this we can
retrieve the record directly from that location. This function need to be
designed very carefully and it should not return the same hash key address for
two different records. This is an undesirable situation in hashing.
Definition: The situation in which the hash function returns the same hash
key (home bucket) for more than one record is called collision and two same
hash keys returned for different records is called synonym. Similarly when
there is no room for a new pair in the hash table then such a situation is called
overflow. Sometimes when we handle collision it may lead to overflow
conditions. Collision and overflow show the poor hash functions.
The record keys to be placed are 131, 44, 43, 78, 19, 36, 57 and 77
131%10=1
0
44%10=4 1 131
2
43%10=3 3 43
4 44
78%10=8
5
19%10=9 6 36
7 57
36%10=6 8 78
9 19
57%10=7
Now if we try to place 77 in the hash table then we get the hash key to be 7 and
at index 7 already the record key 57 is placed. This situation is called collision.
From the index 7 if we look for next vacant position at subsequent indices 8, 9
then we find that there is no room to place 77 in the hash table. This situation
is called overflow.
1. Separate Chaining
2. Open addressing
i) Linear probing
3. Rehashing
4. Extendible hashing
Each slot (i.e bucket) in the hash table points to a linked list (or another data
structure) that store all the elements that hash to the same index. When a
collision occurs, the new element is simply added to the list.
For eg., Consider the keys to be placed in their home buckets are 131, 3, 4, 21,
61, 7, 97, 8, 9
then we will apply a hash function as H(key) = key % D Where D is the size of
table. The hash table will be- Here D = 10
A chain is maintained for colliding elements. for instance 131 has a home
bucket (key) 1, Similarly key 21 and 61 demand for home bucket 1. Hence a
chain is maintained at index 1.
Advantages:
1. Simplicity:
2. Dynamic Size:
o There’s no need to worry about resizing the table when the number
of entries grows, as each bucket can expand independently as
needed.
o Multiple keys can hash to the same index without losing data. The
linked list or similar structure can grow to accommodate additional
entries.
M.Tech-CSE & CSE(AI&ML) SREC, Nandyal Page 8
Advanced Data Structures and Algorithms
5. Ease of Deletion:
Disadvantages:
1. Memory Overhead:
2. Performance Degradation:
o If many keys hash to the same index (poor hash function), chains
can become long, leading to O(n) time complexity for search, insert,
and delete operations in the worst case.
3. Cache Performance:
4. Increased Complexity:
5. Rehashing:
o If the load factor becomes too high, rehashing the entire table (i.e.,
creating a larger table and redistributing the entries) may be
required, which can be expensive.
Instead of using linked lists, this technique finds another FREE slot within the
hash table itself when a collision occurs. This process is known as Probing,
and the sequence is known as probing sequence.
i) Linear Probing:
When collision occurs i.e. when two records demand for the same home
bucket in the hash table then collision can be solved by placing the
second record linearly down whenever the empty bucket is found. When
use linear probing (open addressing), the hash table is represented as a
one-dimensional array with indices that range from 0 to the desired table
size-1. Before inserting any elements into this table, we must initialize the
table to represent the situation where all slots are empty. This allows us
to detect overflows and collisions when we insert elements into the table.
Then using some suitable hash function the element can be inserted into
the hash table.
This method uses the below formula to find the next empty slot in
sequence.
Following keys are to be inserted in the hash table 131, 4, 8, 7, 21, 5, 31, 61,
9, 29. Initially, we will put the following keys in the hash table.
Assume tablesize=10
For instance the element 131 can be placed at H(key) = 131 % 10=1
Index 1 will be the home bucket for 131. Continuing in this fashion we will
place 4, 8, 7. Now the next key to be inserted is 21.
H(21) = 1
But the index 1 location is already occupied by 131 i.e. collision occurs. To
resolve this collision we will linearly move down and at the next empty location
we will prob the element.
H(21)=(H(21)+1) % 10 = (1+1) % 10 = 2
Therefore 21 will be placed at the index 2. If the next element is 5 then we get
the home bucket for 5 as index 5 and this bucket is empty so we will put the
element 5 at index 5.
The next record key is 9. According to decision hash function it demands for
the home bucket 9. Hence we will place 9 at index 9. Now the next final record
key 29 and it hashes a key 9. But home bucket 9 is already occupied. And
there is no next empty bucket as the table size is limited to index 9. The
overflow occurs. To handle it we move back to bucket 0 and is the location over
there is empty 29 will be placed at index 0.
Advantages:
1. Simplicity:
2. Cache Performance:
4. Easy to Implement:
Disadvantages:
1. Clustering:
2. Performance Degradation:
3. Complex Deletion:
5. Rehashing Complexity:
For example: Let insert the following keys in the hash table
37,90,55,22,11,17,49,87
Index Key
H(37)=37%10=7 0 90
1 11
H(90)=90%10=0 2 22
3 NULL
H(55)=55%10=5 4 NULL
5 55
H(22)=22%10=2
6 NULL
H(11)=11%10=1 7 37
8 NULL
9 NULL
H(17)= )=((17%10)+12)=7+1=8
Index Key
0 90
1 11
2 22
3 NULL
4 NULL
5 55
6 NULL
Collision
37
7 for 17
8 17
9 49
H(87)=(H(87))+12)%10=(7+1)%10=8, occupied
H(87)=(H(87))+22)%10=(7+4)%10=1,occupied
Index Key
0 90
1 11
2 22
3 NULL
4 NULL
5 55
6 87
7 37
8 17
9 49
Advantages:
1. Reduced Clustering:
2. Improved Performance:
3. Simple Implementation:
4. Cache Efficiency:
Disadvantages:
1. Complexity in Implementation:
2. Secondary Clustering:
5. Deletion Complexity:
When a collision occurs then probing is done by using the below formula for
i=1,2,..
(H1(key) + i * H2(key)) % tablesize
For example, consider the hash table with size 10 and insert the following keys
in the hash table. Assume table size =10, and prime number(M)=7
37,90,45,22,17,49,55
Index Key
0 90
H1(37)=37%10=7 1 17
2 22
H1(90)=90%10=0 3
H1(45)=45%10=5 4
H1(22)=22%10=2 5 45
6
H1(17)=17%10=7, Collision occurs at index 7. Collision for
7 37 17
So calculate H2(key)
8
H2(17)=M-(key%M)=7-(17%7)=7-3=4
9
H1(49)=49%10=9
H1(55)=55%10=5, Collision occurs at index 5.
Index Key
0 90
So calculate H2(key) 1 17
2 22
H2(55)=M-(key%M)=7-(55%7)=7-6=1 3
4
Now perform probing to find next index for 5 45 Collision for 55
i=1,2… 6 55
Index=(H1(key) + i * H2(key)) % tablesize 7 37
Index=(5+ 1* 1)%10=6%10=6 8
9 49
Advantages:
1. Reduced Clustering:
2. Better Distribution:
3. Flexibility:
4. Performance:
5. Efficiency in Space:
Disadvantages:
1. Complexity:
2. Performance Sensitivity:
4. Increased Computation:
5. Deletion Complexity:
3. REHASHING:
Rehashing is a technique in which the table is resized, i.e., the size of table is
doubled by creating a new table. It is preferable that the table size is a prime
number.
For example, consider the following keys to insert in the hash table.
H(49)=49%10=9
Now this table is almost full and if we try to insert more elements
collisions will occur and eventually further insertions will fail. Hence we
will rehash by doubling the table size. The old table size is 10 then we
should double this size for new table, and it becomes 20. But 20 is not a
prime number, we will prefer to make the table size as 23.
Advantages:
1. Improved Performance:
o By increasing the size of the hash table, rehashing can reduce the
load factor, which generally leads to improved performance for
insertion, deletion, and search operations. A lower load factor
minimizes collisions and clustering.
2. Better Distribution:
3. Dynamic Growth:
4. Efficiency Maintenance:
5. Flexibility:
Disadvantages:
1. High Overhead:
2. Memory Consumption:
3. Complexity:
Load Factor:
The load factor in a hash table is a measure of how full the hash table is, and
it plays a crucial role in the performance of hash table operations. It is defined
as the ratio of the number of entries (elements) stored in the hash table to the
total number of slots (buckets) available in the table.
Definition
α=n/m
where:
1. Performance:
o The load factor directly affects the performance of the hash table. A
higher load factor typically leads to more collisions, which can
degrade the average time complexity of operations such as
insertion, deletion, and search.
2. Collision Resolution:
3. Resizing:
4. Space Efficiency:
o A lower load factor can lead to wasted space, as many buckets may
remain empty. Balancing between load factor and space usage is
key to efficient hash table design.
4. EXTENDIBLE HASHING:
Structure
Key Concepts
2. Global Depth: This is the number of bits in the hash value used to index
into the directory. The global depth determines the size of the directory.
3. Local Depth: Each bucket has its own local depth, which indicates how
many bits from the hash value are used for that particular bucket.
1. Insertion:
Step 6 – Insertion and Overflow Check: Insert the element and check if
the bucket overflows. If an overflow is encountered, go to step 7 followed
by Step 8, otherwise, go to step 9.
o Case2: In case the local depth is less than the global depth, then
only Bucket Split takes place. Then increment only the local depth
value by 1. And, assign appropriate pointers.
2. Search:
Use the global depth to find the appropriate bucket and retrieve
the key.
3. Deletion:
Hash Function: Suppose the global depth is N. Then the Hash Function
returns N LSBs.
Inserting 16:
Inserting 4 and 6:
Both 4(100) and 6(110)have 0 in their LSB. Hence, they are hashed as follows:
Inserting 22: The binary form of 22 is 10110. Its LSB is 0. The bucket pointed
by directory 0 is already full. Hence, Over Flow occurs.
As directed by Step 7-Case 1, Since Local Depth = Global Depth, the bucket
splits and directory expansion takes place. Also, rehashing of numbers present
in the overflowing bucket takes place after the split. And, since the global depth
is incremented by 1, now,the global depth is 2. Hence, 16,4,6,22 are now
rehashed w.r.t 2 LSBs.[ 16(10000),4(100),6(110),22(10110) ]
Inserting 20: Insertion of data element 20 (10100) will again cause the
overflow problem.
The bucket overflows, and, as directed by Step 7-Case 2, since the local depth
of bucket < Global depth (2<3), directories are not doubled but, only the
Advantages:
Flexible Size: The directory can grow or shrink based on the data,
allowing for efficient space utilization.
Disadvantages:
Definition:
A family of hash functions H is said to be universal if, for any two distinct keys
x and y, the probability that they collide (i.e., hash to the same value) when a
hash function h from this family is chosen at random is low. Formally, a family
of hash functions H is universal if:
P(h(x)=h(y))≤1/m
for any x≠y in the input set, where m is the number of possible hash values
(i.e., the size of the hash table).
h(x)=((a⋅x)mod p)mod m
Advantages:
2. Adversarial Resistance:
3. Simplicity:
4. Theoretical Foundations:
Disadvantages:
1. Randomness Requirement:
o Universal hashing often requires the use of random values (like the
random integer aaa), which can complicate implementation,
especially in deterministic settings.
2. Performance Overhead:
1. Insertion:
2. Deletion:
3. Search:
4. Update: