0% found this document useful (0 votes)
18 views30 pages

DS Unit 5

Unit V discusses searching, sorting, and hashing techniques in data structures. It covers various searching algorithms such as Linear and Binary Search, sorting methods including Bubble, Selection, Insertion, and Quick Sort, and introduces hashing concepts including hash functions and collision resolution. The document emphasizes the importance of these techniques for efficient data management and retrieval.

Uploaded by

saisrirambs2005
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views30 pages

DS Unit 5

Unit V discusses searching, sorting, and hashing techniques in data structures. It covers various searching algorithms such as Linear and Binary Search, sorting methods including Bubble, Selection, Insertion, and Quick Sort, and introduces hashing concepts including hash functions and collision resolution. The document emphasizes the importance of these techniques for efficient data management and retrieval.

Uploaded by

saisrirambs2005
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

UNIT V

SEARCHING, SORTING AND HASHING TECHNIQUES


UNIT V - SEARCHING, SORTING AND HASHING TECHNIQUES
Searching
Searching
Searching is the process of finding a given value position in a list of values.
This is one of the important parts of many data structures algorithms, as
one operation can be performed on an element if and only if we find it.
Various algorithms have been defined to find whether an element is present
in the collection of items or not. This algorithm can be executed on both
internal as well as external data structures. The efficiency of searching an
element increases the efficiency of any algorithm.
Searching Techniques
To search an element in a given array, it can be done in following ways:
1. Linear Search
2. Binary Search
1. Linear Search
● Sequential search is also called as Linear Search.
● Sequential search starts at the beginning of the list and checks every
element of the list.
● It is a basic and simple search algorithm.
● Sequential search compares the element with all the other elements given
in the list. If the element is matched, it returns the value index, else it
returns -1.

The above figure shows how sequential search works. It searches an element
or value from an array till the desired element or value is not found. If we
search the element 25, it will go step by step in a sequence order. It
searches in a sequence order. Sequential search is applied on the unsorted
or unordered list when there are fewer elements in a list.

The following code snippet shows the sequential search operation:


function searchValue(value, target)
{
for (var i = 0; i < value.length; i++)
{
if (value[i] == target)
{
return i;
}
}
return -1;
}
searchValue([10, 5, 15, 20, 25, 35] , 25); // Call the function with array
and number to be searched

Example: Program for Linear Search


#include <stdio.h>
int main()
{
int arr[50], search, cnt, num;
printf("Enter the number of elements in array\n");
scanf("%d",&num);
printf("Enter %d integer(s)\n", num);
for (cnt = 0; cnt < num; cnt++)
scanf("%d", &arr[cnt]);
printf("Enter the number to search\n");
scanf("%d", &search);
for (cnt = 0; cnt < num; cnt++)
{
if (arr[cnt] == search) /* if required element found */
{
printf("%d is present at location %d.\n", search, cnt+1);
break;
}
}
if (cnt == num)
printf("%d is not present in array.\n", search);
return 0;
}
Output:

2. Binary Search
● Binary Search is used for searching an element in a sorted array.
● It is a fast search algorithm with run-time complexity of O(log n).



Sorting
Sorting
Sorting refers to arranging data in a particular format. Sorting algorithm
specifies the way to arrange data in a particular order. Most common orders
are in numerical or lexicographical order.
The importance of sorting lies in the fact that data searching can be
optimized to a very high level, if data is stored in a sorted manner. Sorting
is also used to represent data in more readable formats. Following are some
of the examples of sorting in real-life scenarios −
● Telephone Directory − The telephone directory stores the telephone
numbers of people sorted by their names, so that the names can be
searched easily.
● Dictionary − The dictionary stores words in an alphabetical order so
that searching of any word becomes easy.
In-place Sorting and Not-in-place Sorting
Sorting algorithms may require some extra space for comparison and
temporary storage of few data elements. These algorithms do not require
any extra space and sorting is said to happen in-place, or for example,
within the array itself. This is called in-place sorting. Bubble sort is an
example of in-place sorting.
However, in some sorting algorithms, the program requires space which is
more than or equal to the elements being sorted. Sorting which uses equal
or more space is called not-in-place sorting. Merge-sort is an example of
not-in-place sorting.
Stable and Not Stable Sorting
If a sorting algorithm, after sorting the contents, does not change the
sequence of similar content in which they appear, it is called stable
sorting.

If a sorting algorithm, after sorting the contents, changes the sequence of


similar content in which they appear, it is called unstable sorting.

Bubble sort
Bubble sort is a simple sorting algorithm. This sorting algorithm is
comparison-based algorithm in which each pair of adjacent elements is
compared and the elements are swapped if they are not in order. This
algorithm is not suitable for large data sets as its average and worst case
complexity are of Ο(n2) where n is the number of items.
Bubble sort takes Ο(n2) time so we're keeping it short and precise.
How Bubble Sort Works?
1. Starting from the first index, compare the first and the second elements.If
the first element is greater than the second element, they are swapped.
Now, compare the second and the third elements. Swap them if they are not
in order.
The above process goes on until the last element..
The same process goes on for the remaining iterations. After each iteration,
the largest element among the unsorted elements is placed at the end.
In each iteration, the comparison takes place up to the last unsorted
element.
The array is sorted when all the unsorted elements are placed at their
correct positions.
Bubble Sort Algorithm

bubbleSort(array)
for i <- 1 to indexOfLastUnsortedElement-1
if leftElement > rightElement
swap leftElement and rightElement
end bubbleSort

Selection Sort
Selection sort is a simple sorting algorithm. This sorting algorithm is an
in-place comparison-based algorithm in which the list is divided into two
parts, the sorted part at the left end and the unsorted part at the right end.
Initially, the sorted part is empty and the unsorted part is the entire list.
The smallest element is selected from the unsorted array and swapped with
the leftmost element, and that element becomes a part of the sorted array.
This process continues moving unsorted array boundary by one element to
the right.
This algorithm is not suitable for large data sets as its average and worst
case complexities are of Ο(n2), where n is the number of items.
Working of Selection sort
Consider the following depicted array as an example.

For the first position in the sorted list, the whole list is scanned
sequentially. The first position where 14 is stored presently, we search the
whole list and find that 10 is the lowest value.

So we replace 14 with 10. After one iteration 10, which happens to be the
minimum value in the list, appears in the first position of the sorted list.

For the second position, where 33 is residing, we start scanning the rest of
the list in a linear manner.

We find that 14 is the second lowest value in the list and it should appear at
the second place. We swap these values.

After two iterations, two least values are positioned at the beginning in a
sorted manner.

The same process is applied to the rest of the items in the array.
Pseudocode
procedure selection sort
list : array of items
n : size of list

for i = 1 to n - 1
/* set current element as minimum*/
min = i

/* check the element to be minimum */

for j = i+1 to n
if list[j] < list[min] then
min = j;
end if
end for

/* swap the minimum element with the current element*/


if indexMin != i then
swap list[min] and list[i]
end if
end for

end procedure

Insertion sort
This is an in-place comparison-based sorting algorithm. Here, a sub-list is
maintained which is always sorted. For example, the lower part of an array
is maintained to be sorted. An element which is to be 'insert'ed in this
sorted sub-list, has to find its appropriate place and then it has to be
inserted there. Hence the name, insertion sort.
The array is searched sequentially and unsorted items are moved and
inserted into the sorted sub-list (in the same array). This algorithm is not
suitable for large data sets as its average and worst case complexity are of
Ο(n2), where n is the number of items.
Working of Insertion sort
We take an unsorted array for our example.

Insertion sort compares the first two elements.

It finds that both 14 and 33 are already in ascending order. For now, 14 is in sorted
sub-list.

Insertion sort moves ahead and compares 27 with 33.

And finds that 33 is not in the correct position.

It swaps 33 with 27. It also checks with all the elements of sorted sub-list. Here we
see that the sorted sub-list has only one element 14, and 27 is greater than 14.
Hence, the sorted sub-list remains sorted after swapping.

By now we have 14 and 27 in the sorted sub-list. Next, it compares 10 with 33.

These values are not in a sorted order.

So 33 is moved to position 3 and 10 compared with 27.


However, swapping makes 27 and 10 unsorted.

So 27 is moved to position 2 and 10 compared with 14.

Again we find 14 and 10 in an unsorted order.

So 14 is moved to position 1. and 10 is inserted in position 0. By the end of third


iteration, we have a sorted sub-list of 4 items.

This process goes on until all the unsorted values are covered in a sorted sub-list.
Now we shall see some programming aspects of insertion sort.
Pseudocode
procedure insertionSort( A : array of items )
int holePosition
int valueToInsert

for i = 1 to length(A) inclusive do:

/* select value to be inserted */


valueToInsert = A[i]
holePosition = i

/*locate hole position for the element to be inserted */

while holePosition > 0 and A[holePosition-1] > valueToInsert


do:
A[holePosition] = A[holePosition-1]
holePosition = holePosition -1
end while

/* insert the number at hole position */


A[holePosition] = valueToInsert

end for

end procedure
Quick Sort
The quick sort uses divide and conquer to gain the same advantages as the
merge sort, while not using additional storage. As a trade-off, however, it is
possible that the list may not be divided in half. When this happens, we will
see that performance is diminished.
A quick sort first selects a value, which is called the pivot value. Although
there are many different ways to choose the pivot value, we will simply use
the first item in the list. The role of the pivot value is to assist with splitting
the list. The actual position where the pivot value belongs in the final sorted
list, commonly called the split point, will be used to divide the list for
subsequent calls to the quick sort.
Figure 1 shows that 54 will serve as our first pivot value. Since we have
looked at this example a few times already, we know that 54 will eventually
end up in the position currently holding 31. The partition process will
happen next. It will find the split point and at the same time move other
items to the appropriate side of the list, either less than or greater than the
pivot value.

Figure 1: The First Pivot Value for a Quick Sort

Partitioning begins by locating two position markers—let’s call


them leftmark and rightmark—at the beginning and end of the remaining
items in the list (positions 1 and 8 in Figure 2). The goal of the partition
process is to move items that are on the wrong side with respect to the pivot
value while also converging on the split point. Figure 2 shows this process
as we locate the position of 54.
Figure 2: Finding the Split Point for 54

We begin by incrementing leftmark until we locate a value that is greater


than the pivot value. We then decrement rightmark until we find a value
that is less than the pivot value. At this point we have discovered two items
that are out of place with respect to the eventual split point. For our
example, this occurs at 93 and 20. Now we can exchange these two items
and then repeat the process again.
At the point where rightmark becomes less than leftmark, we stop. The
position of rightmark is now the split point. The pivot value can be
exchanged with the contents of the split point and the pivot value is now in
place (Figure 3). In addition, all the items to the left of the split point are less
than the pivot value, and all the items to the right of the split point are
greater than the pivot value. The list can now be divided at the split point
and the quick sort can be invoked recursively on the two halves.

Figure 3: Completing the Partition Process to Find the Split Point for 54

Pseudocode
Procedure quickSort (alist,first,last)
{
if first<last:
splitpoint = partition(alist,first,last)

quickSort (alist,first,splitpoint-1)
quickSort (alist,splitpoint+1,last) }

Procedure partition(alist,first,last)
{
pivotvalue = alist[first]

leftmark = first+1
rightmark = last

done = False
while not done
{

while leftmark <= rightmark and alist[leftmark] <= pivotvalue:


leftmark = leftmark + 1

while alist[rightmark] >= pivotvalue and rightmark >= leftmark:


rightmark = rightmark -1

if rightmark < leftmark:


done = True
else
{
temp = alist[leftmark]
alist[leftmark] = alist[rightmark]
alist[rightmark] = temp} }

temp = alist[first]
alist[first] = alist[rightmark]
alist[rightmark] = temp

return rightmark }

alist = [54,26,93,17,77,31,44,55,20]
quickSort(alist,0,len(alist)-1)
print(alist)

Hashing Techniques
General Idea
▪ The ideal hash table data structure is an array of some fixed size,
containing the items
▪ A search is performed based on key
▪ Each key is mapped into some position in the range 0 to TableSize-1
▪ The mapping is called hash function

A hash table

Usually, m << N
h(Ki) = an integer in [0, …, m-1] called the hash value of Ki.
Hashing:
A Function that transforms a key into a table index is called a hash
function. This mapping process is called Hashing
Collision
– Two keys may hash to the same slot
– Can we ensure that any two distinct keys get different cells?
• No, if N>m, where m is the size of the hash table
Solution
⮚ Task 1: Design a good hash function
– That is fast to compute and
– Can minimize the number of collisions
– Distribute the keys evenly among the cells

⮚ Task 2: Design a method to resolve the collisions when they occur

Design Hash Function
▪ A simple and reasonable strategy: h(k) = k mod m
▪ e.g. m=12, k=100, h(k)=4
▪ Requires only a single division operation (quite fast)
▪ Certain values of m should be avoided
▪ If the table size is 10 and the keys end in zero, then
standard hash function is bad choice
▪ It’s a good practice to set the table size m to be a prime number
Deal with String-type Keys
Method 1: Add up the ASCII values of the characters in the string
(Sum of the ASCII values) % Tablesize
Problems:
▪ Different permutations of the same set of characters would have
the same hash value
▪ e.g - key : maytas, satyam both will give same hash value
▪ If the table size is large, the keys are not distribute well.
e.g. Suppose m=10007 and all the keys are eight or fewer characters long.
Since ASCII value <= 127, the hash function can only assume values
between 0 and 127*8=1016

Method 2
Key has at least two characters plus NULL terminator
– If the first 3 characters are random and the table size is
10,0007 => a reasonably equitable distribution
Problem
• English is not random
• Only 28 percent of the table can actually be hashed to (assuming a
table size of 10,007)

Method 3

involves all characters in the key and be expected to distribute well


Keysize = 5
K[4] * 1 + k[3] * 32 + k[2] * 322 + k[1] * 323 + k[0] * 324
(((((k[0] * 32) + k[1] ) * 32 + k[2] ) * 32 + k[3] ) * 32 + k[4])
coding
Int Hash(Char *key,int Tablesize)
{
int hashval=0;
while (*key!=‘\0’)
hashval=(hashval<<5) + *key++;
return hashval % Tablesize;
}
K[4] * 1 + k[3] * 32 + k[2] * 322 + k[1] * 323 + k[0] * 324
(((((k[0] * 32) + k[1] ) * 32 + k[2] ) * 32 + k[3] ) * 32 + k[4])
Collision Resolution Policies
▪ Two classes:
▪ (1) Open hashing --- separate chaining
▪ (2) Closed hashing --- open addressing
▪ Open hashing --- separate chaining
▪ collisions are stored outside the table
▪ Closed hashing --- open addressing
▪ collisions result in storing one of the records at another slot in
the table
Collision Handling:
(1) Separate Chaining
▪ Instead of a hash table, we use a table of linked list
▪ keep a linked list of keys that hash to the same value
Keys:
Set of squares
e.g 1,4,9,16,25,36,49,64,81
Hash function:
h(K) = K mod 10
Separate Chaining - Insertion
To insert a key K
– Compute h(K) to determine which list to traverse
– If T[h(K)] contains a null pointer, initialitize this entry to point to
a linked list that contains K alone.
– If T[h(K)] is a non-empty list, we add K at the beginning of this
list
Coding
HashTable ADT
Struct listnode
{ elementtype element;
struct listnode* next;
};
typedef struct listnode* List;
Struct hashtb1
{ int Tablesize;
List *Thelist;
};
Typedef struct hashtb1 * Hashtable;
initialize()
Hashtable initializetable(int Tablesize)
{ Hashtable H;
int i;
H=malloc(sizeof(struct hashtb1)); //Allocate Table
H->Tablesize=nextprime(Tablesize);
//allocate array of list
H->Thelist = malloc( sizeof(List) * H->Tablesize );
for( i=0; i<H->Tablesize; i++) //allocate list headers
{
H->Thelist[i] = malloc(sizeof(List));
H->Thelist[i]->next=NULL;
}}
Find()
List Find(elementtype key, Hashtable H)
{
List P,L;
L= H-> Thelist[ Hash(key, H->tablesize) ];
P= L->next;
while(P!=NULL && P->element !=Key)
P = P->next;
return P;
}
Insertion()
Void Insert(elementtype key, hashtable H)
{ List L, pos, newcell;
pos = Find(key, H)
if(pos==NULL)
{ newcell = malloc(sizeof(struct Listnode));
L=H->Thelist[ Hash(key, H->Tablesize) ];
newcell->next = L->next;
newcell->element= key;
L->next = newcell;
}
}
Deletion
To delete a key K
compute h(K), then search for K within the list at T[h(K)]. Delete K if it is
found.
Load Factor
Load Factor: the load factor λ of a hash table is the ratio:
no . of elements / Table Size
Analysis of search, with chaining:
Time for Search= Time to evaluate the hash
function + Time to traverse the list
– Unsuccessful Search : λ
(the average length of a list at hash(i))
– Successful Search : 1 + (λ/2)
(one node, plus half the avg. length of a list
(not including the item))
Separate Chaining Features
▪ Table Size:
As large as the number of elements expected ( λ = 1)
Keep table size as Prime
• Disadvantage:
Memory allocation in linked list manipulation will slow
down the program.
• Advantage: deletion is easy.
Collision Handling:
(2) Open Addressing
▪ Instead of following pointers, compute the sequence of slots to be
examined
▪ Open addressing: relocate the key K to be inserted if it collides with
an existing key.
▪ We store K at an entry different from T[h(K)].
▪ Two issues arise
▪ what is the relocation scheme?
▪ how to search for K later?
▪ Three common methods for resolving a collision in open addressing
▪ Linear probing
▪ Quadratic probing
▪ Double hashing
Open Addressing Strategy
To insert a key K, compute h0(K).
If T[h0(K)] is empty, insert it there.
If collision occurs, probe alternative cell h1(K), h2(K), .... until an
empty cell is found.
Table Size : Bigger Table size is needed
i.e Load factor = 0.5
Hash Function
hi(K) = (hash(K) + f(i)) mod m, with f(0) = 0 f: collision resolution strategy
Linear Probing
f is Linear function of i
f(i) =i
cells are probed sequentially (with wrap-around)
hi(K) = ( hash(K) + i ) mod m
Insertion:
Let K be the new key to be inserted, compute hash(K)
For i = 0 to m-1
compute L = ( hash(K) + i ) mod m
T[L] is empty, then we put K there and stop.
If we cannot find an empty entry to put K, it means that the table is full and
we should report an error.
Example
hi(K) = (hash(K) + i) mod m
E.g, inserting keys 89, 18, 49, 58, 69 with hash(K)=K mod 10

Primary Clustering
● A block of contiguously occupied table entry is a cluster
● On the average, when we insert a new key K, we may hit the middle of
a cluster. Therefore, the time to insert K would be proportional to half
the size of a cluster. That is, the larger the cluster, the slower the
performance.
● Linear probing has the following disadvantages:
o Once h(K) falls into a cluster, this cluster will definitely grow in
size by one. Thus, this may worsen the performance of
insertion in the future.
o If two clusters are only separated by one entry, then inserting
one key into a cluster can merge the two clusters together.
Thus, the cluster size can increase drastically by a single
insertion. This means that the performance of insertion can
deteriorate drastically after a single insertion.
o Large clusters are easy targets for collisions.
Analysis of Linear Probing
Unsuccessful & Insertion : ≈

successful: ≈

The No of probes for successful search = The No of probes required for


insertion
The probes for insertion = the probes for unsuccessful search Linear
probing
if λ=0.5 – 2.5 probes for U & I , 1.5 probes for S
If λ=0.75 – 8.5 probes are expected for insertion
if λ= 0.9 -- 50 probes are expected for insertion
Random collision strategy
if λ=0.75 -- 4 probes are expected for insertion
if λ=0.9 -- 10 probes expected for insertion
Therefore Linear probing is a Bad idea for λ more than 0.5
Quadratic Probing Example
f(i) = i2
hi(K) = ( hash(K) + i2 ) mod m
E.g., inserting keys 89, 18, 49, 58, 69 with hash(K) = K mod 10
To insert 58, probe T[8], T[9], T[(8+4) mod 10]
To insert 69, probe T[9], T[(9+1) mod 10], T[(9+4) mod 10]
Two keys with different home positions will have different probe sequences
e.g. m=101, h(k1)=30, h(k2)=29
probe sequence for k1: 30,30+1, 30+4, 30+9
If the table size is prime and table is at least half empty , then a new key can
always be inserted
Deletion is not easy
If we remove 89 , then all of the remaining finds will fail
Theorem
If the table size is prime and table is at least half empty , then a new key can
always be inserted
Let the table size TABLESIZE be an odd prime greater than 3
We show that the first TABLESIZE/2 alternative location are all distinct
Two of these locations are
(h(x) + I 2 ) mod TABLESIZE / 2)
(h(x) + j 2 ) mod TABLESIZE / 2
where 0<I, j<= TABLESIZE / 2
Suppose, for the sake of contradiction, that these location are the same, but i
!= j then
(h(x) + I 2 ) = (h(x) + j 2 ) mod TABLESIZE / 2)
I2=j2 mod TABLESIZE / 2)
I -j2 =0
2 mod TABLESIZE / 2
( i- j ) (I + j) = 0 mod TABLESIZE / 2
▪ Since Tablesize is prime , it follows that either (i - j) or (i + j) is equal to
0 mod tablesize
▪ Since i and j are distinct , the Ist option is not possible
▪ Since 0 < i, j < Tablesize/2, the 2nd option is also impossible
▪ Thus the first tablesize/2 alternative location are distinct.
Secondary clustering
● Keys that hash to the same home position will probe the same
alternative cells
● Simulation results suggest that it generally causes less than an extra
half probe per search
● To avoid secondary clustering, the probe sequence need to be a
function of the original key value, not the home position
Enum kindofentry { Legitimate, empty, deleted }
Typedef struct hashentry
{
elementtype element;
enum kindofentry Info;
} cell;
Typedef struct hashtb1
{
int tablesize;
cell *thecells;
} *hashtable
Hashtable Initializtable(int tablesize)
{
Hashtable H;
int i;
// Allocate Table
H=malloc(sizeof(struct Hashtbl))
H->tablesize=nextprime(tablesize);
// Allocate array of cells
H->Thecells = malloc(sizeof(cell)*H->tablesize);
for(i=0; i< H->tablesize; i++)
H-> thecells[ i ].Info=empty
return H
}
Int Insert(elementtype key, Hashtable H)
{
int pos;
pos=Find(key,H);
if ( H->Thecells[ pos ].info!=legitimate)
{
H->Thecells[ pos ].info=legitimate
H->Thecells[ pos ].element=key;
}
}
Int Find(elementtype key, Hashtable H)
{ int currentpos; collisionnum = 0;
currentpos=hash(key, H->Tablesize);
while(H->Thecells[currentpos].Info!=empty &&
H->Thecells[currentpos].element!=key)
{
currentpos += 2 * ++ collisionnum – 1
if (currentpos >= H-> Tablesize)
currentpos - = H-> Tablesize
}
return currentpos; }
hi(K) = ( hash(K) + i2 ) mod m 🡺 F(i) = F( i-1 ) + 2 i - 1
Double Hashing
● To alleviate the problem of clustering, the sequence of probes for a key
should be independent of its primary position => use two hash
functions: hash() and hash2()
● f(i) = i * hash2(K)
o E.g. hash2(K) = R - (K mod R), with R is a prime smaller than m
hi(K) = ( hash(K) + f(i) ) mod m; hash(K) = K mod m
f(i) = i * hash2(K); hash2(K) = R - (K mod R),
Example: m=10, R = 7 and insert keys 89, 18, 49, 58, 69
To insert 49, hash2(49)=7, 2nd probe is T[(9+7) mod 10]
To insert 58, hash2(58)=5, 2nd probe is T[(8+5) mod 10]
To insert 69, hash2(69)=1, 2nd probe is T[(9+1) mod 10]
Hash2() must never evaluate to zero
For any key K, hash2(K) must be relatively prime to the table size m.
Otherwise, we will only be able to examine a fraction of the table entries.
E.g.,if hash(K) = 0 and hash2(K) = m/2, then we can only examine the entries
T[0], T[m/2], and nothing else!
One solution is to make m prime, and choose R to be a prime smaller than
m, and set
hash2(K) = R – (K mod R)
Quadratic probing, however, does not require the use of a second hash
function likely to be simpler and faster in practice
Deletion in Open Addressing
Actual deletion cannot be performed in open addressing hash tables
otherwise this will isolate records further down the probe sequence
Solution: Add an extra bit to each table entry, and mark a deleted slot by
storing a special value DELETED (tombstone) or it’s called ‘lazy deletion’.
Extendible Hashing

Extendible hashing is a dynamically updateable disk-based index structure


which implements a hashing scheme utilizing a directory.
Extendible hashing uses a directory to access its buckets. This directory is
usually small enough to be kept in main memory and has the form of an
array with 2d entries, each entry storing a bucket address (pointer to a
bucket). The variable d is called the global depth of the directory. To decide
where a key k is stored, extendible hashing uses the last d bits of some
adopted hash function h(k) to choose the directory entry. Multiple directory
entries may point to the same bucket. Every bucket has a local depth leqd.
The difference between local depth and global depth affects overflow
handling.
An example of extendible hashing is shown in Fig. 1. Here there are four
directory entries and four buckets. The global depth and all the four local
depths are 2. For simplicity assume the adopted hash function is h(k)Dk.
For instance, to search for record 15, one refers to directory entry 15% 4 D
3 (or 11 in binary format), which points to bucket D.
Fig 1 Illustration of the extendible hashing

Fig 2 The directory doubles after inserting 63


Overflow Handling
If a bucket overflow happens, the bucket is split into two. The directory may
or may not double, depending on whether the local depth of the overflown
bucket was equal to the global depth before split. If the local depth was
equal to global depth, d bits are not enough to distinguish the search values
of the overflown bucket. Thus a directory doubling occurs, which effectively
uses one more bit from the hash value. The directory size is then doubled
(this does not mean that the number of buckets is doubled as buckets will
share directory entries). As an example, Fig. 2 illustrates extendible hashing
after inserting a new record with key 63 into Fig. 1. Bucket D overflows and
the records in it are redistributed between D (where the last three bits of a
record’s hash value are 011) and D0 (where the last three bits of a record’s
hash value are 111). The directory doubles. The global depth is increased by
one. The local depth of buckets D and D0 are increased by one, while the
local depth of the other buckets remains to be two. Except 111, which
points to the new bucket D0, each of the new directory entries points to the
existing bucket which shares the last two bits. For instance, directory entry
101 points to the bucket referenced by directory entry 001.
In general, if the local depth of a bucket is d0, the number of directory
entries pointing to the bucket is 2d_d0 . All these directory entries share the
last d0 bits. To split an overflown bucket whose local depth is smaller than
the global depth, one does not need to double the size of the directory.
Instead, half of the 2d_d0 directory entries will point to the new bucket, and
the local depth of both the overflown bucket and its split image are
increased by one. For instance, Fig. 3 illustrates the extendible hashing
after inserting 17 and 13 into Fig. 2. Bucket B overflows and a split image,
bucket B0, is created. There are two directory entries (001 and 101) that
pointed to B before the split. Half of them (101) now points to the split
image B0. The local depth of both buckets B and B0 are increased by one.

Fig 3. The directory does not double after inserting 17 and 13

You might also like