0% found this document useful (0 votes)
2 views36 pages

Unit Iii: Sorting and Searching

The document provides an overview of various sorting and searching algorithms, including bubble sort, selection sort, insertion sort, merge sort, and quick sort, along with linear and binary search techniques. It discusses the importance of sorting in organizing data, the complexity and efficiency of different sorting methods, and categorizes sorting into internal and external types. Additionally, it includes examples and implementations of sorting algorithms in Python.

Uploaded by

Chandran Kavi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views36 pages

Unit Iii: Sorting and Searching

The document provides an overview of various sorting and searching algorithms, including bubble sort, selection sort, insertion sort, merge sort, and quick sort, along with linear and binary search techniques. It discusses the importance of sorting in organizing data, the complexity and efficiency of different sorting methods, and categorizes sorting into internal and external types. Additionally, it includes examples and implementations of sorting algorithms in Python.

Uploaded by

Chandran Kavi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 36

UNIT III

Sorting and Searching-Bubble sort - selection sort - insertion sort - merge sort
- quick sort - linear search - binary search - hashing - hash functions - collision
handling - load factors, rehashing, and efficiency.
Sorting
Sorting is defined as an arrangement of data in a certain order. Sorting techniques
are used to arrange data(mostly numerical) in an ascending or descending order. It is a
method used for the representation of data in a more comprehensible format.
Sorting a large amount of data can take a substantial amount of computing
resources if the methods we use to sort the data are inefficient. The efficiency of the
algorithm is proportional to the number of items it is traversing.

Some of the real-life examples of sorting are:


 Telephone Directory: It is a book that contains telephone numbers and addresses of
people in alphabetical order.
 Dictionary: It is a huge collection of words along with their meanings in alphabetical
order.
 Contact List: It is a list of contact numbers of people in alphabetical order on a
mobile phone.

The different types of order are:

 Increasing Order: A set of values are said to be increasing order when every
successive element is greater than its previous element. For example: 1, 2, 3, 4, 5.
Here, the given sequence is in increasing order.

 Decreasing Order: A set of values are said to be in increasing order when the
successive element is always less than the previous one. For Example: 5, 4, 3, 2, 1.
Here the given sequence is in decreasing order.
 Non-Increasing Order: A set of values are said to be in non-increasing order if every
ith element present in the sequence is greater than or equal to its (i-1)th element. This
order occurs whenever there are numbers that are being repeated. For Example: 1, 2, 2,
3, 4, 5. Here 2 repeated two times.

 Non-Decreasing Order: A set of values are said to be in non-decreasing order if


every ith element present in the sequence is less than or equal to its (i-1)th element.
This order occurs whenever there are numbers that are being repeated. For Example: 5,
4, 3, 2, 2, 1. Here 2 repeated two times.

CATEGORIES OF SORTING

The techniques of sorting can be divided into two categories. These are:

 Internal Sorting
 External Sorting

Internal Sorting: If all the data that is to be sorted can be adjusted at a time in the main
memory, the internal sorting method is being performed.

External Sorting: When the data that is to be sorted cannot be accommodated in the memory
at the same time and some has to be kept in auxiliary memory such as hard disk, floppy disk,
magnetic tapes etc, then external sorting methods are performed.

COMPLEXITY OF SORTING ALGORITHM

The complexity of sorting algorithm calculates the running time of a function in which 'n'
number of items are to be sorted. The choice for which sorting method is suitable for a
problem depends on several dependency configurations for different problems. The most
noteworthy of these considerations are:

 The length of time spent by the programmer in programming a specific sorting


program
 Amount of machine time necessary for running the program
 The amount of memory necessary for running the program

EFFICIENCY OF SORTING ALGORITHM

To get the amount of time required to sort an array of 'n' elements by a particular
method, the normal approach is to analyze the method to find the number of comparisons (or
exchanges) required by it. Most of the sorting techniques are data sensitive, and so the
metrics for them depends on the order in which they appear in an input array.

Various sorting techniques are analyzed in various cases and named these cases as follows:

 Best case
 Worst case
 Average case
Hence, the result of these cases is often a formula giving the average time required for a
particular sort of size 'n.' Most of the sort methods have time requirements that range from
O(nlog n) to O(n2).

Sorting Techniques/Types

The different implementations of sorting techniques in Python are:


 Bubble Sort
 Selection Sort
 Insertion Sort
 Merge Sort
 Quick Sort

Bubble Sort

Bubble Sort is a simple sorting algorithm. This sorting algorithm repeatedly


compares two adjacent elements and swaps them if they are in the wrong order. It is also
known as the sinking sort. It has a time complexity of O(n 2) in the average and worst cases
scenarios and O(n) in the best-case scenario. Bubble sort can be visualized as a queue
where people arrange themselves by swapping with each other so that they all can stand in
ascending order of their heights. Or in other words, we compare two adjacent elements and
see if their order is wrong, if the order is wrong we swap them. (i.e arr[i] > arr[j] for 1 <= i
< j <= s; where s is the size of the array, if array is to be in ascending order, and vice-
versa).

Example
Here we sort the following sequence using bubble sort
Sequence: 2, 23, 10, 1
First Iteration
(2, 23, 10, 1) –> (2, 23, 10, 1), Here the first 2 elements are compared and remain the same
because they are already in ascending order.

(2, 23, 10, 1) –> (2, 10, 23, 1), Here 2nd and 3rd elements are compared and swapped(10
is less than 23) according to ascending order.

(2, 10, 23, 1) –> (2, 10, 1, 23), Here 3rd and 4th elements are compared and swapped(1 is
less than 23) according to ascending order
At the end of the first iteration, the largest element is at the rightmost position which is
sorted correctly.

Second Iteration
(2, 10, 1, 23) –> (2, 10, 1, 23), Here again, the first 2 elements are compared and remain
the same because they are already in ascending order.
(2, 10, 1, 23) –> (2, 1, 10, 23), Here 2nd and 3rd elements are compared and swapped(1 is
less than 10) in ascending order.
At the end of the second iteration, the second largest element is at the adjacent position to
the largest element.

Third Iteration
(2, 1, 10, 23) –> (1, 2, 10, 23), Here the first 2 elements are compared and swap according
to ascending order.
The remaining elements are already sorted in the first and second Iterations. After the
three iterations, the given array is sorted in ascending order. So the final result is 1, 2, 10,
23.
Example Program:1
def bubbleSort(arr):

n = len(arr)

for i in range(n):
for j in range(0, n - i - 1):
if arr[j] > arr[j + 1]:
arr[j], arr[j + 1] = arr[j + 1], arr[j]

arr = [ 2, 1, 10, 23 ]
bubbleSort(arr)

print("Sorted array is:")


for i in range(len(arr)):
print("%d" % arr[i])

Output:
Sorted array is:
1
2
10
23
Example Program:2
List1=[10,15,4,23,0]
print(“Unsorted List:”,List1)
for j in range(len(List1)-1):
for i in range(len(List1)-1):
if List1[i]>List1[i+1]:
List1[i],List1[i+1]=List1[i+1],List1[i]
print(“Sorted list”,List1)

OUTPUT:
Unsorted List: 10,15,4,23,0
Sorted List: 0,4,10,15,23

Selection Sort

This sorting technique repeatedly finds the minimum element and sort it in order.
Bubble Sort does not occupy any extra memory space. During the execution of this
algorithm, two subarrays are maintained, the subarray which is already sorted, and the
remaining subarray which is unsorted. During the execution of Selection Sort for every
iteration, the minimum element of the unsorted subarray is arranged in the sorted subarray.
Selection Sort is a more efficient algorithm than bubble sort. Sort has a Time-Complexity
of O(n2) in the average, worst, and in the best cases.

Example
Here we sort the following sequence using the selection sort
Sequence: 7, 2, 1, 6
(7, 2, 1, 6) –> (1, 7, 2, 6), In the first traverse it finds the minimum element(i.e., 1) and it is
placed at 1st position.
(1, 7, 2, 6) –> (1, 2, 7, 6), In the second traverse it finds the 2nd minimum element(i.e., 2)
and it is placed at 2nd position.
(1, 2, 7, 6) –> (1, 2, 6, 7), In the third traverse it finds the next minimum element(i.e., 6)
and it is placed at 3rd position.
After the above iterations, the final array is in sorted order, i.e., 1, 2, 6, 7.
Example Program:1
def selectionSort(array, size):

for s in range(size):
min_idx = s

for i in range(s + 1, size):

if array[i] < array[min_idx]:


min_idx = i

(array[s], array[min_idx]) = (array[min_idx], array[s])

data = [ 7, 2, 1, 6 ]
size = len(data)
selectionSort(data, size)

print('Sorted Array in Ascending Order is :')


print(data)

Output:
Sorted Array in Ascending Order is :
[1, 2, 6, 7]
Example Program:2

OUTPUT:
Insertion Sort
The Insertion sort is a straightforward and more efficient algorithm than the bubble sort
algorithm. The insertion sort algorithm concept is based on the deck of the card where we
sort the playing card according to a particular card. It has many advantages, but there are
many efficient algorithms available in the data structure.

While the card-playing, we compare the hands of cards with each other. Most of the player
likes to sort the card in the ascending order so they can quickly see which combinations they
have at their disposal.

The insertion sort implementation is easy and simple because it's generally taught in the
beginning programming lesson. It is an in-place and stable algorithm that is more beneficial
for nearly-sorted or fewer elements.

The insertion sort algorithm is not so fast because of it uses nested loop for sort the elements.

What is the meaning of in-place and stable?


o In-place: The in-place algorithm requires additional space without caring for the
input size of the collection. After performing the sorting, it rewrites the original
memory locations of the elements in the collection.
o Stable: The stable is a term that manages the relative order of equal objects from the
initial array.

The more important thing, the insertion sort doesn't require to know the array size in advance
and it receives the one element at a time.

The great thing about the insertion sort is if we insert the more elements to be sorted - the
algorithm arranges the in its proper place without performing the complete sort.

It is more efficient for the small (less than 10) size array. Now, let's understand the concepts
of insertion sort.

The Concept of Insertion Sort

The array spilled virtually in the two parts in the insertion sort - An unsorted
part and sorted part.

The sorted part contains the first element of the array and other unsorted subpart contains the
rest of the array. The first element in the unsorted array is compared to the sorted array so
that we can place it into a proper sub-array.

It focuses on inserting the elements by moving all elements if the right-side value is smaller
than the left side.

It will repeatedly happen until the all element is inserted at correct place.

To sort the array using insertion sort below is the algorithm of insertion sort.
o Spilt a list in two parts - sorted and unsorted.
o Iterate from arr[1] to arr[n] over the given array.
o Compare the current element to the next element.
o If the current element is smaller than the next element, compare to the element before,
Move to the greater elements one position up to make space for the swapped element.

Example:

We will consider the first element in the sorted array in the following array.

[10, 4, 25, 1, 5]

The first step to add 10 to the sorted subarray

[10, 4, 25, 1, 5]

Now we take the first element from the unsorted array - 4. We store this value in a new
variable temp. Now, we can see that the 10>4 then we move the 10 to the right and that
overwrite the 4 that was previously stored.

[10, 10, 25, 1, 5] (temp = 4)

Here the 4 is lesser than all elements in sorted subarray, so we insert it at the first index
position.

[4, 10, 25, 1, 5]

We have two elements in the sorted subarray.

Now check the number 25. We have saved it into the temp variable. 25> 10 and also 25> 4
then we put it in the third position and add it to the sorted sub array.

[4, 10, 25, 1, 5]

Again we check the number 1. We save it in temp. 1 is less than the 25. It overwrites the 25.

[4, 10, 25, 25, 5] 10>1 then it overwrites again

[4, 25, 10, 25, 5]

[25, 4, 10, 25, 5] 4>1 now put the value of temp = 1

[1, 4, 10, 25, 5]

Now, we have 4 elements in the sorted subarray. 5<25 then shift 25 to the right side and
pass temp = 5 to the left side.

[1, 4, 10, 25, 25] put temp = 5


Now, we get the sorted array by simply putting the temp value.

[1, 4, 5, 10, 25]

The given array is sorted.

Example Program:1
def insertion_sort(list1):

for i in range(1, len(list1)):

value = list1[i]

j=i-1
while j >= 0 and value < list1[j]:
list1[j + 1] = list1[j]
j -= 1
list1[j + 1] = value
return list1
# Driver code to test above

list1 = [10, 5, 13, 8, 2]


print("The unsorted list is:", list1)
print("The sorted list1 is:", insertion_sort(list1))

Output

Explanation:

In the above code, we have created a function called insertion_sort(list1). Inside the function
-

o We defined for loop for traverse the list from 1 to len(list1).


o In for loop, assigned a values of list1 in value Every time the loop will iterate the new
value will assign to the value variable.
o Next, we moved the elements of list1[0…i-1], that are greater than the value, to one
position ahead of their current position.
o Now, we used the while to check whether the j is greater or equal than 0, and
the value is smaller than the first element of the list.
o If both conditions are true then move the first element to the 0th index and reduce the
value of j and so on.
o After that, we called the function and passed the list and printed the result.
Example program:2

Output:

Merge Sort
Merge sort is similar to the quick sort algorithm as works on the concept of divide and
conquer. It is one of the most popular and efficient sorting algorithm. It is the best example
for divide and conquer category of algorithms.

It divides the given list in the two halves, calls itself for the two halves and then merges the
two sorted halves. We define the merge() function used to merging two halves.

The sub lists are divided again and again into halves until we get the only one element each.
Then we combine the pair of one element lists into two element lists, sorting them in the
process. The sorted two element pairs is merged into the four element lists, and so on until we
get the sorted list.
Merge Sort Concept

We have divided the given list in the two halves. The list couldn't be divided in equal parts it
doesn't matter at all.

Merge sort can be implement using the two ways - top-down approach and bottom-up
approach. We use the top down approach in the above example, which is Merge sort most
often used.

The bottom-up approach provides the more optimization which we will define later.

The main part of the algorithm is that how we combine the two sorted sublists. Let's merge
the two sorted merge list.

o A : [2, 4, 7, 8]
o B : [1, 3, 11]
o sorted : empty

First, we observe the first element of both lists. We find the B's first element is smaller, so we
add this in our sorted list and move forward in the B list.

o A : [2, 4, 7, 8]
o B : [1, 3, 11]
o Sorted : 1

Now we look at the next pair of elements 2 and 3. 2 is smaller so we add it into our sorted list
and move forward to the list.

o A : [2, 4, 7, 8]
o B : [1, 3, 11]
o Sorted : 1

Continue this process and we end up with the sorted list of {1, 2, 3, 4, 7, 8, 11}. There can be
two special cases.

o What if both sublists have same elements - In such case, we can move either one
sublist and add the element to the sorted list. Technically, we can move forward in
both sublist and add the elements to the sorted list.
o We have no element left in one sublist. When we run out the in a sublist simply add
the element of the second one after the other.

Implementation

The merge sort algorithm is implemented by suing the top-down approach.


Example program:1
def merge_sort(list1, left_index, right_index):
if left_index >= right_index:
return

middle = (left_index + right_index)//2


merge_sort(list1, left_index, middle)
merge_sort(list1, middle + 1, right_index)
merge(list1, left_index, right_index, middle)
def merge(list1, left_index, right_index, middle):

left_sublist = list1[left_index:middle + 1]
right_sublist = list1[middle+1:right_index+1]

left_sublist_index = 0
right_sublist_index = 0
sorted_index = left_index

while left_sublist_index < len(left_sublist) and right_sublist_index < len(right_sublist):

if left_sublist[left_sublist_index] <= right_sublist[right_sublist_index]:


list1[sorted_index] = left_sublist[left_sublist_index]
left_sublist_index = left_sublist_index + 1
else:
list1[sorted_index] = right_sublist[right_sublist_index]
right_sublist_index = right_sublist_index + 1
sorted_index = sorted_index + 1
while left_sublist_index < len(left_sublist):
list1[sorted_index] = left_sublist[left_sublist_index]
left_sublist_index = left_sublist_index + 1
sorted_index = sorted_index + 1

while right_sublist_index < len(right_sublist):


list1[sorted_index] = right_sublist[right_sublist_index]
right_sublist_index = right_sublist_index + 1
sorted_index = sorted_index + 1

list1 = [44, 65, 2, 3, 58, 14, 57, 23, 10, 1, 7, 74, 48]
merge_sort(list1, 0, len(list1) -1)
print(list1)

Quick Sort
Also called partition exchange algorithm.It can be 2 or 3 times faster than the merge
sort and heap sort.
QuickSort is a divide and conquer algorithm. It picks an element as a pivot and
partitions the given array around the picked pivot. There are many different versions of
quickSort that pick pivot in different ways.
1. Always pick the first element as a pivot
2. Always pick the last element as a pivot
3. Pick a random element as a pivot
4. Pick median as a pivot

Searching Algorithms
 Searching is a very basic necessity when you store data in different data structures.
 The simplest approach is to go across every element in the data structure and match it
with the value you are searching for.This is known as Linear search.
 It is inefficient and rarely used, but creating a program for it gives an idea about how we
can implement some advanced search algorithms.

There are two types of searching -

o Linear Search
o Binary Search
Linear Search
 Linear search is a method of finding elements within a list. It is also called a
sequential search.
 It is the simplest searching algorithm because it searches the desired element in a
sequential manner.

Concept of Linear Search

Let's understand the following steps to find the element key = 7 in the given list.

Step - 1: Start the search from the first element and Check key = 7 with each element of list
x.

Step - 2: If element is found, return the index position of the key.

Step - 3: If element is not found, return element is not present.


Linear Search Algorithm

There is list of n elements and key value to be searched.

LinearSearch(list, key)
for each item in the list
if item == value
return its index position
return -1

Python Program

def linear_Search(list1, n, key):

# Searching list1 sequentially


for i in range(0, n):
if (list1[i] == key):
return i
return -1

list1 = [1 ,3, 5, 4, 7, 9]
key = 7

n = len(list1)
res = linear_Search(list1, n, key)
if(res == -1):
print("Element not found")
else:
print("Element found at index: ", res)

Output:

Element found at index: 4

Explanation:

In the above code, we have created a function linear_Search(), which takes three arguments
- list1, length of the list, and number to search. We defined for loop and iterate each element
and compare to the key value. If element is found, return the index else return -1 which
means element is not present in the list.

Linear Search Complexity

Time complexity of linear search algorithm -

o Base Case - O(1)


o Average Case - O(n)
o Worst Case -O(n)

Linear search algorithm is suitable for smaller list (<100) because it check every element to
get the desired number. Suppose there are 10,000 element list and desired element is
available at the last position, this will consume much time by comparing with each element of
the list.

Binary Search
A binary search is an algorithm to find a particular element in the list. Suppose we have a list
of thousand elements, and we need to get an index position of a particular element. We can
find the element's index position very fast using the binary search algorithm.

There are many searching algorithms but the binary search is most popular among them.

The elements in the list must be sorted to apply the binary search algorithm. If elements are
not sorted then sort them first.

Concept of Binary Search

The divide and conquer approach technique is followed by the recursive method. In this
method, a function is called itself again and again until it found an element in the list.

A set of statements is repeated multiple times to find an element's index position in the
iterative method. The while loop is used for accomplish this task.

Binary search is more effective than the linear search because we don't need to search each
list index. The list must be sorted to achieve the binary search algorithm.

Let's have a step by step implementation of binary search.

We have a sorted list of elements, and we are looking for the index position of 45.

[12, 24, 32, 39, 45, 50, 54]

So, we are setting two pointers in our list. One pointer is used to denote the smaller value
called low and the second pointer is used to denote the highest value called high.

Next, we calculate the value of the middle element in the array.


mid = (low+high)/2
Here, the low is 0 and the high is 7.
mid = (0+7)/2
mid = 3 (Integer)

Now, we will compare the searched element to the mid index value. In this case, 32 is not
equal to 45. So we need to do further comparison to find the element.

If the number we are searching equal to the mid. Then return mid otherwise move to the
further comparison.

The number to be search is greater than the middle number, we compare the n with the
middle element of the elements on the right side of mid and set low to low = mid + 1.

Otherwise, compare the n with the middle element of the elements on the left side of mid and
set high to high = mid - 1.

Repeat until the number that we are searching for is found.

Python Implementation

def binary_search(list1, n):


low = 0
high = len(list1) - 1
mid = 0

while low <= high:


mid = (high + low) // 2

if list1[mid] < n:
low = mid + 1

elif list1[mid] > n:


high = mid - 1

else:
return mid

# element was not present in the list, return -1


return -1

list1 = [12, 24, 32, 39, 45, 50, 54]


n = 45

result = binary_search(list1, n)

if result != -1:
print("Element is present at index", str(result))
else:
print("Element is not present in list1")

Output:

Element is present at index 4

Explanation:

In the above program -

o We have created a function called binary_search() function which takes two


arguments - a list to sorted and a number to be searched.
o We have declared two variables to store the lowest and highest values in the list. The
low is assigned initial value to 0, high to len(list1) - 1 and mid as 0.
o Next, we have declared the while loop with the condition that the lowest is equal and
smaller than the highest The while loop will iterate if the number has not been found
yet.
o In the while loop, we find the mid value and compare the index value to the number
we are searching for.
o If the value of the mid-index is smaller than n, we increase the mid value by 1 and
assign it to The search moves to the left side.
o Otherwise, decrease the mid value and assign it to the high. The search moves to the
right side.
o If the n is equal to the mid value then return mid.
o This will happen until the low is equal and smaller than the high.
o If we reach at the end of the function, then the element is not present in the list. We
return -1 to the calling function.

Example Program: 2
OUTPUT

HASHING
Hashing is a data structure that is used to store a large amount of data, which can be
accessed in O(1) time by operations such as search, insert and delete. Various Applications
of Hashing are:
 Indexing in database
 Cryptography
 Symbol Tables in Compiler/Interpreter
 Dictionaries, caches, etc.

Concept of Hashing, Hash Table and Hash Function

Hashing is an important Data Structure which is designed to use a special function called
the Hash function which is used to map a given value with a particular key for faster access
of elements.
The efficiency of mapping depends of the efficiency of the hash function used.
Example:

h(large_value) = large_value % m
Here, h() is the required hash function and „m‟ is the size of the hash table. For large
values, hash functions produce value in a given range.
How Hash Function Works?
 It should always map large keys to small keys.
 It should always generate values between 0 to m-1 where m is the size of the hash table.
 It should uniformly distribute large keys into hash table slots.

Collision Handling

If we know the keys beforehand, then we have can have perfect hashing. In perfect hashing,
we do not have any collisions. However, If we do not know the keys, then we can use the
following methods to avoid collisions:
 Chaining
 Open Addressing (Linear Probing, Quadratic Probing, Double Hashing)
Chaining
While hashing, the hashing function may lead to a collision that is two or more keys are
mapped to the same value. Chain hashing avoids collision. The idea is to make each cell of
hash table point to a linked list of records that have same hash function value.
Performance of Hashing

The performance of hashing is evaluated on the basis that each key is equally likely to be
hashed for any slot of the hash table.
m = Length of Hash Table
n = Total keys to be inserted in the hash table

Load factor lf = n/m


Expected time to search = O(1 +lf )
Expected time to insert/delete = O(1 + lf)

The time complexity of search insert and delete is


O(1) if lf is O(1)
Python Implementation of Hashing
Example Program:
def display_hash(hashTable):

for i in range(len(hashTable)):
print(i, end = " ")

for j in hashTable[i]:
print("-->", end = " ")
print(j, end = " ")

print()

HashTable = [[] for _ in range(10)]

def Hashing(keyvalue):
return keyvalue % len(HashTable)

def insert(Hashtable, keyvalue, value):

hash_key = Hashing(keyvalue)
Hashtable[hash_key].append(value)

insert(HashTable, 10, 'Allahabad')


insert(HashTable, 25, 'Mumbai')
insert(HashTable, 20, 'Mathura')
insert(HashTable, 9, 'Delhi')
insert(HashTable, 21, 'Punjab')
insert(HashTable, 21, 'Noida')
display_hash (HashTable)

Output:

0 --> Allahabad --> Mathura


1 --> Punjab --> Noida
2
3
4
5 --> Mumbai
6
7
8
9 --> Delhi
Separate Chaining Collision Handling Technique in Hashing

What is Collision?
Since a hash function gets us a small number for a key which is a big integer or string, there
is a possibility that two keys result in the same value. The situation where a newly inserted
key maps to an already occupied slot in the hash table is called collision and must be
handled using some collision handling technique.
What are the chances of collisions with the large table?
Collisions are very likely even if we have a big table to store keys. An important
observation is Birthday Paradox. With only 23 persons, the probability that two people
have the same birthday is 50%.

How to handle Collisions?


There are mainly two methods to handle collision:
 Separate Chaining
 Open Addressing

Separate Chaining:
Separate chaining is one of the most popular and commonly used techniques in order to
handle collisions.
The linked list data structure is used to implement this technique. So what happens is, when
multiple elements are hashed into the same slot index, then these elements are inserted into
a singly-linked list which is known as a chain.

Here, all those elements that hash into the same slot index are inserted into a linked list.
Now, we can use a key K to search in the linked list by just linearly traversing. If the
intrinsic key for any entry is equal to K then it means that we have found our entry. If we
have reached the end of the linked list and yet we haven‟t found our entry then it means that
the entry does not exist. Hence, the conclusion is that in separate chaining, if two different
elements have the same hash value then we store both the elements in the same linked list
one after the other.
Example: Let us consider a simple hash function as “key mod 7” and a sequence of keys
as 50, 700, 76, 85, 92, 73, 101
separate chaining is to implement the array as a linked list called a chain.

Advantages:
 Simple to implement.
 Hash table never fills up, we can always add more elements to the chain.
 Less sensitive to the hash function or load factors.
 It is mostly used when it is unknown how many and how frequently keys may be
inserted or deleted.
Disadvantages:
 The cache performance of chaining is not good as keys are stored using a linked list.
Open addressing provides better cache performance as everything is stored in the same
table.
 Wastage of Space (Some Parts of the hash table are never used)
 If the chain becomes long, then search time can become O(n) in the worst case
 Uses extra space for links

Performance of Chaining:

Performance of hashing can be evaluated under the assumption that each key is equally
likely to be hashed to any slot of the table (simple uniform hashing).
m = Number of slots in hash table
n = Number of keys to be inserted in hash table
Load factor α = n/m
Expected time to search = O(1 + α)
Expected time to delete = O(1 + α)

Open Addressing:
 Like separate chaining, open addressing is a method for handling collisions.
 In Open Addressing, all elements are stored in the hash table itself.
 So at any point, the size of the table must be greater than or equal to the total
number of keys (Note that we can increase table size by copying old data if
needed).
 This approach is also known as closed hashing.

 Insert(k): Keep probing until an empty slot is found. Once an empty slot is found,
insert k.
 Search(k): Keep probing until the slot‟s key doesn‟t become equal to k or an empty
slot is reached.
 Delete(k): Delete operation is interesting. If we simply delete a key, then the search
may fail. So slots of deleted keys are marked specially as “deleted”.
The insert can insert an item in a deleted slot, but the search doesn‟t stop at a
deleted slot.
Different ways of Open Addressing:

1. Linear Probing:

In linear probing, the hash table is searched sequentially that starts from the original
location of the hash. If in case the location that we get is already occupied, then we check
for the next location.
The function used for rehashing is as follows: rehash(key) = (n+1)%table-size.
For example, The typical gap between two probes is 1 as seen in the example below:

Let hash(x) be the slot index computed using a hash function and S be the table size
If slot hash(x) % S is full, then we try (hash(x) + 1) % S
If (hash(x) + 1) % S is also full, then we try (hash(x) + 2) % S
If (hash(x) + 2) % S is also full, then we try (hash(x) + 3) % S
…………………………………………..
…………………………………………..
Let us consider a simple hash function as “key mod 7” and a sequence of keys as 50, 700,
76, 85, 92, 73, 101.
Challenges in Linear Probing :
 Primary Clustering:
One of the problems with linear probing is Primary clustering, many consecutive
elements form groups and it starts taking time to find a free slot or to search for an
element.
 Secondary Clustering:
Secondary clustering is less severe, two records only have the same collision chain
(Probe Sequence) if their initial position is the same.

Example:
Let us consider a simple hash function as “key mod 5” and a sequence of keys that are to
be inserted are 50, 70, 76, 93.
 Step1: First draw the empty hash table which will have a possible range of hash values
from 0 to 4 according to the hash function provided.
 Step 2: Now insert all the keys in the hash table one by one. The first key is 50. It will
map to slot number 0 because 50%5=0. So insert it into slot number 0.

 Step 3: The next key is 70. It will map to slot number 0 because 70%5=0 but 50 is
already at slot number 0 so, search for the next empty slot and insert it.
 Step 4: The next key is 76. It will map to slot number 1 because 76%5=1 but 70 is
already at slot number 1 so, search for the next empty slot and insert it.

 Step 5: The next key is 93 It will map to slot number 3 because 93%5=3, So insert it
into slot number 3.
2. Quadratic Probing

 Quadratic probing is a method with the help of which we can solve the problem
of clustering.
 This method is also known as the mid-square method.
 In this method, we look for the i2‘th slot in the ith iteration.
 We always start from the original hash location.
 If only the location is occupied then we check the other slots.

let hash(x) be the slot index computed using hash function.


If slot hash(x) % S is full, then we try (hash(x) + 1*1) % S
If (hash(x) + 1*1) % S is also full, then we try (hash(x) + 2*2) % S
If (hash(x) + 2*2) % S is also full, then we try (hash(x) + 3*3) % S
…………………………………………..
…………………………………………..
Example:
Let us consider table Size = 7, hash function as Hash(x) = x % 7 and collision resolution
strategy to be f(i) = i 2 . Insert = 22, 30, and 50.

 Step 1: Create a table of size 7.


 Step 2 – Insert 22 and 30
 Hash(22) = 22 % 7 = 1, Since the cell at index 1 is empty, we can easily
insert 22 at slot 1.
 Hash(30) = 30 % 7 = 2, Since the cell at index 2 is empty, we can easily
insert 30 at slot 2.

 Step 3: Inserting 50
 Hash(50) = 50 % 7 = 1
 In our hash table slot 1 is already occupied. So, we will search for slot 1+1 2,
i.e. 1+1 = 2,
 Again slot 2 is found occupied, so we will search for cell 1+2 2, i.e.1+4 = 5,
 Now, cell 5 is not occupied so we will place 50 in slot 5.

Performance of Open Addressing:


Like Chaining, the performance of hashing can be evaluated under the assumption that each
key is equally likely to be hashed to any slot of the table (simple uniform hashing)
m = Number of slots in the hash table
n = Number of keys to be inserted in the hash table
Load factor α = n/m ( < 1 )
Expected time to search/insert/delete < 1/(1 – α)
So Search, Insert and Delete take (1/(1 – α)) time

S.No. Separate Chaining Open Addressing

Open Addressing requires more


1. Chaining is Simpler to implement. computation.

In chaining, Hash table never fills up, we In open addressing, table may
2. can always add more elements to chain. become full.
S.No. Separate Chaining Open Addressing

Chaining is Less sensitive to the hash Open addressing requires extra care
3. function or load factors. to avoid clustering and load factor.

Chaining is mostly used when it is Open addressing is used when the


unknown how many and how frequently frequency and number of keys is
4. keys may be inserted or deleted. known.

Open addressing provides better


Cache performance of chaining is not cache performance as everything is
5. good as keys are stored using linked list. stored in the same table.

In Open addressing, a slot can be


Wastage of Space (Some Parts of hash used even if an input doesn‟t map to
6. table in chaining are never used). it.

7. Chaining uses extra space for links. No links in Open addressing

LOAD FACTOR
Load factor is defined as (m/n) where n is the total size of the hash table and m is the
preferred number of entries which can be inserted before a increment in size of the
underlying data structure is required.

Every data structure has a fixed size.

The Hash table provides Constant time complexity of insertion and searching, provided the
hash function is able to distribute the input load evenly.

That‟s because if each element is at a different index, then we can directly calculate the hash
and locate it at that index, but in case of Collision, the time complexity can go up to O(N) in
the worst case, as might need to transverse over other elements comparing against the
element we need to search.

Load Factor in hashing which is basically a measure that decides when exactly to increase the
size of the Hash Table to maintain the same time complexity of O(1).

Let‟s understand with an example with HashTable of size=16:

1. If there are 16 elements in the HashTable, the hash function method will
distribute one element in each Index. The searching for any item, in this case,
will take the only lookup.
2. If there are 32 elements in the HashTable, the hash function method will
distribute two elements in each Index. The searching for any item, in this case,
will take the maximum of two lookups.
3. Similarly, if there are 128 elements in HashTable, the hash function method
will distribute eight elements in each Index. The searching for any item, in this
case, will take the maximum eight lookups.

Load factor is defined as (m/n) where n is the total size of the hash table and m is the
preferred number of entries that can be inserted before an increment in the size of the
underlying data structure is required.

So each Data structure(like Hash table) comes with 2 important properties:

 Initial Capacity

The initial capacity is the number of Indexes allocated in the HashTable. It is created when
the HashTable is initialized.
The capacity of the HashTable is doubled each time it reaches the threshold.
If you recall, Chaining was one of the ways of collision resolution techniques, and
HashTables usually use Chaining technique. So in case the same Index is generated for
multiple keys, all these elements will be stored against the same index in the form of the
chain. So we can see that one Index can store multiple elements, and for the same reason this
chain of elements is also referred to as “buckets”.
Buckets are the group(or chain) of elements, whose hash indexes generated from the hash
function are the same.

E.g. if we have initialized the HashTable with initial capacity of 16, then the hash function
will make sure the key-value pairs will be distributed among 16 indexes equally, thus each
bucket will carry as few elements as possible.

 Load Factor in Hashing

The Load factor is a measure that decides when to increase the HashTable capacity to
maintain the search and insert operation complexity of O(1).
The default load factor of HashMap used in java, for instance, is 0.75f (75% of the map size).
That means if we have a HashTable with an array size of 100, then whenever we have 75
elements stored, we will increase the size of the array to double of its previous size i.e. to 200
now, in this case.

How is the Load Factor Used?

The Load Factor decides “when to increase the size of the hash Table.”
The load factor can be decided using the following formula:

Initial capacity of the HashTable * Load factor of the HashTable

E.g. If The initial capacity of HashTable is = 16


And the load factor of HashTable = 0.75

According to the formula as mentioned above: 16 * 0.75 = 12


It represents that the 12th key-value pair of a HashTable will keep its size to 16.
As soon as the 13th element (key-value pair) will come into the HashTable, it will increase its
size from 16 buckets to 16 * 2 = 32 buckets.

Another way to calculate size:


When the current load factor becomes equal to the defined load factor i.e. 0.75, hashmap
increases its capacity.

Where:
m– is the number of entries in a HashTable
n – is the total size of HashTable

Example of Load Factor in Hashing

If we have the initial capacity of HashTable = 16.

We insert the first element, now check if we need to increase the size of the HashTable
capacity or not.

It can be determined by the formula:

Size of hashmap (m) / number of buckets (n)

In this case, the size of the hashmap is 1, and the bucket size is 16. So, 1/16=0.0625. Now
compare this value with the default load factor.

0.0625<0.75

So, no need to increase the hashmap size.

We do not need to increase the size of hashmap up to 12th element, because

12/16=0.75

As soon as we insert the 13th element in the hashmap, the size of hashmap is increased
because:

13/16=0.8125

Which is greater than the default hashmap size.

0.8125>0.75
REHASHING
Rehashing means hashing again. Basically, when the load factor increases to more than its
pre-defined value (e.g. 0.75 as taken in above examples), the Time Complexity for search and
insert increases.

So to overcome this, the size of the array is increased(usually doubled) and all the values
are hashed again and stored in the new double sized array to maintain a low load factor and
low complexity.

This means if we had Array of size 100 earlier, and once we have stored 75 elements into
it(given it has Load Factor=75), then when we need to store the 76th element, we double its
size to 200.

But that comes with a price:

With the new size the Hash function can change, which means all the 75 elements we had
stored earlier, would now with this new hash Function might yield different Index to place
them, so basically we rehash all those stored elements with the new Hash Function and place
them at new Indexes of newly resized bigger HashTable.

Why Rehashing?

Rehashing is done because whenever key-value pairs are inserted into the map, the load
factor increases, which implies that the time complexity also increases as explained above.
This might not give the required time complexity of O(1). Hence, rehash must be done,
increasing the size of the bucketArray so as to reduce the load factor and the time complexity.

Lets try to understand the above with an example:


Say we had HashTable with Initial Capacity of 4.
We need to insert 4 Keys: 100, 101, 102, 103

And say the Hash function used was division method: Key % ArraySize

So Hash(100) = 1, so Element2 stored at 1st Index.


So Hash(101) = 2, so Element3 stored at 2nd Index.
So Hash(102) = 0, so Element1 stored at 3rd Index.

With the insertion of 3 elements, the load on Hash Table = ¾ = 0.74

So we can add this 4th element to this Hash table, and we need to increase its size to 6 now.

But after the size is increased, the hash of existing elements may/not still be the same.

E.g. The earlier hash function was Key%3 and now it is Key%6.

If the hash used to insert is different from the hash we would calculate now, then we can not
search the Element.
E.g. 100 was inserted at Index 1, but when we need to search it back in this new Hash Table
of size=6, we will calculate its hash = 100%6 = 4
But 100 is not on the 4th Index, but instead at the 1st Index.

So we need the rehashing technique, which rehashes all the elements already stored using the
new Hash Function.

How Rehashing is Done?

Element1: Hash(100) = 100%6 = 4, so Element1 will be rehashed and will be stored at 5th
Index in this newly resized HashTable, instead of 1st Index as on previous HashTable.

Element2: Hash(101) = 101%6 = 5, so Element2 will be rehashed and will be stored at 6th
Index in this newly resized HashTable, instead of 2nd Index as on previous HashTable.

Element3: Hash(102) = 102%6 = 6, so Element3 will be rehashed and will be stored at 4th
Index in this newly resized HashTable, instead of 3rd Index as on previous HashTable.

Since the Load Balance now is 3/6 = 0.5, we can still insert the 4th element now.

Element4: Hash(103) = 103%6 = 1, so Element4 will be stored at 1st Index in this newly
resized HashTable.

Rehashing Steps –

1. For each addition of a new entry to the map, check the current load factor.
2. If it‟s greater than its pre-defined value, then Rehash.
3. For Rehash, make a new array of double the previous size and make it the new
bucket array.
4. Then traverse to each element in the old bucketArray and insert them back so
as to insert it into the new larger bucket array.

However, it must be noted that if you are going to store a really large number of elements in
the HashTable then it is always good to create a HashTable with sufficient capacity upfront
as this is more efficient than letting it perform automatic rehashing.

------------------------------UNIT 3 COMPLETED----------------------

You might also like