0% found this document useful (0 votes)
7 views

Module 5

Uploaded by

georgythomasgeo
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

Module 5

Uploaded by

georgythomasgeo
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 163

Module 5

SORTING TECHNIQUES
AND HASHING
Sorting Techniques
1. Bubble sort
2. Insertion sort O(n^2)
3. Selection sort
4. Quick sort
5. Merge sort O(nlogn)
6. Heap sort
Sorting Techniques
Task of rearranging the data in an order

Or, task of rearranging a set of records based on their key values

when records are stored in a file


Sorting Techniques - Terminologies
▪Internal Sort – done within primary memory
▪External Sort – needs external slow memory too
▪Ascending Order – less than or equal to
▪Descending Order – greater than or equal to
▪Lexicographic Order – order of dictionary
▪Collating Sequence – ordering a set of characters ( higher, lower or same
order)
Sorting Techniques - Terminologies
Random Order – no ordering in particular
Swap: interchanging the data contents b/w two storages
Stable Sort: preserving the relative ordering of equal data values
In Place Sort: sorting values within a data structure without the help of any
external storage
Item : element to be sorted
Sorting Techniques
Sorting

Internal External

By
By comparison
Distribution

Insertion Selection Exchange Enumeration


Sorting Techniques
Straight Insertion
Binary Insertion
insertion
Two – way Insertion
List Insertion
comparison
Straight Selection
Selection Tree Selection
Heap
internal
Bubble
Exchange Shell
Quick
Simple Merge
Merge
Two way Merge
sorting
Radix
distribution
Bucket
Counting
External Merge
Two Way Merge
external
Multi Way Merge
Polyphase Merge
Insertion Sort
Idea: like sorting a hand of playing cards
◦ Start with an empty left hand and the cards facing down on
the table.
◦ Remove one card at a time from the table, and insert it into
the correct position in the left hand
◦ compare it with each of the cards already in the hand, from right to left
◦ The cards held in the left hand are sorted
◦ these cards were originally the top cards of the pile on the table

9
Insertion Sort

To insert 12, we need to make


room for it by moving first 36
and then 24.

10
Insertion Sort

11
Insertion Sort

12
Insertion Sort
input array
5 2 4 6 1 3

at each iteration, the array is divided in two sub-arrays:

left sub-array right sub-array

sorted unsorted

13
INSERTION-SORT
Alg.: INSERTION-SORT(A)
1 2 3 4 5 6 7 8

a1 a2 a3 a4 a5 a6 a7 a8

key
Insertion Sort
Insertion Sort
1. #include <stdio.h> 14. t = array[d];
2. int main() 15. array[d] = array[d-1];
3. { 16. array[d-1] = t;
4. int n, array[1000], c, d, t; 17. d--;
5. printf("Enter number of 18. }
elements\n"); 19. }
6. scanf("%d", &n); 20. printf("Sorted list in ascending
7. printf("Enter %d integers\n", n); order:\n");
8. for (c = 0; c < n; c++) { 21. for (c = 0; c <= n - 1; c++) {
9. scanf("%d", &array[c]); 22. printf("%d\n", array[c]);
10. } 23. }
11. for (c = 1 ; c <= n - 1; c++) { 24. return 0;
12. d = c; 25. }
13. while ( d > 0 && array[d] < array[d-1])
{
Insertion Sort
Selection Sort
Idea:
◦ Find the smallest element in the array
◦ Exchange it with the element in the first position
◦ Find the second smallest element and exchange it with the element in the second
position
◦ Continue until the array is sorted

18
Example

8 4 6 9 2 3 1 1 2 3 4 9 6 8

1 4 6 9 2 3 8 1 2 3 4 6 9 8

1 2 6 9 4 3 8 1 2 3 4 6 8 9

1 2 3 9 4 6 8 1 2 3 4 6 8 9

19
Selection Sort
Alg.: SELECTION-SORT(A) 8 4 6 9 2 3 1
n ← length[A]
for j ← 1 to n - 1
do smallest ← j
for i ← j + 1 to n
do if A[i] < A[smallest]
then smallest ← i
exchange A[j] ↔ A[smallest]
Selection Sort
1. #include <stdio.h> 18. if ( small != c )
2. int main() 19. {
3. { 20. temp = array[c];
4. int array[100], n, c, d, small, temp; 21. array[c] = array[small];
5. printf("Enter number of elements\n"); 22. array[small] = temp;
6. scanf("%d", &n); 23. }
7. printf("Enter %d integers\n", n); 24. }
8. for ( c = 0 ; c < n ; c++ ) 25. printf("Sorted list in ascending
9. scanf("%d", &array[c]); order:\n");
10. for ( c = 0 ; c < ( n - 1 ) ; c++ ) 26. for ( c = 0 ; c < n ; c++ )
11. { 27. printf("%d\n", array[c]);
12. small = c; 28. return 0;
13. for ( d = c + 1 ; d < n ; d++ ) 29. }
14. {
15. if ( array[position] > array[d] )
16. small = d;
17. }
Selection Sort
Quicksort
Basic Concept: divide and conquer
Select a pivot and split the data into two groups: (< pivot) and (> pivot):

(<pivot) (> pivot)


LEFT group RIGHT group

• Recursively apply Quicksort to the subgroups


Quicksort Start
Start with all data
in an array, and
Unsorted Array
consider it unsorted
Quicksort Step 1
Step 1, select a pivot pivot
(it is arbitrary)
26 33 35 29 19 12 22

We will select the first


element, as presented in the
original algorithm by
C.A.R. Hoare in 1962.
Quicksort Step 2
Step 2, start process of pivot
dividing data into LEFT
and RIGHT groups: 26 33 35 29 19 12 22

The LEFT group will left right


have elements less than
the pivot.
The RIGHT group will have
elements greater that the pivot.

Use markers left and right


Quicksort Step 3
Step 3, pivot
If left element belongs
to LEFT group, then increment 26 33 35 29 19 12 22
left index.

If right index element belongs left right


to RIGHT, then decrement right.

Exchange when you find


elements that belong to the other
group.
Quicksort Step 4
Step 4: pivot

Element 33 belongs 26 33 35 29 19 12 22
to RIGHT group.
left right
Element 22 belongs
to LEFT group.
pivot
Exchange the two
elements. 26 22 35 29 19 12 33

left right
Quicksort Step 5
Step 5: pivot

After the exchange, 26 22 35 29 19 12 33


increment left marker,
decrement right marker. left right
Quicksort Step 6
Step 6: pivot

Element 35 belongs 26 22 35 29 19 12 33
to RIGHT group.
left right
Element 12 belongs
to LEFT group.
pivot
Exchange,
increment left, and 26 22 12 29 19 35 33
decrement right.
left right
Quicksort Step 7
Step 7: pivot

Element 29 belongs 26 22 12 29 19 35 33
to RIGHT.
left right
Element 19 belongs
to LEFT.
pivot
Exchange,
increment left, 26 22 12 19 29 35 33
decrement right.
right left
Quicksort Step 8
Step 8: pivot
When the left and right
markers pass each other, 26 22 12 19 29 35 33
we are done with the
partition task. right left

Swap the right with pivot.


pivot
26
19 22 12 29 35 33
LEFT RIGHT
Quicksort Step 8
Step 8a: pivot
Apply quicksort over the
left and right partitions 26 22 12 19 29 35 33

Left partition: right left


• Pivot is 19
• Step 1: interchange 12
and 22 pivot
• Step 2: interchange 12
26
and 19 19 22 12 29 35 33
• Pivot placed in right
location LEFT RIGHT

Similarly apply quicksort


on right with 29 as pivot
Quicksort Step 9
previous pivot
Step 9:
Apply Quicksort 26
Quicksort Quicksort
to the LEFT and
RIGHT groups, 19 22 12 29 35 33
recursively. pivot pivot

12 19 22 26 29 33 35
Assemble parts when done

12 19 22 26 29 33 35
pivot_index = 0 40 20 10 80 60 50 7 30 100

[0] [1] [2] [3] [4] [5] [6] [7] [8]

too_big_index too_small_index
1. While data[too_big_index] <= data[pivot]
++too_big_index

pivot_index = 0 40 20 10 80 60 50 7 30 100

[0] [1] [2] [3] [4] [5] [6] [7] [8]

too_big_index too_small_index
1. While data[too_big_index] <= data[pivot]
++too_big_index

pivot_index = 0 40 20 10 80 60 50 7 30 100

[0] [1] [2] [3] [4] [5] [6] [7] [8]

too_big_index too_small_index
1. While data[too_big_index] <= data[pivot]
++too_big_index

pivot_index = 0 40 20 10 80 60 50 7 30 100

[0] [1] [2] [3] [4] [5] [6] [7] [8]

too_big_index too_small_index
1. While data[too_big_index] <= data[pivot]
++too_big_index
2. While data[too_small_index] > data[pivot]
--too_small_index

pivot_index = 0 40 20 10 80 60 50 7 30 100

[0] [1] [2] [3] [4] [5] [6] [7] [8]

too_big_index too_small_index
1. While data[too_big_index] <= data[pivot]
++too_big_index
2. While data[too_small_index] > data[pivot]
--too_small_index

pivot_index = 0 40 20 10 80 60 50 7 30 100

[0] [1] [2] [3] [4] [5] [6] [7] [8]

too_big_index too_small_index
1. While data[too_big_index] <= data[pivot]
++too_big_index
2. While data[too_small_index] > data[pivot]
--too_small_index
3. If too_big_index < too_small_index
swap data[too_big_index] and data[too_small_index]

pivot_index = 0 40 20 10 80 60 50 7 30 100

[0] [1] [2] [3] [4] [5] [6] [7] [8]

too_big_index too_small_index
1. While data[too_big_index] <= data[pivot]
++too_big_index
2. While data[too_small_index] > data[pivot]
--too_small_index
3. If too_big_index < too_small_index
swap data[too_big_index] and data[too_small_index]

pivot_index = 0 40 20 10 30 60 50 7 80 100

[0] [1] [2] [3] [4] [5] [6] [7] [8]

too_big_index too_small_index
1. While data[too_big_index] <= data[pivot]
++too_big_index
2. While data[too_small_index] > data[pivot]
--too_small_index
3. If too_big_index < too_small_index
swap data[too_big_index] and data[too_small_index]
4. While too_small_index > too_big_index, go to 1.

pivot_index = 0 40 20 10 30 60 50 7 80 100

[0] [1] [2] [3] [4] [5] [6] [7] [8]

too_big_index too_small_index
1. While data[too_big_index] <= data[pivot]
++too_big_index
2. While data[too_small_index] > data[pivot]
--too_small_index
3. If too_big_index < too_small_index
swap data[too_big_index] and data[too_small_index]
4. While too_small_index > too_big_index, go to 1.

pivot_index = 0 40 20 10 30 60 50 7 80 100

[0] [1] [2] [3] [4] [5] [6] [7] [8]

too_big_index too_small_index
1. While data[too_big_index] <= data[pivot]
++too_big_index
2. While data[too_small_index] > data[pivot]
--too_small_index
3. If too_big_index < too_small_index
swap data[too_big_index] and data[too_small_index]
4. While too_small_index > too_big_index, go to 1.

pivot_index = 0 40 20 10 30 60 50 7 80 100

[0] [1] [2] [3] [4] [5] [6] [7] [8]

too_big_index too_small_index
1. While data[too_big_index] <= data[pivot]
++too_big_index
2. While data[too_small_index] > data[pivot]
--too_small_index
3. If too_big_index < too_small_index
swap data[too_big_index] and data[too_small_index]
4. While too_small_index > too_big_index, go to 1.

pivot_index = 0 40 20 10 30 60 50 7 80 100

[0] [1] [2] [3] [4] [5] [6] [7] [8]

too_big_index too_small_index
1. While data[too_big_index] <= data[pivot]
++too_big_index
2. While data[too_small_index] > data[pivot]
--too_small_index
3. If too_big_index < too_small_index
swap data[too_big_index] and data[too_small_index]
4. While too_small_index > too_big_index, go to 1.

pivot_index = 0 40 20 10 30 60 50 7 80 100

[0] [1] [2] [3] [4] [5] [6] [7] [8]

too_big_index too_small_index
1. While data[too_big_index] <= data[pivot]
++too_big_index
2. While data[too_small_index] > data[pivot]
--too_small_index
3. If too_big_index < too_small_index
swap data[too_big_index] and data[too_small_index]
4. While too_small_index > too_big_index, go to 1.

pivot_index = 0 40 20 10 30 60 50 7 80 100

[0] [1] [2] [3] [4] [5] [6] [7] [8]

too_big_index too_small_index
1. While data[too_big_index] <= data[pivot]
++too_big_index
2. While data[too_small_index] > data[pivot]
--too_small_index
3. If too_big_index < too_small_index
swap data[too_big_index] and data[too_small_index]
4. While too_small_index > too_big_index, go to 1.

pivot_index = 0 40 20 10 30 7 50 60 80 100

[0] [1] [2] [3] [4] [5] [6] [7] [8]

too_big_index too_small_index
1. While data[too_big_index] <= data[pivot]
++too_big_index
2. While data[too_small_index] > data[pivot]
--too_small_index
3. If too_big_index < too_small_index
swap data[too_big_index] and data[too_small_index]
4. While too_small_index > too_big_index, go to 1.

pivot_index = 0 40 20 10 30 7 50 60 80 100

[0] [1] [2] [3] [4] [5] [6] [7] [8]

too_big_index too_small_index
1. While data[too_big_index] <= data[pivot]
++too_big_index
2. While data[too_small_index] > data[pivot]
--too_small_index
3. If too_big_index < too_small_index
swap data[too_big_index] and data[too_small_index]
4. While too_small_index > too_big_index, go to 1.

pivot_index = 0 40 20 10 30 7 50 60 80 100

[0] [1] [2] [3] [4] [5] [6] [7] [8]

too_big_index too_small_index
1. While data[too_big_index] <= data[pivot]
++too_big_index
2. While data[too_small_index] > data[pivot]
--too_small_index
3. If too_big_index < too_small_index
swap data[too_big_index] and data[too_small_index]
4. While too_small_index > too_big_index, go to 1.

pivot_index = 0 40 20 10 30 7 50 60 80 100

[0] [1] [2] [3] [4] [5] [6] [7] [8]

too_big_index too_small_index
1. While data[too_big_index] <= data[pivot]
++too_big_index
2. While data[too_small_index] > data[pivot]
--too_small_index
3. If too_big_index < too_small_index
swap data[too_big_index] and data[too_small_index]
4. While too_small_index > too_big_index, go to 1.

pivot_index = 0 40 20 10 30 7 50 60 80 100

[0] [1] [2] [3] [4] [5] [6] [7] [8]

too_big_index too_small_index
1. While data[too_big_index] <= data[pivot]
++too_big_index
2. While data[too_small_index] > data[pivot]
--too_small_index
3. If too_big_index < too_small_index
swap data[too_big_index] and data[too_small_index]
4. While too_small_index > too_big_index, go to 1.

pivot_index = 0 40 20 10 30 7 50 60 80 100

[0] [1] [2] [3] [4] [5] [6] [7] [8]

too_big_index too_small_index
1. While data[too_big_index] <= data[pivot]
++too_big_index
2. While data[too_small_index] > data[pivot]
--too_small_index
3. If too_big_index < too_small_index
swap data[too_big_index] and data[too_small_index]
4. While too_small_index > too_big_index, go to 1.

pivot_index = 0 40 20 10 30 7 50 60 80 100

[0] [1] [2] [3] [4] [5] [6] [7] [8]

too_big_index too_small_index
1. While data[too_big_index] <= data[pivot]
++too_big_index
2. While data[too_small_index] > data[pivot]
--too_small_index
3. If too_big_index < too_small_index
swap data[too_big_index] and data[too_small_index]
4. While too_small_index > too_big_index, go to 1.

pivot_index = 0 40 20 10 30 7 50 60 80 100

[0] [1] [2] [3] [4] [5] [6] [7] [8]

too_big_index too_small_index
1. While data[too_big_index] <= data[pivot]
++too_big_index
2. While data[too_small_index] > data[pivot]
--too_small_index
3. If too_big_index < too_small_index
swap data[too_big_index] and data[too_small_index]
4. While too_small_index > too_big_index, go to 1.

pivot_index = 0 40 20 10 30 7 50 60 80 100

[0] [1] [2] [3] [4] [5] [6] [7] [8]

too_big_index too_small_index
1. While data[too_big_index] <= data[pivot]
++too_big_index
2. While data[too_small_index] > data[pivot]
--too_small_index
3. If too_big_index < too_small_index
swap data[too_big_index] and data[too_small_index]
4. While too_small_index > too_big_index, go to 1.
5. Swap data[too_small_index] and data[pivot_index]

pivot_index = 0 40 20 10 30 7 50 60 80 100

[0] [1] [2] [3] [4] [5] [6] [7] [8]

too_big_index too_small_index
1. While data[too_big_index] <= data[pivot]
++too_big_index
2. While data[too_small_index] > data[pivot]
--too_small_index
3. If too_big_index < too_small_index
swap data[too_big_index] and data[too_small_index]
4. While too_small_index > too_big_index, go to 1.
5. Swap data[too_small_index] and data[pivot_index]

pivot_index = 4 7 20 10 30 40 50 60 80 100

[0] [1] [2] [3] [4] [5] [6] [7] [8]

too_big_index too_small_index
Partition Result

7 20 10 30 40 50 60 80 100

[0] [1] [2] [3] [4] [5] [6] [7] [8]

<= data[pivot] > data[pivot]


Recursion: Quicksort Sub-
arrays

7 20 10 30 40 50 60 80 100

[0] [1] [2] [3] [4] [5] [6] [7] [8]

<= data[pivot] > data[pivot]


Quicksort Efficiency
The partitioning of an array into two parts is O(n)

The number of recursive calls to Quicksort depends on how


many times we can split the array into two groups.
On average this is O (log2 n)

The overall Quicksort efficiency is O(n) = n log2n

What is the worst-case efficiency?


Quicksort: Worst Case
Assume first element is chosen as pivot.
Assume we get array that is already in order:

pivot_index = 0 2 4 10 12 13 50 57 63 100

[0] [1] [2] [3] [4] [5] [6] [7] [8]

too_big_index too_small_index
1. While data[too_big_index] <= data[pivot]
++too_big_index
2. While data[too_small_index] > data[pivot]
--too_small_index
3. If too_big_index < too_small_index
swap data[too_big_index] and data[too_small_index]
4. While too_small_index > too_big_index, go to 1.
5. Swap data[too_small_index] and data[pivot_index]

pivot_index = 0 2 4 10 12 13 50 57 63 100

[0] [1] [2] [3] [4] [5] [6] [7] [8]

too_big_index too_small_index
1. While data[too_big_index] <= data[pivot]
++too_big_index
2. While data[too_small_index] > data[pivot]
--too_small_index
3. If too_big_index < too_small_index
swap data[too_big_index] and data[too_small_index]
4. While too_small_index > too_big_index, go to 1.
5. Swap data[too_small_index] and data[pivot_index]

pivot_index = 0 2 4 10 12 13 50 57 63 100

[0] [1] [2] [3] [4] [5] [6] [7] [8]

too_big_index too_small_index
1. While data[too_big_index] <= data[pivot]
++too_big_index
2. While data[too_small_index] > data[pivot]
--too_small_index
3. If too_big_index < too_small_index
swap data[too_big_index] and data[too_small_index]
4. While too_small_index > too_big_index, go to 1.
5. Swap data[too_small_index] and data[pivot_index]

pivot_index = 0 2 4 10 12 13 50 57 63 100

[0] [1] [2] [3] [4] [5] [6] [7] [8]

too_big_index too_small_index
1. While data[too_big_index] <= data[pivot]
++too_big_index
2. While data[too_small_index] > data[pivot]
--too_small_index
3. If too_big_index < too_small_index
swap data[too_big_index] and data[too_small_index]
4. While too_small_index > too_big_index, go to 1.
5. Swap data[too_small_index] and data[pivot_index]

pivot_index = 0 2 4 10 12 13 50 57 63 100

[0] [1] [2] [3] [4] [5] [6] [7] [8]

too_big_index too_small_index
1. While data[too_big_index] <= data[pivot]
++too_big_index
2. While data[too_small_index] > data[pivot]
--too_small_index
3. If too_big_index < too_small_index
swap data[too_big_index] and data[too_small_index]
4. While too_small_index > too_big_index, go to 1.
5. Swap data[too_small_index] and data[pivot_index]

pivot_index = 0 2 4 10 12 13 50 57 63 100

[0] [1] [2] [3] [4] [5] [6] [7] [8]

too_big_index too_small_index
1. While data[too_big_index] <= data[pivot]
++too_big_index
2. While data[too_small_index] > data[pivot]
--too_small_index
3. If too_big_index < too_small_index
swap data[too_big_index] and data[too_small_index]
4. While too_small_index > too_big_index, go to 1.
5. Swap data[too_small_index] and data[pivot_index]

pivot_index = 0 2 4 10 12 13 50 57 63 100

[0] [1] [2] [3] [4] [5] [6] [7] [8]

too_big_index too_small_index
1. While data[too_big_index] <= data[pivot]
++too_big_index
2. While data[too_small_index] > data[pivot]
--too_small_index
3. If too_big_index < too_small_index
swap data[too_big_index] and data[too_small_index]
4. While too_small_index > too_big_index, go to 1.
5. Swap data[too_small_index] and data[pivot_index]

pivot_index = 0 2 4 10 12 13 50 57 63 100

[0] [1] [2] [3] [4] [5] [6] [7] [8]

<= data[pivot] > data[pivot]


1. #include<stdio.h> 17. while(array[index1] <=
2. #include<conio.h> array[pivotIndex] && index1 <
3. //quick Sort function to Sort Integer lastIndex)
array list 18. {
4. void quicksort(int array[], int 19. index1++;
firstIndex, int lastIndex) 20. }
5. { 21.
6. //declaring index variables while(array[index2]>array[pivotIndex]
7. int pivotIndex, temp, index1, )
index2; 22. {
8. if(firstIndex < lastIndex) 23. index2--;
9. { 24. }
10. //assigning first element index as
pivot element 25. if(index1<index2)
11. pivotIndex = firstIndex; 26. {
12. index1 = firstIndex; 27. //Swapping operation
13. index2 = lastIndex; 28. temp = array[index1];
14. //Sorting in Ascending order with 29. array[index1] =
quick sort array[index2];
15. while(index1 < index2) 30. array[index2] = temp;
16. {
31. } equal to n
32. } 50. printf("Enter Elements in the list : ");
33. //At the end of first iteration, swap 51. for(i = 0; i < n; i++)
pivot element with index2 element 52. {
34. temp = array[pivotIndex]; 53. scanf("%d",&array[i]);
35. array[pivotIndex] = array[index2]; 54. }
36. array[index2] = temp; 55. //calling quickSort function defined
37. //Recursive call for quick sort, with above
partitioning 56. quicksort(array,0,n-1);
38. quicksort(array, firstIndex, index2-1); 57. //print sorted array
39. quicksort(array, index2+1, lastIndex); 58. printf("Sorted elements: ");
40. } 59. for(i=0;i<n;i++)
41. } 60. printf(" %d",array[i]);
42. int main() 61. getch();
43. { 62. return 0;
44. //Declaring variables 63. }
45. int array[100],n,i;
46. //Number of elements in array form user
input
47. printf("Enter the number of elements
you want to Sort : ");
48. scanf("%d",&n);

49. //code to ask to enter elements from user


Sorting Techniques - Comparison
Merge Sort
Step 1 − if it is only one element in the list it is already sorted, return.
Step 2 − divide the list recursively into two halves until it can no more be divided.
Step 3 − merge the smaller lists into new list in sorted order.
Merge Sort
Apply divide-and-conquer to sorting problem
Problem: Given n elements, sort elements into non-
decreasing order
Divide-and-Conquer:
◦ If n=1 terminate (every one-element list is already sorted)
◦ If n>1, partition elements into two or more sub-collections;
sort each; combine into a single sorted list
How do we partition?
Partitioning - Choice 1
First n-1 elements into set A, last element set B
Sort A using this partitioning scheme recursively
◦ B already sorted
Combine A and B using method Insert() (= insertion into
sorted array)
Leads to recursive version of InsertionSort()
◦ Number of comparisons: O(n2)
◦ Best case = n-1
n
n(n − 1)
◦ Worst case =
c i =
i =2 2
Partitioning - Choice 2
Put element with largest key in B, remaining elements in
A
Sort A recursively
To combine sorted A and B, append B to sorted A
◦ Use Max() to find largest element → recursive SelectionSort()
◦ Use bubbling process to find and move largest element to right-
most position → recursive BubbleSort()
All O(n2)
Partitioning - Choice 3
Let’s try to achieve balanced partitioning
A gets n/2 elements, B gets rest half
Sort A and B recursively
Combine sorted A and B using a process called merge, which combines two
sorted lists into one
◦ How? We will see soon
Pseudo-code- Merge Sort
1. procedure mergesort( var a as array )
2. if ( n == 1 ) return a
3. var l1 as array = a[0] ... a[n/2]
4. var l2 as array = a[n/2+1] ... a[n]
5. l1 = mergesort( l1 )
6. l2 = mergesort( l2 )
7. return merge( l1, l2 )
8. end procedure
Pseudo-code- Merge Sort
1. procedure merge( var a as array, 11. end while
var b as array )
12. while ( a has elements )
2. var c as array
13. add a[0] to the end of c
3. while ( a and b have elements )
14. remove a[0] from a
4. if ( a[0] > b[0] )
15. end while
5. add b[0] to the end of c
16. while ( b has elements )
6. remove b[0] from b
17. add b[0] to the end of c
7. else
18. remove b[0] from b
8. add a[0] to the end of c
19. end while
9. remove a[0] from a
20. return c
10. end if
21. end procedure
Example
Partition into lists of size n/2

[10, 4, 6, 3, 8, 2, 5, 7]

[10, 4, 6, 3] [8, 2, 5, 7]

[10, 4] [6, 3] [8, 2] [5, 7]

[4] [10] [3][6] [2][8] [5][7]


Example Cont’d
Merge

[2, 3, 4, 5, 6, 7, 8, 10 ]

[3, 4, 6, 10] [2, 5, 7, 8]

[4, 10] [3, 6] [2, 8] [5, 7]

[4] [10] [3][6] [2][8] [5][7]


Merge Sort
1. #include <stdio.h> 11. b[i] = a[l2++];
2. #define max 10 12. }
3. int a[10] = { 10, 14, 19, 26, 27, 31, 33,13. while(l1 <= mid)
35, 42, 44 }; 14. b[i++] = a[l1++];
4. int b[10]; 15. while(l2 <= high)
16. b[i++] = a[l2++];
5. void merging(int low, int mid, int high) 17. for(i = low; i <= high; i++)
{ 18. a[i] = b[i];
6. int l1, l2, i; 19. }
7. for(l1 = low, l2 = mid + 1, i = low; l1 20.void sort(int low, int high) {
<= mid && l2 <= high; i++) { 21. int mid;
8. if(a[l1] <= a[l2]) 22. if(low < high) {
9. b[i] = a[l1++]; 23. mid = (low + high) / 2;
10. else
26. sort(low, mid);
27. sort(mid+1, high);
28. merging(low, mid, high);
29. } else {
30. return;
31. }
32. }
33. int main() {
34. int i;
35. printf("List before sorting\n");
36.
37. for(i = 0; i <= max; i++)
38. printf("%d ", a[i]);

39. sort(0, max);


40. printf("\nList after sorting\n");
41.
42. for(i = 0; i <= max; i++)
43. printf("%d ", a[i]);
44. }
Heap and Heap Sort
A heap is a specialized tree-based data structure that satisfies the heap property:
“If A is a parent node of B then the key (the value) of node A is ordered with
respect to the key of node B with the same ordering applying across the heap”
The Heap Data Structure
Def: A heap is a nearly complete binary tree with the following two properties:
◦ Structural property: all levels are full, except possibly the last one, which is filled from
left to right
◦ Order (heap) property: for any node x
Parent(x) ≥ x

A heap is a binary tree that is filled in order

8 From the heap property, it


follows that:
7 4
“The root is the maximum
5 2 element of the heap!”
Heap
86
Heap and Heap Sort
A binary heap is a complete binary tree which satisfies the heap ordering property.
The ordering can be one of two types:
1. the min-heap property: the value of each node is greater than or equal to the
value of its parent, with the minimum-value element at the root.
2. the max-heap property: the value of each node is less than or equal to the value
of its parent, with the maximum-value element at the root.
Heap and Heap Sort
MinHeap MaxHeap
Heap Types
Max-heaps (largest element at root), have the max-heap property:
◦ for all nodes i, excluding the root:
A[PARENT(i)] ≥ A[i]

Min-heaps (smallest element at root), have the min-heap property:


◦ for all nodes i, excluding the root:
A[PARENT(i)] ≤ A[i]

89
Heap and Heap Sort
The highest (or lowest) priority element is always stored at the root, hence the
name "heap".
A heap is not a sorted structure and can be regarded as partially ordered.
a heap is a complete binary tree, it has a smallest possible height - a heap with N
nodes always has O(log N) height.
Heap and Heap Sort
Heap Representations
Linked Structure
Array
Array – more advantageous
1. No wastage of array space – complete binary tree
2. Null entries if any at the tail end only
3. No links needed for parent and descendants
Heap and Heap Sort
Insertion to a heap
◦ The new element is initially appended to the end of the heap
◦ The heap property is repaired by comparing the added element with its
parent and moving the added element up a level
◦ This process is called "percolation up".
◦ The comparison is repeated until the parent is larger than or equal to the
percolating element.
Heap and Heap Sort
MaxHeap - Insertion
Heap and Heap Sort
Algorithm InsertMaxHeap
Input: ITEM, data to be inserted, N-> no. of nodes
Output: ITEM inserted into the heap tree
DS: array A[1….Size]
Heap and Heap Sort
Steps 9. while ( p> 0) and (A[p] <
A[i]) do
1. If (N>=SIZE) then 10. temp = A[i]
2. Print “Heap Tree is 11. A[i] = A[p]
saturated”
3. Exit 12. A[p] = temp
4. Else 13. t=p
5. N = N+1 14. p=p/2
6. A[N] = ITEM 15. EndWhile
7. i=N 16. EndIf
8. p=i/2 17. Stop
Heap and Heap Sort
Delete from MinHeap
The minimum element can be found at the root, which is the first
element of the array.
Remove the root and replace it with the last element of the heap
Then restore the heap property by percolating down.
Heap and Heap Sort

• Delete from MaxHeap


• The maximum element can be found at the
root, which is the first element of the array.
• Remove the root and replace it with the last
element of the heap
• Then restore the heap property by percolating
down.
Heap and Heap Sort
Algorithm DeleteMaxHeap
Input: A heap tree with elements
Output: ITEM data to be deleted, and remaining tree
after deletion
DS: array A[1….Size]
Heap and Heap Sort
1. If (N=0) then 12. x = A[lchild]
2. Print “Heap Tree is 13. Else
exhausted”
14. x = -∞
3. Exit
15. EndIf
4. EndIf
16. If (rchild <=N) then
5. ITEM = A[1]
17. y = A[rchild]
6. A[1] =A[N]
18. Else
7. N = N-1
19. y = -∞
8. Flag = FALSE, i = 1
20. EndIf
9. While(flag = FALSE) and (i<N) do
21. if(A[i] > x) and (A[i] > y) then
10. lchild = 2*i, rchild = 2*i + 1
22. flag = TRUE
11. if (lchild <=N) then
Heap and Heap Sort
23. Else A[rchild])
24. if ( x > y) and (A[i] < x) 30. i = rchild
25. swap(A[i], 31. endif
A[lchild])
32. endif
26. i = lchild
33. Endif
27. else
34. EndWhile
28. if (y > x) and (A[j] < y)
35. Stop
29. swap(A[i],
Array Representation of Heaps
A heap can be stored as an array A.
◦ Root of tree is A[1]
◦ Left child of A[i] = A[2i]
◦ Right child of A[i] = A[2i + 1]
◦ Parent of A[i] = A[ i/2 ]
◦ Heapsize[A] ≤ length[A]

The elements in the subarray


A[(n/2+1) .. n] are leaves

104
Operations on Heaps
Maintain/Restore the max-heap property
◦ MAX-HEAPIFY

Create a max-heap from an unordered array


◦ BUILD-MAX-HEAP

Sort an array in place


◦ HEAPSORT

Priority queues
Maintaining the Heap Property

Suppose a node is smaller than a child


◦ Left and Right sub-trees of i are max-
heaps
To eliminate the violation:
◦ Exchange with larger child
◦ Move down the tree
◦ Continue until node is not smaller than
children
Example
MAX-HEAPIFY(A, 2, 10)

A[2]  A[4]

A[2] violates the heap property A[4] violates the heap property

A[4]  A[9]

Heap property restored


Hash Tables
The implementation of hash tables is called hashing.

Hashing is a technique used for performing insertions,


deletions and searches in constant average time (i.e. O(1))

This data structure, however, is not efficient in operations


that require any ordering information among the elements,
such as findMin, findMax and printing the entire table in
sorted order.
General Idea
The ideal hash table structure is merely an array of some fixed size, containing the items.

A stored item needs to have a data member, called key, that will be used in computing the index
value for the item.

◦ Key could be an integer, a string, etc

◦ e.g. a name or Id that is a part of a large employee structure

The size of the array is TableSize.

The items that are stored in the hash table are indexed by values from 0 to TableSize – 1.

Each key is mapped into some number in the range 0 to TableSize – 1.

The mapping is called a hash function.


Example Hash
Table
0
1
Items
2
john 25000
3 john 25000
phil 31250 key Hash 4 phil 31250
Function
dave 27500 5
mary 28200 6 dave 27500
7 mary 28200
key 8
9
Hash Function
Hash function: function that governs the mapping of key
values to indices of hash table is called hash function

The hash function:


◦ must be simple, easy and quick to compute.

◦ must distribute the keys evenly among the cells.

If we know which keys will occur in advance we can write


perfect hash functions, but we don’t.
Hash function
Problems:

Keys may not be numeric.

Number of possible keys is much larger than the space


available in table.

Different keys may map into same location


◦ Hash function is not one-to-one => collision.
◦ If there are too many collisions, the performance of the hash
table will suffer dramatically.
Hash Functions
If the input keys are integers then simply Key mod TableSize is a
general strategy.
◦ Unless key happens to have some undesirable properties. (e.g. all keys end
in 0 and we use mod 10)

If the keys are strings, hash function needs more care.


◦ First convert it into a numeric value.
Some methods
Truncation:
◦ e.g. 123456789 map to a table of 1000 addresses by picking 3
digits of the key.
Folding:
◦ e.g. 123|456|789: add them and take mod.
Key mod N:
◦ N is the size of the table, better if it is prime.
Squaring:
◦ Square the key and then truncate
Radix conversion:
◦ e.g. 1 2 3 4 treat it to be base 11, truncate if necessary.
Hash Function 1
Add up the ASCII values of all characters of the key.
int hash(const string &key, int tableSize)
{
int hasVal = 0;

for (int i = 0; i < key.length(); i++)


hashVal += key[i];
return hashVal % tableSize;
}

• Simple to implement and fast.


• However, if the table size is large, the function does not
distribute the keys well.
• e.g. Table size =10000, key length <= 8, the hash function can
assume values only between 0 and 1016
Hash Function 2
Examine only the first 3 characters of the key.
int hash (const string &key, int tableSize)
{
return (key[0]+27 * key[1] + 729*key[2]) %
tableSize;
}

• In theory, 26 * 26 * 26 = 17576 different words can be


generated. However, English is not random, only 2851
different combinations are possible.

• Thus, this function although easily computable, is also not


appropriate if the hash table is reasonably large.
Properties of Good Hash Functions

Must return number 0, …, tablesize

Should be efficiently computable – O(1) time

Should not waste space unnecessarily


◦ For every index, there is at least one key that hashes to it
◦ Load factor lambda  = (number of keys / TableSize)

Should minimize collisions


= different keys hashing to same index
Integer Keys
Hash(x) = x % TableSize
Good idea to make TableSize prime. Why?
Integer Keys
Hash(x) = x % TableSize
Good idea to make TableSize prime. Why?
◦ Because keys are typically not randomly distributed, but
usually have some pattern
◦ mostly even
◦ mostly multiples of 10
◦ in general: mostly multiples of some k
◦ If k is a factor of TableSize, then only (TableSize/k) slots will
ever be used!
◦ Since the only factor of a prime number is itself, this
phenomena only hurts in the (rare) case where k=TableSize
Strings as Keys
If keys are strings, can get an integer by adding up ASCII
values of characters in key
for (i=0;i<key.length();i++)
hashVal += key.charAt(i);

Problem 1: What if TableSize is 10,000 and all keys are 8


or less characters long?

Problem 2: What if keys often contain the same


characters (“abc”, “bca”, etc.)?
Hashing Strings
Basic idea: consider string to be a integer (base 128):
Hash(“abc”) = (‘a’*1282 + ‘b’*1281 + ‘c’) % TableSize

Problem: although a char can hold 128 values (8 bits), only


a subset of these values are commonly used (26 letters
plus some special characters)
◦ So just use a smaller “base”

◦ Hash(“abc”) = (‘a’*322 + ‘b’*321 + ‘c’) % TableSize


Making the String Hash
Easy to Compute
Horner’s Rule
int hash(String s) {
h = 0;
for (i = s.length() - 1; i >= 0; i--) {
h = (s.keyAt(i) + h<<5) % tableSize;
}
return h;
}

Advantages:
1.minimum number of multiplications (handled by shifts!)
2.avoids overflow, because is doing mods during computation
Optimal Hash Function
The best hash function would distribute keys as evenly as
possible in the hash table
“Simple uniform hashing”
◦ Maps each key to a (fixed) random number
◦ Idealized gold standard
◦ Simple to analyze
◦ Can be closely approximated by best hash functions
Example:
Hash Table of size 10
Keys: 10, 19, 35, 43, 62, 59, 31, 49, 77, 33
Hash function:
1. Add the 2 digits in the key
2. Take the digit at unit’s place as index
Example:
K I Hash Table
10 1 19 0
19 0 10 1
35 8 2
43 H 7 49 3
62 8 59,31,77 4
59 4 5
31 4 33 6
49 3 43 7
77 4 35,62 8
33 6 9
Division Method
Fast Hashing method

Widely accepted
1. Choose a number h larger than N (usually a prime no.)

2. N is the number of keys in K

3. Hash Function H
H(k) = k mod h, if indices start from 0
H(k) = k mod h + 1, if indices start from 1
MidSquare Method
Widely accepted

1. x is obtained by selecting appropriate no. of bits/ digits from the


middle of the square of key value

2. Depends on size of hash table

H(k) = x,
Limitation: time consuming computation

Advantage: good results and uniform distribution of the keys


Folding Method
1. Pure Folding

2. Fold Shifting

3. Fold Boundary

Chopping: dividing a key into pieces

Folding helps also to convert a multi-word key into a


single word so that another hash function can work on it.
Folding Method
Key: 3455677234

Partitioning: 003 | 455 |677 | 234

Adding: 003+455+677+234 = 1369

Address: 369 (the carry is ignored)


Digit Analysis Method
❑Form hash addresses by extracting and/or shifting
the extracted digits/bits of the original key

❑Decision for extraction and rearrangement is based


on some analysis

❑Position and rearrangement methods chosen


should be the same for all keys in the set

❑Eg: Extraction of digits at even position from the key


6732541 and reversing them gives the address 427
Digit Analysis Method
❑Analyzing the digits to be extracted

❑For each criterion, hash addresses are calculated


and then a graph is plotted

❑The criterion which produces the most uniform


distribution is chosen

❑Uniform -> graph with smallest peaks and valleys


Digit Analysis Method
❑Keys: 2234, 3452, 2784

❑Best distribution was seen when digits at second


and third positions were extracted

❑Addresses with no rearrangement are 23, 45, 62

❑Rearranged addresses: 32, 54, 26


Direct Addressing
❑Direct-address tables – the most elementary form of hashing.

❑Assumption – direct one-to-one correspondence between


the keys and numbers 0, 1, …, m-1., m

❑Keys not very large.

❑Searching is fast, but there is cost – the size of the array we


need is the size of the largest key.

❑Not very useful if only a few keys are widely distributed.


Collision – problem and resolution
❑When an element is inserted, it hashes to the same
value as an already inserted element, then we have a
collision and need to resolve it.

❑we must have a systematic method for placing the


second item in the hash table. This process is called
collision resolution.
Collision – problem and resolution
❑There are several methods for dealing with this:

1. Separate chaining or open hashing

2. Open addressing or closed hashing


a. Linear Probing

b. Quadratic Probing

c. Double Hashing
Separate Chaining or chaining
The idea is to keep a list of all elements that hash to
the same value.
◦ The array elements are pointers to the first nodes of the
lists.

◦ A new item is inserted to the front of the list.


Example
Keys: 0, 1, 4, 9, 16, 25, 36, 49, 64, 81
hash(key) = key % 10.
0 0

1 81 1
2

4 64 4
5 25
6 36 16
7

9 49 9
Chaining Technique
Advantages:
• Better space utilization for large items.
• Simple collision handling: searching linked list.
• Overflow: we can store more items than the hash table
size.
• Deletion is quick and easy: deletion from the linked list.

Disadvantages:
• Cost of maintaining linked lists
• Extra storage space for link fields
Algorithm HashChaining

Input: K is the item


Output: if K is found in the hash table then return the pointer of
the node which contain the key value, else insert K into the
linked list
Data Structure: Hash Table H storing pointers to each singly
linked lists
Algorithm HashChaining – steps
1. Index = HashFunction(K) 11. Endif
2. ptr = H[Index] 12. EndWhile
3. flag = False 13. If (flag = FALSE) then
4. While(ptr≠NULL) and (flag 14. print “key don’t exist”
=False) 15. If (INSERT) then
5. if (ptr->DATA = K) then 16. InsertEnd_SL(H[Index])
6. flag = TRUE 17. EndIf
7. Return(ptr) 18. Stop
8. Exit
9. Else
10. ptr = ptr->LINK
Open Addressing
If we have enough contiguous memory to store all the keys (m
> N)  store the keys in the table itself
e.g., insert 14
No need to use linked lists anymore

Basic idea:
◦ Insertion: if a slot is full, try another one,
until you find an empty one
◦ Search: follow the same sequence of probes

Search time depends on the length of the

probe sequence!
Generalize hash function
notation:
A hash function contains two arguments now: insert 14
(i) Key value, and (ii) Probe number

h(k,p), p=0,1,...,m-1 Example

Probe sequences <1, 5, 9>

<h(k,0), h(k,1), ..., h(k,m-1)>


◦ Must be a permutation of <0,1,...,m-1>
◦ There are m! possible permutations
◦ Good hash functions should be able to
produce all m! probe sequences
Common Open Addressing Methods
Linear probing
Quadratic probing
Double hashing

Note: None of these methods can generate more than m2 different probing
sequences!
Linear Probing
In linear probing, collisions are resolved by sequentially scanning an array
(with wraparound) until an empty cell is found.
◦ i.e. f is a linear function of i, typically f(i)= i.

Example:
◦ Insert items with keys: 89, 18, 49, 58, 9 into an empty hash table.
◦ Table size is 10.
◦ Hash function is hash(x) = x mod 10.
◦ f(i) = i;
Figure :
Linear probing
hash table after
each insertion
Search and Delete
The Search algorithm follows the same probe sequence as the insert
algorithm.
◦ A search for 58 would involve 4 probes.
◦ A search for 19 would involve 5 probes.

We must use lazy deletion (i.e. marking items as deleted)


◦ Standard deletion (i.e. physically removing the item) cannot be performed.
◦ e.g. remove 89 from hash table.
Clustering Problem
As long as table is big enough, a free cell can always be found, but the time
to do so can get quite large.
Worse, even if the table is relatively empty, blocks of occupied cells start
forming.
This effect is known as primary clustering.
Any key that hashes into the cluster will require several attempts to resolve
the collision, and then it will add to the cluster.
Algorithm LinearProbe
Input: K is the key value to be searched or inserted
Output: Return location if found, else insert K in the table if no overflow
occurs
DS: Hash table H of size HSIZE
Algorithm LinearProbe - Steps
1. flag = FALSE 15. if(H[i]=K) then
2. Index = HashFunction(K) 16. flag = TRUE
3. If (K=H[Index]) then 17. Return i
4. Return Index 18. Exit
5. Exit 19. Else
6. Else 20. i = i mod h+1
7. i = index + 1 21. EndIf
8. while(i≠ index) and (not flag) do 22. EndIf
9. if (H[i] = NULL) and (H[i] < 0) then 23. EndWhile
10. if(INSERT)then 24. If flag = FALSE and i = index then
11. H[i] = K 25. print “The table is overflow”
12. flag = TRUE 26. EndIf
13. EndIf 27. EndIf
14. Else 28. Stop
Quadratic Probing
Quadratic Probing eliminates primary clustering problem of
linear probing.
Collision function is quadratic.
◦ The popular choice is f(i) = i2.
If the hash function evaluates to h and a search in cell h is
inconclusive, we try cells h + 12, h+22, … h + i2.
H(K) + i2 mod h, i = 1,2,3…
◦ i.e. It examines cells 1,4,9 and so on away from the original probe.
Remember that subsequent probe points are a quadratic
number of positions from the original probe point.
Figure
A quadratic probing
hash table after each
insertion (note that
the table size was
poorly chosen
because it is not a
prime number).
Quadratic Probing

Problem:
◦ We may not be sure that we will probe all locations in the table (i.e.
there is no guarantee to find an empty cell if table is more than half full.)
◦ If the hash table size is not prime this problem will be much severe.

However, there is a theorem stating that:


◦ If the table size is prime and load factor is not larger than 0.5, all probes
will be to different locations and an item can always be inserted.
Quadratic Probing
Load Factor

The load factor α of a hash table with n elements is


given by the following formula: α = n / table.length
Some considerations
How efficient is calculating the quadratic
probes?
◦Linear probing is easily implemented.
Quadratic probing appears to require * and
% operations.
◦However by the use of the following
equation, this is overcome:
◦ Hi = Hi-1+2i – 1 (mod M)
Some Considerations
What happens if load factor gets too high?
◦ Dynamically expand the table as soon as the load factor
reaches 0.5, which is called rehashing.
◦ Always double to a prime number.
◦ When expanding the hash table, reinsert the new table by
using the new hash function.
Rehashing
Rehashing Algorithm:
1. Allocate a new hash table twice the size of the
original table.
2. Reinsert each element of the old table into the new
table (using the hash function).
3. Reference the new table as the hash table..
Analysis of Quadratic Probing
Quadratic probing has not yet been mathematically
analyzed.

Although quadratic probing eliminates primary


clustering, elements that hash to the same location will
probe the same alternative cells. This is know as
secondary clustering.

Techniques that eliminate secondary clustering are


available.
◦ the most popular is double hashing.
Double Hashing
Idea: Spread out the search for an empty slot by using a second hash function
◦ No primary or secondary clustering

h(k,i) = (h1(k) + i h2(k) ) mod m, i=0,1,...

Good choice of Hash2(X) can guarantee does not get “stuck” as


long as  < 1
◦ Integer keys: example
Hash2(X) = R – (X mod R)
where R is a prime smaller than TableSize
Double Hashing

Disadvantage: harder to delete an element


Can generate m2 probe sequences maximum
Double Hashing: Example
0
1 79
h1(k) = k mod 13
2
h2(k) = 1+ (k mod 11) 3
h(k,i) = (h1(k) + i h2(k) ) mod 13 4 69
5 98
Insert key 14:
6
h1(14,0) = 14 mod 13 = 1 7 72

h(14,1) = (h1(14) + h2(14)) mod 13 8


9 14
= (1 + 4) mod 13 = 5
10
h(14,2) = (h1(14) + 2 h2(14)) mod 13 11 50

= (1 + 8) mod 13 = 9 12

160
Hashing – performance factors
1.Hash function :should distribute the keys and entries evenly throughout the entire table o should
minimize collisions

2.Collision resolution strategy:


a. Open Addressing: store the key/entry in a different position

b. Separate Chaining: chain several keys/entries in the same position

3.Table size :
a. Too large a table, will cause a wastage of memory o

b. Too small a table will cause increased collisions and eventually force rehashing (creating a new hash
table of larger size and copying the contents of the current hash table into it) o

c. The size should be appropriate to the hash function used and should typically be a prime number
Applications
Keeping track of customer account information at a
bank
◦ Search through records to check balances and perform
transactions
Keep track of reservations on flights
◦ Search to find empty seats, cancel/modify reservations
Search engine
◦ Looks for all documents containing a given word
Special Case: Dictionaries
Dictionary = data structure that supports mainly two basic
operations: insert a new item and return an item with a given key
Queries: return information about the set S:
◦ Search (S, k)
◦ Minimum (S), Maximum (S)
◦ Successor (S, x), Predecessor (S, x)

Modifying operations: change the set


◦ Insert (S, k)
◦ Delete (S, k) – not very often

You might also like