0% found this document useful (0 votes)
9 views10 pages

5.web Data Mining

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views10 pages

5.web Data Mining

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 10

Unit-4

Merge sort
Merge sort is a sorting technique based on divide and conquer technique. In merge sort the
unsorted list is divided into N sublists, each having one element, because a list of one element is
considered sorted. Then, it repeatedly merge these sublists, to produce new sorted sublists, and at
lasts one sorted list is produced. Merge Sort is quite fast, and has a time complexity of O(n log n).

Conceptually, merge sort works as follows:


1. Divide the unsorted list into two sub lists of about half the size.
2. Divide each of the two sub lists recursively until we have list sizes of length 1, in which case
the list itself is returned.
3. Merge the two sub lists back into one sorted list.
#include<iostream>
using namespace std;
void merge(int a[ ],int low,int mid,int high)
{
int temp[100];
int i,j,k;
i=low;
j=mid+1;
k=low;
while((i<=mid)&&(j<=high))
{
if(a[i]<=a[j])
{
temp[k]=a[i];
++i;
}
else
{
temp[k]=a[j];
++j;
}
++k;
}
if(i>mid)
{
while(j<=high)
{

}
}
else
{
emp[k]=a[j];
++j;
++K;
while(i<=mid)
{
temp[k]=a[i];
++i;
++k;
}
}
for(int i=low;i<=high;i++)
a[i]=temp[i];
}

void mergesort(int a[],int low,int high)


{
int mid;
if(low<high)
{
mid=(low+high)/2;
mergesort(a,low,mid);
mergesort(a,mid+1,high);
merge(a,low,mid,high);
}
}

int main()
{
int n,i;
int list[30];
cout<<"enter no of elements\n";
cin>>n;
cout<<"enter "<<n<<" numbers
"; for(i=0;i<n;i++)
cin>>list[i];
mergesort (list,0,n-1);
cout<<" after sorting\n";
for(i=0;i<n;i++)
cout<<list[i]<<”\t”;
return 0;
}
RUN 1:

enter no of elements 5
enter 5 numbers 44 33 55 11 -1
after sorting -1 11 33 44 55

Heap sort
It is a completely binary tree with the property that a parent is always greater than or equal
to either of its children (if they exist). first the heap (max or min) is created using binary tree and
then heap is sorted using priority queue.

Steps Followed:

a) Start with just one element. One element will always satisfy heap property.
b) Insert next elements and make this heap.
c) Repeat step b, until all elements are included in the
heap. Steps of Sorting:
a) Exchange the root and last element in the heap.
b) Make this heap again, but this time do not include the last node.
c) Repeat steps a and b until there is no element left.

C++ program for implementation of Heap Sort

#include <iostream>
using namespace std;
// To heapify a subtree rooted with node i which is
// an index in arr[]. n is size of
heap void heapify(int arr[], int n,
int i)
{
int largest = i; // Initialize largest as
root int L= 2*i + 1; // left = 2*i + 1
int R= 2*i + 2; // right = 2*i + 2

// If left child is larger than root


if (L < n && arr[L] > arr[largest])
largest = L;
// If right child is larger than largest so
far if (R < n && arr[R] > arr[largest])
largest = R;
// If largest is not
root if (largest != i)
{
swap(arr[i], arr[largest]);
// Recursively heapify the affected sub-tree
heapify(arr, n, largest);
}
}

void heapSort(int arr[], int n)


{ int i;
// Build heap (rearrange array)
for ( i = n / 2 - 1; i >= 0; i--)
heapify(arr, n, i);

// One by one extract an element from


heap for ( i=n-1; i>=0; i--)
{
// Move current root to end
swap(arr[0], arr[i]);

// call max heapify on the reduced heap


heapify(arr, i, 0);
}
}

/* A utility function to print array of size n */


void printArray(int arr[], int n)
{
for (int i=0; i<n; ++i)
cout << arr[i] << " ";
cout << "\n";
}
int main()
{
int n,i;
int list[30];
cout<<"enter no of elements\n";
cin>>n;
cout<<"enter "<<n<<" numbers
"; for(i=0;i<n;i++)
cin>>list[i];
heapSort(list, n);
cout << "Sorted array is \n";
printArray(list, n);
return 0;
}
RUN 1:
enter no of elements 5
enter
5
num
bers
11
99
22
101
1
Sort
ed
array
is
1 11 22 99 101

External Sort
Till now, we saw that sorting is an important term in any database system. It means arranging the
data either in ascending or descending order. We use sorting not only for generating a sequenced
output but also for satisfying conditions of various database algorithms. In query processing, the
sorting method is used for performing various relational operations such as joins, etc. efficiently.
But the need is to provide a sorted input value to the system. For sorting any relation, we have to
build an index on the sort key and use that index for reading the relation in sorted order.
However, using an index, we sort the relation logically, not physically. Thus, sorting is
performed for cases that include:

Case 1: Relations that are having either small or medium size than main memory.

Case 2: Relations having a size larger than the memory size.

In Case 1, the small or medium size relations do not exceed the size of the main memory. So, we
can fit them in memory. So, we can use standard sorting methods such as quicksort, merge sort,
etc., to do so.

For Case 2, the standard algorithms do not work properly. Thus, for such relations whose size
exceeds the memory size, we use the External Sort-Merge algorithm.

The sorting of relations which do not fit in the memory because their size is larger than the
memory size. Such type of sorting is known as External Sorting. As a result, the external-
sort merge is the most suitable method used for external sorting.
External Sort-Merge Algorithm

Here, we will discuss the external-sort merge algorithm stages in detail:

In the algorithm, M signifies the number of disk blocks available in the main memory buffer for
sorting.

Stage 1: Initially, we create a number of sorted runs. Sort each of them. These runs contain
only a few records of the relation.

1. i = 0;

1. repeat
2. read either M blocks or the rest of the relation having a smaller size;
3. sort the in-memory part of the relation;
4. write the sorted data to run file Ri;
5. i =i+1;

1. Until the end of the relation

In Stage 1, we can see that we are performing the sorting operation on the disk blocks. After
completing the steps of Stage 1, proceed to Stage 2.

Stage 2: In Stage 2, we merge the runs. Consider that total number of runs, i.e., N is less than
M. So, we can allocate one block to each run and still have some space left to hold one block of
output. We perform the operation as follows:

1. read one block of each of N files Ri into a buffer block in memory;


2. repeat
3. select the first tuple among all buffer blocks (where selection is made in sorted order);
4. write the tuple to the output, and then delete it from the buffer block;
5. if the buffer block of any run Ri is empty and not EOF(Ri)
6. then read the next block of Ri into the buffer block;
7. Until all input buffer blocks are empty

After completing Stage 2, we will get a sorted relation as an output. The output file is then
buffered for minimizing the disk-write operations. As this algorithm merges N runs, that's why it
is known as an N-way merge.

However, if the size of the relation is larger than the memory size, then either M or more runs
will be generated in Stage 1. Also, it is not possible to allocate a single block for each run while
processing Stage 2. In such a case, the merge operation process in multiple passes. As M-1 input
buffer blocks have sufficient memory, each merge can easily hold M-1 runs as its input. So, the
initial phase works in the following way:

o It merges the first M-1 runs for getting a single run for the next one.
o Similarly, it merges the next M-1 runs. This step continues until it processes
all the initial runs. Here, the number of runs has a reduced M-1 value. Still, if
this reduced value is greater than or equal to M, we need to create another
pass. For this new pass, the input will be the runs created by the first pass.
o The work of each pass will be to reduce the number of runs by M-1 value.
This job repeats as many times as needed until the number of runs is either
less than or equal to M.
o Thus, a final pass produces the sorted output.

Estimating cost for External Sort-Merge Method


The cost analysis of external sort-merge is performed using the above-discussed stages in the
algorithm:

o Assume br denote number of blocks containing records of relation r.


o In the first stage, it reads each block of the relation and writes them back. It
takes a total of 2br block transfers.
o Initially, the value of the number of runs is [b r/M]. As the number of runs
decreases by M-1 in each merge pass, so it needs a total number of [log M-
1(br/M)] merge passes.

Every pass read and write each block of the relation only once. But with two exceptions:

o The final pass can give a sorted output without writing its result to the disk
o There might be chances that some runs may not be read or written during
the pass.

Neglecting such small exceptions, the total number of block transfers for external sorting comes
out:

b r (2 Γ log M-1 (b r /M) ˥ + 1)


We need to add the disk seek cost because each run needs seeks reading and writing data for
them. If in Stage 2, i.e., the merge phase, each run is allocated with bb buffer blocks or each run
reads bb data at a time, then each merge needs [br /bb] seeks for reading the data. The output is
written sequentially, so if it is on the same disk as input runs, the head will need to move
between the writes of consecutive blocks. Therefore, add a total of 2[br /bb] seeks for each merge
pass and the total number of seeks comes out:

2 Γ b r /M ˥ + Γb r /b b ˥(2ΓlogM-1(b r /M)˥ - 1)

Thus, we need to calculate the total number of disk seeks for analyzing the
cost of the External merge-sort algorithm.

Example of External Merge-sort Algorithm


Let's understand the working of the external merge-sort algorithm and also
analyze the cost of the external sorting with the help of an example.

Suppose that for a relation R, we are performing the external sort-merge. In


this, assume that only one block can hold one tuple, and the memory can
hold at most three blocks. So, while processing Stage 2, i.e., the merge
stage, it will use two blocks as input and one block for output.

You might also like