5.web Data Mining
5.web Data Mining
Merge sort
Merge sort is a sorting technique based on divide and conquer technique. In merge sort the
unsorted list is divided into N sublists, each having one element, because a list of one element is
considered sorted. Then, it repeatedly merge these sublists, to produce new sorted sublists, and at
lasts one sorted list is produced. Merge Sort is quite fast, and has a time complexity of O(n log n).
}
}
else
{
emp[k]=a[j];
++j;
++K;
while(i<=mid)
{
temp[k]=a[i];
++i;
++k;
}
}
for(int i=low;i<=high;i++)
a[i]=temp[i];
}
int main()
{
int n,i;
int list[30];
cout<<"enter no of elements\n";
cin>>n;
cout<<"enter "<<n<<" numbers
"; for(i=0;i<n;i++)
cin>>list[i];
mergesort (list,0,n-1);
cout<<" after sorting\n";
for(i=0;i<n;i++)
cout<<list[i]<<”\t”;
return 0;
}
RUN 1:
enter no of elements 5
enter 5 numbers 44 33 55 11 -1
after sorting -1 11 33 44 55
Heap sort
It is a completely binary tree with the property that a parent is always greater than or equal
to either of its children (if they exist). first the heap (max or min) is created using binary tree and
then heap is sorted using priority queue.
Steps Followed:
a) Start with just one element. One element will always satisfy heap property.
b) Insert next elements and make this heap.
c) Repeat step b, until all elements are included in the
heap. Steps of Sorting:
a) Exchange the root and last element in the heap.
b) Make this heap again, but this time do not include the last node.
c) Repeat steps a and b until there is no element left.
#include <iostream>
using namespace std;
// To heapify a subtree rooted with node i which is
// an index in arr[]. n is size of
heap void heapify(int arr[], int n,
int i)
{
int largest = i; // Initialize largest as
root int L= 2*i + 1; // left = 2*i + 1
int R= 2*i + 2; // right = 2*i + 2
External Sort
Till now, we saw that sorting is an important term in any database system. It means arranging the
data either in ascending or descending order. We use sorting not only for generating a sequenced
output but also for satisfying conditions of various database algorithms. In query processing, the
sorting method is used for performing various relational operations such as joins, etc. efficiently.
But the need is to provide a sorted input value to the system. For sorting any relation, we have to
build an index on the sort key and use that index for reading the relation in sorted order.
However, using an index, we sort the relation logically, not physically. Thus, sorting is
performed for cases that include:
Case 1: Relations that are having either small or medium size than main memory.
In Case 1, the small or medium size relations do not exceed the size of the main memory. So, we
can fit them in memory. So, we can use standard sorting methods such as quicksort, merge sort,
etc., to do so.
For Case 2, the standard algorithms do not work properly. Thus, for such relations whose size
exceeds the memory size, we use the External Sort-Merge algorithm.
The sorting of relations which do not fit in the memory because their size is larger than the
memory size. Such type of sorting is known as External Sorting. As a result, the external-
sort merge is the most suitable method used for external sorting.
External Sort-Merge Algorithm
In the algorithm, M signifies the number of disk blocks available in the main memory buffer for
sorting.
Stage 1: Initially, we create a number of sorted runs. Sort each of them. These runs contain
only a few records of the relation.
1. i = 0;
1. repeat
2. read either M blocks or the rest of the relation having a smaller size;
3. sort the in-memory part of the relation;
4. write the sorted data to run file Ri;
5. i =i+1;
In Stage 1, we can see that we are performing the sorting operation on the disk blocks. After
completing the steps of Stage 1, proceed to Stage 2.
Stage 2: In Stage 2, we merge the runs. Consider that total number of runs, i.e., N is less than
M. So, we can allocate one block to each run and still have some space left to hold one block of
output. We perform the operation as follows:
After completing Stage 2, we will get a sorted relation as an output. The output file is then
buffered for minimizing the disk-write operations. As this algorithm merges N runs, that's why it
is known as an N-way merge.
However, if the size of the relation is larger than the memory size, then either M or more runs
will be generated in Stage 1. Also, it is not possible to allocate a single block for each run while
processing Stage 2. In such a case, the merge operation process in multiple passes. As M-1 input
buffer blocks have sufficient memory, each merge can easily hold M-1 runs as its input. So, the
initial phase works in the following way:
o It merges the first M-1 runs for getting a single run for the next one.
o Similarly, it merges the next M-1 runs. This step continues until it processes
all the initial runs. Here, the number of runs has a reduced M-1 value. Still, if
this reduced value is greater than or equal to M, we need to create another
pass. For this new pass, the input will be the runs created by the first pass.
o The work of each pass will be to reduce the number of runs by M-1 value.
This job repeats as many times as needed until the number of runs is either
less than or equal to M.
o Thus, a final pass produces the sorted output.
Every pass read and write each block of the relation only once. But with two exceptions:
o The final pass can give a sorted output without writing its result to the disk
o There might be chances that some runs may not be read or written during
the pass.
Neglecting such small exceptions, the total number of block transfers for external sorting comes
out:
2 Γ b r /M ˥ + Γb r /b b ˥(2ΓlogM-1(b r /M)˥ - 1)
Thus, we need to calculate the total number of disk seeks for analyzing the
cost of the External merge-sort algorithm.