Parallel Algorithm For The Chameleon Clustering Algorithm Using Dynamic Modeling
Parallel Algorithm For The Chameleon Clustering Algorithm Using Dynamic Modeling
Existing algorithm only use a static model of clustering and B ) Partition the Sparse graph using multi graph partitioning
not the information of individual clusters when they are algorithms which are predefined in hMeTis Library[13] for
merged. CURE (Clustering Using Representatives) and other graph partitioning algorithms to get smaller clusters
algorithms ignore the aggregate interconnection of the item depending upon their similarity.
sets of two clusters[7].As CURE uses only representative C )In the third phase the algorithm merges the smaller clusters
points from each cluster and merges only the closest pair of to remove the noisy data and make the clusters with
representative points In this way it does not consider other
significant size.
points (data objects) in the clusters. However ROCK (Robust
Clustering algorithms) has an advantage over K-nearest neighbor is a supervised algorithm used to find
similarity between data. It assigns a class-label to data values
and then finds the ones that belong to the same class, and,
connect them with an edge.
11
International Journal of Computer Applications (0975 – 8887)
Volume 79 – No8, October 2013
The third phase of merging of the clusters is done on the basis Absolute _ IC (Ci , C j )
of their similarity to get final clusters. Similarity is measured RIC
with the two parameters. (internal _ IC (Ci ) internal _ IC (C j ))
i) Inter-connectivity 2
Where, Absolute _ IC (Ci , C j ) = sum of weights of
ii) Closeness
Inter-connectivity refers to the connection between two nodes edges that connect Ci with C j .
and closeness is defined as the similarity between two nodes
in two different clusters. Relative Inter-connectivity [3] is the internal_IC(Ci ) = weighted sum of edges that partition
connection between two nodes of two different clusters the cluster into roughly equal parts.
For the two clusters the Relative inter-connectivity and While, internal closeness of the cluster is the average weight
Relative closeness is calculated using the formulas as follows
of the edges in that particular cluster.
for the two clusters Ci and Cj ,
Relative inter-connectivity,
12
International Journal of Computer Applications (0975 – 8887)
Volume 79 – No8, October 2013
Relative closeness of two clusters is defined as 8 Do not merge the two clusters
9 end if
S EC Ci ,C j
RC (Ci , C j )
Ci Cj
S ECCi S ECC j Disadvantages of chameleon includes,
Ci C j Ci C j a) Complexity: Algorithm has highly complex computation
because it performs inter-connectivity and inter-closeness
Where, between all the data objects and this leads to a high
computational complexity on a single processor.
SEC { Ci , C j } is the average weight of edges that belong to
b) It cannot recover from database corruptions.
a cut-set of the cluster Ci and C j .
3. PROPOSED SOLUTION
SEC Ci is the average weight of edges that belong to a In this paper algorithm is proposed to reduce time complexity
of chameleon algorithm. The exhaustive search time on K-
minimum cut of the cluster Ci Neighboring algorithm is also optimized. Parallel merging
algorithm to merge datasets based on similarity has also
distributed computation across threads. Therefore, it makes it
| Ci | is the number of nodes in Cluster Ci .
scalable for larger datasets and gives a highly robust
Two clusters are merged if they have high value for the computation as one part of dataset doesn’t affect the other. It
product of Relative inter-connectivity and Relative closeness. also improves the complexity of sorting as heap sort is used.
Sometimes, a user can assign a higher priority to a particular Sorting helps to keep similar item sets together [7].
parameter using α. Phase 1: Parallel K-Nearest Neighbor
This algorithm reduces time complexity of analysis. It reduces
RI (Ci , C j ) RC (Ci , C j ) the search time by finding similar data and grouping them
together.
For the two clusters if the above function holds value greater
than or equal to the threshold value for similarity then the Given a repository it analyses data and then compares the
clusters are merged. training data tuple with test tuple to get similarity of data
value. This metric is known as Matric. Their metric is stored
If α > 1 then more priority is given to Relative inter- as a value of similarity between the two tested for tuple
connectivity else if α < 1 Relative closeness gets more values. For the parallel implementation following procedure is
priority. Essentially, in case of α=1 both of them have same followed:-
priority.
Pseudo code for merging algorithms for final clusters 1) Sort the dataset based on Euclidean metric as,
13
International Journal of Computer Applications (0975 – 8887)
Volume 79 – No8, October 2013
In the sparse graph an edge-cut between two points represents 9 chld = i-1;
the similarity among them more the weight of the edge cut 10 prnt = (chld-1) / 2;
implies more similar the two points are. 11 make maximum of children as parents
12 }
Partition the graph such that the edge cut is minimized by 13 void shinkhp(array_of_nos,n)
finding the minimum edge cuts and removing them from the 14 {
graph. 15 //here each thread is assigned to a particular
The above action will divide the sparse graph in smaller size parent node
clusters on the basis of their similarity. 16 prnt=0; //start from root
17 compare left and right child and make maximum as
Phase 3: Merging the clusters to get final clustering parent;
As the clusters obtained by the partitioning of the sparse graph 18 take the max heap from each thread thereby getting
are very small and might have noisy data[9],clusters are each parent node ;
merged to form larger sized clusters on the basis of two 19 i.e the nodes having right and left child ;
factors Relative inter-connectivity and Relative closeness of 20 knowing the position of these set of nodes construct
the clusters. others;
21 }
Parallel algorithm for merging the clusters:
The parallel threads are being used to perform the merging of 22 levelorder()
the two clusters. Parallelism is achieved using work pool 23 {
analogy. 24 traverse heap in level-order by dividing these levels
to threads;
Here merging of two clusters on basis of their similarity is a 25 connect only siblings to form a graph ;
task or work assigned into work pool. 26 }
Work pool is the set of these independent tasks, functions or Parallel K-NN clustering with heap sorting algorithm
methods which can be executed in parallel.
Threads which are not executing any process will be assigned Pseudo code of parallel merging algorithms for final
with a task from the pool of work. The task is of merging two clusters
clusters, defined by the function merge(Ci , Cj).
PSEUDO CODE 1 RI – Relative inter connectivity
2 RC – Relative Closeness
Algorithm: 3 α - user defined parameter
4 β – RI x RC
1 void heapsort (array_of_nos, int n) 5 th – threshold value to take merging decision
2 { 6 n be number of clusters to be merge
3 buildHp(array_of_nos,n); 7 Algorithm :
4 shrinkHp(array_of_nos,n); 8 for i=0 ... n // i and j are used for clusters
5 }
6 void buildHp (array_of_nos,n) a. for j=i+1 ... n
7 { i. Assign task to work pool
8 loop the three steps bellow till all nodes are merge(i,j);
checked; ii. End for // iteration j
b. End for // iteration i
4. EXPERIMENTAL WORK AND
Merge function used above can be implemented as
RESEARCH
1 merge(Ci, Cj) Advantage of chameleon algorithm is that it can be easily
used to cluster two dimensional as well as three Dimensional
2 Calculate RI for Ci and Cj
data sets. With regard to parallel chameleon the Experimental
3 Calculate RC for Ci and Cj work is carried out and the results are found improving the
4 Calculate β = R I α x RC performance as compared to the serial chameleon. This paper
5 if β greater than or equal to th compared parallel implementation with the serial chameleon
a. merge the two clusters Ci and Cj. . algorithm with two different data sets of 8,000 and 10,000
6 else data points. Parallel chameleon results in the same clustering
a. Do not merge the two clusters as the serial implementation but it provides a better
performance on the processors with multi-core architecture.
7 end if Parallel chameleon is analyzed on two different processors
Intel core i5 and Intel core i7.The algorithm performed a
Pseudo code of parallel merging algorithms for final clusters clustering method by distributing the points equally among
threads so as to balance the results.
14
International Journal of Computer Applications (0975 – 8887)
Volume 79 – No8, October 2013
Elapsed
time
15
International Journal of Computer Applications (0975 – 8887)
Volume 79 – No8, October 2013
16
International Journal of Computer Applications (0975 – 8887)
Volume 79 – No8, October 2013
[9] Graph Partitioning Algorithms for Distributing Workloads [12] Xu, X.; Jäger, J. & Kriegel; H.-P.A Fast Parallel
of Parallel Computations (generals exam). Bradford L. Clustering Algorithm for Large Spatial Databases Data
Chamberlain. University of Washington Technical Min. Knowl. Discov., Kluwer Academic Publishers,
Report UW-CSE-98-10-03, October 1998. 1999, 3, 263-290.
[10] Foti, D.; Lipari, D.; Pizzuti, C. & Talia, D. Scalable [13] George Karypis and Vipin Kumar A Hypergraph
Parallel Clustering for Data Mining on Multicomputers Partitioning Package Version 1.5.3. Army HPC Research
Proceedings of the 15 IPDPS 2000 Workshops on Center. November 22, 1998
Parallel and Distributed Processing, Springer-Verlag,
2000, 390-398. [14] https://developer.nvidia.com/cublas
[11] K. P. Soman, Shyam Diwakar, V. Ajay, InsightInto Data [15] Guha, S.; Rastogi, R. & Shim, K. CURE: an efficient
Mining: Theory and Practice, PHI Learning Pvt Ltd, clustering algorithm for large databases Proceedings of
2006. the 1998 ACM SIGMOD international conference on
Management of data, ACM, 1
IJCATM : www.ijcaonline.org 17