0% found this document useful (0 votes)
11 views7 pages

Parallel Algorithm For The Chameleon Clustering Algorithm Using Dynamic Modeling

This paper presents a parallel algorithm for the Chameleon clustering algorithm, which utilizes dynamic modeling for hierarchical clustering and adapts to the internal characteristics of clusters. The proposed solution reduces time complexity by leveraging parallel processing and optimizing the K-Nearest Neighbor algorithm, allowing for efficient clustering of large datasets. Experimental results demonstrate that the parallel implementation significantly improves performance compared to the serial version, particularly on multi-core processors.

Uploaded by

pturishcheva
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views7 pages

Parallel Algorithm For The Chameleon Clustering Algorithm Using Dynamic Modeling

This paper presents a parallel algorithm for the Chameleon clustering algorithm, which utilizes dynamic modeling for hierarchical clustering and adapts to the internal characteristics of clusters. The proposed solution reduces time complexity by leveraging parallel processing and optimizing the K-Nearest Neighbor algorithm, allowing for efficient clustering of large datasets. Experimental results demonstrate that the parallel implementation significantly improves performance compared to the serial version, particularly on multi-core processors.

Uploaded by

pturishcheva
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

International Journal of Computer Applications (0975 – 8887)

Volume 79 – No8, October 2013

Parallel Algorithm for the Chameleon Clustering


Algorithm using Dynamic Modeling

Rajnish Dashora Harsh Bajaj Akshat Dube Geetha Mary .A


SCSE SCSE SELECT SCSE
VIT University,Vellore VIT University,Vellore VIT University,Vellore VIT University,Vellore

CURE [15] as it takes the aggregate interconnectivity of the


ABSTRACT two clusters but it ignores the information about their inter-
With the increasing size of data-sets in application areas like closeness [11] .
bio-medical, hospitals, information systems, scientific data Chameleon clustering is an algorithm that uses dynamic
processing and predictions, finance analytics, communications, modeling in hierarchical clustering. It does not depend on user
retail and marketing, it is becoming increasingly important to supplied information but it automatically adapts to the internal
execute data mining tasks in parallel. At the same time, characteristics of the clusters being merged [3].To know the
technological advancements have made shared memory- internal characteristics it uses relative inter-connectivity and
parallel computation machines commonly available to various relative closeness of the two clusters[4] .Furthermore to make
organizations and individuals. This paper analyzes a the algorithm to perform dynamically it uses the concept of K-
hierarchical clustering algorithm named chameleon clustering nearest neighboring graph as in this algorithm the
which is based on dynamic modeling and we propose a neighborhood radius of a data objects is defined by the density
parallel algorithm for the same. The algorithm utilizes the of region in which the object lies.
concept of parallel processors available and hence reduces the
time to generate final clusters. 2. RELATED WORKS
PCURE [7] is the parallel implementation of CURE which
General Terms makes an effective utilization of shared memory architecture
Enhancement, Utilization, clustering, similarity, Algorithms et. on small-scale symmetric multiprocessors as well as high
al. performance processors. PCURE uses static modeling in
hierarchical clustering [6]. The algorithm uses per cluster
Keywords information by using the index of the closest cluster in
Multicore Processors; Data Mining; Cluster analysis; minimum distance to it. But the maintenance of this cluster
Hierarchical Clustering; Chameleon; Data points; Shared information with larger index needs more computation time.
Memory; Symmetric Multiprocessing(SMP);Dynamic
Modeling; ParMetis . A parallel K-NN algorithms [2] that exist uses truncated bi-
tonic sort. This sort has nlog2(k)/4 comparisons. The
1. INTRODUCTION Euclidean metric takes a lot of portion of search time so this
Cluster analysis can be used in market research, business, land algorithm uses CUBLA on GPUs. CUBLA [14] is a library
use, biology, atmospheric research, astrology, web based developed by NVIDIA for algebra calculations like dot
applications plant observation and load analysis in power product, matrix manipulations, vector products etc.
systems. Clustering is the process of grouping elements into
The chameleon algorithm for clustering works in 3 phases.
classes or groups based on their similarity of constituent data.
Hierarchical form of clustering is known as Hierarchical A) Getting sparse graph for the data using.
clustering and Agglomerative means to group the similar
datasets into one cluster. K-Nearest Neighbour Algorithm

Existing algorithm only use a static model of clustering and B ) Partition the Sparse graph using multi graph partitioning
not the information of individual clusters when they are algorithms which are predefined in hMeTis Library[13] for
merged. CURE (Clustering Using Representatives) and other graph partitioning algorithms to get smaller clusters
algorithms ignore the aggregate interconnection of the item depending upon their similarity.
sets of two clusters[7].As CURE uses only representative C )In the third phase the algorithm merges the smaller clusters
points from each cluster and merges only the closest pair of to remove the noisy data and make the clusters with
representative points In this way it does not consider other
significant size.
points (data objects) in the clusters. However ROCK (Robust
Clustering algorithms) has an advantage over K-nearest neighbor is a supervised algorithm used to find
similarity between data. It assigns a class-label to data values
and then finds the ones that belong to the same class, and,
connect them with an edge.

11
International Journal of Computer Applications (0975 – 8887)
Volume 79 – No8, October 2013

The third phase of merging of the clusters is done on the basis Absolute _ IC (Ci , C j )
of their similarity to get final clusters. Similarity is measured RIC 
with the two parameters. (internal _ IC (Ci )  internal _ IC (C j ))
i) Inter-connectivity 2
Where, Absolute _ IC (Ci , C j ) = sum of weights of
ii) Closeness
Inter-connectivity refers to the connection between two nodes edges that connect Ci with C j .
and closeness is defined as the similarity between two nodes
in two different clusters. Relative Inter-connectivity [3] is the internal_IC(Ci ) = weighted sum of edges that partition
connection between two nodes of two different clusters the cluster into roughly equal parts.
For the two clusters the Relative inter-connectivity and While, internal closeness of the cluster is the average weight
Relative closeness is calculated using the formulas as follows
of the edges in that particular cluster.
for the two clusters Ci and Cj ,
Relative inter-connectivity,

Fig 1: Chameleon Clustering

12
International Journal of Computer Applications (0975 – 8887)
Volume 79 – No8, October 2013

Relative closeness of two clusters is defined as 8 Do not merge the two clusters
9 end if
S EC Ci ,C j
RC (Ci , C j ) 
Ci Cj
S ECCi  S ECC j Disadvantages of chameleon includes,
Ci  C j Ci  C j a) Complexity: Algorithm has highly complex computation
because it performs inter-connectivity and inter-closeness
Where, between all the data objects and this leads to a high
computational complexity on a single processor.
SEC { Ci , C j } is the average weight of edges that belong to
b) It cannot recover from database corruptions.
a cut-set of the cluster Ci and C j .
3. PROPOSED SOLUTION
SEC Ci is the average weight of edges that belong to a In this paper algorithm is proposed to reduce time complexity
of chameleon algorithm. The exhaustive search time on K-
minimum cut of the cluster Ci Neighboring algorithm is also optimized. Parallel merging
algorithm to merge datasets based on similarity has also
distributed computation across threads. Therefore, it makes it
| Ci | is the number of nodes in Cluster Ci .
scalable for larger datasets and gives a highly robust
Two clusters are merged if they have high value for the computation as one part of dataset doesn’t affect the other. It
product of Relative inter-connectivity and Relative closeness. also improves the complexity of sorting as heap sort is used.
Sometimes, a user can assign a higher priority to a particular Sorting helps to keep similar item sets together [7].
parameter using α. Phase 1: Parallel K-Nearest Neighbor
This algorithm reduces time complexity of analysis. It reduces
RI (Ci , C j )  RC (Ci , C j ) the search time by finding similar data and grouping them
together.
For the two clusters if the above function holds value greater
than or equal to the threshold value for similarity then the Given a repository it analyses data and then compares the
clusters are merged. training data tuple with test tuple to get similarity of data
value. This metric is known as Matric. Their metric is stored
If α > 1 then more priority is given to Relative inter- as a value of similarity between the two tested for tuple
connectivity else if α < 1 Relative closeness gets more values. For the parallel implementation following procedure is
priority. Essentially, in case of α=1 both of them have same followed:-
priority.
Pseudo code for merging algorithms for final clusters 1) Sort the dataset based on Euclidean metric as,

1 RI – Relative inter connectivity a) Keep root node as the training tuple

2 RC – Relative Closeness b) Based on the value of Euclidean metric associated with


a tuple put it in max-heap.
3 α - user defined parameter
2) Generate max-heap with similar data as siblings.
4 β – RI x RC
3) A single thread out of the thread pool searches the data
5 th – threshold value to take merging decision which will be in a single group and divide these data among
other threads in pool.
6 n be number of clusters to be merge Algorithm:
4) If in step 3 the number of data chunks formed out of
7 for i=0 ... n // i and j are used for clusters
similarity sort is more compared to number of threads then
8 for j=i+1 ... n keep them in queue.
9 merge(i,j); 5) Then whenever the threads are free de-queue the elements.
10 End for // iteration j 6) Threads will calculate Euclidean distance between all other
data tuple and the data tuples allocated to it.
11 End for // iteration i
7) Find the Euclidean distance between all data tuples and
Merge function used above can be implemented as connect them to form graph.
1 merge(Ci ,Cj) Connect the graph by taking value of Euclidean distance from
2 Calculate RI for Ci and Cj the local variable of the thread to which that tuple was
allocated.
3 Calculate RC for Ci and Cj
8) Repeat step 7 for all the different tuples.
4 Calculate β = RI α x RC
Phase 2: Partition the sparse Graph
5 if β greater than or equal to th After getting the sparse graph from the parallel k-NN method
6 merge the two clusters Ci and Cj. the graph is to be partitioned using multilevel graph
partitioning algorithms [8] (using ParMETIS - Parallel Graph
7 else Partitioning and Fill-reducing Matrix Ordering [13]).

13
International Journal of Computer Applications (0975 – 8887)
Volume 79 – No8, October 2013

In the sparse graph an edge-cut between two points represents 9 chld = i-1;
the similarity among them more the weight of the edge cut 10 prnt = (chld-1) / 2;
implies more similar the two points are. 11 make maximum of children as parents
12 }
Partition the graph such that the edge cut is minimized by 13 void shinkhp(array_of_nos,n)
finding the minimum edge cuts and removing them from the 14 {
graph. 15 //here each thread is assigned to a particular
The above action will divide the sparse graph in smaller size parent node
clusters on the basis of their similarity. 16 prnt=0; //start from root
17 compare left and right child and make maximum as
Phase 3: Merging the clusters to get final clustering parent;
As the clusters obtained by the partitioning of the sparse graph 18 take the max heap from each thread thereby getting
are very small and might have noisy data[9],clusters are each parent node ;
merged to form larger sized clusters on the basis of two 19 i.e the nodes having right and left child ;
factors Relative inter-connectivity and Relative closeness of 20 knowing the position of these set of nodes construct
the clusters. others;
21 }
Parallel algorithm for merging the clusters:
The parallel threads are being used to perform the merging of 22 levelorder()
the two clusters. Parallelism is achieved using work pool 23 {
analogy. 24 traverse heap in level-order by dividing these levels
to threads;
Here merging of two clusters on basis of their similarity is a 25 connect only siblings to form a graph ;
task or work assigned into work pool. 26 }
Work pool is the set of these independent tasks, functions or Parallel K-NN clustering with heap sorting algorithm
methods which can be executed in parallel.
Threads which are not executing any process will be assigned Pseudo code of parallel merging algorithms for final
with a task from the pool of work. The task is of merging two clusters
clusters, defined by the function merge(Ci , Cj).
PSEUDO CODE 1 RI – Relative inter connectivity
2 RC – Relative Closeness
Algorithm: 3 α - user defined parameter
4 β – RI x RC
1 void heapsort (array_of_nos, int n) 5 th – threshold value to take merging decision
2 { 6 n be number of clusters to be merge
3 buildHp(array_of_nos,n); 7 Algorithm :
4 shrinkHp(array_of_nos,n); 8 for i=0 ... n // i and j are used for clusters
5 }
6 void buildHp (array_of_nos,n) a. for j=i+1 ... n
7 { i. Assign task to work pool
8 loop the three steps bellow till all nodes are merge(i,j);
checked; ii. End for // iteration j
b. End for // iteration i
4. EXPERIMENTAL WORK AND
Merge function used above can be implemented as
RESEARCH
1 merge(Ci, Cj) Advantage of chameleon algorithm is that it can be easily
used to cluster two dimensional as well as three Dimensional
2 Calculate RI for Ci and Cj
data sets. With regard to parallel chameleon the Experimental
3 Calculate RC for Ci and Cj work is carried out and the results are found improving the
4 Calculate β = R I α x RC performance as compared to the serial chameleon. This paper
5 if β greater than or equal to th compared parallel implementation with the serial chameleon
a. merge the two clusters Ci and Cj. . algorithm with two different data sets of 8,000 and 10,000
6 else data points. Parallel chameleon results in the same clustering
a. Do not merge the two clusters as the serial implementation but it provides a better
performance on the processors with multi-core architecture.
7 end if Parallel chameleon is analyzed on two different processors
Intel core i5 and Intel core i7.The algorithm performed a
Pseudo code of parallel merging algorithms for final clusters clustering method by distributing the points equally among
threads so as to balance the results.

14
International Journal of Computer Applications (0975 – 8887)
Volume 79 – No8, October 2013

Fig2: Work pool analogy for parallel chameleon.

Elapsed
time

Fig 4: Data set with 8000 data points (C1).

CPU 1 CPU 2 CPU 3 CPU 4

Fig 3: Simultaneously utilized CPUs.

The results obtained by each thread was combined to get the


whole clustering output of the given data points as shown in Fig 5: Data set with 10000 data points (C2).
Figure.The large set of data points were given as input to
check the scalability of the algorithm.

15
International Journal of Computer Applications (0975 – 8887)
Volume 79 – No8, October 2013

4.8x as compared to single core processor like Intel Pentium


IV. Also for the set C3 with 20,000 data points we get a speed
up of approximately 2.8x on core i5 processor(with 4 threads)
while on core i7 processor(with 8 threads) the speed up is near
5.0x as compared to single core processor like Intel Pentium
IV.

5. CONCLUSIONS AND FUTURE


WORKS
The parallel algorithm helps computing the data with a low
time complexity independent of size of dataset. It works on
symmetric and asymmetric multiprocessors. As openMP
handles the asymmetric loop level nested parallelism, the
algorithm gives performance gain over small scale and large
scale SMPs. Then the portioning of data, thus, is highly
Fig 6: Data set with 20000 data points (C3). reliable and parallel with the support of the in-built library.
The task construct also helps to do a set of computations to be
Table 1: Summary of Performance analysis of parallel easily divided to threads this improves performance of
chameleon. functions in the program.
Future work includes a tool to perform dynamic modelling of
hierarchical clustering in parallel .The algorithm will be
Speed Up as thread-independent so it will make it viable to run across
Number Number platform. The algorithm will be made in such a way that it
compared to
Processor of of Data works both on parallel and distributed systems.
single core
threads Points
processor
6. ACKNOWLEDGEMENTS
Intel core i5 4 8000 2.0 Our sincere thanks to all the faculty members, acquaintances
processor and friends who helped us bring this work across.
Intel core i7 8 8000 4.5
processor 7. REFERENCES
Intel core i5 4 10000 2.6
processor [1] Hadjidoukas, P. E. & Amsaleg, L.
Intel core i7 8 10000 4.8 Parallelization of a Hierarchical Data Clustering
processor Algorithm Using OpenMP In Proc. the 2nd International
Intel core i5 4 20000 3.1 Workshop on OpenMP (IWOMP ’06, 2006)
processor [2] Garcia, V.; Debreuve, E. & Barlaud, M.
Intel core i7 8 20000 5.5 Fast k-nearestneighbor search using GPU Computer
processor Vision and Pattern Recognition Workshops, 2008.
CVPRW '08. IEEE Computer Society Conference on,
2008, 1-6
[3] Karypis, G.; Han, E.-H. (S. & Kumar, V.
Chameleon: Hierarchical Clustering Using Dynamic
Modeling Computer, IEEE Computer Society Press,
1999, 32, 68-75
[4] Sismanis, N.; Pitsianis, N. & Sun, X.
Parallel search of k-nearest neighbors with synchronous
operations.High Performance Extreme Computing
(HPEC), 2012 IEEE Conference on, 2012, 1-6
[5] Xu, R. & Wunsch D., I. Survey of clustering algorithms
Neural Networks, IEEE Transactions on, 2005, 16, 645-
678
[6] Maitrey, S.; Jha, C. K.; Gupta, R. & Singh, J.
Article: Enhancement of CURE Clustering Technique in
Data Mining.IJCA Proceedings on Development of
Fig 7: Comparative analysis on different processors for the Reliable Information Systems, Techniques and Related
Issues (DRISTI 2012), 2012, DRISTI, 7-11
three data sets.
[7] J. Han and M. Kamber, “Data Mining: Concepts and
For the set C1 with 8,000 data points get a speed up of Techniques”, Morgan Kaufmann.2000
approximately 4.46 on multi-core processors analyzed as
compared to single core processor like Intel Pentium IV. [8] Karypis, G. & Kumar, V. Parallel Multilevel Graph
While for the set C2 with 10,000 data points we get a speed Partitioning Proceedings of the 10th International
up of approximately 2.6x on core i5 processor(with 4 threads) Parallel Processing Symposium, IEEE Computer Society,
while on core i7 processor(with 8 threads) the speed up is near 1996, 314-319

16
International Journal of Computer Applications (0975 – 8887)
Volume 79 – No8, October 2013

[9] Graph Partitioning Algorithms for Distributing Workloads [12] Xu, X.; Jäger, J. & Kriegel; H.-P.A Fast Parallel
of Parallel Computations (generals exam). Bradford L. Clustering Algorithm for Large Spatial Databases Data
Chamberlain. University of Washington Technical Min. Knowl. Discov., Kluwer Academic Publishers,
Report UW-CSE-98-10-03, October 1998. 1999, 3, 263-290.
[10] Foti, D.; Lipari, D.; Pizzuti, C. & Talia, D. Scalable [13] George Karypis and Vipin Kumar A Hypergraph
Parallel Clustering for Data Mining on Multicomputers Partitioning Package Version 1.5.3. Army HPC Research
Proceedings of the 15 IPDPS 2000 Workshops on Center. November 22, 1998
Parallel and Distributed Processing, Springer-Verlag,
2000, 390-398. [14] https://developer.nvidia.com/cublas

[11] K. P. Soman, Shyam Diwakar, V. Ajay, InsightInto Data [15] Guha, S.; Rastogi, R. & Shim, K. CURE: an efficient
Mining: Theory and Practice, PHI Learning Pvt Ltd, clustering algorithm for large databases Proceedings of
2006. the 1998 ACM SIGMOD international conference on
Management of data, ACM, 1

IJCATM : www.ijcaonline.org 17

You might also like