Chapter 2 (01-09-2019)
Chapter 2 (01-09-2019)
Chapter 2 (01-09-2019)
To cluster increasingly massive data sets that are common today in data and text mining, we
propose a parallel implementation of the k-means clustering algorithm based on the message
passing model. The proposed algorithm exploits the inherent data-parallelism in the kmeans
algorithm.
2 Distributed clustering :
Distributed clustering has been addressed in the Distributed Data Mining (DDM) community, for
a detailed survey up through 2003, see [10]. Most relevant to this paper are the following two
parallel implementations of K-means cluster-ing. Dhillon and Modha [11] divide the data set into
p same-sized blocks, then on each iteration, each of the p processors updates its current centroids
based on its block. The processors broadcast their centroids and cluster counts. Once a processor
has received all the centroids from other processors, it forms the global centroids by weighted
averaging. Each processor proceeds on to the next iteration. Forman and Zhang [12] take a
similar approach, but extend it to K-harmonic means. Note that the methods of [11] and [12]
both start by partitioning then distributing a centralized data set over many sites. This is different
from the setting that we consider: the data are never located in one central repository, it is
inherently distributed; and the network is of modest size and not dynamic. However, we directly
employ their idea of sending around centroids and updating based on weighted averaging.
However, a distributed clustering algorithm must answer a certain challenge that can be summed
up in:
• Quality of the final result: In distributed systems, data is grouped locally at each node.
Then; Local clusters of all nodes are aggregated to build a global data model. The quality
of the overall model from the data must be equal to or comparable to a model derived by
a centralized process.
• The complexity of communication: It is desirable to develop methods that have a low
communication complexity, in order to perform distributed data analysis with a minimum
of overhead communication.
• Local Data Privacy: In some cases, where local data is sensitive and not easily shared, it
is desirable to achieve some level of local data confidentiality while pulling the overall
model.
2.1 Communication Model :
Starting with one point (singleton) clusters and recursively merging two or more most similar
clusters to one “parent” cluster until the termination criterion is reached (e.g., k cluster have been
built) vs.
Starting with one cluster of all objects and recursively splitting each cluster until the termination
criterion is reached.
K-MEANS clustering is a well-known and well-studied exploratory data analysis technique. The
standard version assumes that all data are available at a single location. However, if data sources
are distributed over a large-scale Peer-to-Peer (P2P) network, collecting the data at a central
location before clustering is not an attractive and practical option. There exists many important
and exciting applications of K-means clustering on data distributed over a P2P network, and for
these, a highly scalable, communication-efficient, distributed algorithm is desired. This paper
proposes two such algorithms for K-means clustering on data distributed over a P2P network.
The first algorithm takes a completely decentralized approach, where peers (nodes) only
synchronize with their immediate topological neighbors in the underlying communication
network.1 This algorithm can easily adapt to dynamic P2P network where existing nodes drop
out and new nodes join in during the execution of the algorithm and the data in network changes.
However, it is difficult to analyze that the algorithm and performance guarantees are
experimentally verified. Our experiments show that the algorithm converges quickly, and the
accuracy is quite good and resilient to changes in network topology see figure 2.
3. Algorithm distribute :
Geometric metrics. These metrics define a cluster center for each cluster and use these cluster
centers to determine the distances between clusters. Exam-ples include: centroid: The cluster
center is the centroid of the points in the cluster. The Euclidean distance between the cluster
centers is used. Median: The cluster center is the (unweighted) average of the centers of the two
clusters agglomerated to form it. The Euclidean distance between the cluster centers is used.
hfinimum variance: The cluster center is the centroid of the points in the cluster. The distance
between two clusters is the amount of increase in the sum of squared distances from each point
to the center of its cluster that would be caused by agglomerating the clusters.
The EM algorithm estimates the parameters of a model iteratively, starting from some initial
guess. Each iteration consists of an Expectation (E) step, which finds the distribution for the
unobserved variables, given the known values for the observed variables and the current estimate
of the parameters, and a Maximization (M) step, which re-estimates the parameters to be those
with maximum likelihood, under the assumption that the distribution found in the E step is
correct. It can be shown that each such iteration improves the true likelihood, or leaves it
unchanged (if a local maximum has already been reached, or in uncommon cases, before then).
The M step of the algorithm may be only partially implemented, with the new estimate for the
parameters improving the likelihood given the distribution found in the E step, but not
necessarily maximizing it. Such a partial M step always results in the true likelihood improving
as well. Dempster, et al refer to such variants as "generalized EM (GEM)" algorithms. A sub-
class of GEM algorithms of wide applicability, the "Expectation Conditional Maximization
(ECM)" algorithms, have been developed by Meng and Rubin (1992), and further generalized by
Meng and van Dyk (1997) . In many cases, partial implementation of the E step is also natural.
The unobserved variables are commonly independent, and influence the likelihood of the
parameters only through simple sufficient statistics. If these statistics can be updated
incrementally when the distribution for one of the variables is re-calculated, it makes sense to
immediately re-estimate the parameters before performing the E step for the next unobserved
variable, as this utilizes the new information immediately, speeding convergence. An
incremental algorithm along these general lines was investigated by Nowlan (1991). However,
such incremental variants of the EM algorithm have not previously received any formal
justification. We present here a view of the EM algorithm in which it is seen as maximizing a
joint function of the parameters and of the distribution over the unobserved variables that is
analogous to the "free energy" function used in statistical physics, and which can also be viewed
in terms of a KullbackLiebler divergence. The E step maximizes this function with respect to the
distribution over unobserved variables; the M step with respect to the parameters. Csiszar and
Tusnady (1984) and Hathaway (1986) have also viewed EM in this light. In this paper, we nse
this viewpoint to justify variants of the EM algorithm in which the joint maximization of this
function is performed by other means - a process which must also lead to a maximum of the true
likelihood. In particular, we can now justify incremental versions of the algorithm, which in
effect employ a partial E step, as well as "sparse".
versions, in which most iterations update only that part of the distribution for an unobserved
variable pertaining to its most likely values, and "winner-take-all" versions, in which, for early
iterations, the distributions over unobserved variables are restricted to those in which a single
value has probability one. . We include a brief demonstration showing that use of an incremental
algorithm speeds convergence for a simple mixture estimation problem.
In this section, the distributed k-means algorithm will be proposed. For this purpose, we first
briefly introduce the centralized k-means algorithm which has been very well developed in the
existing literature.
3.3.2.1 Introduction to the Centralized k-Means Algorithm
Given a set of observations {x1, x2, . . . , xn}, where each observation is a d-dimensional real-
valued vector, the k-means algorithm [7] aims to partition the n observations into k(≤ n) sets S =
{S1, S2, . . . , Sk} so as to minimize the within-cluster sum of squares (WCSS) function. In other
words, its objective is to find
k
argmin=
∑ ∑
2
‖ xi−cj‖
S j=1 xi ∈ si
Since the arithmetic mean is a least-squares estimator, this step also minimizes the WCSS
function.
The algorithm converges if the centroids no longer change.
The k-means algorithm can converge to a (local) optimum, while there is no guarantee for it to
converge to the global optimum [7]. Since for a given k, the result of the above clustering
algorithm depends solely on the initial centroids, a common practice in clustering the sensor
observations is to execute the algorithm several times for different initial centroids and then
select the best solution.
The computational complexity of the k-means algorithm is O(nkdM) [7], where n is the number
of the d-dimensional vectors, k is the number of clusters, and M is the number of iterations
before reaching convergence.
I this chapter we see some deferent distributed Algorithms and her final result quality , her
communication complexity and the local data confidentiality, so we propose a distributed
Algorithm in chapter 3 that will answer of this problem.
Réference
[1] Jiahu Qin, Member, IEEE, Weiming Fu, Huijun Gao, Fellow, IEEE, and Wei Xing Zheng,
Fellow, IEEEF Distributed k-Means Algorithm and Fuzzy c-Means. CoRR, vol. abs/1205.2282,
April 2012
[5] Datta S., Giannella C., Kargupta H. Approximate distributed k-means clustering over a peer-
to-peer network. IEEE Trans. on Knowl. and Data Eng., vol. 21, no 10, p. 1372–1388, October
2009
[12] G. Forman and B. Zhang, “Distributed Data Clustering Can BeEfficient and
Exact,”SIGKDD Explorations,vol. 2, no. 2, pp. 34-38,2000.
[13] Dempster, A.; Laird, N.; Rubin, D. Maximum likelihood from incomplete data via the
EMalgorithm.J. R. Stat. Soc. B1977,39, 1–38.
[14] Moon, T. The expectation-maximization algorithm. IEEE Signal Process. Mag.1996,13, 47–
60.
[15] Meng, W.; Xiao, W.; Xie, L. An efficient EM algorithms for multi-source localization in
wirelesssensor networks.IEEE Trans. Instrum. Meas.2011,60, 1017–1027.
[16] R.M. Neal and G.E. Hinton. A view of the em algorithm that justifies incremental, sparse,
and other variants. NATO ASI Series D, Behavioural and Social Sciences, 89:355–370, 1998.
[17] Martin Ester, Hans-Peter Kriegel, Jörg Sander, XiaoweiXu, A Density-Based Algorithm for
Discovering Clusters in Large Spatial Databases with Noise, Published in Proceedings of 2nd
International Conference on Knowledge Discovery and Data Mining. P.169 - 194 .1996
[18] C. Olson, Parallel Algorithms for Hierarchical Clustering, Parallel Computing, Vol. 21, pp:
1313-1325, 1995.