Chapter 2 (01-09-2019)

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 13

1 Introduction :

To cluster increasingly massive data sets that are common today in data and text mining, we
propose a parallel implementation of the k-means clustering algorithm based on the message
passing model. The proposed algorithm exploits the inherent data-parallelism in the kmeans
algorithm.

2 Distributed clustering :

Distributed clustering has been addressed in the Distributed Data Mining (DDM) community, for
a detailed survey up through 2003, see [10]. Most relevant to this paper are the following two
parallel implementations of K-means cluster-ing. Dhillon and Modha [11] divide the data set into
p same-sized blocks, then on each iteration, each of the p processors updates its current centroids
based on its block. The processors broadcast their centroids and cluster counts. Once a processor
has received all the centroids from other processors, it forms the global centroids by weighted
averaging. Each processor proceeds on to the next iteration. Forman and Zhang [12] take a
similar approach, but extend it to K-harmonic means. Note that the methods of [11] and [12]
both start by partitioning then distributing a centralized data set over many sites. This is different
from the setting that we consider: the data are never located in one central repository, it is
inherently distributed; and the network is of modest size and not dynamic. However, we directly
employ their idea of sending around centroids and updating based on weighted averaging.

However, a distributed clustering algorithm must answer a certain challenge that can be summed
up in:

• Quality of the final result: In distributed systems, data is grouped locally at each node.
Then; Local clusters of all nodes are aggregated to build a global data model. The quality
of the overall model from the data must be equal to or comparable to a model derived by
a centralized process.
• The complexity of communication: It is desirable to develop methods that have a low
communication complexity, in order to perform distributed data analysis with a minimum
of overhead communication.
• Local Data Privacy: In some cases, where local data is sensitive and not easily shared, it
is desirable to achieve some level of local data confidentiality while pulling the overall
model.
2.1 Communication Model :

2.1.1 Hierarchic methods :

Hierarchical clustering builds a cluster hierarchy (a tree of clusters).

2.1.1.1 Agglomerative (bottom-up) techniques :

Starting with one point (singleton) clusters and recursively merging two or more most similar
clusters to one “parent” cluster until the termination criterion is reached (e.g., k cluster have been
built) vs.

2.1.1.2 Divisive (top-down) techniques :

Starting with one cluster of all objects and recursively splitting each cluster until the termination
criterion is reached.

This figure show the hierarchical communication model

Figure 01: Hierarchical communication model

2.1.2 Peer to Peer (P2P) methods [5]:

K-MEANS clustering is a well-known and well-studied exploratory data analysis technique. The
standard version assumes that all data are available at a single location. However, if data sources
are distributed over a large-scale Peer-to-Peer (P2P) network, collecting the data at a central
location before clustering is not an attractive and practical option. There exists many important
and exciting applications of K-means clustering on data distributed over a P2P network, and for
these, a highly scalable, communication-efficient, distributed algorithm is desired. This paper
proposes two such algorithms for K-means clustering on data distributed over a P2P network.
The first algorithm takes a completely decentralized approach, where peers (nodes) only
synchronize with their immediate topological neighbors in the underlying communication
network.1 This algorithm can easily adapt to dynamic P2P network where existing nodes drop
out and new nodes join in during the execution of the algorithm and the data in network changes.
However, it is difficult to analyze that the algorithm and performance guarantees are
experimentally verified. Our experiments show that the algorithm converges quickly, and the
accuracy is quite good and resilient to changes in network topology see figure 2.

Figure 02 : Peer to Peer (P2P) communication model

2.1.3 Circulaire Method : ??????

2.2. Data type exchanged : ???

3. Algorithm distribute :

3.1 Hierarchic methods [18] :

Clustering of multidimensional data is required in many fields. One popular method of


performing such clustering is hierarchical clustering. This method starts with a set of distinct
points, each of which is considered a separate cluster. The two clusters that are closest according
to some metric are agglomerated. This is repeated until all of the points belong to one
hierarchically constructed cluster. The final hierarchical cluster structure is called a dendrogrum
(see Fig. l), which is simply a tree that shows which clusters were agglomerated at each step. A
dendrogram can easily be broken at selected links to obtain clusters of desired cardinality or
radius. This representation is easy to generate and store, so this paper will concentrate on the
determination of which clusters to merge at each step. We must use some metric to determine the
distance between pairs of clusters. For individual points, the Euclidean distance is typically used.
For clusters of points, there are a number of metrics for determining the distances between
clusters. The distance metrics can be broken into two general classes, graph metrics and
geometric metrics.Graph metrics. Consider a completely connected graph where the vertices are
the points we wish to cluster and the edges have a cost function that is the Euclidean distance
between the points. The graph metrics determine interclus-ter distances according to the cost
functions of the edges between the points in the two clusters. The common graph metrics are:
*Single link: The distance between two clusters is given by the minimum cost edge between
points in the two clusters. Auerage link: The distance between two clusters is the average of all
of the edge costs between points in the two clusters. CompZete link: The distance between two
clusters is given by the maximum cost edge between points in the two clusters.

Geometric metrics. These metrics define a cluster center for each cluster and use these cluster
centers to determine the distances between clusters. Exam-ples include: centroid: The cluster
center is the centroid of the points in the cluster. The Euclidean distance between the cluster
centers is used. Median: The cluster center is the (unweighted) average of the centers of the two
clusters agglomerated to form it. The Euclidean distance between the cluster centers is used.
hfinimum variance: The cluster center is the centroid of the points in the cluster. The distance
between two clusters is the amount of increase in the sum of squared distances from each point
to the center of its cluster that would be caused by agglomerating the clusters.

3.2 Density-Based Clustering [17]


The key idea of density-based clustering is that for each object of a cluster the neighborhood of a
given radius (ε) has to contain at least a minimum number of objects (MinPts), i.e. the cardi-
nality of the neighborhood has to exceed a threshold. The for-mal definitions for this notion of a
clustering are shortly introduced in the following. For a detailed presentation see figure 03.

Figure 03. Density-reachability and connectivity

Definition 1: (directly density-reachable) Object p is directly density-reachable from object q


wrt. ε and MinPts in a set of objects D if
1) p ∈ Nε(q) (Nε(q) is the subset of D contained in the ε-neighborhood of q.)
2) Card(Nε(q)) ≥ MinPts (Card(N) denotes the cardinal-ity of the set N)
The condition Card(Nε(q)) ≥ MinPts is called the “core object condition”. If this condition holds
for an object p, then we call p a “core object”. Only from core objects, other objects can be
directly density-reachable.
Definition 2: (density-reachable)
An object p is density-reachable from an object q wrt. ε and MinPts in the set of objects D if
there is a chain of ob-jects p1, ..., pn, p1 = q, pn = p such that pi ∈D and pi+1 is directly density-
reachable from pi wrt. ε and MinPts.
Density-reachability is the transitive hull of direct density-reachability. This relation is not
symmetric in general. Only core objects can be mutually density-reachable.
Definition 3: (density-connected)
Object p is density-connected to object q wrt. ε and MinPts in the set of objects D if there is an
object o ∈D such that both p and q are density-reachable from o wrt. ε and MinPts in D.
Density-connectivity is a symmetric relation. Figure 03 illus-trates the definitions on a sample
database of 2-dimensional points from a vector space. Note that the above definitions only
require a distance measure and will also apply to data from a metric space.
A density-based cluster is now defined as a set of density-con-nected objects which is maximal
wrt. density-reachability and the noise is the set of objects not contained in any cluster.
Definition 4: (cluster and noise)
Let D be a set of objects. A cluster C wrt. ε and MinPts in D is a non-empty subset of D
satisfying the following con-ditions:
1) Maximality: ∀p,q ∈D: if p ∈C and q is density-reach-able from p wrt. ε and MinPts, then also
q ∈C.
2) Connectivity: ∀p,q ∈ C: p is density-connected to q wrt.
ε and MinPts in D.
Every object not contained in any cluster is noise.
Note that a cluster contains not only core objects but also ob-jects that do not satisfy the core
object condition. These objects - called “border objects” of the cluster - are, however, directly
density-reachable from at least one core object of the cluster (in contrast to noise objects).
The algorithm DBSCAN [EKSX 96], which discovers the clusters and the noise in a database
according to the above def-initions, is based on the fact that a cluster is equivalent to the set of all
objects in D which are density-reachable from an arbitrary core object in the cluster (c.f. lemma
1 and 2 in [EKSX 96]).
The retrieval of density-reachable objects is performed by iter-atively collecting directly density-
reachable objects. DBSCAN checks the ε-neighborhood of each point in the database. If the ε-
neighborhood Nε(p) of a point p has more than MinPts points, a new cluster C containing the
objects in Nε(p) is created. Then, the ε-neighborhood of all points q in C which have not yet been
processed is checked. If Nε(q) contains more than MinPts points, the neighbors of q which are
not already contained in C are added to the cluster and their ε-neighborhood is checked in
the next step. This procedure is repeated until no new point can be added to the current cluster C.

3.2.1 Density-Based Cluster-Ordering :


To introduce the notion of a density-based cluster-ordering, we first make the following
observation: for a constant MinPts-val-ue, density-based clusters with respect to a higher density
(i.e.a lower value for ε) are completely contained in density-con-nected sets with respect to a
lower density (i.e. a higher value for ε). This fact is illustrated in figure 04, where C1 and C2 are
density-based clusters with respect to ε2 < ε1 and C is a density-based cluster with respect to ε1
completely containing the sets C1 and C2.

Figure 04. Illustration of “nested” density-based clusters


Consequently, we could extend the DBSCAN algorithm such that sev-eral distance parame-ters
are processed at the same time, i.e. the den-sity-based clusters with respect to different den-sities
are constructed si-multaneously. To produce a consistent result, however, we would have to obey
a specific order in which objects are pro-cessed when expanding a cluster. We always have to
select an object which is density-reachable with respect to the lowest ε value to guarantee that
clusters with respect to higher density (i.e. smaller ε values) are finished first.

3.3 Méthodes par partitionnement :

3.3.1 Distributed method based on the EM algorithm [16]


The Expectation-Maximization (EM) algorithm finds maximum likelihood parameter estimates
in problems where some variables were unobserved. Special cases of the algorithm date back
several decades, and its use has grown even more since its generality and widespread
applicability were discussed by Dempster, Laird, and Rubin (1977). The scope of the algorithm's
applications are evident in the book by McLachlan and Krishnan (1997).

The EM algorithm estimates the parameters of a model iteratively, starting from some initial
guess. Each iteration consists of an Expectation (E) step, which finds the distribution for the
unobserved variables, given the known values for the observed variables and the current estimate
of the parameters, and a Maximization (M) step, which re-estimates the parameters to be those
with maximum likelihood, under the assumption that the distribution found in the E step is
correct. It can be shown that each such iteration improves the true likelihood, or leaves it
unchanged (if a local maximum has already been reached, or in uncommon cases, before then).
The M step of the algorithm may be only partially implemented, with the new estimate for the
parameters improving the likelihood given the distribution found in the E step, but not
necessarily maximizing it. Such a partial M step always results in the true likelihood improving
as well. Dempster, et al refer to such variants as "generalized EM (GEM)" algorithms. A sub-
class of GEM algorithms of wide applicability, the "Expectation Conditional Maximization
(ECM)" algorithms, have been developed by Meng and Rubin (1992), and further generalized by
Meng and van Dyk (1997) . In many cases, partial implementation of the E step is also natural.
The unobserved variables are commonly independent, and influence the likelihood of the
parameters only through simple sufficient statistics. If these statistics can be updated
incrementally when the distribution for one of the variables is re-calculated, it makes sense to
immediately re-estimate the parameters before performing the E step for the next unobserved
variable, as this utilizes the new information immediately, speeding convergence. An
incremental algorithm along these general lines was investigated by Nowlan (1991). However,
such incremental variants of the EM algorithm have not previously received any formal
justification. We present here a view of the EM algorithm in which it is seen as maximizing a
joint function of the parameters and of the distribution over the unobserved variables that is
analogous to the "free energy" function used in statistical physics, and which can also be viewed
in terms of a KullbackLiebler divergence. The E step maximizes this function with respect to the
distribution over unobserved variables; the M step with respect to the parameters. Csiszar and
Tusnady (1984) and Hathaway (1986) have also viewed EM in this light. In this paper, we nse
this viewpoint to justify variants of the EM algorithm in which the joint maximization of this
function is performed by other means - a process which must also lead to a maximum of the true
likelihood. In particular, we can now justify incremental versions of the algorithm, which in
effect employ a partial E step, as well as "sparse".

versions, in which most iterations update only that part of the distribution for an unobserved
variable pertaining to its most likely values, and "winner-take-all" versions, in which, for early
iterations, the distributions over unobserved variables are restricted to those in which a single
value has probability one. . We include a brief demonstration showing that use of an incremental
algorithm speeds convergence for a simple mixture estimation problem.

3.3.2 Distributed method based on the K-means algorithm [1]:

In this section, the distributed k-means algorithm will be proposed. For this purpose, we first
briefly introduce the centralized k-means algorithm which has been very well developed in the
existing literature.
3.3.2.1 Introduction to the Centralized k-Means Algorithm
Given a set of observations {x1, x2, . . . , xn}, where each observation is a d-dimensional real-
valued vector, the k-means algorithm [7] aims to partition the n observations into k(≤ n) sets S =
{S1, S2, . . . , Sk} so as to minimize the within-cluster sum of squares (WCSS) function. In other
words, its objective is to find
k
argmin=
∑ ∑
2
‖ xi−cj‖
S j=1 xi ∈ si

where cj is the presentation of cluster j, generally the centroid of points in Sj.


The algorithm uses an iterative refinement technique. Given an initial set of k centroids c1(1), . . .
, ck(1), the algorithm proceeds by alternating between an assignment step and an update step as
follows. During the assignment step, assign each observation to the cluster characterized by the
nearest centroid, that is
where each xp is assigned to exactly one cluster, even if it could be assigned to two or more of
them. Apparently, this step can minimize the WCSS function.
During the update step, the centroid, say ci(T + 1), of the observations in the new cluster is
computed as follows

Since the arithmetic mean is a least-squares estimator, this step also minimizes the WCSS
function.
The algorithm converges if the centroids no longer change.
The k-means algorithm can converge to a (local) optimum, while there is no guarantee for it to
converge to the global optimum [7]. Since for a given k, the result of the above clustering
algorithm depends solely on the initial centroids, a common practice in clustering the sensor
observations is to execute the algorithm several times for different initial centroids and then
select the best solution.
The computational complexity of the k-means algorithm is O(nkdM) [7], where n is the number
of the d-dimensional vectors, k is the number of clusters, and M is the number of iterations
before reaching convergence.

3.3.2.2 Choosing Initial Centroids Using the Distributed k-Means++ Algorithm :


The choice of the initial centroids is the key to making the k-means algorithm work well. In the
k-means algorithm, the partition result is dependent only on the initial centroids for a given k, so
does the distributed one. Thus, an effective distributed initial centroids choosing method is
important. Here we provide a distributed implementation of the k-means++ algorithm [24], a
centralized algorithm to find the initial centroids for the k-means algorithm. It is noted that k-
means++ generally outperforms k-means in terms of both accuracy and speed [24].
Let D(x) denote the distance from observation x to the closest centroids that have been chosen.
The k-means++ algorithm [24] is executed as follows.
1) Choose randomly an observation from {x1, x2, . . . , xn} as the first centroid c1.
2) Take a new center cj, choosing x from the observations with probability

3) Repeat step 2) until we have taken k centers altogether.


Consider that the n nodes are deployed in an Euclidean space of any dimension and suppose that
each node has a limited sensing radius which may be different for different nodes, therefore
leading to the fact that the underlying topology of the WSN is directed. Each node is endowed
with a real vector xi ∈ Rd representing the observation. Here we assume that every node has its
own unique identification (ID), and further the underlying topology of the WSN is strongly
connected.
The detailed realization of the distributed k-means++ algorithm is as shown in Algorithm 1.
3.3.2.3 Distributed k-Means Algorithm :
Conclusion :

I this chapter we see some deferent distributed Algorithms and her final result quality , her
communication complexity and the local data confidentiality, so we propose a distributed
Algorithm in chapter 3 that will answer of this problem.
Réference

[1] Jiahu Qin, Member, IEEE, Weiming Fu, Huijun Gao, Fellow, IEEE, and Wei Xing Zheng,
Fellow, IEEEF Distributed k-Means Algorithm and Fuzzy c-Means. CoRR, vol. abs/1205.2282,
April 2012

[5] Datta S., Giannella C., Kargupta H. Approximate distributed k-means clustering over a peer-
to-peer network. IEEE Trans. on Knowl. and Data Eng., vol. 21, no 10, p. 1372–1388, October
2009

[10] H. Kargupta and K. Sivakumar, “Existential Pleasures of Distributed Data Mining,”Data


Mining: Next Generation Challengesand Future Directions,AAAI Press, 2004

[11] I. Dhillon and D. Modha, “A Data-Clustering Algorithm on Distributed Memory


Multiprocessors,”Proc. KDD Workshop High Performance Knowledge Discovery, pp. 245-260,
1999.

[12] G. Forman and B. Zhang, “Distributed Data Clustering Can BeEfficient and
Exact,”SIGKDD Explorations,vol. 2, no. 2, pp. 34-38,2000.

[13] Dempster, A.; Laird, N.; Rubin, D. Maximum likelihood from incomplete data via the
EMalgorithm.J. R. Stat. Soc. B1977,39, 1–38.

[14] Moon, T. The expectation-maximization algorithm. IEEE Signal Process. Mag.1996,13, 47–
60.

[15] Meng, W.; Xiao, W.; Xie, L. An efficient EM algorithms for multi-source localization in
wirelesssensor networks.IEEE Trans. Instrum. Meas.2011,60, 1017–1027.

[16] R.M. Neal and G.E. Hinton. A view of the em algorithm that justifies incremental, sparse,
and other variants. NATO ASI Series D, Behavioural and Social Sciences, 89:355–370, 1998.

[17] Martin Ester, Hans-Peter Kriegel, Jörg Sander, XiaoweiXu, A Density-Based Algorithm for
Discovering Clusters in Large Spatial Databases with Noise, Published in Proceedings of 2nd
International Conference on Knowledge Discovery and Data Mining. P.169 - 194 .1996

[18] C. Olson, Parallel Algorithms for Hierarchical Clustering, Parallel Computing, Vol. 21, pp:
1313-1325, 1995.

You might also like