Fast Scalable K-Means++ Algorithm With Mapreduce
Fast Scalable K-Means++ Algorithm With Mapreduce
Fast Scalable K-Means++ Algorithm With Mapreduce
with MapReduce
1 Introduction
Clustering has been applied in many areas of computer science and its related
fields, such as data mining, pattern recognition and image retrieval [1–4]. K-means
X.-h. Sun et al. (Eds.): ICA3PP 2014, Part II, LNCS 8631, pp. 15–28, 2014.
c Springer International Publishing Switzerland 2014
16 Y. Xu et al.
is one of the most widely used clustering methods, but it suffers from the well-
known problem that converges to a local optimum. Due to the reason that it is
highly dependent upon the chosen of initial centers. In recent years, many re-
searches have focused on improving its initialization method [5,6]. An important
piece of work in this direction is the k-means++ [7]. This algorithm is fast with
small data in practice. Moreover, it obtains an O(logk) approximation solution
to the optimal result of k-means and gives a theoretical guarantee firstly.
However, the era of big data poses new challenges for k-means++ algorithm.
Although it can be run on the MapReduce [8], and there are also many clustering
algorithms [9–12] run on MapReduce platform efficiently in practice, k-means++
is an exception. The fundamental reason is that k-means++ is a sequential
algorithm and it is lack of scalability. That is the probability a point is chosen
to be a center strongly depends on the previous centers. K-means++ algorithm
chooses one center in each round and it needs k rounds over the data to produce
the expected initial centers. This requires many iterative computations. For a
single computer, iterative computation is common and it is easily implemented.
While for the MapReduce framework, it does not directly support these iterative
data analysis applications. Instead, we must implement iterative programs by
manually issuing multiple MapReduce jobs and this renders the data must be
re-loaded and re-processed at each iteration, wasting I/O, network and CPU
resources [13, 14].
To reduce the number of rounds of k-means++, Bahman Bahmani et al. pro-
posed scalable k-means++ algorithm [15]. We show it in Section 3 in more detail.
It is a parallel version of the inherently sequential k-means++. Instead of choos-
ing one point in each round, scalable k-means++ uses the oversampling method
to choose = Ω(k) points. Hence, it can drastically reduce the iteration rounds
from k to approximate O(logψ). Scalable k-means++ enhances the scalability of
k-means++ and it is easily paralleled in MapReduce framework. Another merit
of it is that it achieves an O(logk) approximation to the k-means objective.
However, scalable k-means++ does not thoroughly break the inherent se-
quential nature of k-means++. Thus, it is embarrassingly parallel and can not
be executed on MapReduce-based systems efficiently. Considering that there is
no communication between Mappers, MapReduce scalable k-means++ requires
two MapReduce jobs to complete in each round. The first job chooses centers
and combines them. The second one is responsible for computing the cluster-
ing cost. Therefore, it has to iterate O(logψ) rounds and at least 2 ∗ O(logψ)
MapReduce jobs to choose the initial centers. As mentioned above, MapReduce
does not directly support iterative analysis applications, when logψ is large, it
is time-consuming and we cannot put up with so many MapReduce jobs. In
addition, it incurs large amount of network and I/O overhead.
This paper proposes an efficient parallel scalable k-means++ algorithm which
is called Oversampling and Refining (OnR) in the situation of big data by virtue
of MapReduce. The main idea of OnR is to use only one MapReduce job, instead
of two jobs, to complete the task of choosing new centers and computing cluster-
ing cost. For lack of communication in Mapper phase, we could not compute the