Soft Clustering For Very Large Data Sets: State University of New York, New Paltz, NY, USA
Soft Clustering For Very Large Data Sets: State University of New York, New Paltz, NY, USA
1, January 2017
Min Chen
illustrated in Section 3. A small selection of applications of are identified: discrete and continuous methods.
soft clustering is discussed in Section 4. The paper is Specifically, fuzzy clustering and rough clustering. Hard
concluded in Section 5. clustering can be considered as a special case of soft
clustering which membership values are discrete and
restricted to either 0 or 1 (see Fig. 1). Fuzzy clustering
2. Overview of Soft Clustering provides continuous membership degrees which range
from 0 to 1. The objective of fuzzy clustering is to
minimize the weighted sum of Euclidean distance between
2.1 General information of soft clustering the objects. Fuzzy clustering is a method of clustering that
allows one piece of data to belong to two or more clusters
Soft clustering is one of the most fundamental tasks in
(see Fig. 2). The Fuzzy C-Means (FCM) algorithm is an
exploratory data analysis that groups similar data points in iterative partition clustering technique that was first
an unsupervised process. The main process of clustering
introduced by Dunn [10], and was then extended by
algorithms is to divide a set of unlabeled data objects into Bezdek [11]. FCM uses a standard least squared error
different groups. The cluster membership measure is based
model that generalizes an earlier and very popular non-
on a similarity measure. In order to obtain a high quality fuzzy c-means model that produces hard clusters of the
partition, the similarity measure between the data objects in
data.
the same group is to be maximized, and the similarity
measure between the data objects from different groups is Rough clustering extends the theory of rough or
to be minimized [7]. Most of the clustering task uses an approximation sets. Rough k-means is first introduced by
iterative process to find locally or globally optimal Lingras [12]. Each cluster has a lower and an upper
solutions from a high-dimensional data sets. In addition, approximation. The lower approximation is a subset of the
there is no unique clustering solution for real-life data and upper approximation (see Fig. 3). In other words, the
it is also hard to interpret the ‘cluster’ representations [8]. upper approximation is a boundary region. The members
Therefore, the clustering task requires much of the lower approximation belong to any other cluster.
experimentation with different algorithms or with different The data objects in an upper approximation may belong to
features of the same data set. Hence, how to save the cluster. Since their membership is uncertain they must
computational complexity is a significant issue for the be member of an upper approximation of at least another
clustering algorithms. Moreover, clustering very large data cluster. Hence, an object to a cluster has two membership
sets that contain large numbers of records with high degrees. One for its lower approximation and one for its
dimensions is considered a very important issue nowadays.
upper approximation.
Most conventional clustering algorithms suffer from the
problem that they do not scale with larger sizes of data sets, 2.2 Fuzzy Clustering
and most of them are computationally expensive with
regards to memory space and time complexities. For these Fuzzy clustering is a method of clustering which allows
reasons, the parallelization of clustering algorithms is a one piece of data to belong to two or more clusters. The
solution to overcome the aforementioned problems, and the fuzzy c-means algorithm is a pretty standard least squared
parallel implementation of clustering algorithms is error model that generalizes an earlier and very popular
inevitable. non-fuzzy c-means model that produces hard clusters of
the data. An optimal partition is produced iteratively by
More importantly, clustering analysis is unsupervised minimizing the weighted within group sum of squared
‘nonpredictive’ learning. It divides the data sets into error objective function [13]:
several clusters based on a subjective measurement. (1)
Clustering analysis is unlike supervised learning and it is where = [ 1, 2, ...,] is the data set in a d-dimensional
not based on a ‘trained characterization’. In general, there
is a set of desirable features for a clustering algorithm [9]:
space. is the number of data items. is the
scalability, robustness, order insensitivity, minimum user- vector
clusters which is defined by the user where 2 ≤
specified input, arbitrary-shaped clusters, and point number of
proportion admissibility. Thus, a clustering algorithm ≤ .
cluster.
is the degree of membership of
is a weighted exponent on
in the
each
ℎ
fuzzy
should be chosen such that duplicating the data set and the is the center of cluster . 2 ,
re-clustering task should not change the clustering results.
membership. ( ) is a
1. ]
3.
Input(c, m,
Iteration starts and set t=1
, data)
the fuzzy partition matrix =[
4. Calculate the cluster centers with
:
(2)
+1
5. Calculate the membership using:
(3)
6. If the stopping criteria is not met, = + 1 and go to
Step 4.
4. Applications