0% found this document useful (0 votes)
21 views8 pages

Clustering Incomplete Data Using Kernel-Based

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views8 pages

Clustering Incomplete Data Using Kernel-Based

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Neural Processing Letters 18: 155–162, 2003.

155
# 2003 Kluwer Academic Publishers. Printed in the Netherlands.

Clustering Incomplete Data Using Kernel-Based


Fuzzy C-means Algorithm

DAO-QIANG ZHANG? and SONG-CAN CHEN


Department of Computer Science and Engineering, Nanjing University of Aeronautics
and Astronautics, Nanjing, 210016, People’s Republic of China. e-mail: [email protected]

Abstract. There is a trend in recent machine learning community to construct a nonlinear


version of a linear algorithm using the ‘kernel method’, e.g. Support Vector Machines (SVMs),
kernel principal component analysis, kernel fisher discriminant analysis and the recent kernel
clustering algorithms. In unsupervised clustering algorithms using kernel method, typically, a
nonlinear mapping is used first to map the data into a potentially much higher feature space,
where clustering is then performed. A drawback of these kernel clustering algorithms is that the
clustering prototypes lie in high dimensional feature space and hence lack clear and intuitive
descriptions unless using additional projection approximation from the feature to the data
space as done in the existing literatures. In this paper, a novel clustering algorithm using
the ‘kernel method’ based on the classical fuzzy clustering algorithm (FCM) is proposed
and called as kernel fuzzy c-means algorithm (KFCM). KFCM adopts a new kernel-induced
metric in the data space to replace the original Euclidean norm metric in FCM and the
clustered prototypes still lie in the data space so that the clustering results can be reformulated
and interpreted in the original space. Our analysis shows that KFCM is robust to noise and
outliers and also tolerates unequal sized clusters. And finally this property is utilized to cluster
incomplete data. Experiments on two artificial and one real datasets show that KFCM
has better clustering performance and more robust than several modifications of FCM for
incomplete data clustering.

Key words. clustering, fuzzy c-means, incomplete data, Kernel methods

1. Introduction
The fuzzy c-means (FCM) algorithm [1], as a typical clustering algorithm, has been
utilized in a wide variety of engineering and scientific disciplines such as medicine
imaging, bioinformatics, pattern recognition, and data mining. FCM partitions
a given dataset, X ¼ fx1 ; . . . ; xn g  Rp , into c fuzzy subsets by minimizing the
following objective function
c X
X n
2
Jm ðU; V Þ ¼ um
ik kxk  vi k ð1Þ
i¼1 k¼1

where c is the number of clusters and selected as a specified value in this paper, n the
number of data points, uik the membership of xk in class i, m the quantity controlling
?Corresponding author.
156 DAO-QIANG ZHANG AND SONG-CAN CHEN

clustering fuzziness, and V the set of cluster centers (vi 2 Rp ). The matrix U with the
ik-th entry uik is constrained to contain elements in the range [0,1] such that
Pc
i¼1 uik ¼ 1; 8k ¼ 1; 2; . . . ; n. The function Jm is minimized by a famous alternate
iterative algorithm.
Since the original FCM uses the squared-norm to measure similarity between
prototypes and data points, it can only be effective in clustering ‘spherical’ clusters.
To cluster more general dataset, a lot of algorithms have been proposed by replacing
the squared-norm in Equation (1) with other similarity measures [1, 2]. A recent
development is to use kernel method to construct the kernel versions of the FCM
algorithm [3–5]. A common ground of these algorithms is that clustering is per-
formed in the transformed feature space after a (implicitly) nonlinear data transfor-
mation F. However, a drawback of these algorithms is that clustering result,
especially those prototypes, is difficult to be exactly represented and reformulated
due to having not corresponding pre-images in the data space for some prototypes
in the feature space. To avoid that problem, some additional approximate projection
techniques must be used, as shown in [6, 7].
On the other hand, all aforementioned algorithms are based on the assumption
that the data in a dataset are complete, that is, all of the features (i.e., components)
of every vector in X are known or exist. However, many real data sets such as
medical data suffer from incompleteness, i.e., one or more of the components in X
are missing, as a result of measurement errors, missing observations, etc. [8]. Many
algorithms have been proposed to deal with incomplete data [8–12]. An elementary
but good summary was given in [9], where several principles for handling incomplete
data were included. More recently, the triangle inequality was used to estimate the
missing dissimilarity data [8]. And in [12], several ways were developed to continue
the FCM clustering of incomplete data. One simple method is to use the partial
distance strategy (PDS) in FCM, the other is to estimate the missing feature as
the weighted sum of the prototypes (WSP), and another strategy is the nearest
prototype strategy (NPS).
In this paper, an alternative kernel-based fuzzy c-means (KFCM) algorithm is
proposed to cluster incomplete data. Unlike the usual way utilizing kernel method
in FCM, the proposed KFCM clustering algorithm is performed still in original data
space, i.e., prototypes lie in data space. Furthermore, KFCM adopts a more robust
kernel-induced metric different from the Euclidean norm in original FCM. By a
similar way as in [12], we applied the proposed KFCM to cluster incomplete data,
and it is shown that WSP and NPS are two special cases of KFCM when clustering
incomplete data. Furthermore, because KFCM has better outlier and noise immu-
nity than FCM, it is especially suitable to dealing with incomplete data. In this
paper, three artificial and real datasets are used for testing. Experimental results
show that KFCM has better performance than WSP and NPS.
In Section 2, we first discuss the alternative kernel based fuzzy c-means clustering
algorithm, and then we apply this algorithm for incomplete data clustering in Section 3.
To demonstrate the effectiveness of the proposed algorithm, three experiments
CLUSTERING INCOMPLETE DATA 157

on incomplete datasets are conducted and results are given in Section 4. At last,
conclusions and discussions are given in Section 5.

2. Kernel Fuzzy c-means Clustering (KFCM)


Define a nonlinear map as F : x ! FðxÞ 2 F, where x 2 X. X denotes the data space,
and F the transformed feature space with higher or even infinite dimension. KFCM
minimizes the following objective function
c X
X n
2
Jm ðU; V Þ ¼ um
ik kFðxk Þ  Fðvi Þk ð2Þ
i¼1 k¼1

where

kFðxk Þ  Fðvi Þk2 ¼ Kðxk ; xk Þ þ Kðvi ; vi Þ  2Kðxk ; vi Þ ð3Þ

where Kðx; yÞ ¼ FðxÞT FðyÞ and is an inner product kernel function. If we adopt the
Gaussian function as a kernel function, i.e., Kðx; yÞ ¼ expðkx  yk2 =s2 Þ, then
Kðx; xÞ ¼ 1, according to Equation (3), Equation (2) can be rewritten as

c X
X n
Jm ðU; V Þ ¼ 2 um
ik ð1  Kðxk ; vi ÞÞ ð4Þ
i¼1 k¼1

Minimizing Equation (4) under the constraint of U, we have

ð1=ð1  Kðxk ; vi ÞÞÞ1=ðm1Þ


uik ¼ Pc 1=ðm1Þ
; 8i ¼ 1; 2; . . . ; c; k ¼ 1; 2; . . . ; n ð5Þ
j¼1 ð1=ð1  Kðxk ; vj ÞÞÞ

Pn
um
ik Kðxk ; vi Þxk
vi ¼ Pk¼1
n m ; 8i ¼ 1; 2; . . . ; c ð6Þ
k¼1 ik Kðxk ; vi Þ
u

It is worth to note that although Equations (5)–(6) are derived using the Gaussian
kernel function, we can use other functions satisfying Kðx; xÞ ¼ 1 in Equations
(5)–(6) in real applications such as the following RBF functions and hyper tangent
functions:
(1) RBF functions:
 X 
a a b 2
Kðx; yÞ ¼ exp  jxi  yi j =s ð0 < b 4 2Þ ð7Þ
i
158 DAO-QIANG ZHANG AND SONG-CAN CHEN

(2) Hyper tangent functions:


Kðx; yÞ ¼ 1  tanhðkx  yk2 =s2 Þ ð8Þ

Note that RBF function with a ¼ 1; b ¼ 2 reduces into the common-used Gaussian
function. In fact, Equation (3) can be viewed as kernel-induced new metric are in the
data space, which is defined as the following
pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
dðx; yÞ/kFðxÞ  FðyÞk ¼ 2ð1  Kðx; yÞÞ ð9Þ

In Appendix I, we prove that dðx; yÞ defined in Equation (9) is a metric in the original
space in case that K(x,y) takes as the Gaussian, RBF functions and hyper tangent
function as the kernel functions respectively. According to Equation (6), the data
point xk is endowed with an additional weight Kðxk ; vi Þ, which measures the similar-
ity between xk and vi . When xk is an outlier, i.e., xk is far from the other data points,
Kðxk ; vi Þ will be very small, so the weighted sum of data points shall be more robust.
Since in incomplete dataset, a data point with missing components is likely to turn
into an outlier, the algorithm based on KFCM to cluster incomplete data is of great
potential.

3. Clustering Incomplete Data using KFCM


To implement clustering on incomplete data, we derive the following algorithm
based on KFCM:
(1) Fix c; m > 1 and e > 0 for some positive constant;
(2) Set xkj ¼ 0; if xkj is a missing feature;
(3) Initialize the prototypes vi using FCM;
(4) For t ¼ 1; 2; . . . ; tmax , do:
(a) Update all memberships utik with Equation (5);
(b) Update all prototypes vti with Equation (6);
(c) Calculate the missing value using Equation (10);
(d) Compute E t ¼ maxi;k jutik  ut1
ik j; if E t 4 e, stop;
Endo.
Pc m
uik Kðxk ; vi Þvij
xkj ¼ Pi¼1
c m ð10Þ
i¼1 uik Kðxk ; vi Þ

Here the kernel function is the same as in Section 2. From Equations (5), (6) and (10),
as s ! 1; Kðxk ; vi Þ  1  kxk  vi k2 =s2 , then KFCM reduces to the classical FCM,
and Equation (10) changes into the expression used in WSP algorithm. Furthermore,
if s ! 0, Equation (10) reduces to xkj ¼ vpj , where p ¼ mini ðkxk  vi k2 Þ, which is just
the strategy used in NPS algorithm.
CLUSTERING INCOMPLETE DATA 159

4. Experiments Results and Discussions


In this section, we compare the performance of FCM with that of KFCM on some
incomplete datasets. In all cases, the incomplete data are artificially generated
by randomly selecting a specified percentage of its components to be designated as
missing values. The random selection of missing values is constrained so that: (1) each
original feature vector retains at least one component; (2) each feature has at least
one value present in the incomplete data set [12]. The initial values for prototypes
are obtained using FCM on the original (complete) data sets. And the kernel func-
tions used are Gaussian, RBF, and hyper tangent functions.
The first dataset (Data Set A) is an artificially generated one [2]. It contains two
clusters; one has 25 points and the other 125 points. Figure 1 shows the clustering
result on the complete data using FCM and KFCM. We use both Gaussian and
hyper tangent kernel function in this experiment, with s ¼ 2, and experiments are
repeated 10 times, with each time having the same result. It can be seen from Figure 2
that FCM misclassifies 8 data points, while KFCM correctly classifies the data.
Table I gives the clustering result on incomplete Data Set A. Both Gaussian and
RBF kernels are used in KFCM, with s ¼ 0:2. Results are got under a total of
1000 trials. Obviously, KFCM has much better performance than FCM on Data
Set A and the best result is got using the RBF function at a ¼ 1.5, b ¼ 1.2, as shown
in Table I.
The second artificial dataset (Data Set B) consists of two clusters having 100 points
each in R5 [12]. They are randomly distributed from a Gaussian distributions with
mean (1 1 1 1 1) and (1 1 1 1 1) respectively and have identity covariance.
Figure 2 shows the 2-D plot of Data Set B using the standard PCA technique [13].

Figure 1. Clustering result of FCM and KFCM algorithms on Data Set A. þ cluster 1; . cluster 2; * data
points misclassified by FCM.
160 DAO-QIANG ZHANG AND SONG-CAN CHEN

Figure 2. Two dimension plot of the distribution of Data Set B. þ cluster 1; * cluster 2.

For Data Set B with Gaussian distributions, both FCM and KFCM can correctly
classify the data. Next, we perform clustering on incomplete Data Set B. Table II
gives the clustering result on incomplete Data Set B. The kernel function used in
KFCM is the Gaussian and hyper tangent function, both taking s ¼ 2. Results
are averaged under a total of 1000 trials. The Gaussian function has the smallest mis-
classifications in average in all cases, as shown in Table II.
The last dataset is the well-known Iris dataset. It consists of 150 four-dimensional
feature vectors, with 50 vectors for each of three physically labeled classes. In order
to acquire better clustering performance, each vector is normalized. The clustering
result on the incomplete Iris data is shown in Table III. The kernel functions used

Table I. Averaged number of misclassifications on incomplete Data Set A.

% Missing PDS WSP NPS Gaussian RBF (a ¼ 1.5, b ¼ 1.2)

10 21.84 11.92 17.95 7.62 7.27


20 33.56 16.12 31.45 16.08 14.93

Table II. Averaged number of misclassifications on incomplete Data Set B.

% Missing PDS WSP NPS Gaussian Hyper tangent

20 2.57 2.54 2.61 2.43 2.51


40 6.39 6.33 6.71 6.07 6.10
60 15.70 14.66 30.50 14.32 14.39
CLUSTERING INCOMPLETE DATA 161

Table III. Averaged number of misclassifications on incomplete Iris dataset.

% Missing PDS WSP NPS Gaussian RBF (a ¼ 0.5, b ¼ 2)

25 63.96 16.33 29.14 13.57 12.73


50 77.79 37.21 50.75 37.66 31.26

are Gaussian and RBF functions, both with s ¼ 1. The result is averaged under a
total of 1000 trials. The RBF function under a ¼ 0.5, b ¼ 2 has the averaged smallest
misclassifications in all cases, as shown in Table III.
From our experiments, we found that different kernels with different parameters
lead to different clustering results. Thus a key point is to choose an appropriate
kernel parameter. However, there is not a general theory to guide the selection of
the best parameter in most kernel based algorithms. This is an open problem. Here
in our experiments, we adopted an approach similar to that in [6], i.e., running a
5-fold cross-validation procedure only on a few realizations of the data set. Each
time, this is done in two stages: first taking a large interval in the exponential scale
to find a good initial guess of the parameter, and then shortening gradually the inter-
val to refine the parameter at the second stage. We use the median of the five esti-
mations throughout the remaining trials on the data set. Usually, it needs about several
tens of trials on the data set to get an appropriate parameter. Remembering in our
experiments, a total of 1000 trials are done, the computation cost on choosing an
appropriate parameter is still less.

5. Conclusions
In this paper, we proposed a kernel-induced new metric to replace the Euclidean
norm in fuzzy c-means algorithm in the original space and then derived the alterna-
tive kernel-based fuzzy c-means algorithm. Unlike the common way using the ‘kernel
method’ to represent a variable in dual form as in SVM [7], kernel PCA [6], kernel
Fisher discriminant analysis [6] and kernel clustering algorithms [3–5], we adopted
a new kernel-induced metric as in Equation (2). It was shown the proposed kernel
clustering algorithm is robust to noise and outliers and also tolerates unequal sized
clusters. That property is further utilized to cluster incomplete data and results in
better performance than those classical counterparts.

Appendix I. Proof that d(x,y) defined in Equation (9) is a metric


Proof. To prove dðx; yÞ is a metric, the necessary and sufficient condition is that
dðx; yÞ satisfies the following three conditions [14]
(i) dðx; yÞ > 0; 8x 6¼ y; dðx; xÞ ¼ 0,
(ii) dðx; yÞ ¼ dðy; xÞ,
(iii) dðx; yÞ 4 dðx; zÞ þ dðz; yÞ; 8z.
162 DAO-QIANG ZHANG AND SONG-CAN CHEN

It’s easy to verify that for Gaussian, RBF and hyper tangent kernel functions,
dðx; yÞ defined in Equation (9) satisfies 8x 6¼ y, dðx; yÞ ¼ dðy; xÞ > 0, and dðx; xÞ ¼ 0,
so condition (i) and (ii) are satisfied. From Equation (9), we have

dðx; yÞ ¼ kFðxÞ  FðyÞk 4 kFðxÞ  FðzÞk þ kFðzÞ  FðyÞk ¼ dðx; zÞ þ dðz; yÞ:

Thus condition (iii) is also satisfied due to the properties of the norm. So dðx; yÞ is a
metric.

Acknowledgements
The authors are grateful to the anonymous reviewers for their comments and sugges-
tions to improve the presentation of this paper. This work was supported in part by
the National Science Foundations of China and of Jiangsu under Grant Nos.
60271017 and BK2002092, ‘QingLan project’ foundation of Jiangsu and Returnees
foundation of Educational Ministry.

References
1. Bezdek, J. C.: Pattern Recognition with Fuzzy Objective Function Algorithms, Plenum Press,
New York, 1981.
2. Wu, K. L. and Yang, M. S.: Alternative c-means clustering algorithms, Pattern Reco-
gnition vol. 35, pp. 2267–2278, 2002.
3. Girolami, M.: Mercer kernel-based clustering in feature space, IEEE Trans. Neual
Networks 13(3) (2002), 780–784.
4. Zhang, L., Zhou, W. D. and Jiao, L. C.: Kernel clustering algorithm, Chinese J. Com-
puters 25(6) (2002), 587–590 (in Chinese).
5. Zhang, D. Q. and Chen, S. C.: Fuzzy clustering using kernel methods, In: Procedings of
Inter. Conf. Control and Automatation (ICCA’02), June 16–19, pp. 123–128, Xiamen,
China, 2002.
6. Muller, K. R. and Mika, S. et al.: An introduction to Kernel-based learning algorithms,
IEEE Trans. Neural Networks 12(2) (2001), 181–202.
7. Vapnik, V. N.: Statistical Learning Theory. Wiley, New York, 1998.
8. Hathaway, R. J. and Bezdek, J. C.: Clustering incomplete relational data using the
non-Euclidean relational fuzzy c-means algorithm. Pattern Recognition Letters 23
(2002), 151–160.
9. Jain, A. K. and Dubes, R. C.: Algorithms for Clustering Data, Englewood Cliffs, NJ, 1988.
10. Gaul, W. and Schader, M.: Pyramidal classification based on incomplete data, J. Classi-
fication 11 (1994), 171–193.
11. Schafer, J. L.: Analysis of Incomplete Multivariate Data, Chapman & Hall, London, 1997.
12. Hathaway, R. J. and Bezdek, J. C.: Fuzzy c-means clustering of incomplete data, IEEE
Trans. Syst. Man. Cybernetics 31(5) (2001), 735–744.
13. Jolliffe, L. T.: Principal Component Analysis, Springer-Verlag, 1986.
14. Rudin, W.: Principles of Mathematical Analysis, McGraw-Hill Book Company, New
York, 1976.

You might also like