Research On Spectral Clustering Algorithms and Prospects
Research On Spectral Clustering Algorithms and Prospects
Abstract—Along with the expansion and in-depth of the within high similarity and different clusters have a smaller
application domain of cluster analysis, one kind of new cluster similarity [5]. In recent years, cluster analysis has been a
algorithm called Spectral Clustering algorithm has been hotspot to analyze data and extract information in the field of
aroused great concern by scholars, Spectral Clustering pattern recognition and data analysis [6].
algorithm is newly developing technique in the field of machine There are many cluster algorithms, but each algorithm is
learning in recent years. Unlike the traditional clustering optimized the certain aspects of data features, such as:
algorithms, this can solve the clustering of non-convex sphere minimize the within-class distance, maximize inter-class
of sample spaces and has globally optimal solution. This paper distance, etc. So far, there is no signal cluster algorithm can
introduces the principle, the induction summary to the current
be used generally to reveal the structure of a variety of multi-
research situation of Spectral Clustering algorithm as well as
in various application domains. Firstly, the analysis and
dimensional data sets [7]. Each algorithm imposes some
induction of some Spectral Clustering algorithms have been structure on the data set explicitly or implicitly, which result
made from several aspects, such as the ideas of algorithm, key in the difficult clustering validity assessment. The traditional
technology, advantage and disadvantage. On the other hand, cluster algorithms, such as K-means, Fuzzy C means (FCM)
some typical Spectral Clustering algorithms have been selected algorithm, etc. most of these algorithms need to assume that
to analyze and compare. Finally, it points out the key problems cluster objects have certain characteristics and it can form a
and future directions. number of different regions, and based on the convex
spherical sample space. But when the sample space is not
Keywords-cluster analysis; spectral clustering; Laplacian convex, the algorithm will be trapped in local optimum. In
Matri; graph partition; eigenvalue order to solve this problem, a new cluster algorithm has been
proposed, known as Spectral Clustering algorithm [8].
I. INTRODUCTION Spectral Clustering algorithms are based on the spectra
graph theory. They treat the data clustering as a graph
So far, clustering analysis has not yet accepted definition partitioning problem without make any assumption on the
in academic community. Spectral clustering refers to a class form of the data clusters, namely, the clustering of data sets
of techniques which rely on the eigenstructure of a similarity mapped to the Laplacian matrix’s row vector which
matrix to partition points into disjoint clusters with points in composed of the fist k feature vectors. Not only can the n-
the same cluster having high similarity and points in dimensional data sets convert into k-dimensional data sets
different clusters having low similarity [1]. Clustering can be (many times, k<<n), to achieve the purpose of
a stand-alone tools and also as a pretreatment process of the dimensionality reduction, and have a good clustering results
model algorithms. Clustering plays an important role in the [9]. Spectral Clustering algorithm is a point-to-point cluster
field of pattern recognition and image processing [2]. algorithm, and has a good application prospects. In recent
Cluster analysis as a data pretreatment process is the years, Spectral Clustering algorithm is more studied and
basis of further analysis and data processing [3]. It is also more increasingly widespread as a cluster analysis algorithm.
known as unsupervised learning process, since there is no It is a new branch in the cluster analysis, it was originally
priori knowledge about the data set. It also acts as an used for load balancing and parallel computing, VLSI design
important data processing analysis tools and methods which [10] and other areas, and it is beginning to be used in
are aimed at completing the exploratory function. Cluster machine learning recently, and quickly becomes an
analysis is a multivariate statistical method which to study international hot spot in the field of machine learning.
“birds of a feather flock together” and also is one method of Currently, Spectral Clustering is attracted more attention in
three practical methods [4], another two are regression the field of text mining [11, 12], information retrieval [13]
analysis and discriminate analysis. It is mainly to study the and image segmentation [14], and has achieved research
individual or similarity distance or similarity measure, results.
according to some clustering rules to divide the individual or
data set into different clusters, making the same cluster
c
978-1-4244-6349-7/10/$26.00 2010 IEEE V6-149
This paper introduces the basic principles of Spectral 4) 2-value k-adjacency matrix: for the graph Gk, if the
Clustering algorithm, some typical algorithm, study situation ith point and jth point should meet the conditions of the k-
and the applications in the field of machine learning, nearest neighbor matrix, then a(i,j)=1,else a(i,j)=0, then 2-
meanwhile points out the key issues and the development value k-adjacency matrix can be expressed as:
trends in the future. ⎧1, (i, j ) ∈ E ( KNN ), i ∈ knn( j )or j ∈ knn(i )
Ak (i, j ) = ⎨
II. FUNDAMENTAL PRINCIPLES ⎩0, others (3)
A. Basic Knowledge of Graph B. Spectral Theory
Spectral Clustering based on the spectral graph partition In 1973, Donath and Hoffman [19] related the graph
theory, the spectral graph algorithm converts the solution partition problem with the eigenvectors of the similar matrix
space from discrete to continuous, and thus transforms the at the earliest, after then Fiedler [20] presented the
graph segmentation problem into the matrix eigenvalue relationship between the dichotomy of the graph and the
problem. At present it has been a popular high-performance second eigenvector of the Laplacian matrix. In 1992, Hagen
algorithm. and Ratio-Cut proposed Ratio-Cut which based on the
Spectral Clustering algorithm treats the clustering data spectral analysis [21]. Spectral Clustering algorithms derive
sets as an undirected complete graph G=(V,E), where the the new features of clustering objects through the theory of
vertex sets should be correspond with the data sets, that is, a matrix analysis, using the new features to cluster the original
vertex vi corresponds a data xi, and the weight wij means the data.
similarity between vi and vj which have connected edge [15]. An arbitrary undirected graph G=(V, E) can be expressed
So the clustering issue becomes a graph partition issue. Its using a symmetric matrix, and also become a graph
mission is to cut V into k disjoint subsets of vertices, the adjacency matrix. Then the graph is divided into two sub-
value V=(V1,V2,…,VK), so that to make the same vertices in graphs A and B which are not connected (where A B=V,
one subset with high similarity, different subsets have lower A∩B= Φ ). The similarity of internal graph is the biggest, and
similarity. the similarity of the sub-graphs is the smallest [22]. The idea
Different graphs can be expressed in different matrices; is to minimize the sum of weights, that is to minimize the
∑
the same graph can also be described with different matrices,
for the same problem different matrixes achieve different cost function cut ( A, B) = wij (where, A and B are the
results. The following will give the commonly used matrix i∈ A, j∈B
diagram [16, 17, and 18]. sub-graphs, wij is the weight). Division criteria will have a
1) 2-value adjacency matrix: Graph Gk=(Vk,Ek), where direct impact on the merits of clustering results. Because the
Vk is the point sets, Ek ∈ Vk × Vk is edge sets, i,j is the row weight of the graph can combined with the various features
and column, the corresponding 2-value adjacency matrix can of the clustering object, so Spectral Clustering algorithm is
be expressed as: simple and can handle complex data types.
⎧1, (i, j ) ∈ Ek C. SPECTRAL CLUSTERING Algorithm Model
Ak (i, j ) = ⎨
⎩0, others (1) According to the proposal of Spectral Clustering, many
2) The weighted adjacency matrix: for the graph Gk, researchers proposed the different concrete realization
using d(i,j) represent the distance between two points, then methods, but these methods follow the four main issues to
the weighted adjacency matrix can be expressed as: explore:
Step1: to build the similarity matrix A;
⎧ − d (i, j )2 Step2: by calculating the first k eigenvalues and
⎪exp( ), (i, j ) ∈ E
Ak (i, j ) = ⎨ σ2 eigenvectors of the k, to construct feature vector space;
⎪0, others Step3: to determine the number of clusters;
⎩ (2) Step4: using the cluster algorithm to cluster the
3) Laplacian matrix: for the graph Gk, Ak(i,j) is 2-value eigenvectors of the feature vector space.
adjacency matrix, Dk(i,j) is the degree matrix, then the matrix
Lk (i, j ) = Dk (i, j ) − Ak (i, j ) III. TYPICAL SPECTRAL CLUSTERING ALGORITHMS
is called Laplacian matrix. The most
used Laplacian matrixes are summarized in the following A. Based on the Graph Partition of the Spectral Clustering
table 1.
Algorithms
TABLE I. LAPLACIAN MATRIXES TYPES Spectral Clustering can be interpreted from several
angles, such as figure cut set theory, random migration point
Unnormalized L = D −W
and the perturbation theory [23]. But no matter what kind of
−1/ 2 −1/ 2 theory, Spectral Clustering has been converted to the
Symmetric L Sy = D LD
eigenvector problem of Laplacian matrix, and then the
Asymmetric L As = D −1L =1− D −1W eigenvectors are clustered. Assumed that the data
X={x1,x2,…,xn} Rl is divided into c clusters, all of which the
V6-150 2010 2nd International Conference on Computer Engineering and Technology [Volume 6]
Normalized-Cut (N-Cut) is recognized as the best Spectral TABLE II. SPECTRAL ALGORITHMS BASED ON THE GRAPH CUT
Clustering algorithm [24]. Classification
Based on N-Cut Algorithm for the Spectral Clustering: Cost Function
Criteria
Step1: Construct the similarity matrix W =(Wij )∈R N × N , Minimum Cut
cut ( A , B ) = ∑ w ( u ,v )
u∈ A , v∈ B
where wii = 0 , when i ≠ j , then wij = exp(− | xi − x j |2 / ∂ 2 ) ; cut ( A, B ) cut ( A , B )
Normalized Cut Ncut ( A, B ) = +
assoc ( A,V ) assoc ( B ,V )
Step2: Calculate the diagonal matrix D = Diag ( w • 1 N ) ,
cut ( A , B )
∑w Ratio Cut Rcut =
in which DII = ij ; min(| A|,| B|)
j cut ( A , B ) cut ( A , B )
Average Cut Avcut ( A , B ) = +
Step3: Make the similarity matrix L standardization, | A| | B|
[Volume 6] 2010 2nd International Conference on Computer Engineering and Technology V6-151
robust to the random noise. Since the algorithm’s time massive data. ACNAFLDS is suitable for large images and
complexity is O(n 3 ) , has the same time complexity with the fast, simple and easy to implement. This algorithm is
effectively applied in image segmentation.
original clustering algorithm, but there are some differences
in the factors. The algorithm does not inherit the advantage IV. PROSPECTS
to reduce the data dimension; it is difficult to deal with out-of
core data problem. With the Spectral Clustering algorithm generated, many
scholars make it better development, but they found that the
C. Based on Sub-space of the Spectral Clustering reason hinder its development is lack of theoretical basis.
algorithm Although there are a variety of algorithms, the difference
Under normal circumstances, we hope that through the only lies in dealing with the difference of matrix, the
nature of data sets’ local neighborhood to cluster the relationship between the matrix spectrum and eigenvectors is
information of the data sets, in order to initially realize the not clear, and most of the existing Spectral Clustering
so-called “popular convergence criteria [29]”, namely it will algorithms need a given number of clusters in advance.
focus the data sets which on the same pop structure on the Spectral Clustering algorithm should face and solve two
same class, so that even if the different pop structures main problems they are: 1) How to measure the similarity
overlap or cross it also can be successfully identified. and dissimilarity between the data sets? 2) How can we
Obviously, in the traditional clustering algorithm, a simple quickly and efficiently find the optimal partition?
similarity criterion is difficult to identify a number of cross- Spectral Clustering solely related to the number of data
clustering data. Sub-space Spectral Clustering algorithm has points, but has noting to do with the dimension, which can be
a strong ability to identify the data sets, can effectively solve avoided the singularity problem caused by the high-
the above problems. Spectral Clustering algorithm uses the dimensional eigenvector. Spectral Clustering is also a
eigenvector to construct simplified data space, can make the distinguishing method, does not assume the global structure.
distributed architecture of the data in the sub space more Although Spectral Clustering is an extremely competitive
apparent at the time of reducing the data dimension. The clustering method, it is still in the early stage. There are some
algorithm introduce the similarity matrix of the sub-space problems to be solved. Under normal circumstances, Spectral
Clustering standards are measured from several perspectives,
neighborhood wsub [30], greatly enhanced the ability to and not fully using the quantitative object criteria. Here are
identify the cross-shape data. However, the similarity matrix the six main criteria for Spectral Clustering algorithm:
of similarity matrix are studied not yet ripe, so the theoretical a) owning the capacity to handle large data sets;
research and applications should be considered more in more b) handling any shape, including clearance of nested
data sets. data;
w' (i, j ) + wsub
'
( j, i) c) the results are or aren’t related with the data input,
wsub (i, j ) = sub
2 (5) namely the algorithm is or isn’t independent of the data
input sequence;
⎧⎪1, x i ∈ N sub ( x j )
'
wsub (i, j ) = ⎨ d) the ability to process the data noise;
⎪⎩0, x i ∉ N sub ( x j ) e) the need to know the number of clusters, and the
(6)
field of knowledge which is given by the users;
D. Based on the Multi-layer Determine the Types of the f) The ability to handle the data which have much
Spectral Clustering Algorithm [31] attribute, that is, whether the sensitive to the data dimension.
The key problems of Spectral Clustering are to A variety of Spectral Clustering algorithms have their
automatically determine multi-layer cluster number and deal own advantages and disadvantages. Because of the actual
with the out-of-core data. Multi-SC algorithm which can complexity and the data diversity, each algorithm only can
determines the type automatically (ACNAFLDS) is a multi- solve a set of problems. Therefore, the users should base on
layer algorithm which can handle large scale data sets. The specific problems to select appropriate clustering algorithm.
core idea of this algorithm is to combine the large-scale data In recent years, along with the development of traditional
sets according to a certain correlation to become a small methods and the new technologies in the field of data
group of data sets, then use the new algorithm to cluster the mining, machine learning and artificial intelligence, Spectral
small data sets, finally step by step split and fine-tuning to Clustering algorithm has been considerable development.
complete all the clustering data. Not difficult to find the new trends: (1) the convergence
This algorithm was proposed based on the ACNA [32], development of traditional clustering methods; (2) new
compared with the traditional clustering algorithms, ACNA methods are emerged; (3) according to actual needs, various
can automatically determine the number of cluster, but for fields of technology are targeted blended. In short, Spectral
the large images, ACNA computational cost takes so high Clustering algorithm combines data mining, pattern
that easily exceed the computer’s memory. ACNAFLDS recognition, mathematics, image and many other areas of
overcome this shortcoming of the ACNA, not only be able to research. As the theory of these areas is developed, improved
automatically determine the number of simple clusters, but and mutual penetrated, as well as new technologies
also solve the problems of the computer’s memory which are emerging, cluster analysis will be developed faster.
caused by the eigenvector and similarity matrix handling the
V6-152 2010 2nd International Conference on Computer Engineering and Technology [Volume 6]
ACKNOWLEDGMENT [17] Z. LI,J. LIU,S. Chen. Noise Robust Spectral Clustering. International
Conferenct on Computer Vision,2007
This work is supported by the Basic Research Program [18] Y. SU, C. JIANG, Y. ZHANG. Matrix Theory.Beijing: Science Press,
(Natural Science Foundation) of Jiangsu Province of China 2006
(No.BK2009093), the National Natural Science Foundation [19] W.E.Donath, A.J.Hoffman. Lower bounds for the partitioning of
of China (No.60975039), and the Opening Foundation of graphs.IBM J. RES. Develop, 1973, vol.17(1),pp.420-425
Key Laboratory of Intelligent Information Processing of [20] Fieder M.Algebraic Connectivity of Graphs.Czechoslovak
Chinese Academy of Sciences (No.IIP2006-2). Mathematical Journal, 1973, vol.23(2),pp.298-305
[21] Y. SHEN, X. SHEN, L. ZHANG. Based on the graph division of the
REFERENCES SC algorithm in the text mining application. Computer Technology
and Development, 2009.5, vol.19(5),pp.96-98
[1] Y. ZHU, H. YANG, L. SUN. Data mining technology. Nanjing:
Southeast University Press, 2006. [22] J. Shi, J. Malik. Normalized cuts and image segmentation. IEEE
Transactions on Pattern Analysis and Machine Intelligence, 2000,
[2] H. Qiu, Ediwin R. Hancock. Graph matching and clustering using
vol.22(8),pp.888-905
spectral partitions. Pattern Recognition, 2006, vol.39(3),pp.22-24
[23] Z. TIAN, X. LI, Y. JU. The perturbation analysis of the Spectral
[3] H. ZHAO, Study on a number of clustering issues in data
clustering. Chinese Science, 2007, vol.37(4),pp.527-543
mining.Xi’an: Xi’an University of Electronic Science and
Technology, 2005 [24] R. GU, B. YE, W. XU. An improved spectral clustering algorithm.
Computer Research and Development, 2007, vol.44,pp.145-149
[4] X. TIAO. Cluster analysis of fixed assets in the enterprise
management application. Jinan: Shangdong Normal University, 2008 [25] M. Brand, H. Kun. A Unifying Theorem for Spectral Embedding and
Clustering. Proceeding of the 9th International Conference on
[5] Mehmed Kantardzic(a), S.SHAN, Y.CHEN, Y.CHENG(translated).
Artificial Intelligence and Statistics. Key West, Florida, 2003
Data Mining: Concepts, models, methods and algorithms.Beijing:
Tsinghua University Press,2003 [26] X. CAI, G. DAI, L. YANG. Survey on Spectral Clustering
Algorithms. Computer Science, 2008, vol.35(7),pp. 14-18
[6] G. MAO, L. DUAN, S. WANG. Data mining principles and
algorithms(second edition).Beijing: Tsinghua University Press, 2007 [27] C. WANG, J. WANG, J. ZHEN. Application of Spectral Clustering in
Image Retrieval. Computer Technology and Development, 2009.1,
[7] J. SUN, J. LIU, L. ZHAO. Clustering Algorithms Research.Journal of
vol.19(1),pp.207-210
Software, 2008, vol.19(1),pp.48-61
[28] L. WANG, L. BO, L. JIAO. Density-Sensitive Spectral Clustering.
[8] Y. GAO, S. GU, J. TANG. Research on Spectral Clustering in
Acta Electronica Sinica, 2007, vol.35(8),pp.1577-1581
Machine Learning. Computer Science, 2007, vol.34(2),pp. 201-203
[29] J. Ding. Research on the Complex Structure of the Clustering and
[9] Ng A Y,Jordan M I,Weiss Y. On spectral clustering: Analysis and an
Image Segmentation. Nanjing Nanjing University of Aeronautics
algorithm.//In: Dietterich T G, Becker S,Ghahramani Z. Advances in
and Astronautics,2008
Neural Information Processing Systems(NIPS).Cambridge: MIT
Press,2002 [30] S.Zhou,Y.Zhao,J.Guane etc. . A Neighborhood-Based Clustering
Algorithm. Proc. of PAKDD, 2005, vol.35(18),pp.361-371
[10] Y. Weiss. Segmentation using eigenvectors: A unified view.
//International Conference on Computer Vision, 1999 [31] H. JIN, L. ZHAO. Multilevel spectral clustering with ascertainable
clustering number. CA, 2008, vol.28(5),pp.129-1231
[11] L. WANG, H. WANG, Y. LU. Text Mining and the Key Techniques
and Methods. Computer Scinece, 2002, vol.29(12),pp.21-24 [32] C. WANG, W. LI, L. DING.Image Segmentation Using Spectral
Clustering. Proceeding of the 17th IEEE International Conference on
[12] S. XU, Z. LU, G. GU. Two Spectral Algorithms to Solve the
Tools with Artificial Intelligence(ICTAI05). Washington,DC:IEEE
Integration Issues of the Text Clustering. Acta Automatics Sinica,
Computer Society,2005
2009, vol.35(7),pp. 997-1002
[33] Z. WANG, G. LIU, E. CHEN. A spectral clustering algorithm based
[13] C. DING, X. HE, H. ZHA, M. GU. Spectral Min-Max cut for Graph
on fuzzy K-harmonic means. CAAI Transactions on Intelligent
Partitioning and Data Clustering. In:Proc.of 1st IEEE Int’1 Conf.
Systems, 2009, vol.4(2),pp.95-99
Data Mining. San Jose, CA, 2001
[34] W. SI, Y. QIAN. Semi-supervised clustering based on spectral
[14] Inderjit Dhillon, Yuqiang Guan and Brian Kulis. A Unified View of
clustering. Computer Applications, 2005, vol.25(6):1347-1349
Kernel k-means,Spectral Clustering and Graph Cuts.UTCS Technical
Report#TR-04-25,2005 [35] M.KONG, J. TANG, B. LUO. Image Clustering Based on Spectral
Features of Laplacian graph. Journal of University of Science and
[15] Y. CHEN. Data Structures(Second Edition).Beijing Higher
Technology of China, 2007, vol.37(9),pp.1125-1129
Education Press, 2009
[36] S. XU, Z. LU, G. GU. Text integration issues in the spectral
[16] M. KONG. Spectral Analysis and Clustering Relational
clustering algorithm. Control Decision, 2009, vol.24(8),pp.1277-1280
Graphs.Hefei: Anhui University, 2006
[Volume 6] 2010 2nd International Conference on Computer Engineering and Technology V6-153