Evolutionary Spectral Clustering by Incorporating Temporal Smoothness.
Evolutionary Spectral Clustering by Incorporating Temporal Smoothness.
Temporal Smoothness
∗
†
Yun Chi †
Xiaodan Song ‡
Dengyong Zhou †
Koji Hino †
Belle L. Tseng
†
NEC Laboratories America, 10080 N. Wolfe Rd, SW3-350, Cupertino, CA 95014, USA
‡
NEC Laboratories America, 4 Independence Way, Princeton, NJ 08540, USA
†
{ychi,xiaodan,hino,belle}@sv.nec-labs.com, ‡
[email protected]
ABSTRACT Keywords
Evolutionary clustering is an emerging research area essen- Evolutionary Spectral Clustering, Temporal Smoothness,
tial to important applications such as clustering dynamic Preserving Cluster Quality, Preserving Cluster Membership,
Web and blog contents and clustering data streams. In evo- Mining Data Streams
lutionary clustering, a good clustering result should fit the
current data well, while simultaneously not deviate too dra- 1. INTRODUCTION
matically from the recent history. To fulfill this dual pur-
In many clustering applications, the characteristics of the
pose, a measure of temporal smoothness is integrated in the
objects to be clustered change over time. Very often, such
overall measure of clustering quality. In this paper, we pro-
characteristic change contains both long-term trend due to
pose two frameworks that incorporate temporal smoothness
concept drift and short-term variation due to noise. For
in evolutionary spectral clustering. For both frameworks, we
example, in the blogosphere where blog sites are to be clus-
start with intuitions gained from the well-known k-means
tered (e.g., for community detection), the overall interests
clustering problem, and then propose and solve correspond-
of a blogger and the blogger’s friendship network may drift
ing cost functions for the evolutionary spectral clustering
slowly over time and simultaneously, short-term variation
problems. Our solutions to the evolutionary spectral clus-
may be triggered by external events. As another example,
tering problems provide more stable and consistent cluster-
in an ubiquitous computing environment, moving objects
ing results that are less sensitive to short-term noises while
equipped with GPS sensors and wireless connections are to
at the same time are adaptive to long-term cluster drifts.
be clustered (e.g., for traffic jam prediction or for animal mi-
Furthermore, we demonstrate that our methods provide the
gration analysis). The coordinate of a moving object may
optimal solutions to the relaxed versions of the correspond-
follow a certain route in the long-term but its estimated
ing evolutionary k-means clustering problems. Performance
coordinate at a given time may vary due to limitations on
experiments over a number of real and synthetic data sets
bandwidth and sensor accuracy.
illustrate our evolutionary spectral clustering methods pro-
These application scenarios, where the objects to be clus-
vide more robust clustering results that are not sensitive to
tered evolve with time, raise new challenges to traditional
noise and can adapt to data drifts.
clustering algorithms. On one hand, the current clusters
should depend mainly on the current data features — ag-
Categories and Subject Descriptors gregating all historic data features makes little sense in non-
stationary scenarios. On the other hand, the current clus-
H.2.8 [Database Management]: Database Applications—
ters should not deviate too dramatically from the most re-
Data mining; H.3.3 [Information Storage and Retrieval]:
cent history. This is because in most dynamic applications,
Information Search and Retrieval—Information filtering
we do not expect data to change too quickly and as a con-
sequence, we expect certain levels of temporal smoothness
General Terms between clusters in successive timesteps. We illustrate this
Algorithms, Experimentation, Measurement, Theory point by using the following example. Assume we want to
partition 5 blogs into 2 clusters. Figure 1 shows the rela-
∗Current address for this author: tionship among the 5 blogs at time t-1 and time t, where
Microsoft Research, One Microsoft Way, Redmond, WA each node represents a blog and the numbers on the edges
98052, email: [email protected] represent the similarities (e.g., the number of links) between
blogs. Obviously, the blogs at time t-1 should be clustered
by Cut I. The clusters at time t are not so clear. Both Cut
II and Cut III partition the blogs equally well. However,
Permission to make digital or hard copies of all or part of this work for according to the principle of temporal smoothness, Cut III
personal or classroom use is granted without fee provided that copies are should be preferred because it is more consistent with re-
not made or distributed for profit or commercial advantage and that copies cent history (time t-1 ). Similar ideas have long been used
bear this notice and the full citation on the first page. To copy otherwise, to in time series analysis [5] where moving averages are often
republish, to post on servers or to redistribute to lists, requires prior specific used to smooth out short-term fluctuations. Because simi-
permission and/or a fee.
KDD’07, August 12–15, 2007, San Jose, California, USA. lar short-term variances also exist in clustering applications,
Copyright 2007 ACM 978-1-59593-609-7/07/0008 ...$5.00. either due to data noises or due to non-robust behaviors
Cut III 1.1 Related Work
1 As stated in [3], evolutionary clustering is a fairly new
C D Cut II C D
topic formulated in 2006. However, it has close relationships
2 4 1 5 with other research areas such as clustering data streams,
3 B E B E incremental clustering, and constrained clustering.
In clustering data streams, large amount of data that ar-
6 5 rive at high rate make it impractical to store all the data in
A A memory or to scan them multiple times. Under such a new
Cut I data model, many researchers have investigated issues such
timestep: t-1 timestep: t as how to efficiently cluster massive data set by using limited
memory and by one-pass scanning of data [12], and how to
cluster evolving data streams under multiple resolutions so
Figure 1: An evolutionary clustering scenario that a user can query any historic time period with guaran-
teed accuracy [1]. Clustering data stream is related to our
work in that data in data streams evolve with time. How-
of clustering algorithms (e.g., converging to different locally
ever, instead of the scalability and one-pass-access issues, we
suboptimal modes), new clustering techniques are needed to
focus on how to obtain clusters that evolve smoothly over
handle evolving objects and to obtain stable and consistent
time, an issue that has not been studied in the above works.
clustering results.
Incremental clustering algorithms are also related to our
In this paper, we propose two evolutionary spectral clus-
work. There exists a large research literature on incremen-
tering algorithms in which the clustering cost functions con-
tal clustering algorithms, whose main task is to efficiently
tain terms that regularize temporal smoothness. Evolution-
apply dynamic updates to the cluster centers [13], medoids
ary clustering was first formulated by Chakrabarti et al. [3]
[12], or hierarchical trees [4] when new data points arrive.
where they proposed heuristic solutions to evolutionary hier-
However, in most of these studies, newly arrived data points
archical clustering problems and evolutionary k-means clus-
have no direct relationship with existing data points, other
tering problems. In this paper, we focus on evolutionary
than that they probably share similar statistical character-
spectral clustering algorithms under a more rigorous frame-
istics. In comparison, our study mainly focuses on the case
work. Spectral clustering algorithms have solid theory foun-
when the similarity among existing data points varies with
dation [6] and have shown very good performances. They
time, although we can also handle insertion and removal of
have been successfully applied to many areas such as doc-
data points over time. In [16], an algorithm is proposed to
ument clustering [22, 15], imagine segmentation [19, 21],
cluster moving objects based on a novel concept of micro-
and Web/blog clustering [9, 18]. Spectral clustering algo-
clustering. In [18], an incremental spectral clustering al-
rithms can be considered as solving certain graph partition
gorithm is proposed to handles similarity changes among
problems, where different graph-based measures are to be
objects that evolve with time. However, the focus of both
optimized. Based on this observation, we define the cost
[16] and [18] is to improve computation efficiency at the cost
functions in our evolutionary spectral clustering algorithms
of lower cluster quality.
by using the graph-based measures and derive correspond-
There is also a large body of work on constrained cluster-
ing (relaxed) optimal solutions. At the same time, it has
ing. In these studies, either hard constraints such as cannot
been shown that these graph partition problems have close
links and must links [20] or soft constraints such as prior
connections to different variation of the k-means clustering
preferences [15] are incorporated in the clustering task. In
problems. Through these connections, we demonstrate that
comparison, in our work the constraints are not given a pri-
our evolutionary spectral clustering algorithms provide solu-
ori. Instead, we set our goal to optimize a cost function that
tions to the corresponding evolutionary k-means clustering
incorporates temporal smoothness. As a consequence, some
problems as special cases.
soft constraints are automatically implied when historic data
In summary, our main contributions in this paper can be
and clusters are connected with current ones.
summarized as the following:
Our work is especially inspired by the work by Chakrabarti
1. We propose two frameworks for evolutionary spectral et al. [3], in which they propose an evolutionary hierarchical
clustering in which the temporal smoothness is incor- clustering algorithm and an evolutionary k-means clustering
porated into the overall clustering quality. To the best algorithm. We mainly discuss the latter because of its con-
of our knowledge, our frameworks are the first evolu- nection to spectral clustering. Chakrabarti et al. proposed
tionary versions of the spectral clustering algorithms. to measure the temporal smoothness by a distance between
the clusters at time t and those at time t-1. Their cluster
2. We derive optimal solutions to the relaxed versions of distance is defined by (1) pairing each centroid at t to its
the proposed evolutionary spectral clustering frame- nearest peer at t-1 and (2) summing the distances between
works. Because the unrelaxed versions are shown to be all pairs of centroids. We believe that such a distance has
NP-hard, our solutions provide both the practical ways two weak points. First, the pairing procedure is based on
of obtaining the final clusters and the upper bounds on heuristics and it could be unstable (a small perturbation on
the performance of the algorithms. the centroids may change the pairing dramatically). Second,
because it ignores the fact that the same data points are to
3. We also introduce extensions to our algorithms to han- be clustered in both t and t-1, this distance may be sensitive
dle the case where the number of clusters changes with to the movement of data points such as shifts and rotations
time and the case where new data points are inserted (e.g., consider a fleet of vehicles that move together while
and old ones are removed over time. the relative distances among them remain the same).
2. NOTATIONS AND BACKGROUND Vq do not have to be disjoint), we first define
P the association
First, a word about notation. Capital letters, such as between Vp and Vq as assoc(Vp , Vq ) = i∈Vp ,j∈Vq W (i, j)
W and Z, represent matrices. Lower case letters in vector Then we can write the k-way average association as
forms, such as ~vi and ~
µl , represent column vectors. Scripted k
letters, such as V and Vp , represent sets. For easy presen- X assoc(Vl , Vl )
AA = (2)
tation, for a given variable, such as W and ~vi , we attach a |Vl |
l=1
subscript t, i.e., Wt and ~vi,t , to represent the value of the
variable at time t. And P we use Tr(W ) to represent the trace and the k-way normalized cut as
of W where Tr(W ) = i W (i, i). In addition, for a ma-
k
trix X ∈ Rn×k , we use span(X) to represent the subspace X assoc(Vl , V\Vl )
NC = (3)
spanned by the columns of X. For vector norms we use the assoc(Vl , V)
l=1
Euclidian norm andP for matrix norms we use the Frobenius
norm, i.e., kW k2 = i,j W (i, j)2 = Tr(W T W ). where V\Vl is the complement of V. For consistency, we
further define the negated average association as
2.1 The clustering problem k
We state the clustering problem in the following way.
X assoc(Vl , Vl )
NA = Tr(W ) − AA = Tr(W ) − (4)
For a set V of n nodes, a clustering result is a partition l=1
|Vl |
{V1 , . . . , Vk } of the nodes in V such that V = ∪kl=1 Vl and
Vp ∩ Vq = ∅ for 1 ≤ p, q ≤ k, p 6= q. A partition (clustering where, as will be shown later, NA is always non-negative if
result) can be equivalently represented as an n-by-k matrix W is positive semi-definite. In the remaining of the paper,
Z whose elements are in {0, 1} where Z(i, j) = 1 if only if instead of maximizing AA, we equivalently aim to minimize
node i belongs to cluster j. Obviously, Z · ~1k = ~1n , where NA, and as a result, all the three objective functions, KM ,
~1k and ~1n are k-dimensional and n-dimensional vectors of NA and NC are to be minimized.
all ones. In addition, we can see that the columns of Z are Finding the optimal partition Z for either the negated
orthogonal. Furthermore, we normalize Z p in the following average association or the normalized cut is NP-hard [19].
way: we divide the l-th column of Z by |Vl | to get Z̃, Therefore, in spectral clustering algorithms, usually a re-
where |Vl | is the size of Vl . Note that the columns of Z̃ are laxed version of the optimization problem is solved by (1)
computing eigenvectors X of some variations of the simi-
orthonormal, i.e., Z̃ T Z̃ = Ik .
larity matrix W , (2) projecting all data points to span(X),
2.2 K -means clustering and (3) applying the k-means algorithm to the projected
data points to obtain the clustering result. While it may
The k-means clustering problem is one of the most widely- seem nonintuitive to apply spectral analysis and then again
studied clustering problems. Assume the i-th node in V can use the k-means algorithm, it has been shown that such
be represented by an m-dimensional feature vector ~vi ∈ Rm , procedures have many advantages such as they work well
then the k-means clustering problem is to find a partition in the cases when the data points are not linearly separable
{V1 , . . . , Vk } that minimizes the following measure [17]. The focus of our paper is in step (1). For steps (2) and
k X
X (3) we follow the standard procedures in traditional spectral
KM = µl k2
k~vi − ~ (1) clustering and thus will not give more details on them.
l=1 i∈Vl
where
P ~
µl is the centroid (mean) of the l-th cluster, i.e., µ
~l = 3. EVOLUTIONARY SPECTRAL
j∈Vl ~vj /|Vl |. CLUSTERING—TWO FRAMEWORKS
A well-known algorithm to the k-means clustering prob-
In this section we propose two frameworks for evolution-
lem is the so called k-means algorithm in which after initially
ary spectral clustering. We first describe the basic idea.
randomly picking k centroids, the following procedure is re-
peated until convergence: all the data points are assigned to 3.1 Basic Idea
the clusters whose centroids are nearest to them, and then
the cluster centroids are updated by taking the average of We define a general cost function to measure the quality
the data points assigned to them. of a clustering result on evolving data points. The function
contains two costs. The first cost, snapshot cost (CS), only
2.3 Spectral clustering measures the snapshot quality of the current clustering re-
sult with respect to the current data features, where a higher
The basic idea of spectral clustering is to cluster based on
snapshot cost means worse snapshot quality. The second
the eigenvectors of a (possibly normalized) similarity ma-
cost, temporal cost (CT ), measures the temporal smooth-
trix W defined on the set of nodes in V. Very often W is
ness in terms of the goodness-of-fit of the current clustering
positive semi-definite. Commonly used similarities include
result with respect to either historic data features or his-
the inner product of the feature vectors, W (i, j) = ~viT ~vj , the
toric clustering results, where a higher temporal cost means
diagonally-scaled Gaussian similarity, W (i, j) = exp(−(~vi −
worse temporal smoothness. The overall cost function1 is
~vj )T diag(~γ )(~vi − ~vj )), and the affinity matrices of graphs.
defined as a linear combination of these two costs:
Spectral clustering algorithms usually solve graph parti-
tioning problems where different graph-based measures are Cost = α · CS + β · CT (5)
to be optimized. Two popular measures are to maximize the
average association and to minimize the normalized cut [19]. 1
Our general cost function is equivalent to the one defined
For two subsets, Vp and Vq , of the node set V (where Vp and in [3], differing only by a constant factor and a negative sign.
where 0 ≤ α ≤ 1 is a parameter assigned by the user and age association. Following the idea of Equation (6), at time
together with β(= 1 − α), they reflect the user’s emphasis t, for a given partition Zt , a natural definition of the overall
on the snapshot cost and temporal cost, respectively. cost is
In both frameworks that we propose, for a current par-
CostNA = α · CSNA + β · CTNA (7)
tition (clustering result), the snapshot cost CS is measured
by the clustering quality when the partition is applied to = α · NAt Zt
+ β · NAt−1 Zt
the current data. The two frameworks are different in how
the temporal cost CT is defined. In the first framework, The above cost function is almost identical to Equation (6),
which we name PCQ for preserving cluster quality, the cur- except that the cluster quality is measured by the negated
rent partition is applied to historic data and the resulting average association NA rather than the k-means KM .
cluster quality determines the temporal cost. In the sec- Next, we derive a solution to minimizing CostNA . First,
ond framework, which we name PCM for preserving cluster it can be easily shown that the negated average association
membership, the current partition is directly compared with defined in Equation (4) can be equivalently written as
the historic partition and the resulting difference determines NA = Tr(W ) − Tr(Z̃ T W Z̃) (8)
the temporal cost.
In the discussion of both frameworks, we first use the k- Therefore2 we write the overall cost (7) as
means clustering problem, Equation (1), as a motivational
example and then formulate the corresponding evolution- CostNA = α · [Tr(Wt ) − Tr(Z̃tT Wt Z̃t )] (9)
ary spectral clustering problems (both NA and NC). We + β · [Tr(Wt−1 ) − Tr(Z̃tT Wt−1 Z̃t )]
also provide the optimal solutions to the relaxed versions h i
of the evolutionary spectral clustering problems and show = Tr(αWt + βWt−1 ) − Tr Z̃tT (αWt + βWt−1 )Z̃t
how they relate back to the evolutionary k-means clustering
problem. In addition, in this section, we focus on a special Notice that the first term Tr(αWt + βWt−1 ) is a constant
case where the number of clusters does not change with time independent of the clustering partitions and as a result,
and neither does the number of nodes to be clustered. We minimizing CostNA is equivalent to maximizing the trace
will discuss the more general cases in the next section. Tr[Z̃tT (αWt + βWt−1 )Z̃t ], subject to Z̃t being a normalized
indicator matrix (cf Sec 2.1). Because maximizing the av-
3.2 Preserving Cluster Quality (PCQ) erage association is an NP-hard problem, finding the so-
lution Z̃t that minimizes CostNA is also NP-hard. So fol-
In the first framework, PCQ, the temporal cost is ex-
lowing most spectral clustering algorithms, we relax Z̃t to
pressed as how well the current partition clusters historic
Xt ∈ Rn×k with XtT Xt = Ik . It is well-known [11] that one
data. We illustrate this through an example. Assume that
solution to this relaxed optimization problem is the matrix
two partitions, Zt and Zt′ , cluster the current data at time
Xt whose columns are the k eigenvectors associated with the
t equally well. However, to cluster historic data at time t-1,
top-k eigenvalues of matrix αWt + βWt−1 . Therefore, after
the clustering performance using partition Zt is better than
computing the solution Xt we can project the data points
using partition Zt′ . In such a case, Zt is preferred over Zt′
into span(Xt ) and then apply k-means to obtain a solution
because Zt is more consistent with historic data. We formal-
to the evolutionary spectral clustering problem under the
ize this idea for the k-means clustering problem using the
measure of negated average association. In addition,
the
following overall cost function
value Tr(αWt + βWt−1 ) − Tr XtT (αWt + βWt−1 )Xt pro-
CostKM = α · CSKM + β · CTKM (6) vides a lower bound on the performance of the evolutionary
clustering problem.
= α · KM t Zt
+ β · KMt−1 Zt Moreover, Zha et al. [22] have shown a close connection
k
X X between the k-means clustering problem and spectral clus-
=α· µl,t k2
k~vi,t − ~ tering algorithms — they proved that if we put the m-
l=1 i∈Vl,t dimensional feature vectors of the n data points in V into
k an m-by-n matrix A = (~v1 , . . . , ~vn ), then
X X
+β· ~ l,t−1 k2
k~vi,t−1 − µ KM = Tr(AT A) − Tr(Z̃ T AT AZ̃) (10)
l=1 i∈Vl,t
Comparing Equations (10) and (8), we can see that the k-
where Z means “evaluated by the partition Zt , where Zt means clustering problem is a special case of the negated av-
t
erage association spectral clustering problem, where the sim-
P
is computed at time t” and ~ µl,t−1 = j∈Vl,t ~vj,t−1 /|Vl,t | .
Note that in the formula of CTKM , the inner summation is ilarity matrix W is defined by the inner product AT A. As
over all data points in Vl,t , the clusters at time t. That a consequence, our solution to the NA evolutionary spectral
is, although the feature values used in the summation are clustering problem can also be applied to solve the k-means
those at time t-1 (i.e., ~vi,t−1 ’s), the partition used is that at evolutionary clustering problem in the PCQ framework, i.e.,
time t (i.e., Zt ). As a result, this cost CTKM = KMt−1 Z under the cost function defined in Equation (6).
t
2
penalizes those clustering results (at t) that do not fit well Here we can show that NA is positive Pn semi-definite:
with recent historic data (at t-1 ) and therefore promotes We have Z̃ T Z̃ = Ik and Tr(W ) = i=1 λi where λi ’s
temporal smoothness of clusters. are the eigenvalues of W ordered by decreasing magni-
tude. Therefore, by Fan’s theorem [10], which says that
Pk
3.2.1 Negated Average Association maxX∈Rn×k ,X T X=Ik Tr(X T W X) = j=1 λk , we can de-
Pn
We now formulate the PCQ framework for evolutionary rive from (8) that NA ≥ j=k+1 λ j ≥ 0 if W is positive
spectral clustering. We start with the case of negated aver- semi-definite.
3.2.2 Normalized Cut 3.3 Preserving Cluster Membership (PCM)
For the normalized cut, we extend the idea of Equation (6) The second framework of evolutionary spectral clustering,
similarly. By replacing the KM Equation (6) with NC, we PCM, is different from the first framework, PCQ, in how the
define the overall cost for evolutionary normalized cut to be temporal cost is measured. In this second framework, the
temporal cost is expressed as the difference between the cur-
CostNC = α · CSNC + β · CTNC (11) rent partition and the historic partition. We again illustrate
= α · NCt + β · NCt−1 this by an example. Assume that two partitions, Zt and Zt′ ,
Zt Zt
cluster current data at time t equally well. However, when
Shi et al. [19] have proved that computing the optimal solu- compared to the historic partition Zt−1 , Zt is much more
tion to minimize the normalized cut is NP-hard. As a result, similar to Zt−1 than Zt′ is. In such a case, Zt is preferred
finding an indicator matrix Zt that minimizes CostNC is also over Zt′ because Zt is more consistent with historic partition.
NP-hard. We now provide an optimal solution to a relaxed We first formalize this idea for the evolutionary k-means
version of the problem. Bach et al. [2] proved that for a problem. For convenience of discussion, we write the current
given partition Z, the normalized cut can be equivalently partition as Zt = {V1,t , . . . , Vk,t } and the historic partition
written as as Zt−1 = {V1,t−1 , . . . , Vk,t−1 }. Now we want to define a
h i measure for the difference between Zt and Zt−1 . Comparing
1 1
NC = k − Tr Y T D− 2 W D− 2 Y (12) two partitions has long been studied in the literatures of
classification and clustering. Here we use the traditional
where D is a diagonal matrix with D(i, i) = n
P chi-square statistics [14] to represent the distance between
j=1 W (i, j)
two partitions
and Y is any matrix in Rn×k that satisfies two conditions:
(a) the columns of D−1/2 Y are piecewise constant with re- k X k
!
2
X |Vij |2
spect to Z and (b) Y T Y = Ik . We remove the constraint χ (Zt , Zt−1 ) = n −1
|Vi,t | · |Vj,t−1 |
(a) to get a relaxed version for the optimization problem i=1 j=1
where |Vij | is the number of nodes that are both in Vi,t (at
−1 −1
CostNC ≈ α · k − α · Tr XtT Dt 2 Wt Dt 2 Xt (13) time t) and in Vj,t−1 (at time t-1 ). Actually, in the above
definition, the number of clusters k does not have to be the
−1 −1 same at time t and t-1, and we will come back to this point
+ β · k − β · Tr XtT Dt−1 2
Wt−1 Dt−1
2
Xt
in the next section. By ignoring the constant shift of -1 and
−1 −1 −1 −1
the constant scaling n, we define the temporal cost for the
= k − Tr XtT αDt 2 Wt Dt 2 + βDt−1 2
Wt−1 Dt−1
2
Xt k-means clustering problem as
k X
k
X |Vij |2
for some Xt ∈ Rn×k such that XtT Xt = Ik . Again we CTKM = − (14)
|Vi,t | · |Vj,t−1 |
have a trace maximization problem and a solution is the i=1 j=1
matrix Xt whose columns are the k eigenvectors associ- where the negative sign is because we want to minimize
1
−2 1
−2
ated with the top-k eigenvalues of matrix αDt Wt Dt + CTKM . The overall cost can be written as
−1 −1
βDt−1 2
Wt−1 Dt−1
2
. And again, after obtaining Xt , we can CostKM = α · CSKM + β · CTKM (15)
further project data points into span(Xt ) and then apply k k k 2
the k-means algorithm to obtain the final clusters.
X X X X |Vij |
=α· µl,t k2 − β ·
k~vi,t − ~
Moreover, Dhillon et al. [8] have proved that the normal- i=1 j=1
|Vi,t | · |Vj,t−1 |
l=1 i∈Vl,t
ized cut approach can be used to minimize the cost function
of a weighted kernel k-means problem. As a consequence, 3.3.1 Negated Average Association
our evolutionary spectral clustering algorithm can also be Recall that in the case of negated average association,
applied to solve the evolutionary version of the weighted we want to maximize NA = Tr(Z̃ T W Z̃) where Z̃ is fur-
kernel k-means clustering problem. ther relaxed to continuous-valued X, whose columns are
the k eigenvectors associated with the top-k eigenvalues of
3.2.3 Discussion on the PCQ Framework W . So in the PCM framework, we shall define a distance
The PCQ evolutionary clustering framework provides a dist(Xt , Xt−1 ) between Xt , a set of eigenvectors at time t,
data clustering technique similar to the moving average frame- and Xt−1 , a set of eigenvectors at time t-1. However, there
work in time series analysis, in which the short-term fluctu- is a subtle point — for a solution X ∈ Rn×k that maxi-
ation is expected to be smoothed out. The solutions to the mizes Tr(X T W X), any X ′ = XQ is also a solution, where
PCQ framework turn out to be very intuitive — the historic Q ∈ Rk×k is an arbitrary orthogonal matrix. This is because
similarity matrix is scaled and combined with current simi- Tr(X T W X) = Tr(X T W XQQT ) = Tr((XQ)T W XQ) =
larity matrix and the new combined similarity matrix is fed T
Tr(X ′ W X ′ ). Therefore we want a distance dist(Xt , Xt−1 )
to traditional spectral clustering algorithms. that is invariant with respect to the rotation Q. One such
Notice that one assumption we have used in the above solution, according to [11], is the norm of the difference be-
derivation is that the temporal cost is determined by data tween two projection matrices, i.e.,
at time t-1 only. However, the PCQ framework can be eas-
1
ily extended to cover longer historic data by including sim- dist(Xt , Xt−1 ) =kXt XtT − Xt−1 Xt−1
T
k2 (16)
ilarity matrices W ’s at older time, probably with different 2
weights (e.g., scaled by an exponentially decaying factor to which essentially measures the distance between span(Xt )
emphasize more recent history). and span(Xt−1 ). Furthermore in Equation (16), the number
of columns in Xt does not have to be the same as that in 2-norms scale, for both matrices have λmax = 1. Therefore,
Xt−1 and we will discuss this in the next section. the two terms CSNC and CTNC are comparable and α can
By using this distance to quantify the temporal cost, we be selected in a straightforward way.
derive the total cost for the negated average association as
3.3.3 Discussion on the PCM Framework
CostNA = α · CSNA + β · CTNA (17)
h i β In the PCM evolutionary clustering framework, all his-
=α · Tr(Wt ) − Tr(XtT Wt Xt ) + · kXt XtT − Xt−1 Xt−1
T
k2 toric data are taken into consideration (with different weights)
h i 2 — Xt partly depends on Xt−1 , which in turn partly depends
=α · Tr(Wt ) − Tr(XtT Wt Xt ) + on Xt−2 and so on. Let us look at two extreme cases. When
α approaches 1, the temporal cost will become unimportant
β T
and as a result, the clusters are computed at each time win-
Tr Xt XtT − Xt−1 Xt−1
T
Xt XtT − Xt−1 Xt−1
T
2 h i dow independent of other time windows. On the other hand,
=α · Tr(Wt ) − Tr(XtT Wt Xt ) + when α approaches 0, the eigenvectors in all time windows
are required to be identical. Then the problem becomes a
β
Tr(Xt XtT Xt XtT − 2Xt XtT Xt−1 Xt−1 T
+ Xt−1 Xt−1 T T
Xt−1 Xt−1 ) special case of the higher-order singular value decomposi-
2 h i tion problem [7], in which singular vectors are computed for
=α · Tr(Wt ) − Tr(XtT Wt Xt ) + βk − βTr XtT Xt−1 Xt−1 T
Xt the three modes (the rows of W , the columns of W , and
h i the timeline) of a data tensor W where W is constructed by
=α · Tr(Wt ) + β · k − Tr XtT (αWt + βXt−1 Xt−1 T
)Xt concatenating Wt ’s along the timeline.
In addition, if the similarity matrix Wt is positive semi-
Therefore, an optimal solution that minimizes CostNA is the −1 −1 T
definite, then αDt 2 Wt Dt 2 + βXt−1 Xt−1 is also positive
matrix Xt whose columns are the k eigenvectors associated −21 −1 T
with the top-k eigenvalues of the matrix αWt + βXt−1 Xt−1 . T semi-definite because both Dt Wt Dt and Xt−1 Xt−1
2
are
After getting Xt , the following steps are the same as before. positive semi-definite.
Furthermore, it can be shown that the un-relaxed version
of the distance defined in Equation (16) for spectral clus- 3.4 Comparing Frameworks PCQ and PCM
tering is equal to that defined in Equation (15) for k-means Now we compare the PCQ and PCM frameworks. For
clustering by a constant shift. That is, it can be shown (cf. simplicity of discussion, we only consider time slots t and
Bach et al. [2]) that t-1 and ignore older history.
k k
In terms of the temporal cost, PCQ aims to maximize
1 T T 2
XX |Vij |2 Tr(X T T T
t Wt−1 Xt ) while for PCM, Tr(Xt Xt−1 Xt−1 Xt ) is to
kZ̃t Z̃t − Z̃t−1 Z̃t−1 k = k − (18)
2 i=1 j=1
|Vi,t | · |Vj,t−1 | be maximized. However, these two are closely connected.
By applying the eigen-decomposition on Wt−1 , we have
As a result, the evolutionary spectral clustering based on
negated average association in the PCM framework provides XtT Wt−1 Xt = XtT (Xt−1 , Xt−1 ⊥ ⊥
)Λt−1 (Xt−1 , Xt−1 )T Xt
a relaxed solution to the evolutionary k-means clustering
problem defined in the PCM framework, i.e., Equation (15). where Λt−1 is a diagonal matrix whose diagonal elements are
the eigenvalues of Wt−1 ordered by decreasing magnitude,
⊥
3.3.2 Normalized Cut and Xt−1 and Xt−1 are the eigenvectors associated with the
It is straightforward to extend the PCM framework from first k and the residual n − k eigenvectors of Wt−1 , respec-
the negated average association to normalized cut as tively. It can be easily verified that both Tr(XtT Wt−1 Xt )
and Tr(XtT Xt−1 Xt−1 T
Xt ) are maximized when Xt = Xt−1
CostNC = α · CSNC + β · CTNC (19) (or more rigorously, when span(Xt ) = span(Xt−1 )). The
differences between PCQ and PCM are (a) if the eigen-
T −1 − 1
= α · k − α · Tr Xt Dt 2 Wt Dt 2 Xt vectors associated with the smaller eigenvalues (other than
the top k) are considered and (b) the level of penalty when
β
+ · kXt XtT − Xt−1 Xt−1 T
k2 Xt deviates from Xt−1 . For PCQ, all the eigenvectors are
2 considered and their deviations between time t and t-1 are
−1 −1 penalized according to the corresponding eigenvalues. For
= k − Tr XtT αDt 2 Wt Dt 2 + βXt−1 Xt−1 T
Xt
PCM, rather than all eigenvectors, only the first k eigenvec-
Therefore, an optimal solution that minimizes CostNC is the tors are considered and they are treated equally. In other
matrix Xt whose columns are the k eigenvectors associated words, in the PCM framework, other than the historic clus-
−21 −1 ter membership, all details about historic data are ignored.
with the top-k eigenvalues of the matrix αDt Wt Dt 2 + Although by keeping only historic cluster membership,
T
βXt−1 Xt−1 . After obtaining Xt , the subsequent steps are PCM introduces more information loss, there may be bene-
the same as before. fits in other aspects. For example, the CT part in the PCM
It is worth mentioning that in the PCM framework, CostNC framework does not necessarily have to be temporal cost —
has an advantage over CostNA in terms of the ease of se- it can represent any prior knowledge about cluster member-
lecting an appropriate α. In CostNA , the two terms CSNA ship. For example, we can cluster blogs purely based on
and CTNA are of different scales — CSNA measures a sum of interlinks. However, other information such as the content
variances and CTNA measures some probability distribution. of the blogs and the demographic data about the bloggers
Consequently, this difference needs to be considered when may provide valuable prior knowledge about cluster mem-
choosing α. In contrast, for CostNC , because the CSNC is bership that can be incorporated into the clustering. The
−1 − 1
T
normalized, both Dt 2 Wt Dt 2 and Xt−1 Xt−1 have the same PCM framework can handle such information fusion easily.
4. GENERALIZATION similarity between vnew and vold is the same as the average
There are two assumptions in the PCQ and the PCM similarity between any existing node and vold . (3) In Ŵt−1 ,
framework proposed in the last section. First, we assumed the similarity between any pair of newly inserted nodes is the
that the number of clusters remains the same over all time. same as the average similarity among all pairs of existing
Second, we assumed that the same set of nodes is to be clus- nodes.
tered in all timesteps. Both assumptions are too restrictive We can see that these properties are appealing when no
in many applications. In this section, we extend our frame- prior knowledge is given about the newly inserted nodes.
works to handle the issues of variation in cluster numbers
and insertion/removal of nodes over time. 4.2.2 Node Insertion and Removal in PCM
For the PCM framework, when old nodes are removed,
4.1 Variation in Cluster Numbers we remove the corresponding rows from Xt−1 to get X̃t−1
In our discussions so far, we have assumed that the num- (assuming X̃t−1 is n1 × k). When new nodes are inserted
ber of clusters k does not change with time. However, keep- at time t, we extend X̃t−1 to X̂t−1 , which has the same
ing a fixed k over all time windows is a very strong restric- dimension as Xt (assuming Xt is n2 × k) as follows
tion. Determining the number of clusters is an important
research problem in clustering and there are many effec- X̃t−1 1~
X̂t−1 = for Gt−1 = 1n −n ~1Tn X̃t−1 (20)
tive methods for selecting appropriate cluster numbers (e.g., Gt−1 n1 2 1 1
by thresholding the gaps between consecutive eigenvalues).
Here we assume that the number of cluster k at time t has That is, we insert new rows as the row average of X̃t−1 .
T
been determined by one of these methods and we investi- After obtaining X̂t−1 , we replace the term βXt−1 Xt−1 with
T −1 T
gate what will happen if the cluster number k at time t is β X̂t−1 (X̂t−1 X̂t−1 ) X̂t−1 in Equations (17) and (19).
different from the cluster number k′ at time t-1. Such a heuristic has the following good property, whose
It turns out that both the PCQ and the PCM frameworks proof is skipped due to the space limit.
can handle variations in cluster number already. In the PCQ
framework, the temporal cost is expressed by historic data Property 2. Equation (20) corresponds to for each newly
themselves, not by historic clusters and therefore the com- inserted nodes, assigning to it a prior clustering membership
putation at time t is independent of the cluster number k′ that is approximately proportional to the size of the clusters
at time t-1. In the PCM framework, as we have mentioned, at time t-1.
the partition distance (Equation 14) and the subspace dis-
tance (Equation 16) can both be used without change when 5. EXPERIMENTAL STUDIES
the two partitions have different numbers of clusters. As a In this section, we report experimental studies based on
result, both of our PCQ and PCM frameworks can handle both synthetic data sets and a real blog data set.
variations in the cluster numbers.
5.1 Synthetic Data
4.2 Insertion and Removal of Nodes First, we use several experiments on synthetic data sets
Another assumption that we have been using is that the to illustrate the good properties of our algorithms.
number of nodes in V does not change with time. However,
in many applications the data points to be clustered may 5.1.1 NA-based Evolutionary Spectral Clustering
vary with time. In the blog example, very often there are In this subsection, we show three experimental studies
old bloggers who stop blogging and new bloggers who just based on synthetic data. In the first experiment, we demon-
start. Here we propose some heuristic solutions to handle strate a stationary case where data variation is due to a
this issue. zero-mean noise. In the second experiment, we show a non-
stationary case where there are concept drifts. In the third
4.2.1 Node Insertion and Removal in PCQ experiment, we show a case where there is a large difference
For the PCQ framework, the key is αWt + βWt−1 . When between the PCQ and PCM frameworks.
old nodes are removed, we can simply remove the corre- By using the k-means algorithm, we design two baselines.
sponding rows and columns from Wt−1 to get W̃t−1 (assum- The first baseline, which we call ACC, accumulates all his-
ing W̃t−1 is n1 × n1 ). However, when new nodes are inserted toric data before the current timestep t and applies the k-
at time t, we need to add entries to W̃t−1 and to extended it means algorithm on the aggregated data. The second base-
to Ŵt−1 , which has the same dimension as Wt (assuming Wt line, which we call IND, independently applies the k-means
is n2 × n2 ). Without lost of generality, we assume that the algorithm on the data in only timestep t and ignore all his-
first n1 rows and columns of Wt correspond to those nodes toric data before t.
in W̃t−1 . We propose to achieve this by defining For our algorithms, we use the NA-based PCQ and PCM
algorithms, because of the equivalence between the NA-
Et−1 = n11 W̃t−1~1n1 ~1Tn2 −n1
(
based spectral clustering problem and the k-means cluster-
W̃t−1 Et−1
Ŵt−1= T for ing problem (Equation (10)). We choose to use W = AT A in
Et−1 Ft−1 Ft−1 = n12 ~1Tn1 W̃t−1~1n1 ~1n2 −n1 ~1Tn2 −n1
1 the NA-based evolutionary spectral clustering and compare
Such a heuristic has the following good properties, whose its results with that of the k-means baseline algorithms. For
proofs are skipped due to the space limitation. a fair comparison, we use the KM defined for the k-means
clustering problem (i.e., Equation (1)) as the measure for
Property 1. (1) Ŵt−1 is positive semi-definite if Wt−1 performance, where a smaller KM value is better.
is. (2) In Ŵt−1 , for each existing node vold , each newly The data points to be clustered are generated in the fol-
inserted node vnew looks like an average node in that the lowing way. 800 two-dimensional data points are initially
positioned as described in Figure 2(a) at timestep 1. As Next, for the same data set, we let α increase from 0.2 to 1
can be seen, there are roughly four clusters (the data were with a step of 0.1. Figure 4 shows the average snapshot cost
actually generated by using four Gaussian distributions cen- and the temporal cost over all 10 timesteps under different
tered in the four quadrants). Then in timesteps 2 to 10, we settings of α. As we expected, when α increases, to em-
perturb the initial positions of the data points by adding phasize more on the snapshot cost, we get better snapshot
different noises according to the experimental setup. Un- quality at the price of worse temporal smoothness. This re-
less stated otherwise, all experiments are repeated 50 times sult demonstrates that our frameworks are able to control
with different random seeds and the average performances the tradeoff between the snapshot quality and the temporal
are reported. smoothness.
x 10
4 Snapshot Cost (CS) x 10
4 Temporal Cost (CT)
6 6 1.72 1.8
PCQ PCQ
1.7 PCM PCM
4 4 1.75
1.68
2 2 1.66 1.7
1.64 1.65
0 0
1.62
1.6
−2 −2 1.6
1.58 1.55
−4 −4 0.2 0.4 0.6 0.8 1 0.2 0.4 0.6 0.8 1
α α
−6 −6
−6 −4 −2 0 2 4 6 −6 −4 −2 0 2 4 6
(a) (b)
Figure 4: The tradeoff between snapshot cost and
temporal cost, which can be controlled by α
Figure 2: (a) The initial data point positions and
(b) A typical position in the non-stationary case
In the second experiment, we simulate a non-stationary
situation. At timesteps 2 through 10, before adding random
In the first experiment, for timesteps 2 through 10, we add
noises, we first rotate all data points by a small random
an i.i.d. Gaussian noise following N (0, 0.5) to the initial po-
angle (with zero mean and a variance of π/4). Figure 2(b)
sitions of the data points. We use this data to simulation
shows the positions of data points in a typical timestep.
a stationary situation where the concept is relatively stable
Figure 5 gives the performance of the four algorithms. As
but there exist short-term noises. In Figures 3(a) and 3(b),
can be seen, while the performance of our algorithms and
we report the snapshot cost CSKM and the temporal cost
the IND baseline has little change, the performance of the
CTKM for the two baselines and for our algorithms (with
ACC baseline becomes very poor. This result shows that
α = 0.9 for both PCQ and PCM) from timesteps 1 to 10.
if an aggregation approach is used, we should not aggregate
For both costs, a lower value is better. As can be seen from
the data features in a non-stationary scenario — instead, we
the figure, the ACC baseline has low temporal smoothness
should aggregate the similarities among data points.
but very high snapshot cost, whereas the IND baseline has
the low snapshot cost but very high temporal cost. In com- Snapshot Cost (CS) Temporal Cost (CT)
parison, our two algorithms show low temporal cost at the
2500 2300
ACC ACC
2400 IND IND
PCQ 2200 PCQ
2300
2100
1900
2000
seen, the ACC baseline has the worst overall performance 1800
1700
1900
1800
In addition, Figure 3(d) shows the degree of cluster change Overall Cost Degree of Cluster Change
2400 80
1800 1900
1850
1750
1 2 3 4 5 6 7 8 9 10
1800
1 2 3 4 5 6 7 8 9 10
Figure 5: Performance for a non-stationary syn-
(a) (b)
thetic data set, which shows that aggregating data
1950
Overall Cost
70
Degree of Cluster Change features does not work
ACC ACC
IND 65 IND
PCQ PCQ
1900 60
PCM PCM
1850
55
50
In the third experiment, we show a case where the PCQ
1800
45
40
and PCM frameworks behave differently. We first generate
1750
1 2 3 4 5 6 7 8 9 10
35
30
1 2 3 4 5 6 7 8 9 10
data points using the procedure described in the first ex-
(c) (d)
periment (the stationary scenario), except that this time we
generate 60 timesteps for a better view. This time, instead
Figure 3: The performance for the stationary syn- of 4 clusters, we let the algorithms partition the data into
thetic data set, which shows that PCQ and PCM 2 clusters. From Figure 2(a) we can see that there are ob-
result in low temporal cost at a price of a small in- viously two possible partitions, a horizonal cut or a vertical
crease in snapshot cost cut at the center, that will give similar performance where
the performance difference will mainly be due to short-term
expected, the cluster membership change using our frame- noises. Figure 6 shows the degree of cluster membership
works is less dramatic than that of the IND baseline, which change over the 60 timesteps in one run (for obvious rea-
takes no historic information into account. sons, no averaging is taken in this experiment). As can
be seen, the cluster memberships of the two baselines jump 5.2 Real Blog Data
around, which shows that they are not robust to noise in The real blog data was collected by an NEC in-house blog
this case. Also can be seen, the cluster membership of the crawler. Due to space limit, we will not describe how the
PCM algorithm varies much lesser than that of the PCQ data was collected and refer interested readers to [18] for
algorithm. The reason for this difference is that switching details. This NEC blog data set contains 148,681 entry-to-
the partition from the horizontal cut to the vertical cut will entry links among 407 blogs crawled during 63 consecutive
introduce much higher penalty to PCM than to PCQ — weeks, between July 10th in 2005 and September 23rd in
PCM is directly penalized by the change of eigenvectors, 2006. By looking at the contents of the blogs, we discovered
which corresponds to the change of cluster membership; for two main groups: a set of 314 blogs with technology focus
PCQ, the penalty is indirectly acquired from historic data, and a set of 93 blogs with politics focus. Figure 8 shows
not historic cluster membership. the blog graph for this NEC data set, where the nodes are
blogs (with different labels depending on their group mem-
Cluster Membership Variation
ber) and the edges are interlinks among blogs (obtained by
5
x 10
5
ACC
4.5
4
IND
PCQ aggregating all entry-to-entry links).
PCM
3.5
2.5
1.5
0.5
0
10 20 30 40 50 60
Timesteps
80 80 80 80
60 60 60 60 plot the results for PCQ, which are similar to those of PCM,
40 40 40 40
50 100 150 50 100 150 50 100 150 50 100 150
t = 1 t = 2 t = 3 t = 4
80 80 80 80 1
60 60 60 60 0.5
0.5
40 40 40 40
50 100 150 50 100 150 50 100 150 50 100 150
t = 1 t = 2 t = 3 t = 4 0 0
10 20 30 40 50 60 10 20 30 40 50 60
(a) (b)
Figure 7: A toy example that demonstrates our evo- Overall Cost Error Compared to the Ground Truth
2 0.8
ACC ACC
0.5 0.2
(c) (d)
data sets demonstrate that compared to traditional cluster-
ing methods, our evolutionary spectral clustering algorithms
can provide clustering results that are more stable and con- Figure 9: The performance on the NEC data, which
sistent, less sensitive to short-term noise, and adaptive to shows that evolutionary spectral clustering clearly
long-term trends. outperforms non-evolutionary ones
as shown in Table 1). In Figure 9(d), we show the error 8. REFERENCES
between the cluster results and the ground truth obtained [1] C. C. Aggarwal, J. Han, J. Wang, and P. S. Yu. A
from content analysis, where the error is the distance be- framework for clustering evolving data streams. In
tween partitions defined in Equation (18). As can be seen Proc. of the 12th VLDB Conference, 2003.
from these figures, the evolutionary spectral clustering has [2] F. R. Bach and M. I. Jordan. Learning spectral
the best performance in all four measures. The high snap- clustering, with application to speech separation.
shot cost of IND was surprising to us. We believe this could Journal of Machine Learning Research, 7, 2006.
be due to the non-robustness of the normalized cut package
[3] D. Chakrabarti, R. Kumar, and A. Tomkins.
(which we obtained from the homepage of the first author
Evolutionary clustering. In Proc. of the 12th ACM
of [19]). In addition, note that CTNC is usually smaller than
SIGKDD Conference, 2006.
CSNC because CTNC is computed over those nodes that are
[4] M. Charikar, C. Chekuri, T. Feder, and R. Motwani.
active in both t and t-1 and such nodes are usually less than
Incremental clustering and dynamic information
those that are active at t. This is also one of the reasons for
retrieval. In Proc. of the 29th STOC Conference, 1997.
the high variation of the curves.
[5] C. Chatfield. The Analysis of Time Series: An
Introduction. Chapman & Hall/CRC.
Table 1: Performance under Different Cluster Num- [6] F. R. K. Chung. Spectral Graph Theory. American
bers for the Blog Data Set Mathematical Society, 1997.
ACC IND NC PCQ NC PCM [7] L. De Lathauwer, B. De Moor, and J. Vandewalle. A
CS 0.76 0.79 0.68 0.46 multilinear singular value decomposition. SIAM J. on
k=2 CT 0.59 0.20 0.10 0.06 Matrix Analysis and Applications, 21(4), 2000.
Total Cost 0.74 0.73 0.63 0.42 [8] I. S. Dhillon, Y. Guan, and B. Kulis. Kernel k-means:
CS 1.22 1.53 1.12 1.07 spectral clustering and normalized cuts. In Proc. of
k=3 CT 0.98 0.22 0.24 0.02
Total Cost 1.21 1.43 1.06 0.98 the 10th ACM SIGKDD Conference, 2004.
CS 1.71 2.05 1.70 1.71 [9] C. Ding and X. He. K-means clustering via principal
k=4 CT 1.40 0.18 0.39 0.03 component analysis. In Proc. of the 21st ICML
Total Cost 1.69 1.89 1.59 1.57 Conference, 2004.
[10] K. Fan. On a theorem of weyl concerning eigenvalues
In addition, we run the algorithms under different cluster of linear transformations. In Proc. Natl. Acad. Sci.,
numbers and report the performance in Table 1, where the 1949.
best results among the same category are in bold face. Our [11] G. Golub and C. V. Loan. Matrix Computations.
evolutionary clustering algorithms always give more stable Johns Hopkins University Press, third edition, 1996.
and consistent cluster results than the baselines where the [12] S. Guha, N. Mishra, R. Motwani, and L. O’Callaghan.
historic data is totally ignored or totally aggregated. Clustering data streams. In IEEE Symposium on
Foundations of Computer Science, 2000.
[13] C. Gupta and R. Grossman. Genic: A single pass
6. CONCLUSION generalized incremental algorithm for clustering. In
There are new challenges when traditional clustering tech- SIAM Int. Conf. on Data Mining, 2004.
niques are applied to new data types, such as streaming [14] L. J. Hubert and P. Arabie. Comparing partitions.
data and Web/blog data, where the relationship among data Journal of Classification, 2, 1985.
evolves with time. On one hand, because of long-term con- [15] X. Ji and W. Xu. Document clustering with prior
cept drifts, a naive approach based on aggregation will not knowledge. In SIGIR, 2006.
give satisfactory cluster results. On the other hand, short- [16] Y. Li, J. Han, and J. Yang. Clustering moving objects.
term variations occur very often due to noise. Preferably In Proc. of the 10th ACM SIGKDD Conference, 2004.
the cluster results should not change dramatically over short [17] A. Ng, M. Jordan, and Y. Weiss. On spectral
time and should exhibit temporal smoothness. In this paper, clustering: Analysis and an algorithm. In NIPS, 2001.
we propose two frameworks to incorporate temporal smooth- [18] H. Ning, W. Xu, Y. Chi, Y. Gong, and T. Huang.
ness in evolutionary spectral clustering. In both frameworks, Incremental spectral clustering with application to
a cost function is defined where in addition to the traditional monitoring of evolving blog communities. In SIAM
cluster quality cost, a second cost is introduced to regular- Int. Conf. on Data Mining, 2007.
ize the temporal smoothness. We then derive the (relaxed)
[19] J. Shi and J. Malik. Normalized cuts and image
optimal solutions for solving the cost functions. The solu-
segmentation. IEEE Trans. on Pattern Analysis and
tions turn out to have very intuitive interpretation and have
Machine Intelligence, 22(8), 2000.
forms analogous to traditional techniques used in time series
analysis. Experimental studies demonstrate that these new [20] K. Wagstaff, C. Cardie, S. Rogers, and S. Schroedl.
frameworks provide cluster results that are both stable and Constrained K-means clustering with background
consistent in the short-term and adaptive in the long run. knowledge. In Proc. 18th ICML Conference, 2001.
[21] Y. Weiss. Segmentation using eigenvectors: A unifying
view. In ICCV ’99: Proceedings of the International
7. ACKNOWLEDGMENTS Conference on Computer Vision-Volume 2, 1999.
We thank Shenghuo Zhu, Wei Xu, and Kai Yu for the in- [22] H. Zha, X. He, C. H. Q. Ding, M. Gu, and H. D.
spiring discussions, and thank Junichi Tatemura for helping Simon. Spectral relaxation for k-means clustering. In
us prepare the data sets. NIPS, 2001.