Pattern Recognition Letters: René Vidal, Paolo Favaro
Pattern Recognition Letters: René Vidal, Paolo Favaro
Pattern Recognition Letters: René Vidal, Paolo Favaro
a r t i c l e i n f o a b s t r a c t
Article history: We consider the problem of fitting a union of subspaces to a collection of data points drawn from one or
Available online xxxx more subspaces and corrupted by noise and/or gross errors. We pose this problem as a non-convex opti-
mization problem, where the goal is to decompose the corrupted data matrix as the sum of a clean and
Communicated by Kim Boyer
self-expressive dictionary plus a matrix of noise and/or gross errors. By self-expressive we mean a dictio-
nary whose atoms can be expressed as linear combinations of themselves with low-rank coefficients. In
Keywords: the case of noisy data, our key contribution is to show that this non-convex matrix decomposition prob-
Subspace clustering
lem can be solved in closed form from the SVD of the noisy data matrix. The solution involves a novel
Low-rank and sparse methods
Principal component analysis
polynomial thresholding operator on the singular values of the data matrix, which requires minimal
Motion segmentation shrinkage. For one subspace, a particular case of our framework leads to classical PCA, which requires
Face clustering no shrinkage. For multiple subspaces, the low-rank coefficients obtained by our framework can be used
to construct a data affinity matrix from which the clustering of the data according to the subspaces can be
obtained by spectral clustering. In the case of data corrupted by gross errors, we solve the problem using
an alternating minimization approach, which combines our polynomial thresholding operator with the
more traditional shrinkage-thresholding operator. Experiments on motion segmentation and face clus-
tering show that our framework performs on par with state-of-the-art techniques at a reduced compu-
tational cost.
! 2013 Elsevier B.V. All rights reserved.
0167-8655/$ - see front matter ! 2013 Elsevier B.V. All rights reserved.
http://dx.doi.org/10.1016/j.patrec.2013.08.006
Please cite this article in press as: Vidal, R., Favaro, P. Low rank subspace clustering (LRSC). Pattern Recognition Lett. (2013), http://dx.doi.org/10.1016/
j.patrec.2013.08.006
2 R. Vidal, P. Favaro / Pattern Recognition Letters xxx (2013) xxx–xxx
Please cite this article in press as: Vidal, R., Favaro, P. Low rank subspace clustering (LRSC). Pattern Recognition Lett. (2013), http://dx.doi.org/10.1016/
j.patrec.2013.08.006
R. Vidal, P. Favaro / Pattern Recognition Letters xxx (2013) xxx–xxx 3
1.3. Paper outline Notice that the latter solution does not coincide with the one gi-
ven by PCA, which performs hard-thresholding of the singular val-
The remainder of the paper is organized as follows (see Table 1): ues of D without shrinking them by 1=a.
Section 2 reviews existing results on sparse representation and
rank minimization for subspace estimation and clustering as well
2.1.2. Principal component pursuit
as some background material needed for our derivations. Section 3
While the above methods work well for data corrupted by
formulates the low rank subspace clustering problem for linear
Gaussian noise, they break down for data corrupted by gross errors.
subspaces in the absence of noise or gross errors and derives a
In Candès et al. (2011) this issue is addressed by assuming sparse
closed form solution for A and C. Section 4 extends the results of
gross errors, i.e., only a small percentage of the entries of D are cor-
Section 3 to data contaminated by noise and derives a closed form
rupted. Hence, the goal is to decompose the data matrix D as the
solution for A and C based on the polynomial thresholding opera-
sum of a low-rank matrix A and a sparse matrix E, i.e.,
tor. Section 5 extends the results to data contaminated by both
noise and gross errors and shows that A and C can be found using min rankðAÞ þ ckEk0 s:t: D ¼ A þ E; ð8Þ
A;E
alternating minimization. Section 6 presents experiments that
evaluate our method on synthetic and real data. Section 7 gives
where c > 0 is a parameter and kEk0 is the number of nonzero en-
the conclusions.
tries in E. Since this problem is in general NP hard, a common prac-
tice is to replace the rank of A by its nuclear norm and the ‘0 semi-
2. Background norm by the ‘1 norm. It is shown in Candès et al. (2011) that, under
broad conditions, the optimal solution to the problem in (8) is iden-
In this section we review existing results on sparse representa- tical to that of the convex problem
tion and rank minimization for subspace estimation (Section 2.1)
and and subspace clustering (Section 2.2). min kAk) þ ckEk1 s:t: D ¼ A þ E: ð9Þ
A;E
2.1. Subspace estimation by sparse representation and rank While a closed form solution to this problem is not known, con-
minimization vex optimization techniques can be used to find the minimizer. We
refer the reader to Lin et al. (2011) for a review of numerous ap-
2.1.1. Low rank minimization proaches. One such approach is the Augmented Lagrange Multi-
Given a data matrix corrupted by Gaussian noise D ¼ A þ G, plier (ALM) method, an iterative approach for solving the
where A is an unknown low-rank matrix and G represents the following optimization problem
noise, the problem of finding a low-rank approximation of D can
a
be formulated as max minkAk) þ ckEk1 þ hY; D + A + Ei þ kD + A + Ek2F ; ð10Þ
Y A;E 2
min kD + Ak2F subject to rankðAÞ 6 r: ð2Þ
A
where the third term enforces the equality constraint via the matrix
The optimal solution to this (PCA) problem is given by of Lagrange multipliers Y, while the fourth term (which is zero at
A ¼ UHrrþ1 ðRÞV > , where D ¼ U RV > is the SVD of D, rk is the kth the optimum) makes the cost function strictly convex. The optimi-
singular value of D, and H! ðxÞ is the hard thresholding operator: zation problem in (10) is solved using the inexact ALM method in
! (13). This method is motivated by the fact that the minimization
x jxj > !
H! ðxÞ ¼ ð3Þ over A and E for a fixed Y can be re-written as
0 else:
a
When r is unknown, the problem in (2) can be formulated as minkAk) þ ckEk1 þ kD + A + E þ a+1 Yk2F : ð11Þ
A;E 2
a
min rankðAÞ þ kD + Ak2F ; ð4Þ Given E and Y, it follows from the solution of (6) that the opti-
A 2
mal A is A ¼ US a+1 ðRÞV > , where U RV > is the SVD of D + E þ a+1 Y.
where a > 0 is a parameter. Since the optimal solution of (2) for a Given A and Y, the optimal E satisfies
fixed rank r ¼ rankðAÞ is A ¼ UHrrþ1 ðRÞV > , the problem in (4) is
equivalent to +aðD + A + E þ a+1 YÞ þ csignðEÞ ¼ 0: ð12Þ
aX It is shown in Lin et al. (2011) that this equation can be solved in
min r þ r2k : ð5Þ
r 2 k>r closed form using the shrinkage-thresholding operator as
pffiffiffiffiffiffiffiffi E ¼ S ca+1 ðD + A þ a+1 YÞ. Therefore, the inexact ALM method iter-
The optimal r is the smallest r such that rrþ1 6 2=a. Therefore, ates the following steps till convergence
the optimal A is given by A ¼ UHpffi2 ðRÞV > .
a
Since rank minimization problems are in general NP hard, a ðU; R; VÞ ¼ svdðD + Ek þ a+1
k Y kÞ
common practice (see Recht et al., 2010) is to replace the rank of
Akþ1 ¼ US a+1 ðRÞV >
A by its nuclear norm kAk) , i.e., the sum of its singular values, k
which leads to the following convex problem Ekþ1 ¼ S ca+1 ðD + Akþ1 þ a+1
k YkÞ
ð13Þ
k
Please cite this article in press as: Vidal, R., Favaro, P. Low rank subspace clustering (LRSC). Pattern Recognition Lett. (2013), http://dx.doi.org/10.1016/
j.patrec.2013.08.006
4 R. Vidal, P. Favaro / Pattern Recognition Letters xxx (2013) xxx–xxx
2.2. Subspace clustering by sparse representation and rank 2.2.2. Low rank representation (LRR)
minimization This algorithm (Liu et al., 2010) is very similar to SSC, except
that it aims to find a low-rank representation instead of a sparse
Consider now the more challenging problem of clustering data representation. LRR is motivated by the fact that for uncorrupted
n
drawn from multiple subspaces. In what follows, we discuss two data drawn from n independent subspaces of dimensions fdi gi¼1 ,
methods based on sparse and low-rank representation for address- Pn
the data matrix D is low rank because r ¼ rankðDÞ ¼ i¼1 di , which
ing this problem. is assumed to be much smaller than minfM; Ng. Therefore, the
equation D ¼ DC has a solution C that is low-rank. The LRR
algorithm finds C by solving the following convex optimization
2.2.1. Sparse subspace clustering (SSC)
problem
The work of Elhamifar and Vidal (2009) shows that, in the case
of uncorrupted data, an affinity matrix for solving the subspace min kCk) s:t: D ¼ DC: ð18Þ
C
clustering problem can be constructed by expressing each data
point as a linear combination of all other data points. That is, we It is shown in Liu et al. (2011) that the optimal solution to (18)
wish to find a matrix C such that D ¼ DC and diagðCÞ ¼ 0. In prin- is given by C ¼ V 1 V >
1 , where D ¼ U 1 R1 V 1 is the rank r SVD of D.
>
ciple, this leads to an ill-posed problem with many possible solu- Notice that this solution for C coincides with the affinity matrix
tions. To resolve this issue, the principle of sparsity is invoked. proposed by Costeira and Kanade (1998), as described in the intro-
Specifically, every point is written as a sparse linear combination duction. It is shown in Vidal et al. (2008) that this matrix is such
of all other data points by minimizing the number of nonzero coef- that C ij ¼ 0 when points i and j are in different subspaces, hence
ficients. That is it can be used to build an affinity matrix.
X In the case of data contaminated by noise or gross errors, the
min kC i k0 s:t: D ¼ DC and diagðCÞ ¼ 0; ð14Þ LRR algorithm solves the convex optimization problem
C
i
min kCk) þ ckEk2;1 s:t: D ¼ DC þ E; ð19Þ
C
where C i is the ith column of C. Since this problem is combinatorial, P qffiP ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
2
a simpler ‘1 optimization problem is solved where kEk2;1 ¼ Nk¼1 N
j¼1 jEjk j is the ‘2;1 norm of the matrix of er-
rors E. Notice that the problem in (19) is analogous to that in (17),
min kCk1 s:t: D ¼ DC and diagðCÞ ¼ 0: ð15Þ except that the ‘1 norms of C and E are replaced by the nuclear norm
C
of C and the ‘2;1 norm of E, respectively. The nuclear norm is a con-
It is shown in Elhamifar and Vidal (2009, 2010, 2013) that, un- vex relaxation of the rank of C, while the ‘2;1 norm is a convex relax-
der some conditions on the subspaces and the data, the solutions to ation of the number of nonzero columns of E. Therefore, the LRR
the optimization problems in (14) and (15) are such that C ij ¼ 0 algorithm aims to minimize the rank of the matrix of coefficients
when points i and j are in different subspaces. In other words, and the number of corrupted data points, while the SSC algorithm
the nonzero coefficients of the ith column of C correspond to points aims to minimize the number of nonzero coefficients and the num-
in the same subspace as point i. Therefore, one can use C to define ber of corrupted entries. As before, the optimal C is used to define an
an affinity matrix as jCj þ jC > j. The segmentation of the data is then affinity matrix jCj þ jC > j and the segmentation of the data is ob-
obtained by applying spectral clustering (von Luxburg, 2007) to tained by applying spectral clustering to this affinity.
this affinity.
In the case of data contaminated by noise G, the SSC algorithm 3. Low rank subspace clustering with uncorrupted data
assumes that each data point can be written as a linear combina-
tion of other data points up to an error G, i.e., D ¼ DC þ G, and In this section, we consider the low rank subspace clustering
solves the following convex problem problem P in the case of uncorrupted data, i.e., a ¼ 1 and c ¼ 1
so that G ¼ E ¼ 0 and D ¼ A. In Section 3.1, we show that the opti-
a mal solution for C can be obtained in closed form from the SVD of A
min kCk1 þ kGk2F s:t: D ¼ DC þ G and diagðCÞ ¼ 0: ð16Þ
C;G 2 by applying a nonlinear thresholding to its singular values. In
In the case of data contaminated also by gross errors E, the SSC Section 3.2, we assume that the self-expressiveness constraint is
algorithm assumes that D ¼ DC þ G þ E, where E is sparse. Since satisfied exactly, i.e., s ¼ 1 so that A ¼ AC. As shown in Liu et al.
both C and E are sparse, the equation D ¼ DC þ G þ E ¼ (2011), the optimal solution to this problem can be obtained by
>
½DI#½C > E> # þ G means that each point is written as a sparse linear hard thresholding the singular values of A. Here, we provide a sim-
combination of a dictionary composed of all other data points plus pler derivation of this result.
the columns of the identity matrix I. Thus, one can find C by solving
the following convex optimization problem 3.1. Uncorrupted data and relaxed constraints
Please cite this article in press as: Vidal, R., Favaro, P. Low rank subspace clustering (LRSC). Pattern Recognition Lett. (2013), http://dx.doi.org/10.1016/
j.patrec.2013.08.006
R. Vidal, P. Favaro / Pattern Recognition Letters xxx (2013) xxx–xxx 5
# $
1 be equal to one. Now, if dj ¼ 0, then Kwj ¼ K1 V > 1 U C ej ¼ 0, where ej is
C ¼ VP s ðKÞV > ¼ V 1 I + K+2
1 V >1 ; ð20Þ
s the jth column of the identity. This means that the columns of U C
associated to dj ¼ 0 must be orthogonal to the columns of V 1 , and
where the operator P s acts on the diagonal entries of K as
( pffiffiffi hence the columns of U C associated with dj ¼ 1 must be in the range
1
: 1 + sx 2 x > 1= s of V 1 . Thus, U C ¼ ½ V 1 R1 U 2 R2 #P for some rotation matrices R1 and
P s ðxÞ¼ pffiffiffi ; ð21Þ
0 x 6 1= s R2 , and permutation matrix P, and so the optimal C is
% &
I 0
and U ¼ ½U 1 U 2 #; K ¼ diagðK1 ; K2 Þ and V ¼ ½V 1 V 2 # are partitioned C ¼ ½ V 1 R1 V 2 R2 # ½ V 1 R1 V 2 R2 #> ¼ V 1 V >1 ; ð26Þ
pffiffiffi pffiffiffi 0 0
according to the sets I1 ¼ fi : ki > 1= sg and I2 ¼ fi : ki 6 1= sg.
Moreover, the optimal value of P1 is as claimed. h
# $
:X 1 sX 2
Us ðAÞ¼ 1 + k+2
i þ ki : ð22Þ
i2I 1
2s 2 i2I 2 4. Low rank subspace clustering with noisy data
since U U ¼ I and
>
¼ I. Let W ¼ V U C ¼ ½ w1 ; . . . ; wN #. Then,
U>
C UC
>
Theorem 3. Let D ¼ U RV > be the SVD of the data matrix D. The
Kwj ¼ Kwj dj for all j ¼ 1; . . . ; N. This means that dj ¼ 1 if Kwj –0 optimal solutions to P 3 are of the form
and dj is arbitrary otherwise. Since our goal is to minimize
P A ¼ U KV > and C ¼ VP s ðKÞV > ; ð27Þ
kCk) ¼ kDk) ¼ Nj¼1 jdj j, we need to set as many dj to zero as possible.
Since A ¼ AC implies that rankðAÞ 6 rankðCÞ, we can set at most where each entry of K ¼ diagðk1 ; . . . ; kn Þ is obtained from each entry of
N + rankðAÞ of the dj to zero and the remaining rankðAÞ of the dj must R ¼ diagðr1 ; . . . ; rn Þ as the solutions to
Please cite this article in press as: Vidal, R., Favaro, P. Low rank subspace clustering (LRSC). Pattern Recognition Lett. (2013), http://dx.doi.org/10.1016/
j.patrec.2013.08.006
6 R. Vidal, P. Favaro / Pattern Recognition Letters xxx (2013) xxx–xxx
that minimizes (29). If 3s 6 a, the solution for k is unique and thresholding operator Hpffi2 in (3). Therefore, the optimal A can be
a
# $
1 aþs obtained from the SVD of D ¼ U RV > as A ¼ UHpffi2 ðRÞV > , while
r) ¼ w pffiffiffi ¼ pffiffiffi : ð33Þ a
s a s the optimal C is given by Theorem 2.
If 3s > a, the solution for k is unique, except when r satisfies
Theorem 5. Let D ¼ U RV > be the SVD of the data matrix D. The
/ðb1 ðrÞ; rÞ ¼ /ðb3 ðrÞ; rÞ; ð34Þ optimal solution to P4 is given by
and r) must lie in the range
rffiffiffiffiffiffi A ¼ U 1 R1 V >1 and C ¼ V 1 V >1 ; ð41Þ
4 4 3 aþs qffiffi
< r) < pffiffiffi : ð35Þ where R1 contains the singular values of D that are larger than 2
3 as a s a, and
U 1 and V 1 contain the corresponding singular vectors.
4.2. Noisy data and exact constraints This suggest an iterative thresholding algorithm that, starting
from A0 ¼ D and E0 ¼ 0, alternates between applying polynomial
Assume now that the data are generated from the exact self- thresholding to D + Ek to obtain Akþ1 and applying shrinkage-thres-
expressive model, A ¼ AC, and contaminated by noise, i.e., holding to D + Akþ1 to obtain Ekþ1 , i.e.,
D ¼ A þ G. This leads to the optimization problem ðU k ; Rk ; V k Þ ¼ svdðD + Ek Þ
a Akþ1 ¼ U k P a;s ðRk ÞV >k
ðP4 Þ min kCk) þ kD + Ak2F s:t: A ¼ AC and C ¼ C : > ð45Þ
A;C 2
Ekþ1 ¼ S ca+1 ðD + Akþ1 Þ:
This problem can be seen as the limiting case of P3 with s ! 1.
qffiffi Notice that we do not need to compute C at each iteration be-
In this case, b1 ðrÞ ¼ r, b3 ðrÞ ¼ 0 and r) ¼ 2a, hence the cause the updates for A and E do not depend on C. Therefore, we
polynomial thresholding operator P a;s in (31) reduces to the hard can obtain C from A upon convergence. Although the optimization
Please cite this article in press as: Vidal, R., Favaro, P. Low rank subspace clustering (LRSC). Pattern Recognition Lett. (2013), http://dx.doi.org/10.1016/
j.patrec.2013.08.006
8 R. Vidal, P. Favaro / Pattern Recognition Letters xxx (2013) xxx–xxx
problem in (42) is non-convex, the algorithm in (45) is guaranteed 5.2. Corrupted data and exact constraints
to converge, as shown in Tseng (2001). Specifically, it follows from
Theorem 1 that the optimization problem in (42) is equivalent to Let us now consider the low rank subspace clustering problem
the minimization of the cost function P, where the constraint A ¼ AC is enforced exactly, i.e.,
where q > 1 is a parameter. As in the case of the IPT method, C is Motion segmentation refers to the problem of clustering a set of
obtained from A upon convergence. Experimentally, we have ob- 2D point trajectories extracted from a video sequence into groups
served that our method always converges. However, while the con- corresponding to different rigid-body motions. Here, the data ma-
vergence of the ADMM is well studied for convex problems, we are trix D is of dimension 2F ' N, where N is the number of 2D trajec-
not aware of any extensions to the nonconvex case. tories and F is the number of frames in the video. Under the affine
Please cite this article in press as: Vidal, R., Favaro, P. Low rank subspace clustering (LRSC). Pattern Recognition Lett. (2013), http://dx.doi.org/10.1016/
j.patrec.2013.08.006
R. Vidal, P. Favaro / Pattern Recognition Letters xxx (2013) xxx–xxx 9
Please cite this article in press as: Vidal, R., Favaro, P. Low rank subspace clustering (LRSC). Pattern Recognition Lett. (2013), http://dx.doi.org/10.1016/
j.patrec.2013.08.006
10 R. Vidal, P. Favaro / Pattern Recognition Letters xxx (2013) xxx–xxx
Fig. 2. Motion segmentation: given feature points on multiple rigidly moving objects tracked in multiple frames of a video (top), the goal is to separate the feature trajectories
according to the moving objects (bottom).
Table 3
Clustering error (%) of different variants of the LRSC algorithm on the Hopkins 155 database. The parameters in the first four columns are set as s ¼ 420, a ¼ 3000 for 2 motions,
4 4
a ¼ 5000 for 3 motions and c ¼ 5. The parameters in the last four columns are set as s ¼ 4:5'10
pffiffiffiffiffi and a ¼ 3000 for two motions, s ¼ 6'10
MN
pffiffiffiffiffi and a ¼ 5000 for 3 motions, and c ¼ 5. For
MN
P5 -ADMM, we set l0 ¼ 100 and q ¼ 1:1. In all cases, we use homogeneous coordinates with g ¼ 0:1. Boldface indicates the top performing algorithm in each experiment.
performs very well, achieving a clustering error of about 10% for 10 7. Discussion and conclusion
subjects. Table 7 shows the performance of different variants of
LRSC on the same data. P1 has the best performance for 2 and 3 We have proposed a new algorithm for clustering data drawn
subjects, and the worst performance for 5, 8 and 10 subjects. All from a union of subspaces and corrupted by noise/gross errors.
other variants of LRSC have similar performance, perhaps with Our approach was based on solving a non-convex optimization
P5 -IPT being the best. Overall, LRSC performs better than LSA, problem whose solution provides an affinity matrix for spectral
SCC and LRR, and worse than LRR-H and SSC-N. clustering. Our key contribution was to show that important par-
Finally, Fig. 4 shows the average computational time of each ticular cases of our formulation can be solved in closed form by
algorithm as a function of the number of subjects (or equivalently applying a polynomial thresholding operator to the SVD of the
the number of data points). Note that the computational time of data. A drawback of our approach to be addressed in the future
SCC is drastically higher than other algorithms. This comes from is the need to tune the parameters of our cost function. Further re-
the fact that the complexity of SCC increases exponentially in the search is also needed to understand the correctness of the resulting
dimension of the subspaces, which in this case is d ¼ 9. On the other affinity matrix in the presence of noise and corruptions. Finally, all
hand, SSC, LRR and LRSC use fast and efficient convex optimization existing methods decouple the learning of the affinity from the
techniques which keep their computational time lower than that of segmentation of the data. Further research is needed to integrate
other algorithms. Overall, LRR and LRSC are the fastest methods. these two steps into a single objective.
Please cite this article in press as: Vidal, R., Favaro, P. Low rank subspace clustering (LRSC). Pattern Recognition Lett. (2013), http://dx.doi.org/10.1016/
j.patrec.2013.08.006
R. Vidal, P. Favaro / Pattern Recognition Letters xxx (2013) xxx–xxx 11
Table 4
Clustering error (%) on the Hopkins 155 database. Boldface indicates the top performing algorithm in each experiment.
Fig. 3. Face clustering: given face images of multiple subjects (top), the goal is to find images that belong to the same subject (bottom).
Appendix A. The Von Neumann trace inequality where r1 ðXÞ P r2 ðXÞ P - - - P 0 and r1 ðYÞ P r2 ðYÞ P - - - P 0 are
the descending singular values of X and Y respectively. The case of
This appendix reviews two matrix product inequalities, which equality occurs if and only if it is possible to find unitary matrices U X
we will use later in our derivations. and V X that simultaneously singular value-decompose X and Y in the
sense that
Lemma 1 (Von Neumann’s inequality). For any m ' n real valued
matrices X and Y, X ¼ U X RX V >X and Y ¼ U X RY V >X ; ðA:2Þ
n
X where RX and RY denote the m ' n diagonal matrices with the singular
traceðX > YÞ 6 ri ðXÞri ðYÞ; ðA:1Þ
values of X and Y, respectively, down in the diagonal.
i¼1
Table 5 Table 6
Clustering error (%) of different algorithms on the Extended Yale B database after Clustering error (%) of different algorithms on the Extended Yale B database without
applying RPCA separately to the images from each subject. Boldface indicates the top pre-processing the data. Boldface indicates the top performing algorithm in each
performing algorithm in each experiment. experiment.
Method LSA SCC LRR LRR-H SSC-N LRSC Method LSA SCC LRR LRR-H SSC-N
2 Subjects 2 Subjects
Mean 6:15 1:29 0:09 0:05 0:06 0.00 Mean 32.80 16.62 9.52 2.54 1.86
Median 0.00 0.00 0.00 0.00 0.00 0.00 Median 47.66 7.82 5.47 0.78 0.00
3 Subjects 3 Subjects
Mean 11:67 19:33 0:12 0:10 0:08 0.00 Mean 52:29 38:16 19:52 4:21 3.10
Median 2:60 8:59 0.00 0.00 0.00 0.00 Median 50.00 39.06 14.58 2.60 1.04
5 Subjects 5 Subjects
Mean 21:08 47:53 0:16 0:15 0:07 0.00 Mean 58:02 58:90 34:16 6:90 4.31
Median 19:21 47:19 0.00 0.00 0.00 0.00 Median 56.87 59.38 35.00 5.63 2.50
8 Subjects 8 Subjects
Mean 30:04 64:20 4:50 11:57 0:06 0.00 Mean 59:19 66:11 41:19 14:34 5.85
Median 29:00 63:77 0:20 15:43 0.00 0.00 Median 58.59 64.65 43.75 10.06 4.49
10 Subjects 10 Subjects
Mean 35:31 63:80 0:15 13:02 0:89 0.00 Mean 60:42 73:02 38:85 22:92 10.94
Median 30:16 64:84 0.00 13:13 0:31 0.00 Median 57.50 75.78 41.09 23.59 5.63
Please cite this article in press as: Vidal, R., Favaro, P. Low rank subspace clustering (LRSC). Pattern Recognition Lett. (2013), http://dx.doi.org/10.1016/
j.patrec.2013.08.006
12 R. Vidal, P. Favaro / Pattern Recognition Letters xxx (2013) xxx–xxx
Fig. 4. Average computational time (s) of the algorithms on the Extended Yale B
database as a function of the number of subjects. Proof. Let A ¼ U KV > be the SVD of A and C ¼ U C DU >
C be the eigen-
value decomposition (EVD) of C. The cost function of P1 reduces to
i¼1 need to consider the last term of the cost function, i.e.,
' (
where r1 ðXÞ P r2 ðXÞ P - - - P 0 and r1 ðZÞ P r2 ðZÞ P - - - P 0 are kKWðI + DÞk2F ¼ trace ðI + DÞ2 W > K2 W : ðB:5Þ
the descending singular values of X and Z, respectively. The case of
Applying Lemma 2 to X ¼ WðI + DÞ2 W > and Z ¼ K2 , we obtain
equality occurs if and only if it is possible to find a unitary matrix U X
that for all unitary matrices W
that simultaneously singular value-decomposes X and Z in the sense
that ' ( XN ' (
min trace ðI + DÞ2 W > K2 W ¼ ri ðI + DÞ2 rn+iþ1 ðK2 Þ; ðB:6Þ
W
i¼1
X ¼ UX R >
X UX and Z ¼ U X PRZ P >
U >X ; ðA:4Þ
where the minimum is achieved by a permutation matrix W ¼ P>
where RX and RZ denote the n ' n diagonal matrices with the singular that sorts the diagonal entries of K2 in ascending order, i.e., the
values of X and Z, respectively, down in the diagonal in descending diagonal entries of PK2 P> are in ascending order.
2 2
order, and P is a permutation matrix such that PRZ P> contains the Let the ith' largest( entry of ðI + DÞ and K be, respectively,
2 2 2 2 2
singular values of Z in the diagonal in ascending order. ð1 + di Þ ¼ ri ðI + DÞ and mn+iþ1 ¼ ki ¼ ri ðK Þ. Then
Please cite this article in press as: Vidal, R., Favaro, P. Low rank subspace clustering (LRSC). Pattern Recognition Lett. (2013), http://dx.doi.org/10.1016/
j.patrec.2013.08.006
R. Vidal, P. Favaro / Pattern Recognition Letters xxx (2013) xxx–xxx 13
s N
X N
sX Proof. The proof of this result will be done in two steps. First, we
min kDk) þ kKWðI + DÞk2F ¼ jdi j þ m2i ð1 + di Þ2 : ðB:7Þ will show that the optimal C can be computed in closed form from
W 2 i¼1
2 i¼1
the SVD of A. Second, we will show that the optimal A can be
To find the optimal D, we can solve for each di independently as obtained in closed form from the SVD of D.
di ¼ argmind jdj þ 2s m2i ð1 + dÞ2 . As shown, e.g., by Parikh and Boyd
A closed-form solution for C. Notice that when A is fixed, P3
(2013) , the solution to this problem can be found in closed form
reduces to P1 . Therefore, it follows from Theorem 1 that the
using the shrinkage-thresholding operator in (7), which gives
( pffiffiffi optimal solution for C is C ¼ VP s ðKÞV > , where A ¼ U KV > is the
1 + s1m2 mi > 1= s SVD of A. Moreover, it follows from (22) that if we replace the
di ¼ S 1 ð1Þ ¼ i
pffiffiffi : ðB:8Þ optimal C into the cost of P3 , then P3 is equivalent to
sm2
i 0 mi 6 1= s
a
Then, di ¼ P s ðkn+iþ1 Þ, which can be compactly written as min Us ðAÞ þ kD + Ak2F : ðB:15Þ
A 2
D ¼ PP s ðKÞP> . Therefore,
" # A closed-form solution for A. Let D ¼ U RV > and A ¼ U A KV >
A be the
> I + 1s K+2
1 0 SVDs of D and A, respectively. Then,
P DP ¼ P s ð KÞ ¼ ; ðB:9Þ
0 0
kD + Ak2F ¼ kU RV > + U A KV >A k2F ;
where K ¼ diagðK1 ; K2 Þ is partitioned according to the sets ¼ kRk2F + 2traceðV RU > U A KV >A Þ þ kKk2F ; ðB:16Þ
pffiffiffi pffiffiffi
I1 ¼ fi : ki > 1= sg and I2 ¼ fi : ki 6 1= sg.
To find the optimal W, notice from Lemma 2 that the equality ¼ kR k2F + 2traceðRW 1 K W >2 Þ þ kK k2F ;
' ( P
trace ðI + DÞ2 W > K2 W ¼ N 2 2
where W 1 ¼ U U A and W 2 ¼ V > V A . Therefore, the minimization
>
i¼1 ð1 + di Þ kn+iþ1 is achieved if and
only if there exists a unitary matrix U X such that over A in (B.15) can be carried out by minimizing first with respect
to W 1 and W 2 and then with respect to K.
ðI + DÞ2 ¼ U X ðI + DÞ2 U >X and W > K2 W ¼ U X PK2 P> U >X : ðB:10Þ The minimization over W 1 and W 2 is equivalent to
Since the SVD of a matrix is unique up to the sign of the singular max traceðRW 1 KW >2 Þ: ðB:17Þ
vectors associated with different singular values and up to a W 1 ;W 2
rotation and sign of the singular vectors associated with repeated By letting X ¼ R and Y ¼ W 1 KW >
2 in Lemma 1, we obtain
singular values, we conclude that U X ¼ I up to the aforementioned
n
X n
X
ambiguities of the SVD of ðI + DÞ2 . Likewise, we have that
max traceðRW 1 KW >2 Þ ¼ ri ðRÞri ðKÞ ¼ ri ki : ðB:18Þ
W > ¼ U X P up to the aforementioned ambiguities of the SVD of W 1 ;W 2
i¼1 i¼1
K2 . Now, if K2 has repeated singular values, then ðI + DÞ2 has
repeated eigenvalues at the same locations. Therefore, Moreover, the maximum is achieved if and only if there exist
W > ¼ U X P ¼ P up to a block-diagonal transformation, where each orthogonal matrices U W and V W such that
block is an orthonormal matrix that corresponds to a repeated R ¼ U W RV >W and W 1 KW >2 ¼ U W KV >W : ðB:19Þ
singular value of D. Nonetheless, even though W may not be
unique, the matrix C is always unique and equal to Hence, the optimal solutions are W 1 ¼ U W ¼ I and W 2 ¼ V W ¼ I
up to a unitary transformation that accounts for the sign and rota-
C¼ U C DU >C¼ VW DW V ¼ V P DPV > > > >
tional ambiguities of the singular vectors of R. This means that A
" #
I + 1s K+2 0 > 1 and D have the same singular vectors, i.e., U A ¼ U and V A ¼ V,
1
¼ ½ V1 V2 # ½ V 1 V 2 # ¼ V 1 ðI + K+2 ÞV > ; ðB:11Þ
0 0 s 1 1 and that kD + Ak2F ¼ kUðR + KÞV > k2F ¼ kR + Kk2F . By substituting
this expression for kD + Ak2F into (B.15), we obtain
as claimed.
X 1 sX 2 aX
Finally, the optimal C is such that AC ¼ U 1 ðK1 + 1s K+1 >
1 ÞV 1 and min ð1 + k+2 k þ ðri + ki Þ2 ;
i Þþ ðB:20Þ
A + AC ¼ U 2 K2 V >
2 þ 1
U 1 K 1 V >
1 . This shows (22), because K
i2I
2s 2 i2I i 2 i
s 1 2
!
s X# 1 +2
$
s X k+2 X 2 pffiffiffi pffiffiffi
where I1 ¼ fi : ki > 1= sg and I2 ¼ fi : ki 6 1= sg.
kCk) þ kA + ACk2F ¼ 1 + ki þ i
þ ki ;
2 s 2 i2I s2 It follows from the above equation that the optimal ki can be
i2I 1 i2I 1 2
obtained independently for each ri by minimizing the ith term of
as claimed. h the above summation, which is of the form /ðk; rÞ in (29). The first
order derivative of / is given by
Theorem 3. Let D ¼ U RV > be the SVD of the data matrix D. The opti-
( pffiffiffi
1
@/ sk
+3
k > 1= s
mal solutions to P3 are of the form ¼ aðk + rÞ þ pffiffiffi : ðB:21Þ
@k sk k 6 1= s
>
A ¼ U KV and C ¼ VP s ðKÞV ; >
ðB:12Þ
Therefore, the optimal k’s can be obtained as the solutions of the
where each entry of K ¼ diagðk1 ; . . . ; kn Þ is obtained from each entry of nonlinear equation r ¼ wðkÞ in (28) that minimize (29). This com-
R ¼ diagðr1 ; . . . ; rn Þ as the solutions to pletes the proof of Theorem 3. h
( pffiffiffi
1 +3
: k þ as k if k > 1= s
r ¼ wðkÞ¼ pffiffiffi ; ðB:13Þ
k þ as k if k 6 1= s Theorem 4. There exists a r) > 0 such that the solutions to (28) that
minimize (29) can be computed as
that minimize !
( pffiffiffi : b1 ðrÞ if r 6 r)
1+ 1 +2
k > 1= s k ¼ P a;s ðrÞ¼ ðB:22Þ
: a k b3 ðrÞ if r > r) ;
/ðk; rÞ¼ ðr + kÞ2 þ 2s
pffiffiffi : ðB:14Þ
2 s k2 k 6 1= s :
2 where b1 ðrÞ ¼ aþs r
a and b3 ðrÞ is the real root of the polynomial
Please cite this article in press as: Vidal, R., Favaro, P. Low rank subspace clustering (LRSC). Pattern Recognition Lett. (2013), http://dx.doi.org/10.1016/
j.patrec.2013.08.006
14 R. Vidal, P. Favaro / Pattern Recognition Letters xxx (2013) xxx–xxx
1 Now, notice from Fig. 1(c) that when r < h1 the optimal solution is
pðkÞ ¼ k4 + rk3 þ ¼0 ðB:23Þ qffiffiffiffi
as k ¼ b1 . When r ¼ h1 , b1 ¼ 3ða4þa sÞ 4 a3s is a minimum and
that minimizes (29). If 3s 6 a, the solution for k is unique and qffiffiffiffi
# $ b2 ¼ b3 ¼ 4 a3s is an inflection point, thus the optimal solution is
1 aþs
r) ¼ w pffiffiffi ¼ pffiffiffi : ðB:24Þ k ¼ b1 .1 When r > h3 , the optimal solution is b3 . Finally, when
s a s r ¼ h3 ; b1 ¼ b2 ¼ p1ffisffi is a maximum and b3 is a minimum, thus the
If 3s > a, the solution for k is unique, except when r satisfies optimal solution is k ¼ b3 . Therefore, the threshold for r must lie
/ðb1 ðrÞ; rÞ ¼ /ðb3 ðrÞ; rÞ; ðB:25Þ in the range
rffiffiffiffiffi
and r) must lie in the range 4 4 3 aþs
rffiffiffiffiffi < r) < pffiffiffi : ! ðB:32Þ
4 4 3 aþs 3 as a s
< r) < pffiffiffi : ðB:26Þ
3 as a s
References
Proof. When 3s 6 a, the solution to r ¼ wðkÞ that minimizes
/ðk; rÞ is unique, as illustrated in Fig. 1(b). This is because Agarwal, P., Mustafa, N., 2004. k-Means projective clustering. In: ACM Symposium
on Principles of database systems.
( pffiffiffi Basri, R., Jacobs, D., 2003. Lambertian reflection and linear subspaces. IEEE
@2/ a + 3s k+4 k > 1= s Transactions on Pattern Analysis and Machine Intelligence 25 (3), 218–233.
¼ pffiffiffi ðB:27Þ
@k2 aþs k 6 1= s Boult, T., Brown, L., 1991. Factorization-based segmentation of motions. In: IEEE
Workshop on Motion Understanding, pp. 179–186.
Bradley, P.S., Mangasarian, O.L., 2000. k-plane clustering. Journal of Global
is strictly positive, hence / is a strictly convex function of k. When
Optimization 16 (1), 23–32.
k 6 p1ffisffi, the unique solution to r ¼ wðkÞ is given by Cai, J.-F., Candés, E.J., Shen, Z., 2008. A singular value thresholding algorithm for
matrix completion. SIAM Journal of Optimization 20 (4), 1956–1982.
: a
k ¼ b1 ðrÞ¼ r: ðB:28Þ Candès, E., Li, X., Ma, Y., Wright, J., 2011. Robust principal component analysis?
aþs Journal of the ACM 58 (3).
þs Chen, G., Lerman, G., 2009. Spectral curvature clustering (SCC). International Journal
From this equation we immediately conclude that r) ¼ aap ffiffi.
s of Computer Vision 81 (3), 317–330.
Now, if k > p1ffisffi, then k must be one of the real solutions of the poly- Costeira, J., Kanade, T., 1998. A multibody factorization method for independently
moving objects. International Journal of Computer Vision 29 (3), 159–179.
nomial in (B.23). Since 3s 6 a, this polynomial has a unique real Elhamifar, E., Vidal, R., 2009. Sparse subspace clustering. In: IEEE Conference on
root with multiplicity 2, which we denote as b3 ðq rÞ.
ffiffiffiffi Computer Vision and Pattern Recognition.
:
When 3s > a, we have k ¼ b1 ðrÞ if r < h1 ¼ 43 4 a3s and k ¼ b3 ðrÞ Elhamifar, E., Vidal, R., 2010. Clustering disjoint subspaces via sparse
: þsffiffi representation. In: IEEE International Conference on Acoustics, Speech, and
if r > h3 ¼ aaps, as illustrated in Fig. 1(c). However, if h1 6 r 6 h3 Signal Processing.
there could be up to three solutions for k. The first candidate Elhamifar, E., Vidal, R., 2013. Sparse subspace clustering: Algorithm, theory, and
applications. IEEE Transactions on Pattern Analysis and Machine Intelligence.
solution is b1 ðrÞ. The remaining two candidate solutions b2 ðrÞ and
Favaro, P., Vidal, R., Ravichandran, A., 2011. A closed form solution to robust
b3 ðrÞ are the two real roots of the polynomial in (B.23), with b2 ðrÞ subspace estimation and clustering. In: IEEE Conference on Computer Vision
being the smallest and b3 ðrÞ being the largest root. The other two and Pattern Recognition.
Gear, C.W., 1998. Multibody grouping from motion images. Int. Journal of Computer
roots of p are complex. Out of the three candidate solutions, b1 ðrÞ
Vision 29 (2), 133–150.
and b3 ðrÞ correspond to a minimum and b2 ðrÞ corresponds to a Goh, A., Vidal, R., 2007. Segmenting motions of different types by unsupervised
maximum, because manifold clustering. In: IEEE Conference on Computer Vision and Pattern
rffiffiffiffiffiffi rffiffiffiffiffi Recognition.
pffiffiffi 4 3 4 3 Gruber, A., Weiss, Y., 2004. Multibody factorization with uncertainty and missing
b1 ðrÞ 6 1= s; b2 ðrÞ < and b3 ðrÞ > ; ðB:29Þ data using the EM algorithm. In: IEEE Conf. on Computer Vision and Pattern
as as Recognition, vol. I, pp. 707–714.
2
Ho, J., Yang, M.H., Lim, J., Lee, K., Kriegman, D., 2003. Clustering appearances of
and so @@k/2 is positive for b1 , negative for b2 and positive for b3 .
objects under varying illumination conditions. In: IEEE Conference on Computer
Therefore, except when r is such that (B.25) holds true, the solution Vision and Pattern Recognition.
to r ¼ wðkÞ that minimizes (29) is unique and equal to Hong, W., Wright, J., Huang, K., Ma, Y., 2006. Multi-scale hybrid linear models for lossy
! image representation. IEEE Trans. on Image Processing 15 (12), 3655–3671.
b1 ðrÞ if /ðb1 ðrÞ; rÞ < /ðb3 ðrÞ; rÞ Lauer, F., Schnörr, C., 2009. Spectral clustering of linear subspaces for motion
k¼ ðB:30Þ segmentation. In: IEEE International Conference on Computer Vision.
b3 ðrÞ if /ðb1 ðrÞ; rÞ > /ðb3 ðrÞ; rÞ: Lee, K.-C., Ho, J., Kriegman, D., 2005. Acquiring linear subspaces for face recognition
under variable lighting. IEEE Transactions on Pattern Analysis and Machine
To show that k can be obtained as in (B.22), we need show that Intelligence 27 (5), 684–698.
there exists a r) > 0 such that /ðb1 ðrÞ; rÞ < /ðb3 ðrÞ; rÞ for r < r) Lin, Z., Chen, M., Wu, L., Ma, Y., 2011. The augmented Lagrange multiplier method
and /ðb1 ðrÞ; rÞ > /ðb3 ðrÞ; rÞ for r > r) . Because of the intermedi- for exact recovery of corrupted low-rank matrices. arXiv:1009.5055v2.
Liu, G., Lin, Z., Yan, S., Sun, J., Yu, Y., Ma, Y., 2011. Robust recovery of subspace
ate value theorem, it is sufficient to show that
structures by low-rank representation. In: http://arxiv.org/pdf/1010.2955v1.
¼ 0 þ aðr + b1 Þ + 0 + aðr + b3 Þ ¼ aðb3 + b1 Þ > 0: which follows from the fact that 3s > a.
Please cite this article in press as: Vidal, R., Favaro, P. Low rank subspace clustering (LRSC). Pattern Recognition Lett. (2013), http://dx.doi.org/10.1016/
j.patrec.2013.08.006
R. Vidal, P. Favaro / Pattern Recognition Letters xxx (2013) xxx–xxx 15
Liu, G., Lin, Z., Yan, S., Sun, J., Yu, Y., Ma, Y., 2012. Robust recovery of subspace Tseng, P., 2000. Nearest q-flat to m points. Journal of Optimization Theory and
structures by low-rank representation. IEEE Transactions on Pattern Analysis Applications 105 (1), 249–252.
and Machine Intelligence. Tseng, P., 2001. Convergence of a block coordinate descent method for
Liu, G., Lin, Z., Yu, Y., 2010. Robust subspace segmentation by low-rank nondifferentiable minimization. Journal of Optimization Theory and
representation. In: International Conference on Machine Learning. Applications 109 (3), 475–494.
Lu, L., Vidal, R., 2006. Combined central and subspace clustering on computer vision Vidal, R., 2011. Subspace clustering. IEEE Signal Processing Magazine 28 (3), 52–68.
applications. In: International Conference on Machine Learning, pp. 593–600. Vidal, R., Ma, Y., Piazzi, J., 2004. A new GPCA algorithm for clustering subspaces by
Ma, Y., Derksen, H., Hong, W., Wright, J., 2007. Segmentation of multivariate mixed fitting, differentiating and dividing polynomials. In: IEEE Conference on
data via lossy coding and compression. IEEE Transactions on Pattern Analysis Computer Vision and Pattern Recognition, vol. I, pp. 510–517.
and Machine Intelligence 29 (9), 1546–1562. Vidal, R., Ma, Y., Sastry, S., 2003a. Generalized Principal Component Analysis
Mirsky, L., 1975. A trace inequality of John von Neumann. Monatshefte für (GPCA). In: IEEE Conference on Computer Vision and Pattern Recognition, vol. I,
Mathematic 79, 303–306. pp. 621–628.
Parikh, N., Boyd, S., 2013. Proximal Algorithms. Now. Vidal, R., Ma, Y., Sastry, S., 2005. Generalized Principal Component Analysis (GPCA).
Rao, S., Tron, R., Ma, Y., Vidal, R., 2008. Motion segmentation via robust subspace IEEE Transactions on Pattern Analysis and Machine Intelligence 27 (12), 1–15.
separation in the presence of outlying, incomplete, or corrupted trajectories. In: Vidal, R., Soatto, S., Ma, Y., Sastry, S., 2003b. An algebraic geometric approach to the
IEEE Conference on Computer Vision and Pattern Recognition. identification of a class of linear hybrid systems. In: Conference on Decision and
Rao, S., Tron, R., Vidal, R., Ma, Y., 2010. Motion segmentation in the presence of Control, pp. 167–172.
outlying, incomplete, or corrupted trajectories. IEEE Transactions on Pattern Vidal, R., Tron, R., Hartley, R., 2008. Multiframe motion segmentation with missing
Analysis and Machine Intelligence 32 (10), 1832–1845. data using PowerFactorization and GPCA. International Journal of Computer
Recht, B., Fazel, M., Parrilo, P., 2010. Guaranteed minimum-rank solutions of linear Vision 79 (1), 85–105.
matrix equations via nuclear norm minimization. SIAM Review 52 (3), 471–501. von Luxburg, U., 2007. A tutorial on spectral clustering. Statistics and Computing 17.
Soltanolkotabi, M., Elhamifar, E., Candes, E., 2013. Robust subspace clustering. Yan, J., Pollefeys, M., 2006. A general framework for motion segmentation:
http://arxiv.org/abs/1301.2603. Independent, articulated, rigid, non-rigid, degenerate and non-degenerate. In:
Sugaya, Y., Kanatani, K., 2004. Geometric structure of degeneracy for multi-body European Conf. on Computer Vision, pp. 94–106.
motion segmentation. In: Workshop on Statistical Methods in Video Processing. Yang, A., Wright, J., Ma, Y., Sastry, S., 2008. Unsupervised segmentation of natural
Tipping, M., Bishop, C., 1999. Mixtures of probabilistic principal component images via lossy data compression. Computer Vision and Image Understanding
analyzers. Neural Computation 11 (2), 443–482. 110 (2), 212–225.
Tomasi, C., Kanade, T., 1992. Shape and motion from image streams under Yang, A.Y., Rao, S., Ma, Y., 2006. Robust statistical estimation and segmentation of
orthography: A factorization method. International Journal of Computer multiple subspaces. In: Workshop on 25 years of RANSAC.
Vision 9, 137–154. Zhang, T., Szlam, A., Lerman, G., 2009. Median k-flats for hybrid linear modeling
Tron, R., Vidal, R., 2007. A benchmark for the comparison of 3-D motion with many outliers. In: Workshop on Subspace Methods.
segmentation algorithms. In: IEEE Conference on Computer Vision and Zhang, T., Szlam, A., Wang, Y., Lerman, G., 2010. Hybrid linear modeling via local
Pattern Recognition. best-fit flats. In: IEEE Conference on Computer Vision and Pattern Recognition,
pp. 1927–1934.
Please cite this article in press as: Vidal, R., Favaro, P. Low rank subspace clustering (LRSC). Pattern Recognition Lett. (2013), http://dx.doi.org/10.1016/
j.patrec.2013.08.006