Fusion Subspace Clustering For Incomplete Data
Fusion Subspace Clustering For Incomplete Data
Abstract—This paper introduces fusion subspace clustering, cluster K r-dimensional subspaces. In low-sampling regimes,
arXiv:2205.10872v1 [cs.LG] 22 May 2022
a novel method to learn low-dimensional structures that ap- this would require a super-polynomial number of points [39],
proximate large scale yet highly incomplete data. The main which are rarely available in practice. Alternatively, filling
idea is to assign each datum to a subspace of its own, and
minimize the distance between the subspaces of all data, so that missing entries with a sensible value (e.g., zeros or means [35]
subspaces of the same cluster get fused together. Our method or using low-rank matrix completion [36]) may work if data
allows low, high, and even full-rank data; it directly accounts is missing at a rate inversely proportional to the subspaces’
for noise, and its sample complexity approaches the information- dimensions [37], or if data is low-rank. However, in most
theoretic limit. In addition, our approach provides a natural applications data is missing at much higher rates, and due to the
model selection clusterpath, and a direct completion method. We
give convergence guarantees, analyze computational complexity, number and dimensions of the subspaces, data is typically high
and show through extensive experiments on real and synthetic or even full-rank. In general, data filled with zeros or means
data that our approach performs comparably to the state-of-the- no longer lie in a union of subspaces (UoS), thus guaranteeing
art with complete data, and dramatically better if data is missing.failure even with a modest amount of missing data [38]. Other
approaches include alternating methods like k-subspaces [40],
I. I NTRODUCTION
expectation-maximization [41], group-lasso [42], and lifting
Inferring low-dimensional structures that explain high- techniques [38], [43], [44] that require (at the very least)
dimensional data has become a cornerstone of discovery in squaring the dimension of an already high-dimensional problem,
virtually all fields of science. Principal component analysis which severely limits their applicability. More recently methods
(PCA), which identifies the low-dimensional linear subspace like [45], [46] incorporate a variation of fuzzy c-means for
that best explains a dataset, is arguably the most prominent data imputation and/or clustering. However, these existing
technique for this purpose. In many applications — computer approaches either have limited applicability or do not perform
vision, image processing, bioinformatics, linguistics, networks well if data is missing in large quantities [50]. For example,
analysis, and more [1]–[10] — data is often composed of a k-nearest neighbors imputation can distort the data distribution,
mixture of several classes, each of which can be explained by resulting in inaccurate nearest neighbors identification [47].
a different subspace. Clustering accordingly is an important Regression methods can also lead to low accuracy, especially
unsupervised learning problem that has received tremendous if the underlying variables have low correlation. Existing
attention in recent years, producing theory and algorithms to approaches along with their weaknesses are compared by [50].
handle outliers, noisy measurements, privacy concerns, and These challenges call the attention to new strategies to address
data constraints, among other difficulties [11]–[31]. missing data.
However, one major contemporary challenge is that data This paper introduces fusion subspace clustering (F SC), a
is often incomplete. For example, in image inpainting, the novel approach to address incomplete data, inspired by greedy
values of some pixels are missing due to faulty sensors methods, convex relaxations, and fusion penalties [59]–[64].
and image contamination [32]; in computer vision features The main idea is to assign each datum to a subspace of its
are often missing due to occlusions and tracking algorithms own, and then fuse together nearby subspaces by minimizing
malfunctions [33]; in recommender systems each user only (i) the distance between each datum and its subspace (thus
rates a limited number of items [34]; in a network, most nodes guaranteeing that each datum is explained by its subspace),
communicate in subsets, producing only a handful of all the and (ii) the distance between the subspaces of all data, so
possible measurements [7]. that subspaces from points that belong together get fused
Missing data notoriously complicates clustering. The into one. While F SC is mainly motivated by missing data,
main difficulty with highly incomplete data is that subsets it is also new to full-data, and has the next advantages: it
of points are rarely observed in overlapping coordinates, allows low, high, and even full-rank data. F SC directly allows
which impedes assessing distances. Existing self-expressive noise, and its sample complexity approaches the information-
formulations [9], [35], agglomerative strategies [25], and partial theoretic limit [65], as shown in sections VII-A4 and VII-A5.
neighborhoods [39], all require observing O(r + 1) overlapping Similar to hierarchical clustering, F SC can produce a model
coordinates in at least K sets of O(r + 1) points in order to selection clusterpath providing detailed information about intra-
cluster and cluster-to-cluster distances (see Figure 2). Finally, Notice that if λ ≥ n(n − 1)/2 (the number of distinct pairs of
its simplicity makes F SC amenable to analysis: our main points), then the problem is unconstrained, and a trivial solution
theoretical result shows that F SC converges to a local minimum. is Ui formed by xi and any other r − 1 vectors (in fact this
This is particularly remarkable, especially in light that most is precisely our choice for initialization, with the additional
other subspace clustering algorithms lack theoretical guarantees r − 1 vectors populated with i.i.d. N(0, 1) entries, known to
(even local convergence) when data is missing (except for produce incoherent and nearly orthogonal subspaces with high
restrictive fractions of missing entries, and liftings, which are probability [51]). If λ = n(n − 1)/2 − 1, then (1) forces
unfeasible for high-dimensional data). Our experiments on two subspaces to fuse, similar to the first step in hierarchical
real and synthetic data show that with full-data, F SC performs clustering. More generally, if λ = n(n − 1)/2 − `, then (1)
comparably to the state-of-the-art, and dramatically better if forces ` − 1 subspaces to fuse. However, (1) is a combinatorial
data is missing. problem, so we propose the following relaxation:
second term forces subspaces from different columns to get complexity [73], [74]. Ideally, we would like wij to be large if
closer, even if they no longer contain exactly their assigned xi and xj lie in the same subspace, so that the penalty ρ(Ui , Uj )
columns. As λ grows, subspaces get closer and closer, up to gets a higher weight, forcing subspaces Ui and Uj to fuse into
the point where some subspaces fuse into one. This is verified one. Conversely, if xi and xj lie in different subspaces, we want
in our experiments (see Figure 3). The extreme case (λ = ∞) wij to be small, so that the penalty ρ(Ui , Uj ) gets ignored,
forces all subspaces to fuse into one (to attain zero in the and subspaces Ui and U√j do not fuse.
second term), meaning we only have one subspace to explain Here we use wij = rd 1κ {exp(−γρ2 (Ui , Uj ))}, where
all data, which is precisely PCA (for full-data) and LRMC the indicator 1κ takes the value one if j is amongst the κ nearest
(for incomplete data). In other words, F SC is a generalization neighbors
√ of i or vice versa, and zero otherwise. Here the factor
of PCA and LRMC, which is the particular case of (2) with rd ensures that the penalty is in the order of the number
λ = ∞. Iteratively decreasing λ will result in more and more of degrees of freedom (the same rescaling is used in [54],
clusters, until λ = 0 produces n. The more subspaces, the more [55]), and the second factor is a Gaussian kernel that slows the
accuracy, but the more degrees of freedom (overfitting). For fusion of distant subspaces [73], [74], where γ ≥ 0 regulates
each λ that provides a different clustering, we can compute a how separated subspaces are. In particular γ = 0 corresponds
goodness of fit test (like the Akaike information criterion, AIC to uniform weights (wij = 1 for every (i, j)), known to be a
[72]) that quantifies the tradeoff between accuracy and degrees good option if subspaces are well-separated, and to produce no
of freedom, to determine the best number of subspaces K. For splits in the clusterpath of Euclidean clustering (Theorem 1 in
example, this test can be in the form of K, and the residuals [62]). More generally, with γ > 0 the second factor measures
of the projections of each xΩ Ω
i onto its corresponding Ûk , as the distance between subspaces Ui and Uj such that if Ui
defined in Section III-A. Similarly, we can iteratively increase and Uj are close, then ρ(Ui , Uj ) will be small, resulting in a
r to find all the columns that lie in 1-dimensional subspaces, large value of exp(−γρ2 (Ui , Uj )) (and vice versa), as desired.
then all the columns that lie in 2-dimensional subspaces, and Figure 2 shows the distribution of these Gaussian kernels on
so on (pruning the data at each iteration). This will result in an the Yale B dataset (which are mostly small for outer-cluster
estimate of the number of subspaces K, and their dimensions. points, and mostly large for intra-cluster points, as desired),
The Subspace Clusterpath. Notice that iteratively increas- together with the similarity matrices produced by two options
ing λ also provides a natural way to quantify and visualize intra- (γ = 0 and γ = 1).
cluster and outer-cluster similarities through a graph showing We point out that besides improving clustering quality,
the evolution of subspaces as they fuse together, similar to the limiting positive weights to nearest neighbors (NNs) improves
clusterpath produced in [62] for euclidian clustering (Figure the computational complexity of F SC. To see this first notice
2). Notice, however, that fusion is not necessarily monotonic, that finding NNs requires n2 linear operations to compute
i.e., fused subspaces may split, so in general this graph may pairwise distances. However, these calculations are negligible in
be a network, rather than a tree. comparison with the polynomial operations required to compute
n2 gradients (which require matrix inversions). Consequently,
VI. P ENALTY W EIGHTS , C OMPUTATIONAL C OMPLEXITY, by limiting positive weights to κ NNs we cut down these
AND PARAMETERS
n2 polynomial calculations to κn, thus reducing the effective
Like other fusion formulations [53]–[57], F SC involves computational complexity of F SC to O(drκn) and achieving
weight terms wij that bring the flexibility to distinguish which linear complexity in problem size (see Sections 3.3 and 5.1 in
subspaces to fuse, and which ones not to, which in turn can [73]). This, of course, comes at a price: increasing the effective
dramatically improve the clustering quality and computational number of parameters of F SC to a total of four: λ, which
U1 U1
U10 U10
U20 U20
U30 U30
U40 U40
p
U50 U50
U60 U60
U70 U70
U80 U80
U90 U90
U96 U96
U1 U10 U20 U30 U40 U50 U60 U70 U80 U90U96 U1 U10 U20 U30 U40 U50 U60 U70 U80 U90U96
Subspaces Subspaces
Fig. 2: Left: Clusterpath showing how subspace estimates (corresponding 32 data points in 4 gaussian subspaces, 8 each) progressively fuse as λ increases.
Center: Distribution of Gaussian kernels in the Yale B dataset with γ = 1; notice that most outer-cluster points receive small values and vice versa, as
desired. Right: Similarity matrices produced by F SC on the Yale dataset with γ = 0 (uniform weights, producing a poor clustering because subspaces are not
well-separated), and with γ = 1 and κUU1 = UU1 1
1 r (Gaussian kernel weights with nearest neighbors, producing a near perfect clustering).
UU1010 UU1010
UU2020 UU2020
UU3030 UU3030
UU4040 UU4040
p
p
p
p
controls how much subspacesUU50fuse
50 (see Section V), UU5050 γ and κ, nk = 20, and p = 0. We run 30 trials of each experiment, and
UU6060 UU6060
which together determine the Uweights
U7070 wij that control UU7070 which show the average results of F SC and all the algorithms above.
UU8080 step size, which can UU8080 be tuned
subspaces fuse, and the gradient
UU9090 UU9090 In all our simulations we first generate K matrices U?k ∈
UU9696 UU9696
U7070UU8080UU90U90U9696with i.i.d. N(0, 1) entries, to use as bases of the true
with standard techniques; in our d×r
UU1 1UUexperiments
1010UU
2020UU
3030UU
4040UU
5050UU
6060UU
7070UU
we
8080UU
90U90U
used
9696 UU 1 1UU
5-fold
1010UU
2020UU
3030UU
4040UU
5050UU
6060UR
cross-validation. Subspaces
Subspaces Subspacessubspaces. For each k we generate a matrix Θ? ∈ Rr×nk , also
Subspaces
k
VII. E XPERIMENTS with i.i.d. N(0, 1) entries, to use as coefficients of the columns
in the kth subspace. We then form X as the concatenation
Now we study the performance of F SC. For reference, we [U1 ? Θ?1 U2 ? Θ?2 · · · U?K Θ?K ], plus a d × n noise matrix
compare against the following subspace clustering algorithms with i.i.d. N(0, σ 2 ) entries. To create Ω, we sample each entry
that allow missing data: (a) Entry-wise zero-filling SSC [35]. independently with probability 1 − p.
(b) LRMC + SSC [35]. (c) SSC-Lifting [38]. (d) Algebraic
1) Effect of the penalty parameter: We first study the number
variety high-rank matrix completion [43]. (e) k-subspaces with
of clusters obtained by F SC as a function of λ, with the default
missing data [40]. (f ) EM [41]. (g) Group-sparse subspace
settings above. Figure 3 shows, consistent with our discussion
clustering [42]. (h) Mixture subspace clustering [42]. We
in Section V, that if λ = 0, F SC assigns each point to its
chose these algorithms based on [38], [42], where they show
own cluster. As λ increases, subspaces start fusing together up
comparable state-of-the-art performance. (a)-(d) essentially
to the point where if λ is too large, F SC fuses all subspaces
run SSC after fill missing entries with zeros, according to
into one, and all data gets clustered together. Next we study
a single larger subspace, or after lifting the data. (e)-(h)
performance. Figure 3 shows that there is a wide range of
are alternating algorithms that according to [42] produce
values of λ that produce low error, showing that F SC is quite
best results when initialized with the output of (a), and so
stable. Note that the error increases if λ is too small or too
indirectly they also depend on SSC. To measure performance
large. This is consistent with our previous experiment, showing
we compute clustering error (fraction of misclassified points).
that extreme values of λ produce too few or too many clusters.
When applicable (i.e., when no data is missing) we additionally
compare against the following full-data approaches: BDR 2) Effect of dimensionality: It is well-documented that
[30], iPursuit [28], LRR [11], [13], LSR [24], LRSC [23], data in lower-dimensional subspaces are easier to cluster
L2Graph [26], SCC [6], SSC [9], and S3C [29]. In the [38], [39], [42]–[44], [65]. In the extreme case, clustering 1-
interest of reproducibility, all our code is included in the dimensional subspaces requires a simple co-linearity test, and
supplementary material. In the interest of fairness to other is theoretically possible with as little as 2 samples per column
algorithms, whenever available we (a) used their code, (b) used [65]. In contrast, no existing algorithm can successfully cluster
their specified parameters, (c) did a sweep to find the best (d − 1)-dimensional subspaces (hyperplanes), which is actually
parameters, and (d) used reported results from the literature. impossible even if one entry per column is missing [65]. Of
Whenever there was a discrepancy, we reported their best course, subspaces’ dimensionality is relative to the ambient
performance, be that from reports or from our experiments. dimension: a 10-dimensional subspace is a hyperplane in R11 ,
but low-dimensional in R1000 . In this experiment we test F SC
A. Simulations as a function of the low-dimensionality of the subspaces, i.e.,
Since F SC is an entirely new approach to both, full and the gap between the ambient dimension d and the subspaces’
incomplete data, we present a thorough series of experiments dimension r. First we fix d = 100 and compute error as a
to study its behavior as a function of the penalty parameter λ, function of r. As r grows, the subspace becomes higher and
the ambient dimension d, the number of subspaces involved higher-dimensional. Then we turn things around, fixing r = 5
K, their dimensions r, the noise variance σ 2 , the number of and varying d. As d grows, this subspace becomes lower and
data points in each cluster nk , and of course, the fraction lower-dimensional. Figure 3 shows that F SC is more sensitive
of unobserved entries p. Unless otherwise stated, we use the to high-dimensionality than the state-of-the-art. However, pay
following default settings: d = 100, K = 4, r = 5, σ = 0, attention to the scale: even in the worst-case (r = 30), the gap
Simulations Real Data
Yale Yale
80
0.6
0.03
Produced Clusters
Clustering Error
SSC(EWZF)
0.4
SSC(LRMC)
0.00
Lift
0.05 0.2
EM
-0.03
1510
0.05
1 10−410−5 10−9 0 10−5 10−3 10−1 1 2 3 4 10 15 25 2 3 4 5 6 .1 .3 .5 .7 .9 .99
Parameter λ Parameter λ Number of Subspaces Number of Subjects MISSING DATA
Hopkins Hopkins
0.66
0.6
GSSC
0.61
0.09
Clustering Error
Mixture
0.44
0.4
0.020
0.06
BDR
0.38
iPursuit
0.22
0.05 0.2
0.03
LRR
0.005
0.11
0.00
0.0
1 5 10 20 30 50 6 10 20 30 50 70 90 100 6 10 15 20 25 30 2 3 .1 .3 .5 .7 .9 .99
Subspaces’ Dimension Ambient Dimension Columns Per Subspace Number of Objects MISSING DATA
LSR
0.8
0.10
0.8
Clustering Error
LRSC
0.6
0.6
L2Graph
0.6
0.4
0.4
SCC
0.05
0.4
SSC
0.2
0.2
S3C
0.0
0.0
0.2
0.00
Fig. 3: Top-left corner: Number of clusters obtained by F SC as a function of the parameter λ in (2). Rest: Clustering error of F SC and other baseline
algorithms. Notice the different scales. With full-data F SC is rarely and barely outperformed by other algorithms. In contrast, when data is missing, F SC
outperforms other algorithms by a wide margin. For example, in simulations with nk = 20 (resp. Yale dataset) and p = 0.9, F SC achieves 7.5% error
(resp. 25.7%), while the next best algorithm achieves 71.25% (resp. 64.79%). We point out that some curves are “missing” from some plots because some
methods are not applicable. e.g., SCC cannot handle missing data, and Lift cannot handle large dimensions.
between F SC and the state-of-the-art is around 10%. and many columns). Notice that if p ≈ 0 (few missing data),
3) Effect of noise: Figure 3 shows that F SC performs as well then F SC performs as well as the state-of-the-art, and much
or better than the state-of-the-art with different noise levels. better as p increases (many missing data); see for example
Recall that λ quantifies the tradeoff between how accurately nk = 20 and p = 0.9, where the best alternative algorithms
we want to represent each point xi (the first term in (2)), and gets 71.25% error, which is close to random guessing (because
how close subspaces from different points will be (second there are K = 4 subspaces in our default settings). In contrast,
term), which in turn determines how subspaces fuse together, F SC gets 7.5% error. Notice that p = 0.9 is very close to
or equivalently, how many subspaces we will obtain. If data is the exact information-theoretic minimum sampling rate p =
completely noiseless, we expect to represent each point very 1 − (r + 1)/d = 0.94 [65]. Similar to noise, if there is much
accurately, and so we can use a smaller λ (giving more weight missing data the first term in (2) will carry less weight, which
to the first term). On the other hand, if data is noisy, we expect we can compensate by making λ smaller.
to represent each point within the noise level, and so we can
use a larger λ. As a rule of thumb, we can use λ inversely B. Real Data Experiments
proportional to the noise level σ. 1) Face Clustering.: It has been shown that the vectorized
4) Effect of the number of subspaces and data points: images of the same person under different illuminations lie
Figure 3 shows that F SC is quite robust to the number of near a 9-dimensional subspace [4]. In this experiment we
subspaces. Recall that in our default settings, r = 5, so K ≥ 20 evaluate the performance of F SC at clustering faces of multiple
produces a full-rank data matrix X. Figure 3 also evaluates the individuals, using the Yale B dataset [75], containing a total of
performance of F SC as a function of the columns per subspace 2432 images, each of size 48 × 42, evenly distributed among
nk . Since r = 5, nk = 6 is information-theoretically necessary 38 individuals. To compare things vis à vis, before clustering,
for subspace clustering, we conclude that F SC only requires we use robust PCA [76] on each cluster, to remove outliers;
little more than that to perform as well as the state-of-the-art. this is a widely used preprocessing step [9], [35], [38], [41].
5) Effect of missing data: There is a tradeoff between the In each of 30 trials, we select K people uniformly at random,
number of columns per subspace nk and the sampling rate and record the clustering error. Figure 3 shows that F SC is
(1 − p) required for subspace clustering [65]. The larger nk , very competitive and there is only a negligible gap between
the higher p may be, and vice versa. Figure 3 evaluates the F SC and the best alternative algorithm. Figure 3 also shows the
performance of F SC as a function of p with nk = 20, 50 (few average clustering error as a function of the amount of missing
data (induced uniformly at random), with K fixed to 6 people. reduce computational complexity (as in [73] for euclidean
Consistent with our simulations, F SC outperforms the state-of- clustering), (ii) optimal initializations, greedy, adaptive, data-
the-art in the low-sampling regime (many missing data). For driven, and outlier-robust variants, and (iii) geodesics on the
example, with p = 0.9 F SC gets 25.7% error, while the next Stiefel and Grassmann manifolds (similar to [78] for subspace
best algorithm gets 64.79%. Note that p = 0.9 is quite close to tracking) to avoid the inversion of the term UT i Ui in Pi ,
the exact information-theoretic limit p = 1−(r+1)/d = 0.995 which may become ill-conditioned. Ultimately, we hope this
[65]. publication spurs discussions and insights that ultimately lead
2) Motion Segmentation.: It is well-known that the locations to better methods and a better understanding of subspace
over time of a rigidly moving object approximately lie in a clustering when data is missing.
3-dimensional affine subspace [2], [3] (which can be thought
as a 4-dimensional subspace whose fourth component accounts R EFERENCES
for the offset). Hence, by tracking points in a video, and [1] T. Hastie and P. Simard, Metrics and models for handwritten character
subspace clustering them, we can segment the multiple moving recognition, Statistical Science, 1998.
[2] C. Tomasi and T. Kanade, Shape and motion from image streams under
objects appearing in the video. In this experiment we test F SC orthography, International Journal of Computer Vision, 1992.
on this task, using the Hopkins 155 dataset [77], containing [3] K. Kanatani, Motion segmentation by subspace separation and model
sequences of points tracked over time in 155 videos. Each selection, IEEE International Conference in Computer Vision, 2001.
video contains K = 2, 3 objects. On average, each object is [4] R. Basri and D. Jacobs, Lambertian reflection and linear subspaces,
IEEE Transactions on Pattern Analysis and Machine Intelligence, 2003.
tracked on nk = 133 points (described by two coordinates) over [5] J. Rennie and N. Srebro, Fast maximum margin matrix factorization for
29 frames, producing vectors in ambient dimension d = 58. collaborative prediction, International Conference on Machine Learning,
Figure 3 shows the results. With full-data F SC is far from 2005.
[6] G. Chen and G. Lerman, Spectral curvature clustering (SCC) International
the best, but has performance comparable to the rest of the Journal of Computer Vision, 2009.
algorithms. However, when data is missing, we again see that [7] B. Eriksson, P. Barford, J. Sommers and R. Nowak, DomainImpute:
F SC again dramatically outperforms the rest of the algorithms. Inferring unseen components in the Internet, IEEE INFOCOM Mini-
Conference, 2011.
Figure 3 shows the average results over all videos when data [8] A. Zhang, N. Fawaz, S. Ioannidis and A. Montanari, Guess who rated
is induced uniformly at random. For example, with p = 0.9, this movie: Identifying users through subspace clustering, Uncertainty
the best baseline algorithm gets 52.95% error. In contrast, F SC in Artificial Intelligence, 2012.
[9] E. Elhamifar and R. Vidal, Sparse subspace clustering: Algorithm, theory,
achieves 15.03% error. Notice that p = 0.9 is very close to and applications IEEE Transactions on Pattern Analysis and Machine
the exact information-theoretic minimum sampling rate p = Intelligence, 2013.
10(r + 1)/d = 0.914 [65]. [10] G. Mateos and K. Rajawat, Dynamic network cartography: Advances in
network health monitoring, IEEE Signal Processing Magazine, 2013.
3) Handwritten Digits Clustering.: As a last experiment [11] G. Liu, Z. Lin and Y. Yu, Robust subspace segmentation by low-rank
we use F SC to cluster vectorized images of handwritten representation, International Conference on Machine Learning, 2010.
digits, known to be well-approximated by 12-dimensional [12] R. Vidal, Subspace clustering, IEEE Signal Processing Magazine, 2011.
[13] G. Liu, Z. Lin, S. Yan, J. Sun, Y. Yu and Y. Ma, Robust recovery of
subspaces [79]. For this purpose we use the MNIST dataset subspace structures by low-rank representation, IEEE Transactions on
[80], containing thousands of gray-scaled, 28 × 28 images. Pattern Analysis and Machine Intelligence, 2013.
First we will test F SC as a function of the number of digits [14] M. Soltanolkotabi, E. Elhamifar and E. Candès, Robust subspace
clustering, Annals of Statistics, 2014.
(subspaces) in the mix. Following common practice, for each [15] C. Qu and H. Xu, Subspace clustering with irrelevant features via robust
K = 2, 3, . . . , 10 we randomly selected K digits, nk = 50 Dantzig selector, Advances in Neural Information Processing Systems,
images per digit, and aimed to cluster the images. Figure 3 2015.
[16] X. Peng, Z. Yi and H. Tang, Robust subspace clustering via thresholding
shows the average results of 10 independent trials of each ridge regression, AAAI Conference on Artificial Intelligence, 2015.
configuration. Consistent with our previous experiments, if [17] Y. Wang and H. Xu, Noisy sparse subspace clustering, International
no data is missing, F SC performs comparable to the rest of Conference on Machine Learning, 2013.
[18] Y. Wang, Y. Wang and A. Singh, Differentially private subspace
the algorithms with a gap (in the worst-cases) no larger than clustering, Advances in Neural Information Processing Systems, 2015.
5%. However, as soon as we induce missing data (uniformly [19] H. Hu, J. Feng and J. Zhou, Exploiting unsupervised and supervised
at random), F SC starts outperforming all other methods by a constraints for subspace clustering, IEEE Pattern Analysis and Machine
huge margin (up to 40%). Figure 3 shows the results of this [20] Intelligence, 2015.
C. You, D. Robinson and R. Vidal, Scalable sparse subspace clustering
experiment. by orthogonal matching pursuit, IEEE Conference on Computer Vision
and Pattern Recognitionm 2016.
B ROADER I MPACT [21] Y. Yang, J. Feng, N. Jojic, J. Yang and T. Huang, `0 -sparse subspace
clustering, European Conference on Computer Vision, 2016.
This paper introduces a novel strategy to address missing data [22] B. Xin, Y. Wang, W. Gao and D. Wipf, Data-dependent sparsity for
in subspace clustering, which enables clustering and completion subspace clustering, Uncertainty in Artificial Intelligence, 2017.
in regimes where other methods fail. Practitioners in computer [23] P. Favaro, R. Vidal, and A. Ravichandran, A closed form solution to
robust subspace estimation and clustering, IEEE Conference on Computer
vision, recommender systems, networks inference, and data Vision and Pattern Recognition, 2011.
science in general can use our new method. We expect this [24] C. Lu, H. Min, Z. Zhao, L. Zhu, D. Huang, and S. Yan, Robust and
paper to motivate the learning community to explore new efficient subspace segmentation via least squares regression, Proceedings
of the European Conference in Computer Vision, 2012.
directions that stem from this initial work. These include the [25] D. Park, C. Caramanis and S. Sanghavi, Greedy subspace clustering,
investigation of (i) ADMM and AMA formulations of (2) that Advances in Neural Information Processing Systems, 2014.
[26] X. Peng, Z. Yu, Z. Yi and H. Tang, Constructing the L2-graph for [52] K. Ye and L. Lim, Schubert varieties and distances between subspaces of
robust subspace learning and subspace clustering, IEEE Transactions different dimensions, SIAM Journal on Matrix Analysis and Applications,
on Cybernetics, 2017. 2016.
[27] P. Ji, T. Zhang, H. Li, M. Salzmann and I. Reid, Deep subspace clustering [53] A. Antoniadis and J. Fan, Regularization of wavelet approximations (with
networks, Advances in Neural Information Processing Systems, 2017. discussion), Journal of the American Statistical Association, 2001.
[28] M. Rahmani and G. Atia, Innovation pursuit: a new approach to subspace [54] M. Yuan and Y. Lin, Model selection and estimation in regression with
clustering, IEEE Transactions on Signal Processing, 2017. grouped variables, Journal of the Royal Statistical Society, 2006.
[29] C. Li, C. You and R. Vidal, Structured sparse subspace clustering: a joint [55] L. Meier, S. Van de Geer and P. Bühlmann, The group lasso for logistic
affinity learning and subspace clustering framework, IEEE Transactions regression, Journal of the Royal Statistical Society, 2008.
on Image Processing, 2017. [56] J. Friedman, T. Hastie and R. Tibshirani, A note on the group lasso and
[30] C. Lu, J. Feng, Z. Lin, T. Mei and S. Yan, Subspace clustering by block a sparse group lasso, Arxiv preprint, 2010.
diagonal representation, IEEE Transactions on Pattern Analysis and [57] M. Kshirsagar, E. Yang and A. Lozano, Learning task structure via
Machine Intelligence, 2018. sparsity grouped multitask learning, European Conference on Machine
[31] M. Yin, S. Xie, Z. Wu, Y. Zhang and J. Gao, Subspace clustering Learning and Principles and Practice of Knowledge Discovery in
via learning an adaptive low-rank graph, IEEE Transactions on Image Databases, 2017.
Processing, 2018. [58] L. Balzano, B. Recht and R. Nowak, High-dimensional matched subspace
[32] J. Mairal, F. Bach, J. Ponce and G. Sapiro, Online dictionary learning detection when data are missing, IEEE International Symposium on
for sparse coding, International Conference on Machine Learning, 2009. Information Theory, 2010.
[33] R. Vidal, R. Tron and R. Hartley, Multiframe motion segmentation with [59] S. Land and J. Friedman, Variable fusion: a new method of adaptive
missing data using Power Factorization and GPCA International Journal signal regression, Technical Report, Department of Statistics, Stanford
of Computer, 2008. University, 1996.
[34] D. Park, J. Neeman, J. Zhang, S. Sanghavi and I. Dhillon, Preference com- [60] R. Tibshirani, S. Rosset, J. Zhu and K. Knight, Sparsity and smoothness
pletion: Large-scale collaborative ranking from pairwise comparisons, via the fused lasso, Journal of the Royal Statistical Society, 2005.
International Conference on Machine Learning, 2015. [61] X. Shen and H. Huang, Grouping pursuit through a regularization
solution surface, Journal of the American Statistical Association, 2010.
[35] C. Yang, D. Robinson and R. Vidal, Sparse subspace clustering with
[62] T. Hocking, A. Joulin and F. Bach, Clusterpath: An algorithm for
missing entries, International Conference on Machine Learning, 2015.
clustering using convex fusion penalties International Conference on
[36] E. Candès and B. Recht, Exact matrix completion via convex optimization,
Machine Learning, 2011.
Foundations of Computational Mathematics, 2009.
[63] F. Lindsten, H. Ohlsson and L. Ljung, Clustering using sum-of-norms
[37] M. Tsakiris and R. Vidal, Theoretical analysis of sparse subspace regularization: with application to particle filter output computation,
clustering with missing entries, International Conference on Machine Statistical Signal Processing, 2011.
Learning, 2018. [64] S. Poddar and M. Jacob, Clustering of data with missing entries using
[38] E. Elhamifar, High-rank matrix completion and clustering under self- non-convex fusion penalties, arXiv preprint, 2017.
expressive models, Neural Information Processing Systems, 2016. [65] D. Pimentel-Alarcón and R. Nowak, The information-theoretic require-
[39] B. Eriksson, L. Balzano and R. Nowak, High-rank matrix completion ments of subspace clustering with missing data, International Conference
and subspace clustering with missing data, Artificial Intelligence and on Machine Learning, 2016.
Statistics, 2012. [66] M. Powell, On search directions for minimization algorithms, Mathemat-
[40] L. Balzano, R. Nowak, A. Szlam and B. Recht, k-Subspaces with missing ical Programming, 1973.
data, IEEE Statistical Signal Processing, 2012. [67] J. Nocedal and S. Wright, Numerical optimization, Springer Science and
[41] D. Pimentel-Alarcón, L. Balzano and R. Nowak, On the sample Business Media, 2006.
complexity of subspace clustering with missing data, IEEE Statistical [68] A. Beck, On the convergence of alternating minimization for convex
Signal Processing, 2014. programming with applications to iteratively reweighted least squares
[42] D. Pimentel-Alarcón, L. Balzano, R. Marcia, R. Nowak and R. Willett, and decomposition schemes, SIAM Journal on Optimization, 2015.
Group-sparse subspace clustering with missing data, IEEE Statistical [69] S. Boyd and L. Vandenberghe, Convex optimization, Cambridge Univer-
Signal Processing, 2016. sity Press, 2004.
[43] G. Ongie, R. Willett, R. Nowak and L. Balzano, Algebraic variety models [70] C. Ding, D. Zhou, X. He and H. Zha, R1-PCA: Rotational invariant
for high-rank matrix completion, International Conference on Machine L1-norm principal component analysis for robust subspace factorization,
Learning, 2017. International Conference on Machine Learning, 2006.
[44] D. Pimentel-Alarcón, G. Ongie, L. Balzano, R. Willett and R. Nowak, [71] A. Ng, M. Jordan and Y. Weiss, On spectral clustering: analysis and an
Low algebraic dimension matrix completion, Allerton Conference on algorithm, Advances in Neural Information Processing Systems, 2002.
Communication, Control, and Computing, 2017. [72] H. Akaike, Information theory and an extension of the maximum
[45] Y. Song, M. Li, Z. Zhu, G. Yang and X. Luo, Non-negative Latent likelihood principle, IEEE International Symposium on Information
Factor Analysis-Incorporated and Feature-Weighted Fuzzy Double c- Theory, 1973.
Means Clustering for Incomplete Data," IEEE Transactions on Fuzzy [73] E. Chi and K. Lange, Splitting methods for convex clustering, Journal of
Systems. Computational and Graphical Statistics, 2015.
[46] D. Li, H. Zhang, T. Li, A. Bouras, X. Yu and T. Wang, Hybrid [74] K. Tan and D. Witten, Statistical properties of convex clustering,
Missing Value Imputation Algorithms Using Fuzzy C-Means and Vaguely Electronic Journal of Statistics, 2015.
Quantified Rough Set, IEEE Transactions on Fuzzy Systems. [75] A. Georghiades, P. Belhumeur and D. Kriegman, From few to many:
[47] Beretta, L., Santaniello, A. Nearest neighbor imputation algorithms: a Illumination cone models for face recognition under variable lighting and
critical evaluation. BMC Med Inform Decis Mak 16, 74 (2016). pose, IEEE Transactions on Pattern Analysis and Machine Intelligence,
[48] Lodder P. To Impute or not Impute, That’s the Question. In Mellenbergh 2001.
GJ, Adèr HJ, editors, Advising on research methods: Selected topics [76] E. Candès, X. Li, Y. Ma and J. Wright, Robust principal component
2013. Huizen: Johannes van Kessel Publishing. 2014. analysis?, Journal of the ACM, 2011.
[49] A. Chaudhry et al, "A Method for Improving Imputation and Prediction [77] R. Tron and R. Vidal, A benchmark for the comparison of 3-D motion
Accuracy of Highly Seasonal Univariate Data with Large Periods of segmentation algorithms, IEEE Conference on Computer Vision and
Missingness," Wireless Communications & Mobile Computing (Online), Pattern Recognition, 2007.
vol. 2019, pp. 13, 2019. [78] L. Balzano, R. Nowak and B. Recht, Online identification and tracking
[50] C. Lane, R. Boger, C. You, M. Tsakiris, B. Haeffele and R. Vidal, of subspaces from highly incomplete information, Allerton Conference
"Classifying and Comparing Approaches to Subspace Clustering with on Communication, Control and Computing, 2010.
Missing Data," 2019 IEEE/CVF International Conference on Computer [79] T. Hastie and P. Simard, Metrics and models for handwritten character
Vision Workshop (ICCVW), 2019, pp. 669-677. recognition, Statistical Science, 1998.
[51] E. Candès, Y. Eldar, D. Needell and P. Randall, Compressed sensing [80] Y. LeCun, L. Bottou, Y. Bengio and P. Haffner, Gradient-based learning
with coherent and redundant dictionaries, Applied and Computational applied to document recognition, Proceedings of the IEEE, 1998.
Harmonic Analysis, 2011.