Triple Component Matrix Factorization: Untangling Global, Local, and Noisy Components
Triple Component Matrix Factorization: Untangling Global, Local, and Noisy Components
Abstract
In this work, we study the problem of common and unique feature extraction from noisy data.
When we have N observation matrices from N different and associated sources corrupted
by sparse and potentially gross noise, can we recover the common and unique components
from these noisy observations? This is a challenging task as the number of parameters
to estimate is approximately thrice the number of observations. Despite the difficulty, we
propose an intuitive alternating minimization algorithm called triple component matrix
factorization (TCMF) to recover the three components exactly. TCMF is distinguished from
existing works in literature thanks to two salient features. First, TCMF is a principled method
to separate the three components given noisy observations provably. Second, the bulk of the
computation in TCMF can be distributed. On the technical side, we formulate the problem
as a constrained nonconvex nonsmooth optimization problem. Despite the intricate nature
of the problem, we provide a Taylor series characterization of its solution by solving the
corresponding Karush–Kuhn–Tucker conditions. Using this characterization, we can show
that the alternating minimization algorithm makes significant progress at each iteration
and converges into the ground truth at a linear rate. Numerical experiments in video
segmentation and anomaly detection highlight the superior feature extraction abilities of
TCMF.
Keywords: Matrix Factorization, Heterogeneity, Alternating minimization, Sparse noise,
Outlier identification
1. Introduction
In the era of Big Data, an important task is to find low-rank features from high-dimensional
observations. Methods including principal component analysis (Hotelling, 1933), low-rank
∗. Corresponding author
matrix factorization (Koren et al., 2009), and dictionary learning (Aharon et al., 2006), have
found success in numerous fields of statistics and machine learning (Wright and Ma, 2022).
Among them, matrix factorization (MF) is an efficient method to identify the features that
best explain the observation matrices.
Despite the wide popularity, standard MF methods are known to be brittle in the
presence of outliers with huge noise (Candès et al., 2011). These noises are often sparse but
can have large norms. A series of methods (e.g., (Candès et al., 2011; Netrapalli et al., 2014;
Wong and Lee, 2017; Fattahi and Sojoudi, 2020; Chen et al., 2021)) have been developed to
estimate low-rank features from data that contain outliers. When the portion of outliers
is not too large, one can provably identify the outliers and the low-rank components with
convex programming (Candès et al., 2011) or nonconvex optimization algorithms equipped
with convergence guarantees (Netrapalli et al., 2014).
Recently, there has been a growing number of applications where data are acquired from
diverse but connected sources, such as smartphones, car sensors, or medical records from
different patients. This type of data displays both mutual and individual characteristics. For
instance, in biostatistics, the measurements of different miRNA and gene expressions from
the same set of samples can reveal co-varying patterns yet exhibit heterogeneous trends (Lock
et al., 2013). Statistical modeling of the common information among all data sources and
the specific information for each source is of central interest in these applications. Multiple
works propose to recover such common and unique features by minimizing the square norm
of the residuals of fitting (Lock et al., 2013; Zhou et al., 2015; Gaynanova and Li, 2019; Park
and Lock, 2020; Lee and Choi, 2009; Yang and Michailidis, 2016; Shi and Kontar, 2024;
Shi et al., 2023; Liang et al., 2023). These methods prove to be useful in aligning genetics
features (Lock et al., 2013), visualizing bone and soft tissues in X-ray images (Zhou et al.,
2015), functional magnetic resonance imaging (Kashyap et al., 2019), surveillance video
segmentation (Shi and Kontar, 2024), stocks market analysis (Shi et al., 2023), and many
more.
Though these algorithms achieve decent performance on multiple applications, they rely
on least square estimates, which are not robust to outliers in data. Real-world data are
commonly corrupted by outliers (Tan et al., 2005). Factors including measurement errors or
sensor malfunctions can give rise to large noise in data. These outliers can substantially skew
the estimation of low-rank features. As such, we attempt to answer the following question.
Question: How can one provably identify low-rank common and unique information
robustly from data corrupted by outlier noise?
A natural thought is to borrow techniques in robust PCA to handle outlier noise. Indeed,
there exist a few heuristic methods in literature (Sagonas et al., 2017; Panagakis et al.,
2015; Ponzi et al., 2021) to find robust estimates of shared and unique features. These
methods often use `1 regularization (Sagonas et al., 2017; Panagakis et al., 2015) or Huber
loss (Ponzi et al., 2021) to accommodate the sparsity of noise. However, these algorithms are
mainly based on heuristics and lack theoretical guarantees, thus potentially compromising
the quality of their outputs. A theoretically justifiable method to identify low-rank shared
and unique components from outlier noise is still lacking. In this paper, we will study the
question rigorously and develop an efficient algorithm to solve it.
2
TCMF
2. Problem Statement
We consider the framework where N observation matrices M(1) , M(2) , · · · , M(N ) come from
N ∈ N+ different but associated sources. These matrices M(i) ∈ Rn1 ×n2,(i) have the same
number of features n1 . To model their commonality and uniqueness, we assume each matrix
is driven by r1 shared factors and r2,(i) unique factors and contaminated by potentially gross
noise. More specifically, we consider the model where the observation M(i) from source i is
generated by,
M(i) = U? g V? T(i),g + U? (i),l V? T(i),l + S? (i) , (1)
where U? g ∈ Rn1 ×r1 , V? (i),g ∈ Rn2,(i) ×r1 , U? (i),l ∈ Rn1 ×r2,(i) , V? (i),l ∈ Rn2,(i) ×r2,(i) , S? (i) ∈
Rn1 ×n2,(i) . We use ? to denote the ground truth. r1 is the rank of global (shared) feature
matrices, and r2,(i) is the rank of local (unique) feature matrix from source i. The matrix
U? g V? T(i),g models the shared low-rank part of the observation matrix, as the column space
is the same across different sources. U? (i),l V? T(i),l models the unique low-rank part. S? (i)
models the noise from source i.
In matrix factorization problems, the representations U? and V? often correspond to
latent data features. For instance, in recommender systems, U? can be interpreted as
user features that reveal their preferences on different items in the latent space (Koren
et al., 2009). For better interpretability, it is often desirable to have the underlying features
disentangled so that each feature can vary independently of others (Higgins et al., 2017).
Under this rationale, we consider the model where shared and unique factors are orthogonal,
U? Tg U? (i),l = 0, ∀i ∈ [N ], (2)
where [N ] denotes the set {1, 2, · · · , N }. The orthogonality of features implies that the
shared and unique features span different subspaces, thus describing different patterns in the
observation. The orthogonal condition (2) is thus an inductive bias that reflects our prior
belief about the independence between common and unique factors and naturally models a
diverse range of applications, such as miRNA and gene expression (Lock et al., 2013), human
faces (Zhou et al., 2015), and many more (Sagonas et al., 2017; Shi and Kontar, 2024).
We should note that the orthogonality (2) does not limit the model representation power.
Suppose U? Tg U? (i),l 6= 0 otherwise, we can decompose U? (i),l into the two parts, U? (i),l =
−1 ? T ? −1 ? T ?
U? g U? Tg U? g U g U (i),l + I − U? g U? Tg U? g U g U (i),l . The first part is in the
?
column subspace of U g , while the second part is in the orthogonal
−1
space of the column
T T
subspace of U? g . If we define U f? (i),l = I − U? g U? U? g U? g U? (i),l , and V
f? (i),g =
g
f? T + U
−1
V? (i),g + V? (i),l U? T(i),l U? g U? Tg U? g , we have, M(i) = U? g V (i),g
f? (i),l V? T + S? (i)
(i),l
where U? T U
g
f? (i),l = 0. This formulation admits the form of model (1) with constraint (2).
The noise term S? (i) in (1) models the sparse and large noise, where only a small
fraction of S? (i) registers as nonzero. The noise sparsity is extensively invoked in literature,
particularly when datasets are plagued by outliers (Candès et al., 2011; Netrapalli et al.,
2014; Chen et al., 2020, 2021).
3
Shi, Fattahi, and Al Kontar
2.1 Challenges
Given data generation model (1), our task is to separate common, individual, and noise
components. The task seems Herculean as the problem is under-definite: we need to estimate
three sets of parameters from one set of observations. There are two major challenges
associated with the problem,
Challenge 1: New identifiability conditions are needed. Standard analysis in robust PCA
(Candès et al., 2011; Netrapalli et al., 2014) often uses the incoherence condition to distinguish
low-rank components from sparse noise. However, the incoherence condition alone is
insufficient to guarantee the separability between common and unique features. Since there
are infinitely many ways in which shared, unique, and noise components can form the
observation matrices, it is not apparent whether untangling them is even feasible. Thus, the
crux of our investigation is to understand when the separation is possible.
Fortunately, we show that a group of conditions–known here as identifiability conditions–
exists that can ensure the precise retrieval of the shared, unique, and sparse noise. Intuitively,
these identifiability conditions require the three components to have “little overlaps”.
Based on these conditions, we will develop an alternating minimization algorithm called
TCMF to iteratively update the three components. An illustration of the algorithm is shown
in the left graph in Figure 1. The hard-thresholding step finds the closest sparse matrix for
the data noise. We use JIMF to denote a subroutine that represents a group of algorithms
(e.g., (Lock et al., 2013; Shi and Kontar, 2024)) to identify common and unique low-rank
features. In essence, JIMF solves a sub-problem in TCMF. It is worth noting that there exist
multiple algorithms in literature to implement JIMF, many of which can produce high-quality
outputs. With the implemented JIMF, TCMF applies hard thresholding and JIMF alternatively
to estimate the sparse, as well as common and unique low-rank components. The left graph
of Figure 1 offers an intuitive understanding of how estimates of various components progress
toward the ground truth with each iterative step.
Figure 1: Left: An illustration of TCMF’s update trajectory. The purple and blue curves rep-
resent the spaces for the low-rank and sparse matrices. The algorithm alternatively
performs hard thresholding and JIMF, making the updates closer and closer to
the ground truth. Middle: An illustration showing why insufficient understanding
about the output of JIMF can be problematic in the convergence analysis. Right:
Our contribution to represent the solution into a Taylor-like series.
Challenge 2: New analysis tools are needed. Showing the exact recovery of low-rank and
sparse components is not easy. Even in standard robust PCA, one needs to apply highly
nontrivial analytical techniques to provide theoretical guarantees. For example, Robust PCA
4
TCMF
(Candès et al., 2011) relies on a “golfing scheme” to construct dual variables that ensure the
uniqueness of a convex optimization problem. Nonconvex robust PCA (Netrapalli et al.,
2014) applies a perturbation analysis of SVD to quantify the improvement of the algorithm
per iteration. These techniques are tailored for standard robust PCA and cannot be directly
extended to the case where both common and unique features are involved, which increases
the complexity of the analysis. The major difficulty stems from the fact that TCMF updates
the low-rank common and unique components by another iterative algorithm JIMF. Unlike
robust PCA, the output of JIMF does not have a closed-form formula. This conceptual
hurdle is illustrated in the middle graph of Figure 1. As a result, novel analysis tools are
needed to justify the convergence of the proposed TCMF.
One of our key contributions in tackling the challenge is to develop innovative analysis
tools by solving the Karush–Kuhn–Tucker conditions of the objective of JIMF and express the
solutions into a Taylor-like series. From the Taylor-like series, we can precisely characterize
the output of JIMF, thereby showing the series converge to a close estimate of the ground
truth shared and unique features.
The Taylor-like series is depicted in the right graph of Figure 1. Perhaps surprisingly,
regardless of the choice of the subroutine JIMF, as long as JIMF finds a close estimate of
the optimal solutions to a subproblem, its output can be represented by an infinite series.
The series describes the optimal solution of the subproblem and is independent of the
intermediate steps in JIMF. The derivation and analysis of Taylor-like series have stand-alone
values in the theoretical research of the sensitivity analysis of matrix factorization. With
the new analysis tool, we are able to show that even if the JIMF only outputs a reasonable
approximate solution, the meta-algorithm TCMF can still take advantage of the information in
such an inexact solution to refine the estimates of the three components. We will elaborate
on the Taylor-like series in greater detail in Section 6 and the Appendix.
We summarize our contributions in the following.
5
Shi, Fattahi, and Al Kontar
to high precision. Our theoretical analysis introduces new techniques to solve the KKT
conditions in Taylor-like series and bound each term in the series. It sheds light on the
sensitivity analysis with the `∞ norm.
Case studies. We use a wide range of numerical experiments to demonstrate the
application of TCMF in different case studies, as well as the effectiveness of our proposed
method. Numerical experiments corroborate theoretical convergence results. Also, the case
studies on video segmentation and anomaly detection showcase the benefits of untangling
shared, unique, and noisy components.
In the rest of the paper, we provide a comprehensive review of the literature in Section 3.
Then, we elaborate on the conditions sufficient for the separation of the three components
in Section 4. In Section 5, we introduce the alternating minimization algorithm. We present
our convergence theorem in Section 6 and discuss the key insights in the proof and how
they solve challenge 2. In Section 7, we demonstrate the numerical experiment results. The
detailed proofs are relegated to the Appendix for brevity of the main paper.
3. Related Work
Matrix Factorization There are numerous works that analyze the theoretical and practical
properties of first-order algorithms that solve the (asymmetric) matrix factorization problem
2
minU,V M − UVT F or its variants (Li et al., 2018; Ye and Du, 2021; Sun and Luo, 2016;
Park et al., 2017; Tu et al., 2016). Among them, Sun and Luo (2016) analyzes the local
landscape of the optimization problem and establishes the local linear convergence of a series
of first-order algorithms. Park et al. (2017); Ge et al. (2017) study the global geometry of the
optimization problem. Tu et al. (2016) proposes the Rectangular Procrustes Flow algorithm
that is proved to converge linearly into the ground truth under proper initialization and
a balancing regularization. Recently, Ye and Du (2021) shows that gradient descent with
small and random initialization can converge to the ground truth.
Robust PCA When the observation is corrupted by sparse and potentially large noise,
several approaches can still identify the low-rank components. An exemplary work is Robust
PCA (Candès et al., 2011), which proposes an elegant convex optimization problem called
principal component pursuit that uses nuclear norm and `1 norm to promote the sparsity and
low-rankness of the solutions. It is proved that under incoherence assumptions, the solution
of the convex optimization is unique and corresponds to the ground truth. Several works also
consider the problem of matrix completion under outlier noise (Wong and Lee, 2017; Chen
et al., 2021). Nonconvex robust PCA (Netrapalli et al., 2014) improves the computational
efficiency of principal component pursuit by proposing a nonconvex formulation and using
an alternating projection algorithm to solve it. Though the formulation is nonconvex, the
alternating projection algorithm is also proved to recover the ground truth exactly under
incoherence and sparsity requirements. For the special case of rank-1 robust PCA, Fattahi and
Sojoudi (2020) show that a simple sub-gradient method applies directly to the nonsmooth `1 -
loss provably recovers the low-rank component and sparse noise, under the same incoherence
and sparsity requirements. To model a broader set of noise, Meng and De La Torre (2013)
consider a mixture of Gaussian noise models and exploit the EM algorithm to estimate the
low-rank components. Robust PCA has found successful applications in video segmentation
(Bouwmans and Zahzah, 2014), image processing (Vaswani et al., 2018), change point
6
TCMF
detection (Xiao et al., 2020), and many more. Nevertheless, the formulations of robust PCA
focus on shared low-rank features among all data and neglect unique components.
Robust shared and unique feature extraction As discussed, a few heuristic methods
also attempt to find the shared and unique features when data are corrupted by large noise
(Sagonas et al., 2017; Ponzi et al., 2021; Panagakis et al., 2015). Amid them, RaJIVE
(Ponzi et al., 2021) employs robust SVD (Zhang et al., 2013) to remove noise from the
observations and then uses a variant of JIVE (Feng et al., 2018) to separate common and
unique components. RJIVE (Sagonas et al., 2017) proposes a constrained optimization
formulation to minimize the `1 norm of the fitting residuals and exploits ADMM to solve the
problem. RCICA (Panagakis et al., 2015) adopts a similar optimization objective but uses a
regularization to encourage the similarity of common subspaces and only works for N = 2
cases. Though these methods can achieve decent performance in applications including
facial expression synthesis and audio-visual fusion, they are based on heuristics and it is
not clear whether their output converges to the ground truth common and unique factors.
In contrast, we prove that TCMF is guaranteed to recover the ground truths and use a few
numerical examples to show that TCMF indeed recover more meaningful components.
7
Shi, Fattahi, and Al Kontar
4. Identifiability Conditions
Our goal is to decouple the common components, unique components, and the sparse noise,
N
given a group of data observations M(i) i=1 . At first glance, such decoupling may seem
impossible or even ill-defined: roughly speaking, the number of unknown variables, namely
global components, local components, and noise, are thrice the number of observed data
N
matrices M(i) i=1 , and hence, there are infinite number of decouplings that can give rise
to the same M(i) .
The very first question to ask is whether such decoupling is possible and, if so, which
properties can ensure the identifiability of three components. Intriguingly, we are able to
prove that the exact decoupling of shared features, unique features, and noise is possible if
there is “little overlap” among the three components. Below, we will formalize this intuition
in more detail. Though intuitive, it turns out that these conditions can guarantee the
identifiability of the shared components, unique components, and the sparse noise.
4.1 Sparsity
As discussed, identifying arbitrarily dense and large noise from signals is not possible. Hence,
we consider sparse noise where only a small fraction of observations are corrupted. To
characterize the sparsity of S? (i) , we use the following definition of α-sparsity.
Definition 1 (α-sparsity) A matrix S ∈ Rn1 ×n2 is α-sparse if at most αn1 entries in each
column and at most αn2 entries in each row are nonzero.
The definition follows from that of Netrapalli et al. (2014). In Definition 1, α characterizes
the maximum portion of corrupted entries in each row and each column. Intuitively, if
a matrix is α-sparse with small α, then its nonzero entries are “spread out” instead of
concentrated on specific columns or rows.
4.2 Incoherence
It is shown that distinguishing sparse components from arbitrary low-rank components is
also hard (Candès et al., 2011; Netrapalli et al., 2014). As a simple counterexample, the
matrix M = ei eTj , where we use ei to denote the basis vector of axis i, has its ij-th entry
to be 1 and all other entries to be 0. This matrix has rank 1, and is also sparse since it
has only one nonzero entry. Thus, deciding whether it is sparse or low rank is difficult as it
satisfies both requirements.
From the above analysis, one can see that the low-rank components should not be
sparse. In other words, to be distinguishable from the sparse noise, their elements should
be sufficiently spread out. In the literature, this requirement is often characterized by the
so-called incoherence condition (Candès et al., 2011; Netrapalli et al., 2014).
where ei ∈ Rn is the standard basis vector of axis i, defined as ei = (0, 0, ...0, 1, 0, ...0)T
8
TCMF
The incoherence condition restricts the maximum row-wise `2 norm of a matrix U, thus
preventing the entries of U from being too concentrated on a few specific axes.
Remember that in model (1), U? g V? T(i),g and U? (i),l V? T(i),l represent the global (shared)
and local (unique) factors. For any n ≥ r, we use On×r to denote the set of n by r matrices
whose column vectors are orthonormal, On×r = {W ∈ Rn×r |WT W = I}. We assume the
SVD of U? g V? T(i),g and U? (i),l V? T(i),l has the following form,
( ? ?T
U g V (i),g = H? g Σ? (i),g W? T(i),g
, (3)
U? (i),l V? T(i),l = H? (i),l Σ? (i),l W? T(i),l
where H? g ∈ On1 ×r1 , W? (i),g ∈ On2,(i) ×r1 , H? (i),l ∈ On1 ×r2,(i) . Moreover, W? (i),l ∈
On2,(i) ×r2,(i) are orthogonal matrices, Σ? (i),g ∈ Rr1 ×r1 and Σ? (i),l ∈ Rr2,(i) ×r2,(i) are posi-
tive diagonal matrices. In (3), we consider the case where the global and local column
singular vectors are orthogonal, i.e., H? Tg H? (i),l = 0. We use r2 = maxi r2,(i) throughout the
paper.
To avoid overlapping between sparse and low-rank components, we assume the row
and column singular vectors H? g , H? (i),l , W? (i),g , and W? (i),l are all µ-incoherent. This
assumption ensures that the low-rank components do not have entries too concentrated on
specific rows or columns. As a result, the incoherence on singular vectors encourages the
low-rank components to distribute evenly on all entries, which is distinguished from sparse
noises that are nonzero on a small fraction of entries.
4.3 Misalignment
As discussed in (3), we use orthogonality between shared and unique features Ĥ?Tg Ĥ? (i),l = 0
to encode our prior belief about the independence of different features. This is equivalent
to U? Tg U? (i),l = 0. Such orthogonality, however, is still insufficient to guarantee the
identifiability of shared and unique factors.
To see this, consider a counterexample where all U? (i),l ’s are equal, i.e., U? (1),l = U? (2),l =
· · · = U? (N ),l . In this case, “unique” factors are also shared among all observation matrices.
Thus, separating them from the ground truth U? g is not possible. From this counterexample,
we can see that it is essential for the local features not to be perfectly aligned with each
other. Next, we formally introduce the notion of misalignment. For a full column-rank −1 T
matrix U ∈ Rd×n , we define the projection matrix PU ∈ Rd×d as PU = U UT U U .
Definition 3 (θ-misalignment) We say {U? (i),l } are θ-misaligned if there exists a positive
constant θ ∈ (0, 1) such that:
N
!
1 X
λmax PU? (i),l ≤ 1 − θ. (4)
N
i=1
P
By the triangular inequality of λmax (·), we know λmax N1 N i=1 P U ?
(i),l
≤
1 P N
N i=1 λmax PU? (i),l = 1. Thus, the introduced θ is always nonnegative. Indeed, all
PU? (i),l ’s have a common nonempty eigenspace with eigenvalue 1 if and only if θ = 0. Thus,
the θ-misalignment condition requires that the subspaces spanned by all unique factors do
9
Shi, Fattahi, and Al Kontar
not contain a common subspace. On the contrary, all global features are shared; hence, the
subspaces spanned by these features are also identical. This comparison shows that the
misalignment condition unequivocally distinguishes unique features from shared ones.
As a concrete example, consider N = 2 and U(1),l = (cos ϑ, sin ϑ)T , U(2),l =
(cos ϑ, − sin ϑ)T for ϑ ∈ [0, π4 ]. Indeed, the angle between U(1),l and U(2),l is 2ϑ and
!
1 cos2 ϑ, 0
PU(1),l + PU(2),l = .
2 0, sin2 ϑ
Hence, by definition, θ = sin2 ϑ. We can thus clearly see that when ϑ increases, the U(1),l
and U(2),l become more misaligned.
The notion of θ-misalignment is first proposed by Shi and Kontar (2024) and intimately
related to the uniqueness conditions in Lock et al. (2013).
5. Algorithm
The introduced identifiability conditions restrict the overlaps between shared, unique, and
sparse components. It remains to develop algorithms to untangle the three parts from N
matrices. In Section 5.1, we introduce a constrained optimization formulation, and in Section
5.2, we propose an alternating minimization program to decouple the three parts. The
alternating minimization requires solving subproblems to distinguish shared features from
unique ones.
Throughout the paper, we use kAk or kAk2 to denote the operator norm of a matrix
A ∈ Rm×n and kAkF to denote the Frobenius norm of A. We use r to denote r = r1 + r2 .
N
X
min hi (Ug , V(i),g , U(i),l , V(i),l , S(i) ; λ)
x (5)
i=1
s.t. UTg U(i),l = 0, ∀i ∈ [N ].
Term (fi ) measures the distance between the sum of shared, unique, and sparse components
and the observation matrix M(i) . It denotes the residual of fitting. A common approach for
10
TCMF
solving this problem is based on convex relaxation (Candès et al., 2011). However, convex
relaxation increases the number of variables to O(n1 n2 ), while our nonconvex formulation
keeps it in the order of O(max{n1 , n2 }(r1 + r2 )), which is significantly smaller.
Term (Φi ) is an `0 regularization term that promotes the sparsity of matrix S(i) . The
parameter λ mediates the balance between the `0 penalty with the residual of fitting. A
large value of λ leads to sparser S(i) with only large nonzero elements. Conversely, a small
value of λ yields a denser S(i) with potentially small nonzero elements. Therefore, to identify
both large and small nonzero values of S(i) , while correctly filtering out its zero elements,
we propose to gradually decrease the value of λ during the optimization of objective (5).
We use the notation hi (Ug , V(i),g , U(i),l , V(i),l , S(i) ; λ) to explicitly show that the objective
hi is dependent on the regularization parameter λ.
At first glance, the proposed optimization problem (5) may appear daunting due to
its inherent nonconvexity and nonsmoothness. Notably, it exhibits two distinct sources of
nonconvexity: firstly, both terms (fi ) and (Φi ) are nonconvex, and secondly, the feasible
set corresponding to the constraint UTg U(i),l = 0 is also nonconvex. Furthermore, the `0
regularization term in (Φi ) introduces nonsmoothness into the problem. However, we will
introduce an intuitive and efficient algorithm designed to alleviate these challenges and
effectively solve the problem. Surprisingly, under our identifiability conditions introduced in
Section 4, this algorithm can be proven to converge to the ground truth.
where Hardλ (·) is the hard-thresholding operator. For a matrix X ∈ Rm×n , the hard
thresholding operator is defined as:
(
Xij , if |Xij | > λ
[Hardλ (X)]ij = . (7)
0 , if Xij ∈ [−λ, λ]
The coefficient λ is a thresholding parameter that controls the sparsity of the output.
To recover the correct sparsity pattern of Ŝ(i),t , our approach is to maintain a small false
11
Shi, Fattahi, and Al Kontar
positive rate (elements that are incorrectly identified as nonzero), while gradually improving
the true positive rate (elements that are correctly identified as nonzero). To this goal, we
start with a large λ to obtain a conservative estimate of Ŝ(i),t . Then, we decrease λ to refine
the estimate.
In the second step, we fix Ŝ(i),t and optimize Ug , {V(i),g , U(i),l , V(i),l } under the
constraint UTg U(i),l = 0. Removing the `0 regularization term that is independent of
Ug , {V(i),g , U(i),l , V(i),l } , the optimization subproblem takes the following form,
N
X 2
T T
min M̂(i) − Ug V(i),g − U(i),l V(i),l
(Ug ,{V(i),g ,U(i),l ,V(i),l }) i=1
F (8)
s.t. UTg U(i),l = 0, ∀i ∈ [N ],
Definition 4 (-optimality) Given (Ûg , {V̂(i),g , Û(i),l , V̂(i),l }) as any global optimal solution
to the problem (8) and a constant > 0, we say (Û g , {V̂ (i),g , Û (i),l , V̂ (i),l }) is an -optimal
solution to (8) if it satisfies,
and
ÛTg Û (i),l = 0, ∀i.
The nonconvexity of (8) gives rise to multiple global optimal solutions. Our definition
of -optimality only emphasizes the closeness between the product of features and the
coefficients, and the product of any set of global optimal solutions. As discussed, there exist
multiple methods proposed to solve (8) that demonstrate decent practical performance. In
particular, PerPCA, PerDL, and HMF are proved to converge to the optimal solutions of
(8) at linear rates when initialized properly. Hence, under suitable initializations, PerPCA
12
TCMF
and HMF can reach -optimality of (8) within O log 1 iterations for any value of . The
In Algorithm 1, we use JIMF {M̂(i) }, to denote the call for a subroutine to solve (8) to
-optimality. In each epoch, sparse matrices Ŝ(i),t are firstly estimated by hard thresholding.
Then M̂(i) = M(i) − Ŝ(i),t are calculated, which are subsequently decoupled into the shared
and unique components via a JIMF call. The output of this subroutine is represented as
(Û g,t , {V̂ (i),g,t }, {Û (i),l,t }, {V̂ (i),l,t }), where the superscript signifies -optimality. The
outputs (Û g,t , {V̂ (i),g,t }, {Û (i),l,t }, {V̂ (i),l,t }) are used to improve the estimate of Ŝ(i) in
the next epoch. After each epoch, we decrease the thresholding parameter λt by a constant
ρ < 1, then add a constant . The inclusion of in λt+1 is necessary to ensure that the
estimated Ŝ(i),t+1 does not contain any false positive entries. By incorporating into λt+1 ,
we guarantee that the inexactness of the JIMF outputs does not undermine the false positive
rate of the entries in Ŝ(i),t+1 .
Then, per-epoch computational complexity of Algorithm 1 is O(n1 n2 N ) when the rank
r1 and r2 are small r1 , r2 n1 , n2 . To see this, we can add up the computation complexity
complexity of hard-thresholding and JIMF. Element-wise hard-thresholding requires O(n1 n2 )
computations for each source. Efficient implementations of the subroutine JIMF, such as
PerPCA and HMF, can converge into the −optimal solutions within O(log 1 ) steps, where
each step require O(n1 n2 ) computations. Therefore, the per-epoch computational complexity
of TCMF is O(n1 n2 N ), where log factors are omitted.
Furthermore, if the JIMF and hard-thresholding are distributed among N sources, TCMF
can further exploit parallel computation to reduce the running time. In the regime where com-
munication cost is negligible, the per-iteration total running time scales as O (n1 n2 + N n1 ).
We will show later that such a design can ensure that the estimation error diminishes
linearly. A pictorial representation of Algorithm 1 is plotted in the left graph of Figure 1.
13
Shi, Fattahi, and Al Kontar
6. Convergence Analysis
In this section, we will analyze the convergence of Algorithm 1. Our theorem characterizes the
conditions under which Algorithm 1 converges linearly to the ground truth. We additionally
introduce σmax > 0 to denote an upper bound of the singular values of {U? g V? T(i),g +
U? (i),l V? T(i),l }N
i=1 , and σmin > 0 to denote a lower bound on the smallest nonzero singular
values of {U g V? T(i),g + U? (i),l V? T(i),l }N
?
i=1 . For simplicity we assume n2,(i) = n2 , r2,(i) = r2 ,
and r = r1 + r2 in this section.
Theorem 5 (Convergence of Algorithm 1) Consider the true model (1) with SVD de-
fined in (3), where nonzero singular values of U? g V? T(i),g + U? (i),l V? T(i),l are lower bounded
by σmin > 0 and upper bounded by σmax ≥ σmin for each source i. Suppose that the following
conditions are satisfied:
N
• (µµ-incoherency) The matrices H? g and H? (i),l , W? (i),g , W? (i),l i=1 are µ-
r = r1 + r2 .
√ 2
Then, there exist constants Cg,1 , Cg,2 , Cl,1 , Cl,2 , Cs,1 , Cs,2 > 0 and ρmin = O α µ√θr < 1
σ√ 2
max µ r
such that the iterations of Algorithm 1 with λ1 = n1 n2 , ≤ λ1 (1 − ρmin ), and 1 − λ1 >
ρ ≥ ρmin satisfy
Theorem 5 presents a set of sufficient conditions under which the model is identifiable,
and Algorithm 1 converges to the ground truth at a linear rate. As discussed in Section 4,
these conditions are indeed sufficient to guarantee the identifiability of the true model. In
particular, µ-incoherency is required for disentangling global and local components from
noise, whereas θ-misalignment is needed to separate local and global components. Moreover,
there is a natural trade-off
2 between the parameters µ, θ, and α: the upper bound on the
sparsity level α, O µθ4 r2 , is proportional to θ2 , implying that more alignment among local
feature matrices can be tolerated only at the expense of sparser noise matrices. Similarly,
α is inversely proportional to µ4 , indicating that more coherency in the local and global
components is only possible with sparser noise matrices. We also highlight the dependency
of α on the rank r; such dependency is required even in the standard settings of robust
PCA (Netrapalli et al., 2014; Chandrasekaran et al., 2011; Hsu et al., 2011), albeit with
a milder condition on r. Finally, the scaling of α does not depend on the the number of
14
TCMF
sources N , suggesting that the convergence guarantees provided by Theorem 5 are valid for
extremely large datasets.
Two important observations are in order. First, we do not impose any constraint on the
norm or sign of the sparse noise S? (i) . Thus, Theorem 5 holds for arbitrarily large noise
values. Second, at every epoch, Algorithm 1 solves the inner optimization problem (8) via
JIMF to -optimality. Also, the convergence of Algorithm 1 is contingent upon the precision
of JIMF output: the `∞ norm of the optimization error should not be larger than O (λ1 ).
Such requirement is not strong as even the trivial solution Ug , V(i),g , U(i),l , V(i),l = 0 is
λ1 -optimal. One should expect many algorithms to perform much better than the trivial
solution. Indeed, methods including PerPCA and PerDL are proved to output -optimal
solutions for arbitrary small within logarithmic iterations, thus satisfying the requirement.
In practice, heuristic methods including JIVE or COBE can output reasonable solutions
that may also satisfy the requirement in Theorem 5.
In the next section, we provide the sketch of the proof for Theorem 5.
supp S? (i) . Therefore, the initial error matrix E(i),1 is also α-sparse.
Step 2: Error reduction via JIMF. Suppose that E(i),t is α-sparse. In Step 7, L̂(i),t
is obtained by applying JIMF on M(i) − Ŝ(i),t = L?(i) + E(i),t . Note that the input to JIMF
is the true low-rank component perturbed by an α-sparse matrix E(i),t . One of our key
contributions is to show that L?(i) − L̂(i),t is much smaller than E(i),t ∞ , provided that
∞
the true local and global components are µ-incoherent and E(i),t is α-sparse. This fact is
delineated in the following key lemma.
Lemma 6 (Error reduction via JIMF (informal)) Suppose that the conditions of The-
orem 5 are satisfied. Moreover, suppose that E(i),t is α-sparse for each client i. We have
√ 2
? αµ r n o
L(i) − L̂(i),t ≤C· √ · max E(j),t ∞ ,
∞ θ j
15
Shi, Fattahi, and Al Kontar
Indeed, proving Lemma 6 is particularly daunting since L̂(i),t does not have a closed-form
solution. We will elaborate on the major techniques √to prove Lemma 6 in Section 6.2.
2
Suppose that α is small enough such that C · αµ √ r ≤ ρ . Then, Lemma 6 implies
θ 2
n o
? ρ
that L(i) − L̂(i),t ≤ 2 maxi E(i),t ∞ . From the definition of -optimality, we know
∞ n o
L?(i) − L̂(i),t ≤ ρ2 maxi E(i),t ∞ + . This implies that the `∞ norm of the error in
∞
the output of JIMF shrinks by a factor of ρ2 compared with the error in the input E(i),t ∞
(modulo an additive factor ). As will be discussed next, this shrinkage in the `∞ norm of
the error is essential for the exact sparsity recovery of the noise matrix.
Step 3: Preservation
n of sparsity
o via hard-thresholding. Given that
? ρ
L(i) − L̂(i),t ≤ 2 maxi E(i),t ∞ + , our next goal is to show that supp E(i),t+1 ⊆
∞ n o
supp S? (i) (i.e., E(i),t+1 remains α-sparse) and maxi E(i),t+1 ∞ ≤ 2λt+1 . To prove
supp E(i),t+1 ⊆ supp S? (i) , suppose that S?(i)
= 0 for some (k, l), we have
kl
Ŝ(i),t+1 6 0 only if M(i) − L̂(i),t
= = L?(i) − L̂(i),t > λt+1 . On the other
kl kl n o kl
hand, in the Appendix, we show that maxi E(i),t ∞ ≤ 2λt . This implies that
n o
L?(i) − L̂(i),t ≤ ρ2 maxi E(i),t ∞ + ≤ ρλt + = λt+1 . This in turn leads to
∞
= E(i),t+1 kl = 0, and hence, supp E(i),t+1 ⊆ supp S? (i) . Finally, according
Ŝ(i),t+1
kl
to the definition of hard-thresholding, we have Ŝ(i),t+1 − S?(i) + L?(i) − L̂(i) ≤ λt+1 ,
kl
which, by triangle inequality, yields E(i),t+1 kl ≤ L?(i) − L̂(i)
+ λt+1 ≤ 2λt+1 .
kl
Step n4: Establishing
o linear convergence. Repeating Steps 2 and 3, we have
maxi E(i),t+1 ∞ ≤ 2λt+1 and L?(i) − L̂(i),t ≤ λt+1 for all t. Noting that λt =
∞
ρλt−1 + = + ρ + ρ2 λt−2 = · · · = + ρ + ρ3 + · · · + ρt−1 λ1 ≤ 1−ρ + ρt−1 λ1 , we establish
n o
that maxi E(i),t ∞ = O() and L?(i) − L̂(i),t = O() in O log(λ 1 /)
log(1/ρ) iterations.
∞
Step 5: Untangling global and local components. Under the misalignment condition,
a small error of the joint low-rank components L?(i) − L̂(i),t indicates that both the shared
F
component and the unique component is small. More specifically, Theorem 1 in Shi and
Kontar (2024) indicates U? g V? T(i),g − Ûg,t V̂(i),g,t
T , U? (i),l V? T(i),l − Û(i),l,t V̂(i),l,t
T =
F F
O L? (i) − L̂(i),l,t . Since L? (i) − L̂(i),l,t shrinks linearly to a small constant, we
F F
can conclude that the estimation errors for shared and unique features also decrease linearly
to O().
16
TCMF
{E(i) }, how will the recovered solutions change in terms of `∞ norm? We highlight that the
standard matrix perturbation analysis, such as the classical Davis-Kahan bound (Bhatia,
2013) as well as the more recent `∞ bound (Fan et al., 2018), fall short of answering this
question for two main reasons. First, these bounds are overly pessimistic and cannot take
into account the underlying sparsity structure of the noise. Second, they often control the
singular vectors and singular values of the perturbed matrices, whereas the optimal solutions
to problem (8) generally do not correspond to the singular vectors of M̂(i) .
To address these challenges, we characterize the optimal solutions of (8) by analyzing
its Karush–Kuhn–Tucker (KKT) conditions. We establish the KKT condition and ensure
the linear independence constraint qualification (LICQ). Afterward, we obtain closed-form
solutions for the KKT conditions in the form of convergent series and use these series to
control the element-wise perturbation of the solutions.
KKT conditions. The following lemma shows two equivalent formulations for the KKT
conditions. For convenience, we drop the subscript t in our subsequent arguments.
Lemma 7 Suppose that {Ûg , Û(i),l , V̂(i),g , V̂(i),l } is the optimal solution to problem (8)
and M̂(i) has rank at least r1 + r2 . We have
N
X
T T
Ûg V̂(i),g + Û(i),l V̂(i),l − M̂(i) V̂(i),g = 0 (12a)
i=1
T T
Ûg V̂(i),g + Û(i),l V̂(i),l − M̂(i) V̂(i),l = 0 (12b)
T
T T
Ûg V̂(i),g + Û(i),l V̂(i),l − M̂(i) Û(i),l = 0 (12c)
T
T T
Ûg V̂(i),g + Û(i),l V̂(i),l − M̂(i) Ûg = 0 (12d)
ÛT(i),l Û(i),l = I, ÛTg Ûg = I, ÛT(i),l Ûg = 0. (12e)
Moreover, there exist positive diagonal matrices Λ1 ∈ Rr1 ×r1 , Λ2,(i) ∈ Rr2 ×r2 , and Λ3,(i) ∈
Rr1 ×r2 such that the optimality conditions imply:
for some Ĥg ∈ On1 ×r1 that spans the same subspaces as Ûg , and some Ĥ(i),l ∈ On1 ×r2 that
spans the same subspaces as Û(i),l .
The Λ3,(i) term in (13) complicates the relation between Ĥg and Ĥ(i),l . When Λ3,(i) is
nonzero, one can see that neither Ĥg nor Ĥ(i),l span a invariant subspace of M̂(i) M̂T(i) . As
a consequence, perturbation analysis from Netrapalli et al. (2014) based on characteristic
equations is not applicable. To alleviate this issue, we provide a more delicate control over
the solution set of (13).
17
Shi, Fattahi, and Al Kontar
?T ? T T ? ?T
E (i),t L (i) + L E
(i) (i),t + E E
(i),t (i),t Ĥ(i),l − Ĥ Λ
(i),l 2,(i) = − L (i) L Ĥ
(i) (i),l − Ĥ Λ
g 3,(i)
N
1 X
E(i),t L? T(i) + L? (i) ET(i),t + E(i),t ET(i),t Ĥg − Ĥg Λ1
N
i=1
N N
!
1 X ? 1 X
? T
=− L (i) L (i) Ĥg − Ĥ(i),l Λ3,(i) .
N N
i=1 i=1
(14)
√
We can show that the norms of input errors E(i),t Λ3,(i) are upper bounded by O( α).
Thus, when the sparsity parameter α is not too large, we can write the solutions to (14) as
a series of α.
In the limit α = 0, we have E(i),t = 0, thus we can solve the leading terms of Ĥ(i),l and
Ĥg from (14). When α is not too large, we can prove the following lemma,
Lemma 8 (informal) If α is not too large, then Ĥg and Ĥ(i),l introduced in Lemma 7
satisfy,
N
!
1 X T
−1
√
? ?
L (i) L (i) Ĥg − Ĥ(i),l Λ2,(i) Λ3,(i) T
Λ−1 + O( α)
Ĥg = 6
N
i=1
Ĥ(i),l = L (i) L? T(i) Ĥ(i),l Λ−1
?
2,(i) (15)
N
√
1 X ? ?T
−1 T
Λ−1 −1
− L L Ĥ − Ĥ Λ Λ 6 Λ3,(i) Λ2,(i) + O( α),
(j) g (j),l
(j) 2,(j) 3,(j)
N
j=1
√ √
where O( α)’s are terms whose Frobenius norm and `∞ norm is upper bounded by O( α),
and Λ6 is defined as Λ6 = Λ1 − N1 N −1 T
P
j=1 Λ3,(j) Λ2,(j) Λ3,(j) .
The formal version of lemma 8 and its proof are relegated to the appendix. We now briefly
introduce our methodology for deriving the solutions in Lemma 8. For matrices A, B, X, Y
satisfying the Sylvester equation AX − XB = −Y, if the spectra of A Pand Bp are −1−p
separated,
i.e., σmax (A) < σmin (B), then the solution can be written as X = ∞ p=0 A YB . We
apply this solution form to (14) and iteratively expand Ĥg and Ĥ(i),l . The exact forms
of the resulting series are shown in (41) and (42) in the appendix. In the series, each
term is a product of a group of sparse matrices, an incoherent matrix, and some remaining
terms. Based on the special structure of the series, we can calculate upper bounds on
the Frobenius norm and maximum row norm of each term in the series. The leading
terms are simply N i=1 L (i) L (i) Ĥg − Ĥ(i),l Λ2,(i) Λ3,(i) Λ6 and L? (i) L? T(i) Ĥ(i),l Λ−1
1 PN ? ? T −1 T −1
2,(i) −
1 P N ? ?T −1 T −1 −1
N j=1 L (j) L (j) Ĥg − Ĥ(j),l Λ2,(j) Λ3,(j) Λ6 Λ3,(i) Λ2,(i) . By summing up the norm of all
remaining higher-order terms in the series and applying a few basic inequalities in geometric
series, we can prove the result in Lemma 8.
18
TCMF
Then, we can replace Ĥg,t and Ĥ(i),l,t by the Taylor-like series described in Lemma
8. The error between L? (i) and L̂(i),t can be written as the summation of a few terms.
The leading term is H? g H? Tg L? (i) + H? (i),l H? T(i),l L? (i) , which is identical to L? (i) because
√
of the SVD (3). The remaining terms are errors resulting from O( α) terms in (15)
and E(i) . Each of the error terms possesses a special structure that allows us to derive
an upper bound on its `∞ norm. By summing up all these bounds, we can show that
√
L̂(i),t − L? (i) ≤ O( α maxj E(j),t ∞ ). The detailed calculations on the upper bounds
∞
on the `∞ norm of error terms are long and repetitive, thus relegated to the proof of Lemma
20 in the Appendix.
7. Numerical Experiments
In this section, we investigate the numerical performance of TCMF on several datasets. We
first use synthetic datasets to verify the convergence in Theorem 5 and validate TCMF’s
capability in recovering the common and individual features from noisy observations. Then,
we use two examples of noisy video segmentation and anomaly detection to illustrate the
utility of common, unique, and noise components. We implement Algorithm 1 with HMF
(Shi et al., 2023) as its subroutine JIMF. Experiments in this section are performed on a
desktop with 11th Gen Intel(R) i7-11700KF and NVIDIA GeForce RTX 3080. Code is
available in the linked Github repository.
19
Shi, Fattahi, and Al Kontar
epoch t as
N
1 X
`∞ − global error = Û g,t V̂T(i),g,t − U? (i),g V? T(i),g ,
N ∞
i=1
the `∞ local error at epoch t as
N
1 X
`∞ − local error = Û (i),l,t V̂T(i),l,t − U? (i),l V? T(i),l ,
N ∞
i=1
We show the error plot for three different sparsity parameters α in Figure 2.
Figure 2: Error plots of Algorithm 1. The x-axis denotes the iteration index, and the y-axis
shows the `∞ error at the corresponding iteration. The y-axis is in log scale.
From Figure 2, it is clear that the global, local, and sparse components indeed converge
linearly to the ground truth.
Further, we compare the feature extraction performance of TCMF with benchmark algo-
rithms, including JIVE (Lock et al., 2013), RJIVE (Sagonas et al., 2017), RaJIVE (Ponzi
et al., 2021), and HMF (Shi et al., 2023). We do not include the comparison with RCICA
(Panagakis et al., 2015) because RCICA is designed only for N = 2, while we have 100 different
sources. Since the errors of different methods
vary drastically, we calculate and report the log-
2
1 PN T ? ? T
arithm of global error as g-error = log10 N i=1 Û g,t V̂ (i),g,t − U (i),g V (i),g , the
F
2
1 PN T ? ? T
logarithm of local error as l-error = log10 N i=1 Û (i),l,t V̂ (i),l,t − U (i),l V (i),l ,
F
PN 2
?
and the logarithm of sparse noise error as s-error = log10 i=1 Ŝ(i),t − S (i) at
F
20
TCMF
t = 20. We run experiments from 5 different random seeds and calculate the mean and
standard deviation of the log errors. Results are reported in Table 1.
Table 1: Recovery error of different algorithms. The columns g-error, l-error, and
s-error stand for the log recovery errors of global components, local components,
and sparse components.
α = 0.01 α = 0.1
g-error l-error s-error g-error l-error s-error
JIVE 5.52 ± 0.01 5.64 ± 0.01 - 6.52 ± 0.01 6.58 ± 0.01 -
HMF 5.49 ± 0.01 5.62 ± 0.01 - 6.48 ± 0.01 6.55 ± 0.01 -
RaJIVE 5.46 ± 0.01 5.36 ± 0.05 5.71 ± 0.05 6.48 ± 0.00 6.25 ± 0.14 6.59 ± 0.12
RJIVE 5.49 ± 0.01 5.44 ± 0.01 5.77 ± 0.01 6.48 ± 0.00 6.47 ± 0.00 6.78 ± 0.00
TCMF -3.38 ± 0.14 -3.37 ± 0.13 -2.94 ± 0.08 -1.93 ± 0.09 -1.95 ± 0.06 -1.54 ± 0.04
Table 1 shows that TCMF outperforms benchmark algorithms by several orders. This
is understandable as TCMF is provably convergent into the ground truth, while benchmark
algorithms either neglect sparse noise or rely on instance-dependent heuristics.
21
Shi, Fattahi, and Al Kontar
Original
noisy
frames
Noise
Global
components
Local
components
2014) to extract the sparse and low-rank components from Mstack . The low-rank component
is regarded as the background, while the sparse component captures the foreground.
To assess the performance of these methods, we calculate the differences between the
recovered background and foreground compared to the ground truths. Specifically, we
estimate the mean squared error (MSE), peak signal-to-noise ratio (PSNR), and structural
similarity index (SSIM) of the recovered foreground and background with respect to the
true foreground and background. The comparison results are presented in Table 3.
Table 3: Background and foreground recovery quality metrics for different algorithms.
Background Foreground Wall-clock
MSE ↓ PSNR ↑ SSIM ↑ MSE ↓ PSNR ↑ SSIM ↑ time (s) ↓
JIVE 415 -26 0.08 2521 14 0.03 1.6 × 103
HMF 198 -22 0.18 2413 14 0.05 2.3 × 101
PerPCA 236 -23 0.14 2389 14 0.07 9.8 × 101
RJIVE 277 -24 0.13 1309 16 0.22 9.2 × 101
RaJIVE 170 -22 0.22 166 26 0.18 1.2 × 104
Robust PCA 0.0016 31 1.00 5105 11 0.61 3.3 × 10−1
TCMF 0.0003 33 1.00 98 31 0.98 3.5 × 102
22
TCMF
In Table 3, a lower MSE, a higher PSNR, and a higher SSIM signify superior recovery
quality. In terms of background recovery, both TCMF and robust PCA exhibit low MSE,
high PSNR, and high SSIM, surpassing other methods. This suggests that both algorithms
effectively reconstruct the background. This outcome was anticipated as TCMF and robust
PCA possess the capability to differentiate between significant noise and low-rank components.
In contrast, other benchmarks either neglect large noise in the model or rely on heuristics.
Furthermore, TCMF showcases marginally superior performance in MSE and PSNR compared
to robust PCA, signifying higher-quality background recovery.
When it comes to foreground recovery, TCMF outperforms benchmark algorithms signifi-
cantly across all metrics. The inability of robust PCA to achieve high-quality foreground
recovery is likely due to its inability to separate sparse noise from the foreground. JIVE and
HMF yield high MSE and low PSNR, indicating noisy foreground reconstruction. Although
heuristic methods, such as RJIVE and RaJIVE, exhibit slight performance improvements
over JIVE and HMF, they still fall short of the performance exhibited by TCMF. This com-
parison underscores TCMF’s remarkable power to identify unique components from sparse
noise accurately.
We also report the running time of each experiment in Table 3. Compared with heuristic
methods to robustly separate the shared and unique components, TCMF exhibits a slightly
longer running time than RJIVE but significantly outperforms RaJIVE in terms of speed.
The comparison highlights TCMF’s superior performance with moderate computation demands.
Although Robust PCA demonstrates a relatively short running time in this instance, larger-
scale experiments presented in Appendix B will show that Robust PCA has larger running
time scaling as the problem size increases.
23
Shi, Fattahi, and Al Kontar
implementation alternatively applies SVD and hard thresholding. We require all entries in
the sparse component to be negative in the hard-thresholding step to encode the domain
knowledge that surface defects tend to have lower temperatures. The hyper-parameters for
SVD and thresholding are consistent for both TCMF and Robust PCA.
Figure 3: Left: An example of the surface of the steel bar. There are a few anomalies inside
the red ellipse. Right: Recovered sparse noises, shared components, and unique
components from 2 frames.
In Figure 3, we can see that TCMF effectively identifies the small defects on the steel
plate surface. The global component reflects the general patterns in the video frames, while
the local component accentuates the variations in different frames. In contrast, the sparse
components recovered by robust PCA do not faithfully represent the surface defects.
We proceed to show that TCMF-recovered sparse components can be conveniently leveraged
for frame-level anomaly detection. Our task here is to identify which frames contain surface
anomalies. Inspired by the statistics-based anomaly detection (Chandola et al., 2009), we
construct simple test statistics to monitor the anomalies. The test statistics is defined as
the `1 norm of the recovered sparse noise on each frame Ŝ(i) . Indeed, a large Ŝ(i)
1 1
provides strong evidence for surface defects. The choice of `1 norm is not special as we find
other norms, such as `2 norm, would yield a similar performance.
After using TCMF to extract the sparse components, we calculate the test statistics for each
frame. Then, we can set up a simple threshold-based classification rule for anomaly detection:
when the `1 norm exceeds the threshold, we report an anomaly in the corresponding frame.
In the case study, the threshold is set to be the highest value in the first 50 frames, which is
the in-control group that does not contain anomalies (Yan et al., 2018). We plot the test
statistics and thresholds in Figure 5. The blue dots and red crosses denote the (ground truth)
24
TCMF
normal and abnormal frame labels in Yan et al. (2018). In an ideal plot of test statistics, one
would expect the abnormal samples to have higher `1 norms, while normal samples should
have lower norms. This is indeed the case for Figure 5, where a simple threshold based on
the sparse features can distinguish abnormal samples from normal ones with high accuracy.
In comparison, we also calculate the `1 norm of sparse noise recovered by robust PCA
and plot the testing statistics in Figure 4. In Figure 4, the `1 norm is less indicative of
anomaly labels, as some abnormal samples have small test statistics, while some normal
samples have large statistics. It is also hard to use a threshold on the test statistics to
predict anomalies.
The comparison highlights TCMF’s ability to find surface defects. The results are un-
derstandable as TCMF uses a more refined model to decompose the thermal frames into
three parts, thus having more representation power to fit the underlying physics in the
manufacturing process. As a result, the recovered sparse components are more representative
of the anomalies.
8. Conclusion
In this work, we propose a systematic method TCMF to separate shared, unique, and noise
components from noisy observation matrices. TCMF is the first algorithm that is provably
convergent to the ground truth under identifiability conditions that require the three
components to have “small overlaps”. TCMF outperforms previous heuristic algorithms
by large margins in numerical experiments and finds interesting applications in video
segmentation, anomaly detection, and time series imputation.
Our work also opens up several venues for future theoretical exploration in separating
shared and unique low-rank features from noisy matrices. For example, a minimax lower
bound on the µ, θ, and α can help fathom the statistical difficulty of such separation. Also,
as many existing methods for JIMF rely on good initialization to excel, designing efficient
algorithms for JIMF that are independent of the initialization is also an interesting topic.
On the practical side, methods to integrate TCMF with other machine learning models, e.g.,
auto-encoders, to find nonlinear features in data are worth exploring.
25
Shi, Fattahi, and Al Kontar
Acknowledgments
This research is supported in part by Raed Al Kontar’s NSF CAREER Award 2144147 and
Salar Fattahi’s NSF CAREER Award CCF-2337776, NSF Award DMS-2152776, and ONR
Award N00014-22-1-2127,
26
TCMF
N
X
f˜i Ug , {V(i),g , U(i),l , V(i),l }
min
Ug ,{U(i),l }i=1,··· ,N
i=1
N
X 1 T T
2 β 2 β 2
= M̂(i) − Ug V(i),g − U(i),l V(i),l + UTg Ug − I F
+ UT(i),l U(i),l − I
2 F 2 2 F
i=1
subject to UTg U(i),l = 0, ∀i.
(16)
Compared with (8), the objective in (16) contains two additional regularization terms
2 2
β
2 UTg Ug − I F + β2 UT(i),l U(i),l − I . The regularization terms enhance the smoothness
F
of the optimization objective thereby facilitating convergence. Despite the regularization
terms, any optimal solution to (16) is also an optimal solution to (8). We can prove the
claim in the following proposition.
Proof The proof is straightforward. We first claim that ÛHMF T ÛHMF = I and ÛHMF T ÛHMF = I.
g g (i),l (i),l
We prove the claim by contradiction. Suppose otherwise, we can find a QR decomposition
of ÛHMF
g and ÛHMF HMF = Q R and ÛHMF = Q
(i),l as Ûg g g g (i),l R(i),l , where Qg and Q(i),l ’s are
orthonormal and Rg and R(i),l ’s are upper-triangular. Furthermore, not both Rg and R(i),l
2 2
are identity matrices, thus RTg Rg − I F
+ RT(i),l R(i),l − I > 0. Now, we construct a
F
refined set of solutions as,
ined
ÛHMF,ref
g = Qg
HMF,ref ined
V̂(i),g HMF
= V̂(i),g RTg
ined
ÛHMF,ref
(i),l = Q(i),l
HMF,ref ined (i),l
V̂(i),l = V̂HMF RT(i),l .
27
Shi, Fattahi, and Al Kontar
N
HMF,ref ined HMF,ref ined HMF,ref ined
X
f˜i ÛHMF,ref
g
ined
, V̂ (i),g , Û(i),l , V̂ (i),l
i=1
N β
X 2 2
= f˜i ÛHMF
g , V̂ HMF
(i),g , ÛHMF
(i),l , V̂ HMF
(i),l − RTg Rg − I F
+ RT(i),l R(i),l − I
2 F
i=1
XN
< f˜i ÛHMF
g , V̂ HMF
(i),g , ÛHMF
(i),l , V̂ HMF
(i),l ,
i=1
which contradicts with the global optimality of ÛHMF HMF HMF HMF
g , {V̂(i),g , Û(i),l , V̂(i),l }. This proves the
claim.
From the orthogonality, we know fi ÛHMFg , V̂ HMF , ÛHMF , V̂HMF
(i),g (i),l (i),l =
f˜i ÛHMF HMF HMF HMF
g , V̂(i),g , Û(i),l , V̂(i),l .
N
X
fi ÛJIMF
g , V̂ JIMF
(i),g , ÛJIMF
(i),l , V̂ JIMF
(i),l
i=1
N
X
< fi ÛHMF
g , V̂ HMF
(i),g , ÛHMF
(i),l , V̂ HMF
(i),l
i=1
XN
= f˜i ÛHMF HMF HMF HMF
g , V̂(i),g , Û(i),l , V̂(i),l .
i=1
ined
ÛJIMF,ref
g = QJIMF
g
JIMF,ref ined JIMF JIMF T
V̂(i),g = V̂(i),g Rg
ined
ÛJIMF,ref
(i),l = QJIMF
(i),l
JIMF,ref ined (i),lT
V̂(i),l = V̂HMF RJIMF
(i),l ,
JIMF,ref ined
where QJIMF
g , RJIMF
g , QJIMF JIMF
(i),l , R(i),l are QR decompositions that satisfy Ûg =
ined
QJIMF
g RJIMF
g and ÛJIMF,ref
(i),l = QJIMF JIMF
(i),l R(i),l . Based on the refined set of solutions, we
28
TCMF
HMF optimizes the objective by gradient descent. To ensure feasibility, HMF employs a
special correction step to orthogonalize Ug and U(i),l without changing the objective at
every step. The pseudo-code is presented in Algorithm 2.
In Algorithm 2, we use τ to denote the iteration index, where the half-integer index
denotes the update of the variable is half complete: it is updated by gradient descent but
is not feasible yet. It is proven that under a group of sufficient conditions, Algorithm 2
converges to the optimal solutions of problem (16). The sufficient conditions require the
stepsize ητ to be chosen appropriately and the initialization close to the optimal solution
(Shi et al., 2023).
In practice, Algorithm 2 is often efficient and accurate. Therefore, we implement HMF
as the subroutine JIMF for all of our numerical simulations in Section 7. To initialize
Algorithm 2, we adopt a spectral initialization approach. Specifically, we concatenate all
29
Shi, Fattahi, and Al Kontar
The objective only optimizes the feature matrices Ug and U(i),l , but it’s essentially equivalent
to problem (8). The formal statement is presented in the following proposition.
N
X
fi ÛJIMF
g , V̂ JIMF
(i),g , Û JIMF
(i),l , V̂ JIMF
(i),l
i=1
N
X
T T
< fi ÛPerPCA
g , M̂ Û
(i) g
PerPCA
, Û PerPCA
(i),l , M̂ Û PerPCA
(i) (i),l
i=1
N
X 2
T PerPCA T
= M̂(i) − UPerPCA
g UPerPCA
g M̂(i) − UPerPCA
(i),l U(i),l M̂(i) .
F
i=1
30
TCMF
N
X −1 −1 2
T JIMF T T JIMF T
M̂(i) − ÛJIMF
(i),l ÛJIMF
(i),l Û(i),l ÛJIMF
(i),l M̂(i) − ÛJIMF
(i),g ÛJIMF
(i),g Û(i),g ÛJIMF
(i),g M̂(i)
i=1 F
N
X
JIMF,opt JIMF,opt
= fi ÛJIMF
g , V̂ (i),g , ÛJIMF
(i),l , V̂ (i),l
i=1
XN
≤ fi ÛJIMF
g
JIMF
, V̂(i),g , ÛJIMF
(i),l , V̂ JIMF
(i),l
i=1
N
X 2
T PerPCA T
< M̂(i) − UPerPCA
g UPerPCA
g M̂(i) − UPerPCA
(i),l U(i),l M̂(i) .
F
i=1
−1/2
ine ine
If we define ÛJIMF,ref
(i),l = ÛJIMF ÛJIMF T ÛJIMF
(i),l (i),l (i),l and ÛJIMF,ref
(i),g =
−1/2
ine ine
ÛJIMF JIMF T JIMF
(i),g Û(i),g Û(i),g , then ÛJIMF,ref
(i),l and ÛJIMF,ref
(i),g are also feasible for (17)
and achieve lower objective. This contradicts the optimality of UPerPCA
g and UPerPCA
(i),l .
To solve the constrained optimization problem (17), personalized PCA adopts a dis-
tributed version of Stiefel gradient descent. The pseudo-code is presented in Algorithm
3.
In Algorithm 3, GR denotes generalized retraction. In practice, it can be implemented
− 1
via polar projection GRU (V) = (U + V) UT U + VT U + UT V + VT V 2 . Algorithm 3
can also be proved to converge to the optimal solutions with suitable choices of stepsize and
initialization (Shi and Kontar, 2024).
In this section, we include the additional running time comparison between TCMF and Robust
PCA. We use a set of synthetic datasets with varying numbers of sources N , then compare
the per-iteration running time of the two algorithms. More specifically, we follow the setting
in Section 7.1 where n1 = 15 and n2 = 1000, and generate synthetic datasets where the
number of sources N changes from 100 to 10000. Then, we apply TCMF and Robust PCA on
the same dataset. We do not parallelize computations for either algorithm for fair comparison.
The per-iteration running time of the two algorithms is collected and plotted in Figure 6.
31
Shi, Fattahi, and Al Kontar
Figure 6: Running time comparison between runtime of TCMF and Robuct PCA.
From Figure 6, it is clear that the running time of TCMF scales linearly with the number
of sources N , which is consistent with the complexity analysis.
In contrast, though Robuct PCA has a smaller per-iteration runtime when N is small,
as N becomes larger, the runtime increases faster than TCMF. This is because Robuct
PCA vectorizes the observation matrices from each source. The resulting vector from each
source has dimension n1 n2 . Robuct PCA then concatenates these vectors into a n1 n2 × N
matrix and alternatively performs Singular Value Decomposition and hard-thresholding. For
each application of SVD, the computational complexity is O(n1 n2 (N 2 + N ) + N 3 ) when
N ≤ n1 n2 (Li et al., 2019). Indeed, in Figure 6, the slope of the initial part of the Robust
PCA curve is around 1.2, and the slope of the final part is around 3.0, suggesting that the
running time scales cubically in the large N regime.
32
TCMF
Since in the ground truth model, the SVD of L? (i) can be written as L? (i) =
? T
H g , H? (i),l diag(Σ(i),g , Σ(i),l ) W? (i),g , W? (i),l , one can immediately see that T(i) ’s
nonzero eigenvalues are upper bounded by σmax 2 and lower bounded by σmin2 . Finally,
recall that we use Ûg , Û(i),l , V̂(i),g , and V̂(i),l to denote the optimal solutions to (8) (we
omit the subscript t here for Qkbrevity.) For a series of square matrices of the same shape
A1 , · · · , Ak ∈ Rr×r , we use
Q m=1 Am to denote the product of these matrices in the ascend-
ing order of indices, and 1m=k Am to denote the product of these matrices in the descending
order of indices,
Yk
Am = A1 A2 · · · Ak−1 Ak
m=1
Y1
Am = Ak Ak−1 · · · A2 A1 .
m=k
Our next two lemmas provide upper bound on the maximum row-norm of the errors
with respect to the `∞ -norms of E(i) . By building upon these two lemmas, we provide a key
result in Lemma 13 connecting {F(i) } and the error matrices {E(i) }.
Lemma 11 Suppose that E(1) , · · · , E(N ) , ∈ Rn1 ×n2 are α-sparse and U ∈ Rn1 ×r is µ-
incoherent and kU k ≤ 1. For any integers p1 , p2 , · · · , pk ≥ 0, and i1 , i2 , · · · , ik ∈
{0, 1, · · · , N }, we have
s
k
! 2(p1 +p2 +···+pk )
µ2 r
Y
T T p`
max ej (E(i` ) E(i` ) ) U ≤ αn max E(i) ∞ . (20)
j n1 i
`=1 2
33
Shi, Fattahi, and Al Kontar
With a slight abuse of notation, in Lemma 11 and the rest of the paper, we define E(0) ET(0)
to be,
N
T 1 X
E(0) E(0) = E(i) ET(i) . (21)
N
i=0
Proof We will prove it by induction on the exponent. From the definition of incoherence,
we know that when p1 + · · · + pk = 0, the inequality (20) holds. Now suppose that
the inequality (20) holds for all p1 , p2 , · · · , pk ≥ 0 such that p1 + · · · + pk ≤ s − 1 and
i1 , i2 , · · · , ik ∈ {0, 1, · · · , N }. We will prove the statement for p1 + · · · + pk = s. Without
loss of generality, we assume p1 ≥ 1. One can write
k
! 2 k
! !2
Y X Y
eTj (E(i` ) ET(i` ) )p` U = eTj (E(i` ) ET(i` ) )p` Uel
`=1 2 l `=1
k
! !2
X Y
= eTj E(i1 ) ET(i1 ) (E(i1 ) ET(i1 ) )p1 −1 (E(i` ) ET(i` ) )p` Uel
l `=2
k
! !2
X Xh i Y
= E(i1 ) ET(i1 ) eTh (E(i1 ) ET(i1 ) )p1 −1 (E(i` ) ET(i` ) )p` Uel
j,h
l h `=2
XXh i h i
= E(i1 ) ET(i1 ) E(i1 ) ET(i1 )
j,h1 j,h2
l h1 ,h2
k 2
! !
Y Y
× eTh1 (E(i1 ) ET(i1 ) )p1 −1 (E(i` ) ET(i` ) )p` Uel eTl UT (E(i` ) ET(i` ) )p` (E(i1 ) ET(i1 ) )p1 −1 eh2
`=2 `=k
(22)
.
el eTl = I, we can simplify the summation as,
P
Since l
X
E(i1 ) ET(i1 ) E(i1 ) ET(i1 )
j,h1 j,h2
h1 ,h2
k 2
! !
Y Y
× eTh1 (E(i1 ) ET(i1 ) )p1 −1 (E(i` ) ET(i` ) )p` UUT (E(i` ) ET(i` ) )p` (E(i1 ) ET(i1 ) )p1 −1 eh2
`=2 `=k
X
≤ E(i1 ) ET(i1 ) E(i1 ) ET(i1 )
j,h1 j,h2
h1 ,h2
k
! 2
Y
× max eTm (E(i1 ) ET(i1 ) )p1 −1 (E(i` ) ET(i` ) )p` U
m
`=2 2
4s−4
µ2 r
X
≤ E(i1 ) ET(i1 ) E(i1 ) ET(i1 ) αn max E(i) ∞
,
j,h1 j,h2 n1 i
h1 ,h2
34
TCMF
where in the last step, we used the induction hypothesis. Now, to complete the proof, we
consider two cases.
If i1 > 0, we have:
X X
E(i1 ) ET(i1 ) E(i1 ) ET(i1 ) = (E(i1 ) )j,g1 (E(i1 ) )h1 ,g1 (E(i1 ) )j,g2 (E(i1 ) )h2 ,g2
j,h1 j,h2
h1 ,h2 h1 ,h2 ,g1 ,g1
4
≤ αn1 E(i1 ) ∞
αn2 E(i1 ) ∞
αn1 E(i1 ) ∞
αn2 E(i1 ) ∞
= αn max E(i) ∞
,
i
where the last inequality holds because at most αn1 entries in each column of E(i1 ) are
nonzero and at most αn2 entries in each row of E(i1 ) are nonzero. On the other hand, if
i1 = 0, we have:
X 1 X X X
E(0) ET(0) E(0) ET(0) ≤ 2 E(f1 ) ET(f1 ) E(f2 ) ET(f2 )
j,h1 j,h2 N
h1 ,h2 h1 ,h2 f1 >0 f2 >0
jk1 jk2
4
1 2
= N αn max E(i) ∞
.
N2 i
k 2 4s
µ2 r
Y
eTj (E(i` ) ET(i` ) )p` U ≤ αn max E(i) ∞
,
n1 i
`=1 2
Lemma 12 Suppose that E(1) , · · · , E(N ) ∈ Rn1 ×n2 are α-sparse and V ∈ Rn2 ×r is µ-
incoherent. For any integers p1 , p2 , · · · , pk ≥ 0, and i1 , i2 , · · · , ik ∈ {0, 1, · · · , N }, we have,
k
!
Y
max eTj (E(i` ) ET(i` ) )p` E(ik+1 ) V
j
`=1 2
s (23)
2(p1 +p2 +···+pk )+1
µ2 r
≤ αn max E(i) ∞
.
n1 i
Proof The proof is analogous to that of Lemma 11, and hence, omitted for brevity.
Combining Lemma 11 and 12, we can show the following key lemma on the connection
between {F(i) } and the error matrices {E(i) }.
Lemma 13 For every i ∈ [N ], suppose that E(i) ∈ Rn1 ×n2 is α-sparse and L? (i) =
H? (i) Σ? (i) W? (i) is rank-r with µ-incoherent matrices H? (i) ∈ Rn1 ×r and W? (i) ∈ Rn2 ×r .
35
Shi, Fattahi, and Al Kontar
Proof We firstly expand Fp(i11 ) Fp(i22 ) · · · Fp(ikk ) U and rearrange the terms by the number of
consecutive E(i) ET(i) terms appearing in the beginning of each factor.
36
TCMF
For an α-sparse matrix E(i) ∈ Rn1 ×n2 , its operator norm is bounded by
X
E(i) 2
= max vT E(i) h = max vj hk [E(i) ]jk
kvk=1,khk=1 kvk=1,khk=1
j,k
X 1 r n1 r
n2
2 2
≤ max vj + hk [E(i) ]jk
kvk=1,khk=1 2 n2 n1
j,k
r r
1 X n1 X n2
≤ max E(i) ∞ vj2 αn2 + h2k αn1
kvk=1,khk=1 2 n2 n1
j k
√
= α n1 n2 E(i) ∞ .
Therefore E(i) 2
≤ αn maxi E(i) ∞
. As a result, we know,
2
F(i) 2 ≤ 2σmax αn max E(i) ∞ + αn max E(i) ∞
i i
= αn max E(i) ∞ 2σmax + αn max E(i) ∞ .
i i
We thus have:
37
Shi, Fattahi, and Al Kontar
s 2 Pk`=1 p`
µ2 r
= αn max E(i) ∞
n1 i
Pk`=1 p` Pk`=1 p`
+ αn max E(i) ∞ 2σmax + αn max E(i) ∞
i i
Pk`=1 p`
− αn max E(i) ∞
i
s Pk`=1 p` Pk`=1 p`
µ2 r
= αn max E(i) ∞ αn max E(i) ∞
+ 2σmax .
n1 i i
38
TCMF
Proof of Equations (12). The Lagrangian of the optimization problem (8) can be written as
N
1X T T
2
L= Ug V(i),g + U(i),l V(i),l − M̂(i)
2 F (25)
i=1
+ Tr Λ8,(i) UTg U(i),l ,
where Λ8,(i) is the dual variable for the constraint UTg U(i),l = 0.
Under the LICQ, we know that Ûg , {V̂(i),g , Û(i),l , V̂(i),l } satisfies KKT condition. Setting
the gradient of L with respect to V(i),g and V(i),l to zero, we can prove (12d) and (12c).
−1
Considering the constraint ÛTg Û(i),l = 0, we can solve them as V̂(i),g = M̂T(i) Ûg ÛTg Ûg
−1
and V̂(i),l = M̂T(i) Û(i),l ÛT(i),l Û(i),l . Then we examine the gradient of L with respect to
U(i),l :
∂
T T
L = Ug V(i),g + U(i),l V(i),l − M̂(i) V(i),l + Ug ΛT(8),i .
∂U(i),l
−1 −1
Substituting V̂(i),g = M̂T(i) Ûg ÛTg Ûg and V̂(i),l = M̂T(i) Û(i),l ÛT(i),l Û(i),l in the
above gradient and setting it to zero, we have
−1 −1
T T T T
Ûg Ûg Ûg Ûg + Û(i),l Û(i),l Û(i),l Û(i),l − I M̂(i) V̂(i),l
+ Ûg ΛT(8),i = 0.
N
∂ X
T T
L= Ûg V̂(i),g + Û(i),l V̂(i),l − M̂(i) V̂(i),g = 0.
∂Ug
i=1
Left multiplying both sides by ÛTg , we have ÛTg Ûg − I = 0. We have thus proven (12a).
This completes the proof for (12).
Proof of Equations (13). Equation (12b) can be rewritten as:
M̂(i) M̂T(i) Û(i),l = Û(i),l ÛT(i),l M̂(i) M̂T(i) Û(i),l + Ûg ÛTg M̂(i) M̂T(i) Û(i),l . (26)
39
Shi, Fattahi, and Al Kontar
T ÛT Û
Ĥ(i),l is also orthonormal as ĤT(i),l Ĥ(i),l = W(i),l (i),l (i),l W(i),l = I. Similarly, we rewrite
the equation (12a) as:
N N N
1 X 1 X 1 X
M(i) MT(i) Ûg = Ûg ÛTg M(i) MT(i) Ûg + Û(i),l ÛT(i),l M(i) MT(i) Ûg . (27)
N N N
i=1 i=1 i=1
PN
Since ÛTg 1 i=1 M̂(i) M̂T(i) Ûg is positive definite, we can use Wg Λ1 WgT =
PN N ×r
ÛTg N1 T
i=1 M̂(i) M̂(i) Ûg to denote its eigen decomposition, where Λ1 ∈ R
r 1 1 is positive
diagonal, Wg ∈ Rr1 ×r1 is orthogonal Wg WgT = WgT Wg = I. We define Ĥg as Ĥg = Ûg Wg ,
then Ĥg is also orthonormal. Additionally, ĤTg Ĥ(i),l = WgT ÛTg Û(i),l W(i),l = 0. This
completes the proof of equation (13c).
Next, we proceed with the proof of equations (13b) and (13a). By right multiplying both
sides of (27) with Wg and replacing Ûg and Û(i),l by Ĥg and Ĥ(i),l , we have
N N
1 X T 1 X
M̂(i) M̂(i) Ĥg = Ĥg Λ1 + PĤ M̂(i) M̂T(i) Ĥg . (28)
N (i),l N
i=1 i=1
Similarly, by right multiplying both sides of (26) with W(i),l , we can rewrite (26) as,
M̂(i) M̂T(i) Ĥ(i),l = Ĥ(i),l Λ2,(i) + Ĥg ĤTg M̂(i) M̂T(i) Ĥ(i),l . (29)
We thus prove the equations (13b) and (13a), where Λ3,(i) = ĤTg M̂(i) M̂T(i) Ĥ(i),l .
We note that the KKT conditions provide a set of conditions that must be satisfied
for all stationary points of (8). Our next key contribution is to use these conditions to
characterize a few interesting properties satisfied by all the optimal solutions. To this goal,
we heavily rely on the spectral properties of Λ1 , Λ2,(i) , and Λ3,(i) .
For simplicity, we introduce three additional notations, Λ4,(i) = −Λ3,(i) , Λ5,(i) =
−ΛT3,(i) /N , and Λ6 ∈ Rr1 ×r1 defined as,
N
1 X
Λ6 = Λ1 − Λ3,(i) Λ−1 T
2,(i) Λ3,(i) . (30)
N
i=1
Λ6 is a symmetric matrix. It is worth noting that Λ6 is well defined as the diagonal matrix
Λ2,(i) is invertible throughput the proof. We also introduce short-hand notation ∆Pg to
denote PĤg − PU? g and ∆P(i),l to denote PĤ − PU? (i),l .
(i),l
Spectral properties of Λ1 , Λ2,(i) , Λ3,(i) , and Λ6 are critical for developing the solutions
to KKT conditions. We will establish these properties in the following lemmas.
Before diving into these properties, we investigate the deviance of the estimates features
Ĥg and Ĥ(i),l to ground truth features U? g and U? (i),l .
2
Lemma 14 If maxi E(i) ∞ ≤ 4σmax µnr , and E(i) is α-sparse with α ≤
4
−2
1 σmin 4σ 2
4 2
60µ r σ 4 1 + √ max
θσ 2 , we have,
max min
√ 5σ
PU? g − PĤg ≤ αn max E(i) ∞
√ max , (31)
F i 2
θσmin
40
TCMF
and
!
√ σmax σ2
PU? (i),l − PĤ ≤ 3 αn max E(i) ∞ 2 1 + 4 √ max
2
. (32)
(i),l i σmin θσmin
Proof
By Shi and Kontar (2024, Theorem 1), we know that Ĥg and Ĥ(i),l corresponding to the
global optimal solutions to the problem (8) satisfy
N N 2
2 1 X 2 4 X F(i) F
P U? g − PĤg + PU? (i),l − PĤ ≤ 4 . (33)
F N (i),l F N θσmin
i=1 i=1
PU? g − PĤg
F
F(i) F √
2
≤ 2q ≤ αn E(i),t ∞
2σmax + αn max E(i) ∞
q
4 i 4
θσmin θσmin
√ 5σ
≤ αn max E(i) ∞
√ max ,
i 2
θσmin
where we used the condition αn maxi E(i) ∞ ≤ σmax /2 for the last inequality.
From (8), we can also deduce that that the column vectors of Ĥ(i),l span the top invariant
subspace of I − PĤg M̂(i) M̂T(i) I − PĤg . Since column vectors of U? (i),l span the top
invariant subspace of I − PU? g M? (i) M? T(i) I − PU? g , we know from Weyl’s theorem (Tao,
Since
I − PĤg M̂(i) M̂T(i) I − PĤg − I − PU? g M? (i) M? T(i) I − PU? g
F
41
Shi, Fattahi, and Al Kontar
= M̂(i) M̂T(i) − M? (i) M? T(i) I − PĤg
I − PĤg
F
?T
?
+ M (i) M (i) PU? g − PĤg + PU? g − PĤg M? (i) M? T(i)
F F
!
5√ σ2
≤ F(i) F
2
+ 2σmax PĤg − PU? g ≤ αn max E(i) ∞
σmax 1 + 4 √ max
2
F 2 i θσmin
2
σmin
≤ ,
6
we have,
With Lemma 14, we first provide upper bound on the operator norm of Λ3,(i) .
Lemma 15 For every i ∈ [N ], suppose that U?(i),l ’s are θ-misaligned, maxi E(i) ∞
≤
2
4σmax µnr , and E(i) is α-sparse with α ≤ 1
10µ2 r
, we have,
Λ3,(i) ≤ 2σmax .
We then estimate lower bounds on the smallest eigenvalues of Λ1 , Λ2,(i) , and Λ6 . These
estimates rely on more refined matrix perturbation analysis.
Lemma 16 For every i ∈ [N ], suppose that U?(i),l ’s are θ-misaligned, maxi E(i) ∞
≤
µ2 r
4σmax n , and E(i) is α-sparse with
4 2 4 !−2
1 1 σmin σmax 8 σmax
α≤ 1+2 +√ .
64 µ4 r2 σmax σmin θ σmin
42
TCMF
Proof This lemma is a result of Weyl’s theorem (Tao, 2010) and the perturbation bound
on the eigenspaces. From the first equation in (13), we know,
PĤ T(i) + F(i) PĤ Ĥ(i),l = Ĥ(i),l Λ2,(i) .
(i),l (i),l
2
≥ σmin − PU? (i),l T(i) PU? (i),l − PĤ T(i) + F(i) PĤ .
(i),l (i),l 2
Similarly, we can solve Λ3,(i) from the first equation of (13) as Λ3,(i) =
T
Ĥg T(i) + F(i) Ĥ(i),l . Plugging this into the second equation of (13), we have
N
1 X
I − PĤ T(i) + F(i) I − PĤ Ĥg = Ĥg Λ1 .
N (i),l (i),l
i=1
43
Shi, Fattahi, and Al Kontar
1 PN
2 , Weyl’s inequality
N i=1 I − P ?
U (i),l T (i) I − P ?
U (i),l is lower bounded by σmin
can be invoked to provide a lower bound on the minimum eigenvalue of Λ1 :
N
!
1 X
λmin (Λ1 ) = λmin I − PĤ T(i) + F(i) I − PĤ
N (i),l (i),l
i=1
2
≥ σmin −
PN
i=1 I − PU? (i),l T(i) I − PU? (i),l − I − PĤ T(i) + F(i) I − PĤ
(i),l (i),l
.
N
2
The operator norm on the right hand side can be bounded by triangle inequalities. For each
term in the summation, we have
I − PU? (i),l T(i) I − PU? (i),l − I − PĤ T(i) + F(i) I − PĤ
(i),l (i),l 2
≤ I − PU? (i),l T(i) I − PU? (i),l − I − PĤ T(i) I − PU? (i),l
(i),l
+ I − PĤ T(i) I − PU? (i),l − I − PĤ T(i) I − PĤ
(i),l (i),l (i),l
+ I − PĤ F(i) I − PĤ
(i),l (i),l
2
≤2σmax PU? (i),l − PĤ + F(i)
(i),l
1 2
≤ σmin .
4
where we used the assumed upper bound on α. This completes the proof.
We also provide a lower bound on the minimum eigenvalue of the symmetric matrix Λ6 .
Remember that Λ6 is defined as Λ6 = Λ1 − N i=1 Λ3,(i) Λ−1
1 PN T
2,(i) Λ3,(i) .
Lemma 17 For every i ∈ [N ], suppose that U?(i),l ’s are θ-misaligned, maxi E(i) ∞ ≤
8 −2
2 4σ 2
4σmax µnr , and E(i) is α-sparse with α ≤ (µ2 r640)
1
2
σmin
σmax 1 + √ max
θσ 2
, then, the mini-
min
mum eigenvalue of Λ6 is lower bounded by 34 σmin
2 .
Proof The proof is constructive. We use two steps. In the first step, we introduce a block
matrix Λ7 defined as,
√
···
N Λ1 √ Λ3,(1) Λ3,(N )
ΛT N Λ2,(1) · · · 0
3,(1)
Λ7 = . , (35)
.. .. .. ..
. . .
√
Λ3,(N ) 0 ··· N Λ2,(N )
and√show that the minimum eigenvalue of the minimum eigenvalue of Λ7 is lower bounded
by N 34 σmin
2 . Then, in the second step, we prove that the minimum eigenvalue of Λ is
6
lower bounded by √1N multiplies the minimum eigenvalue of Λ7 , λmin (Λ6 ) ≥ λmin ( √1N Λ7 ).
44
TCMF
for notational simplicity. From the SVD (3) and the assumption on singular values of L? (i) ,
we know:
?
Λ 1,(i) Λ? 3,(i)
? ? T ? ?T ? ? 2
[H g , H (i),l ] L (i) L (i) [H g , H (i),l ] = σmin I. (36)
Λ? T3,(i) Λ? 2,(i)
45
Shi, Fattahi, and Al Kontar
ĤT(i),l H? (i),l H? T(i),l T(i) ∆P(i),l Ĥ(i),l , and ĤTg T(i) Ĥ(i),l = ĤTg H? g H? Tg T(i) H? (i),l H? T(i),l Ĥ(i),l +
ĤTg ∆Pg T(i) Ĥ(i),l + ĤTg H? g H? Tg T(i) ∆P(i),l Ĥ(i),l .
Therefore, we can rewrite Λ7,2 as,
Λ7,2 = Λ7,3 +
√
N ĤTg H? g Λ? 1 H?Tg Ĥg ĤTg H? g Λ? 3,(1) H?T(1),l Ĥ(1),l ··· ĤTg H? g Λ? 3,(N ) H?T(N ),l Ĥ(N ),l
T √
Ĥ(1),l H? (1),l Λ?T3,(1) H?Tg Ĥg N ĤT(1),l H? (1),l Λ? 2,(1) H?T(1),l Ĥ(1),l · · · 0
.. .. .. ,
..
. . . .
√
ĤT(N ),l H? (N ),l Λ?T3,(N ) H?Tg Ĥg 0 ··· N ĤT(N ),l H? (N ),l Λ? 2,(N ) H?T(N ),l Ĥ(N ),l
| {z }
Λ7,4
46
TCMF
× Diag Ĥ?Tg Ĥg , H? T(1),l Ĥ(1),l , · · · , H? T(N ),l Ĥ(N ),l .
√ √ N
√ √
−1
2 3 2 3
X
N Λ1 − N σmin I − ΛT3,(i) N Λ2,(i) − N σmin I Λ2,(i) 0.
4 4
i=1
2 3 I. As a result,
Lemma 16 already shows that Λ2,(i) σmin 4
−1
√ √
T 2 3
Λ3,(i) N Λ2,(i) − N σmin I Λ2,(i)
4
47
Shi, Fattahi, and Al Kontar
p
∞ ∞
1 X X
2 3 −1
=√ ΛT3,(i) Λ−1
2,(i) +
σmin Λ Λ2,(i)
N p=0 4 2,(i)
p=0
1
√ ΛT3,(i) Λ−1
2,(i) Λ3,(i) .
N
By rearranging terms, we have,
√ N √
1 X T 2 3
N Λ1 − √ Λ3,(i) Λ−1
2,(i) Λ2,(i) N σmin I.
N i=1 4
Lemma 18 For every i ∈ [N ], suppose that U?(i),l ’s are θ-misaligned, maxi E(i) ∞ ≤
2
2 −3
4σmax µnr , and E(i) is α-sparse with α ≤ 40µ1 2 r 2σmax
3 2
σ
. The solutions to (13) satisfy the
4 min
following,
and
∞
−p−1
Fp(j) T(j) Ĥ(j),l Λ2,(j)
X
Ĥ(j),l = Ĥ(i),l,0 + + Ĥg,1 Λ4,(j) Λ−1
2,(j)
p=1
∞ ∞ ∞ N k
" #
2l p2l+1
Fp(0)
X X X X Y
+ ··· F(i2l+1 ) Ĥg,0
k=0 p0 +p1 ≥1 p2k +p2k+1 ≥1 i1 ,i3 ,··· ,i2k+1 =1 l=0
0
−p −1 −p2l −1
Y
× Λ4,(i2l+1 ) Λ4,(i2l+1 ) Λ2,(i2l+1
2l+1 )
Λ Λ
5,(i2l+1 ) 1 Λ6 Λ4,(j) Λ−1
2,(j)
l=k
48
TCMF
∞ ∞ ∞ N k
" #
2l p2l+1
Fp(0)
X X X X Y
+ ··· F(i2l+1 ) Ĥg,1
k=0 p0 +p1 ≥1 p2k +p2k+1 ≥1 i1 ,i3 ,··· ,i2k+1 =1 l=0
0
−p −1 −p2l −1
Y
× Λ4,(i2l+1 ) Λ4,(i2l+1 ) Λ2,(i2l+1
2l+1 )
Λ Λ
5,(i2l+1 ) 1 Λ6 Λ4,(j) Λ−1
2,(j)
l=k
∞
Fp(j) Ĥg Λ4,(j) Ĥg Λ4,(j) Λ−1
X
+ 2,(j) ,
p=1
(42)
Ĥg,1 is defined as
∞ N
Fp(0) Fp(i11 ) T(i1 ) Ĥ(i1 ),l Λ−p 1 −1 −p0 −1
X X
0
Ĥg,1 = 2,(i1 ) Λ5,(i1 ) Λ1 Λ6 , (44)
p0 +p1 ≥1 i1 =1
Note that σmin (Λ2,(i) ) > F(i) and σmin (Λ1,(i) ) > F(0) . Therefore, according to Bhatia
(2013, Theorem VII.2.2), the solution to (46) satisfies the following equation
∞ ∞ X N
−p−1
p
Fp(0) Ĥ(i),l Λ5,(i) Λ−p−1
X X
Ĥ = F(0) T(0) Ĥg Λ1 +
g
1
p=0 p=0 i=1
∞ ∞
. (47)
p −p−1 p −p−1
X X
Ĥ(i),l = F(i) T(i) Ĥ(i),l Λ2,(i) + F(i) Ĥg Λ4,(i) Λ2,(i)
p=0 p=0
We can substitute Ĥ(i),l in the right hand side of the first equation of (47) by the second
equation in (47)
∞ ∞ X
∞ X
N
Fp(0) T(0) Ĥg Λ1−p−1 Fp(0) Fp(i11 ) T(i1 ) Ĥ(i1 ),l Λ−p 1 −1 −p0 −1
X X
0
Ĥg = + 2,(i1 ) Λ5,(i1 ) Λ1
p=0 p0 =0 p1 =0 i1 =1
49
Shi, Fattahi, and Al Kontar
∞ X
∞ X
N
Fp(0) Fp(i11 ) Ĥg Λ4,(i1 ) Λ−p 1 −1 −p0 −1
X
0
+ 2,(i1 ) Λ5,(i1 ) Λ1
p0 =0 p1 =0 i1 =1
N
X N
X
= T(0) Ĥg Λ−1
1 + T(i) Ĥ(i),l Λ−1 −1
2,(i) Λ5,(i) Λ1 + Ĥg Λ4,(i) Λ−1 −1
2,(i) Λ5,(i) Λ1
i=1 i=1
∞ N
Fp(0) Fp(i11 ) T(i1 ) Ĥ(i1 ),l Λ−p 1 −1 −p0 −1
X X
0
+ 2,(i1 ) Λ5,(i1 ) Λ1
p0 +p1 ≥1 i1 =1
∞ N
Fp(0) Fp(i11 ) Ĥg Λ4,(i1 ) Λ−p 1 −1 −p0 −1
X X
0
+ 2,(i1 ) Λ5,(i1 ) Λ1 .
p0 +p1 ≥1 i1 =1
(48)
We then move the third term on the right hand side of (48) to the left hand side, multiply
both sides by Λ1 Λ−1
6 on the right, and recall the definition of Λ6,(i) in (30), we have,
N
X
Ĥg = T(0) Ĥg Λ−1
6 + T(i) Ĥ(i),l Λ−1 −1
2,(i) Λ5,(i) Λ6
i=1
| {z }
Ĥg,0
∞ N
F(i1 ) T(i1 ) Ĥ(i1 ),l Λ−p
p0 p1 1 −1 −p0 −1
X X
+ F(0) 2,(i1 ) Λ5,(i1 ) Λ1 Λ6
p0 +p1 ≥1 i1 =1
| {z }
Ĥg,1
∞ N
Fp(0) Fp(i11 ) Ĥg Λ4,(i1 ) Λ−p 1 −1 −p0 −1
X X
0
+ 2,(i1 ) Λ5,(i1 ) Λ1 Λ6 .
p0 +p1 ≥1 i1 =1
| {z }
residual term
(49)
On the right hand side of (49), one can see that the Ĥg,0 and Ĥg,1 are products of sparse
matrices, incoherent matrices, and remaining terms. Therefore we can use Lemma 13 to
calculate an upper bound on their maximum row norm. However, the residual term does
not have such specific structure as we do not know whether Ĥg is incoherent. As a result
we cannot provide precise estimate on its maximum row norm directly. To circumvent the
issue, notice that (49) has a recursive form. Therefore, the residual term can be replaced by
∞ N
Fp(0) Fp(i11 ) Ĥg Λ4,(i1 ) Λ−p 1 −1 −p0 −1
X X
0
Ĥg → Ĥg,0 + Ĥg,1 + 2,(i1 ) Λ5,(i1 ) Λ1 Λ6 . (50)
p0 +p1 ≥1 i1 =1
The result will have 5 terms, the first 4 of which have the structure specified in Lemma
13. The 5-th term does not as it contains Ĥg . We can apply the replacement rule (50) again
for the 5-th term, generating 7 terms. After applying the replacement rule ω times, where ω
50
TCMF
(51)
∞ ∞ N X
N N
p
Fp(0) Fp(i11 ) Fp(0) Fp(i33 ) · · · F(i2ω+3
X X X X
0 2
··· ··· 2ω+3 ) F
p0 +p1 ≥1 p2ω+2 +p2ω+3 ≥1 i1 =1 i3 =1 i2ω+3 =1
−p −1 −p2ω+2 −p1 −1 −p0 −1
Ĥg Λ4,(i2ω+3 ) Λ2,(i2ω+3
2ω+3 )
Λ5,(i2ω+3 ) Λ1 Λ−1
6 · · · Λ4,(i1 ) Λ2,(i1 ) Λ5,(i1 ) Λ1 Λ6
∞ ∞
!p0 +···+p2ω+3
X X αn maxi E(i) ∞ 52 σmax
≤ ··· 3 2
p0 +p1 ≥1 p2ω+2 +p2ω+3 ≥1 4 σmin
!2(ω+2)
2
2σmax
× 3 2
4 σmin
5
!2(ω+2) !2(ω+2)
αn maxi E(i) 2
∞ 2 σmax 2σmax
≤ 4 3 2 3 2
σ
4 min 4 σmin
2(ω+2)
1
≤ ,
2
51
Shi, Fattahi, and Al Kontar
2
−3
1 2σmax
where we used Lemma 23 in the first inequality and the condition that α ≤ 40µ2 r 3 2
σ
4 min
in the last inequality.
Therefore, we can take the limit ω → ∞ in (51) and rewrite it as a series. The series is
absolutely convergent when α is small. Finally, we prove (41). Though (41) is an infinite
series, each term in the series is the product of sparse matrices and an incoherent matrix.
Such structure will be useful later when we use Lemma 13 to calculate the maximum row
norm of Ĥg .
Now we proceed to derive an expansion for Ĥ(i),l . We can replace Ĥg on the right hand
side of the second equation of (47) with (51) to derive,
In Lemma 18, although the series of Ĥg and Ĥ(i),l ’s have infinite terms, when α is not
too large, the leading term is only the first term. This is delineated in the following lemma,
which is a formal version of Lemma 8. For simplicity, we introduce a notation
αn maxi E(i) ∞ αn maxi E(i) ∞ + 2σmax
ζ= 3 2 . (53)
4 σmin
Lemma 19 Suppose that the conditions of Lemma 18 are satisfied. Additionally, suppose
√ 2
σmin
that α ≤ 6−3 2 1
80 µ2 r σmax , we have
52
TCMF
53
Shi, Fattahi, and Al Kontar
0
−p −1
Λ5,(i2l+1 ) Λ−p
Y
× Λ4,(i2l+1 ) Λ2,(i2l+1
2l+1 ) 1
2l
Λ−1
6 .
l=k
∞ N
p0 p1
T(i1 ) Ĥ(i1 ),l Λ−p 1 −1 −p0 −1
X X
Ĥg,1 ≤ F(0) F(i1 ) 2,(i1 ) Λ5,(i1 ) Λ1 Λ6
p0 +p1 ≥1 i1 =1
∞
X p0 +p1
≤ αn max E(i) ∞
αn max E(i) ∞
+ 2σmax
i i
p0 +p1 ≥1
!2 !p0 +p1
2
σmax 1
×2 3 2 3 2
4 σmin 4 σmin
!2
αn maxi E(i) ∞
αn maxi E(i) ∞
+ 2σmax σ2
≤ 3 2 2 3 max
2
4 σmin 4 σmin
−1
αn maxi E(i) ∞
αn maxi E(i) ∞
+ 2σmax
2 1 − 3 2
4 σmin
!2
αn maxi E(i) ∞
αn maxi E(i) ∞
+ 2σmax √ 2
σmax
≤ 3 2 4 2 3 2 ,
4 σmin 4 σmin
where we used Lemma 22 in the first inequality, the upper bound on F(i) in the sec-
ond inequality. Because of the upper bound on α, we can use auxiliary Lemma 23
to derive an upper bound on the series. The last inequality comes from the fact that
αn maxi kE(i) k∞ (αn maxi kE(i) k∞ +2σmax ) −1 √
(1 − 3 2
σ
) ≤ 2.
4 min
Therefore, we can proceed to estimate,
kδHg k
!2 ∞
k+1 !2(k+1) !2
√ 2
σmax 2X 2
2σmax 2
σmax 2
≤ ζ4 2 3 2 + ζ 3 2 3 2
2ζ
4 σmin
1−ζ 4 σmin 4 σmin
1−ζ
k=0
∞ k+1 !2(k+1) ! !
X 2 2
2σmax 2
σmax 2
2σmax
+ ζ 3 2 3 2 1+ 3 2
1−ζ 4 σ min 4 σ min 4 σmin
k=0
!2 !2 −1
√ 2
σmax 2ζ 2σ 2
max
≤ ζ4 2 3 2 1 −
3 2
σ
4 min
1 − ζ σ
4 min
!2 ! ! !2 −1
2 2
2σmax 2
σmax 2σ 2 2
+ζ 1 + 3 max 1 − 2ζ 2σmax
.
1 − ζ 34 σmin2 3 2
σ
4 min σ 2
4 min
1 − ζ 3 2
4 σmin
54
TCMF
(58)
where we used the triangle inequality in the first inequality, and Lemma 13 together with
r = r1 + r2 in the second inequality.
p`
A similar equality also holds for maxk eTk k`=1 F(i
Q
`)
T(0) :
k k N N k
1 1 X
Fp(i`` ) T(0) Fp(i`` ) Fp(i`` ) T(j)
Y Y X Y
max eTk = max eTk T(j) ≤ T
max ek
k k N N k
`=1 `=1 j=1 j=1 `=1
s Pkm=1 pm
µ2 r
2
≤ 2σmax αn max E(i) ∞
(αn max E(i) ∞
+ 2σmax ) .
n1 i i
(59)
Combining the above two inequalities, we have
max eTj δHg
j
∞
X ∞
X ∞
X N
X
≤ max eTj Ĥg,1 + ···
j
k=0 p0 +p1 ≥1 p2k +p2k+1 ≥1 i1 ,i3 ,··· ,i2k+1 =1
k 0
" #
p −p −1
Fp(0) Λ5,(i2l+1 ) Λ−p
Y Y
max eTj 2l
F(i2l+1
2l+1 )
Ĥg,0 Λ4,(i2l+1 ) Λ2,(i2l+1
2l+1 ) 1
2l
Λ−1
6
j
l=0 l=k
∞ ∞ ∞ N k
" #
2l p2l+1
Fp(0)
X X X X Y
+ ··· max eTj F(i2l+1 ) Ĥg,1
j
k=0 p0 +p1 ≥1 p2k +p2k+1 ≥1 i1 ,i3 ,··· ,i2k+1 =1 l=0
0
−p −1
Λ5,(i2l+1 ) Λ−p
Y
× Λ4,(i2l+1 ) Λ2,(i2l+1
2l+1 ) 1
2l
Λ−1
6
l=k
s !2
X µ2 r 2
2σmax
≤ ζ p0 +p1 3 2
n1 4 σmin
p0 +p1 ≥1
55
Shi, Fattahi, and Al Kontar
∞
s !2 !2(k+1)
X X X µ2 r 2
2σmax 2
2σmax
+ ··· ζ p0 +···+p2k+3 3 2 3 2
n1 4 σmin 4 σmin
k=0 p0 +p1 ≥1 p2k+2 +p2k+3 ≥1
∞
!2(k+1) !2 !2
X X X 2
2σmax 2
2σmax 2
2σmax
+ ··· ζ p0 +···+p2k+1 3 2 3 2
1 +
3 2
k=0 p0 +p1 ≥1 p2k +p2k+1 ≥1 4 σmin 4 σmin 4 σmin
s !2 ∞
!2 k+1 s
!2
2 µ2 r 2
2σmax 2ζ X 2
2r
2σmax
µ 2σ 2
max
≤ζ +
3 2
3 2 3 2
1−ζ n1 4 σmin
1−ζ n1
4 σmin σ
4 min
k=0
!2 k+1 s
∞
! !
X 2ζ 2σ 2 µ 2r 2σ 2 2σ 2
max max max
+ 1+ 3 2
1 − ζ 34 σmin
2 n1 3 2
4 σ min 4 σmin
k=0
s !2 !2 −1 ! !!
2 2
µ r 2σmax2 2ζ 2σ 2 2σ 2 2σ 2
max max max
≤ζ 3 2
1 −
3 2
1+ 3 2 1+ 3 2 ,
1−ζ n1 4 σmin
1 − ζ 4 σmin 4 σmin 4 σmin
where we applied (58), (59) in the second inequality, and Lemma 23 in the third inequality.
Similarly, we define δH(i),l as the summation,
∞ ∞
Fp(i) T(i) Ĥ(i),l Λ−p−1 Fp(i) Ĥg,0 Λ4,(i) Λ−p−1
X X
δH(i),l = 2,(i) + 2,(i)
p=1 p=1
∞
Fp(i) δHg Λ4,(i) Λ−p−1
X
+ 2,(i) .
p=0
56
TCMF
Finally, we have,
The first two summations can be upper bounded by Lemma 22, and the last summation
can be estimated in a similar way we calculate maxj eTj δHg . We omit the details and
present the estimated upper bound for brevity.
This completes our proof.
Equipped with the aforementioned perturbation analysis on Ĥg and H(i),l , we are ready
to provide the formal version of Lemma 6.
Proof Notice that L? (i) = L? (i),g + L? (i),l , and L̂(i) = (PĤg + PĤ )M̂(i) = (PĤg +
(i),l
PĤ )(L? (i),g + L? (i),l + E(i),t ). Therefore, we have
(i),l
L? (i) − L̂(i)
∞
57
Shi, Fattahi, and Al Kontar
where
PN we used the KKT condition (13a) and the definition of Λ6 = Λ1 +
−1
i=1 3,(i) Λ2,(i) Λ5,(i) .
Λ
The first term in (TM1) is thus bounded by,
µ2 r
Ĥg,0 ĤTg,0 L? (i),g − L? (i),g
n
µ2 r
≤ Ĥg ĤTg L? (i),g − H? g H? Tg L? (i),g
n
58
TCMF
µ2 r µ2 r µ2 r
+ Ĥg δHTg,0 L? (i),g + δHg,0 ĤTg L? (i),g + δHg,0 δHTg,0 L? (i),g .
n n n
The second term in (TM1) is bounded by
I − H? g H? Tg Ĥg,0 ĤTg,0 L? (i),g
≤ I − H? g H? Tg Ĥg ĤTg L? (i),g + I − H? g H? Tg δHg,0 ĤTg L? (i),g
?T ?T
? T ?
+ I − H g H g Ĥg δHg,0 L (i),g + I − H g H g δHg,0 δHTg,0 L? (i),g
?
I − H? g H? Tg Ĥg,0 ĤTg,0 L? (i),g
∞
N
?T T(i)
X
= Ĥ ?
(i) Ĥ (i) Ĥg Λ−1
6 + Ĥ ? ?T −1 −1
(i) Ĥ (i) T(i) Ĥ(i),l Λ2,(i) Λ5,(i) Λ6 L? (i),g
N
i=1 ∞
N
µ2 r X T(i)
≤ Ĥ? (i) Ĥ?T(i) Ĥg Λ−1 ? ?T −1 −1
6 + Ĥ (i) Ĥ (i) T(i) Ĥ(i),l Λ2,(i) Λ5,(i) Λ6 L? (i),g .
n N
i=1
(62)
We know that,
H? (i),l H? T(i),l T(i) = PĤ T(i) + H? (i),l H? T(i),l − PĤ T(i)
(i),l (i),l
= PĤ S(i) − PĤ F(i) + H? (i),l H? T(i),l − PĤ T(i) ,
(i),l (i),l (i),l
and that,
S(i)
PĤ Ĥg + PĤ S(i) Ĥ(i),l Λ−1
2,(i) Λ5,(i) = 0.
(i),l N (i),l
As a result, we have
I − H? g H? Tg Ĥg,0 ĤTg,0 L? (i),g
∞
µ2 r
N
X −PĤ F(i) + H? (i),l H? T(i),l − PĤ T(i)
Ĥ? (i) Ĥ?T(i)
(i),l (i),l
≤
n N
i=1
× Ĥg + Ĥ(i),l Λ−1 −1 ?
2,(i) N Λ5,(i) Λ6 L (i),g
2 N F + σ 2 H ? H?T − P
!
µ r 1 X (i) max (i),l (i),l Ĥ(i),l 2σ 2
≤ σmax 3 2
1 + 3 max
2
.
n N 4 σmin 4 σmin
i=1
59
Shi, Fattahi, and Al Kontar
= Ĥ(i),l − δH(i),l .
µ2 r
Ĥ(i),l,0 ĤT(i),l,0 L? (i),l − L? (i),l
n
µ2 r
≤ Ĥ(i),l Ĥ(i),l L? (i),g − H? g H? Tg L? (i),g
n
µ2 r µ2 r µ2 r
+ Ĥ(i),l δHT(i),l,0 L? (i),l + δH(i),l,0 ĤT(i),l L? (i),l + δH(i),l,0 δHT(i),l,0 L? (i),l .
n n n
The second term in (TM2) is upper bounded by
I − H? (i),l H? T(i),l Ĥ(i),l,0 ĤT(i),l,0 L? (i),l
≤ I − H? (i),l H? T(i),l PĤ L? (i),l + I − H? (i),l H? T(i),l δH(i),l,0 ĤT(i),l L? (i),l
(i),l
60
TCMF
+ I − H? (i),l H? T(i),l Ĥ(i),l δHT(i),l,0 L? (i),l + I − H? (i),l H? T(i),l δH(i),l,0 δHT(i),l,0 L? (i),l
2
≤ PĤ − H? (i),l H? T(i),l σmax + 2 δH(i),l,0 σmax + δH(i),l,0 σmax .
(i),l
Then we bound the third term of (TM2). From the definition of Ĥ(i),l,0 and T(i) , we
know,
I − H? (i),l H? T(i),l Ĥ(i),l,0
= H? g H? Tg T(i) Ĥ(i),l Λ−1 ? ?T −1
2,(i) + H g H g Ĥg,0 Λ4,(i) Λ2,(i)
+ I − H? g H? Tg − H? (i),l H? T(i),l Ĥg,0 Λ4,(i) Λ−1
2,(i)
61
Shi, Fattahi, and Al Kontar
!
N
X Ĥg
× H? (j),l H? T(j),l ∆P(j),l T(j) + Ĥ(j),l Λ−1
2,(j) Λ5,(j)
Λ−1 −1
6 Λ4,(i) Λ2,(i)
N
j=1
.
From the KKT conditions, we know ĤTg Si Ĥ(i),l +
T −1 P N −1 −1
Ĥg S(0) Ĥg Λ6 + j=1 S(j) Ĥ(j),l Λ2,(j) Λ5,(j) Λ6 Λ4,(i) = 0 and
Ĥ
ĤT(j),l S(j) Ng + Ĥ(j),l Λ−12,(j) Λ5,(j) = 0.
Therefore, we have,
max eTk I − H? (i),l H? T(i),l Ĥ(i),l,0
k
s !2
2
µ r F(i) F(0) 2σ 2 σ 2 2 2
1 + 2σmax + 2σmax
≤ 3 2
+ 3 2 3 max 2
+ k∆Pg k 3 max
2 3 2 3 2
n1 4 σmin 4 σmin 4 σmin 4 σmin 4 σmin 4 σmin
N N
! !
1 X F(j) 2σmax 2 2
2σmax 1 X 2
σmax 2
2σmax 2
2σmax
+ 3 2 3 2 2 + 3 3 2 + ∆P (j) 3 2 2 3 2 1 + 3 2 .
N 4 σmin 4 σmin 4 σmin
N 4 σmin 4 σmin 4 σmin
j=1 j=1
(64)
(65)
62
TCMF
(66)
(67)
where we used the definition of α-sparsity in the second inequality, and applied Lemma 19
in the last inequality.
Bounding the fifth term of (61):
63
Shi, Fattahi, and Al Kontar
where we applied the incoherence condition on T(0) and T(i) , and the relation
P T
l el E(i),t ek ≤ αn1 kEk∞ .
Bounding the seventh term of (61):
64
TCMF
where we applied the incoherence condition in the first inequality, (57) in the second
inequality.
Bounding the eleventh term of (61):
where we applied the incoherence condition in the first inequality, and (57) in the second
inequality.
Bounding the twelfth term of (61):
(69)
where we applied the condition l eTl E(i),t ek ≤ αn1 kEk∞ in the first inequality, the
P
incoherence condition in the second inequality.
Bounding the thirteenth term of (61):
where we apply the incoherence condition in the first inequality, and δH(i),l ≤ 1 in the
second inequality.
65
Shi, Fattahi, and Al Kontar
where we applied the condition l eTl E(i),t ek ≤ αn1 kEk∞ in the first inequality.
P
Bounding the fifteenth term of (61):
PĤg L? (i),l
∞
= max ej Ĥg,0 + δHg ĤTg H? (i),l Σ(i),l W? T(i),l ek
T
j,k
s
µ2 r
≤ max eTj Ĥg,0 + max eTj δHg ĤTg H? (i),l σmax
j j n2
s
µ2 r
≤ max eTj Ĥg,0 + max eTj δHg k∆Pg k σmax
j j n2
! s
µ2 r 2σmax
2 2σ 2 µ2 r
= 1 + 3 max k∆Pg k σmax + σmax k∆Pg k max eTj δHg .
n 34 σmin
2 2
4 σmin
n2 j
PĤ = max eTj Ĥ(i),0 + δH(i),l ĤT(i),l H? g Σ? (i),g W? T(i),g ek
L? (i),g
(i),l ∞ j,k
s
µ2 r
≤ max eTj Ĥ(i),l,0 + max eTj δH(i),l ĤT(i),l H? g σmax
j j n2
s
µ2 r
≤ max eTj Ĥ(i),l,0 + max eTj δH(i),l ∆P(i),l σmax
j j n2
!2
µ2 r 2σ 2 2
2σmax 2
2σmax
≤ σmax 3 max
2
1 + 3 2 + 3 2
∆P(i),l (71)
n σ
4 min σ
4 min σ
4 min
s
µ2 r
+ ∆P(i),l σmax max eTj δH(i),l (TM16)
n2 j
66
TCMF
!9 !5
2
2σmax 2
2σmax 1
1
C4 = 34327 3 2 + 534 3 2 θ − 2 = O κ9 + κ5 θ − 2 . (72)
4 σmin 4 σmin
2
2. Ŝ(i),t − S? (i) ≤ 2λt ≤ 4σmax µnr for every i ∈ [N ].
∞
Moreover, we have
T ? ?T t
Û g,t V̂ (i),g,t −U g V (i),g =O ρ + , for every i ∈ [N ]. (73)
∞ 1−ρ
and
T ? ?T
Û (i),l,t V̂ (i),l,t −U (i),l V (i),l = O ρt + , for every i ∈ [N ]. (74)
∞ 1−ρ
Remark The definition of the term ρmin in the statement of the above theorem is kept
intentionally implicit to streamline the presentation. In what follows, we will give an
estimate of the requirements on α purely in terms of the parameters of the problem.
8
θ σmax
Lemma 14 requires α = O µ4 r2 σmin . Lemma 15 requires α = O µ12 r . Lemma 16
12 12
θ σmin θ σmin
requires α = O µ4 r2 σmax . Lemma 17 requires α = O µ4 r2 σmax . Lemma
6 2
θ σmin θ σmin
18 requires α = O µ2 r σmax . And Lemma 19 requires α = O µ2 r σmax .
67
Shi, Fattahi, and Al Kontar
10
√1 σmax
As C4 = O θ σmin , the additional requirement in Theorem 21 requires α =
20
θ σmin
O µ4 r2 σmax . Taking the intersections of all these requirements, we can derive
20
θ σmin
the upper bound on α as α = O µ4 r2 σmax .
Induction step: Now supposing that Claims 1, 2, and 3 hold for iterations 1, · · · , t, we
will show their correctness for the iteration t + 1. Since Claim 3 holds for iteration t, we
2
know L̂(i),t − L? (i) ≤ ρλt + under the condition α ≤ 4µ4ρr2 C 2 . With the choice of λt+1 =
∞ 4
ρλt + , if the jk-th entry of Ŝ(i),t+1 is nonzero, we have [S? (i) ]jk + [L? (i) ]jk − [L̂(i),t ]jk >
λt+1 . Since [L? (i) ]jk − [L̂(i),t ]jk ≤ λt+1 , we must have [S? (i) ]jk > 0. This proves Claim 1
for iteration t + 1.
We will now proceed to prove Claim 2. We consider each entry of Ŝ(i),t+1 =
h i
Hardλt+1 S? (i) + L? (i) − L̂(i),t . From the definition of hard-thresholding, we know
[Ŝ(i),t+1 ]jk − [S? (i) ]jk + [L? (i) ]jk − [L̂(i),t ]jk ≤ λt+1 . Remember that we know
[L? (i) ]jk − [L̂(i),t ]jk ≤ λt+1 from the correctness of Claim 3 at iteration t and the up-
per bound on α, we can derive [Ŝ(i),t+1 ]jk − [S? (i) ]jk ≤ 2λt+1 by triangle inequality. We
hence prove Claim 2.
For Claim 3, since E(i),t+1 = S? (i) − Ŝ(i),t+1 , we have E(i),t+1 ∞ ≤ 2λt+1 for each i as
well. Also, by Claim 1 at iteration t, E(i),t ’s are α-sparse at iteration t. Therefore by Lemma
√ 2
20, L̂(i),t+1 − L? ≤ 2 αµ2 rC4 λt+1 . Under the constraint that α ≤ 4µ4ρr2 C 2 , we know
∞ 4
68
TCMF
T
Ûg,t V̂(i),g,t − U? g V? T(i),g
∞
? ?
+ E(i) − L? (i),g
= PĤg L (i),g +L (i),l
∞
?
≤ PĤg L (i),l
∞
+ T(0) Ĥg Λ1 Ĥg T(0) + T(0) Ĥg Λ−1
−2 T
1 δHT
g + δH Λ−1 T
g 1 Ĥ T
g (0) + δHg δHT
g
In Lemma 20, we have shown that each term above is upper bounded by O(maxi E(i),t ∞
).
Therefore by Claim 2, we have T
Ûg,t V̂(i),g,t − U? ?T = O(maxi E(i),t ) = O(λt ) =
g V (i),g ∞
∞
O(ρt + 1−ρ ). (73) follows accordingly by triangle inequality.
We can prove (74) in a similar way. This completes our proof of Theorem 21.
and
kABk2 ≤ kAk2 kBk2
Proof The proof is straightforward and can be found in Sun and Luo (2016).
Lemma 23 For x, y ∈ [0, 1) such that x + y < 1, the following relation holds:
X 2x
xp1 +p2 ≤ . (75a)
1−x
p1 +p2 ≥1
69
Shi, Fattahi, and Al Kontar
X ∞ X
X ∞
xp1 +p2 = xp1 +p2 − 1
p1 +p2 ≥1 p1 =0 p2 =0
∞ ∞
X X 1
= xp 1 xp 2 − 1 = −1
(1 − x)2
p1 =0 p2 =0
2x − x2 2x
= 2
≤ .
(1 − x) 1−x
Lemma 24 For symmetric matrices A0 ∈ Rr0 ×r0 , A1 ∈ Rr1 ×r1 , · · · , AN ∈ Rr1 ×r1 , and
Bi ∈ Rr0 ×rN for i ∈ {1, · · · , N }, we can construct a symmetric block matrix C as,
A0 B1 B2 · · · BN
BT A1 0 ··· 0
1
T ··· 0
C = B2 0 A2 . (76)
.. .. . .
. . . ··· 0
BTN 0 0 · · · AN
Then,
PN C is−1positive definite if and only if A1 , A2 , · · · , AN are positive definite and A0 −
B A B T is positive definite.
i=1 i i i
Proof Sicne Ai ’s are positive definite, they are invertible. Thus we can decompose C as
70
TCMF
On the right hand side, C1 and CT1 are both invertible. Thus, C is positive definite if and
only if C2 is positive definite. Since C2 is a block diagonal matrix, we prove the statement
in the lemma.
The following lemma provides an eigenvalue lower bound on the product of three matrices.
Lemma 25 For matrix A ∈ Rn×n , and symmetric positive semidefinite matrix B ∈ Rn×n ,
we know that,
We finally present the lemma that provides an upper bound of the operator norm of
block matrices.
Note: in (78), the diagonal blocks and off-diagonal blocks are treated differently.
Proof We first prove for the special case where Bij = 0. In this case,
71
Shi, Fattahi, and Al Kontar
N
X
≤ max {kAi k2 } × kvi k2 = max {kAi k2 }.
i=1,··· ,N i=1,··· ,N
i=1
References
M. Aharon, M. Elad, and A. Bruckstein. K-svd: An algorithm for designing overcomplete
dictionaries for sparse representation. IEEE Transactions on signal processing, 54(11):
4311–4322, 2006.
R. Bhatia. Matrix analysis, volume 169. Springer Science & Business Media, 2013.
T. Bouwmans and E. H. Zahzah. Robust pca via principal component pursuit: A review for a
comparative evaluation in video surveillance. Computer Vision and Image Understanding,
122:22–34, 2014. ISSN 1077-3142. doi: https://doi.org/10.1016/j.cviu.2013.11.009. URL
https://www.sciencedirect.com/science/article/pii/S1077314213002294.
E. J. Candès, X. Li, Y. Ma, and J. Wright. Robust principal component analysis? Journal
of the ACM (JACM), 58(3):1–37, 2011.
D. Chai, L. Wang, K. Chen, and Q. Yang. Secure federated matrix factorization. IEEE
Intelligent Systems, 36(5):11–20, 2021. doi: 10.1109/MIS.2020.3014880.
72
TCMF
X. Chen, B. Zhang, T. Wang, A. Bonni, and G. Zhao. Robust principal component analysis
for accurate outlier sample detection in rna-seq data. Bmc Bioinformatics, 21(1):1–20,
2020.
Y. Chen, J. Fan, C. Ma, and Y. Yan. Bridging convex and nonconvex optimization in robust
pca: Noise, outliers, and missing data. Annals of statistics, 49(5):2948, 2021.
J. Fan, W. Wang, and Y. Zhong. An l-infinity eigenvector perturbation bound and its
application to robust covariance estimation. Journal of Machine Learning Research, 18
(207):1–42, 2018.
S. Fattahi and S. Sojoudi. Exact guarantees on the absence of spurious local minima for
non-negative rank-1 robust principal component analysis. Journal of machine learning
research, 2020.
Q. Feng, M. Jiang, J. Hannig, and J. Marron. Angle-based joint and individual variation
explained. Journal of multivariate analysis, 166:241–265, 2018.
R. Ge, C. Jin, and Y. Zheng. No spurious local minima in nonconvex low rank problems:
A unified geometric analysis. In International Conference on Machine Learning, pages
1233–1242. PMLR, 2017.
D. Hsu, S. M. Kakade, and T. Zhang. Robust matrix decomposition with sparse corruptions.
IEEE Transactions on Information Theory, 57(11):7221–7234, 2011.
N. Jin, S. Zhou, and T.-S. Chang. Identification of impacting factors of surface defects in hot
rolling processes using multi-level regression analysis. Society of Manufacturing Engineers
Southfield, MI, USA, 2000.
73
Shi, Fattahi, and Al Kontar
Y. Koren, R. Bell, and C. Volinsky. Matrix factorization techniques for recommender systems.
Computer, 42(8):30–37, 2009. doi: 10.1109/MC.2009.263.
H. Lee and S. Choi. Group nonnegative matrix factorization for eeg classification. In
D. van Dyk and M. Welling, editors, Proceedings of the Twelth International Conference
on Artificial Intelligence and Statistics, volume 5 of Proceedings of Machine Learning
Research, pages 320–327, Hilton Clearwater Beach Resort, Clearwater Beach, Florida USA,
16–18 Apr 2009. PMLR. URL https://proceedings.mlr.press/v5/lee09a.html.
X. Li, J. Haupt, J. Lu, Z. Wang, R. Arora, H. Liu, and T. Zhao. Symmetry. saddle points,
and global optimization landscape of nonconvex matrix factorization. In 2018 Information
Theory and Applications Workshop (ITA), pages 1–9, 2018. doi: 10.1109/ITA.2018.8503215.
X. Li, S. Wang, and Y. Cai. Tutorial: Complexity analysis of singular value decomposition
and its variants. arXiv preprint arXiv:1906.12085, 2019.
D. Meng and F. De La Torre. Robust matrix factorization with unknown noise. In Proceedings
of the IEEE International Conference on Computer Vision, pages 1337–1344, 2013.
E. Ponzi, M. Thoresen, and A. Ghosh. Rajive: Robust angle based jive for integrating noisy
multi-source data. arXiv preprint arXiv:2101.09110, 2021.
74
TCMF
B. Shen, W. Xie, and Z. J. Kong. Smooth robust tensor completion for back-
ground/foreground separation with missing pixels: Novel algorithm with convergence
guarantee. Journal of Machine Learning Research, 23(217):1–40, 2022.
N. Shi and R. A. Kontar. Personalized pca: Decoupling shared and unique features. Journal
of Machine Learning Research, 25(41):1–82, 2024. URL http://jmlr.org/papers/v25/
22-0810.html.
R. Sun and Z.-Q. Luo. Guaranteed matrix completion via non-convex factorization. IEEE
Transactions on Information Theory, 62(11):6535–6579, 2016.
Y. L. Tan, V. Sehgal, and H. H. Shahri. Sensoclean: Handling noisy and incomplete data in
sensor networks using modeling. Main, pages 1–18, 2005.
R. K. Wong and T. C. Lee. Matrix completion with noisy entries and outliers. The Journal
of Machine Learning Research, 18(1):5404–5428, 2017.
J. Wright and Y. Ma. High-dimensional data analysis with low-dimensional models: Princi-
ples, computation, and applications. Cambridge University Press, 2022.
W. Xiao, X. Huang, F. He, J. Silva, S. Emrani, and A. Chaudhuri. Online robust principal
component analysis with change point detection. IEEE Transactions on Multimedia, 22
(1):59–68, 2020. doi: 10.1109/TMM.2019.2923097.
75
Shi, Fattahi, and Al Kontar
T. Ye and S. S. Du. Global convergence of gradient descent for asymmetric low-rank matrix
factorization. Advances in Neural Information Processing Systems, 34:1429–1439, 2021.
L. Zhang, H. Shen, and J. Z. Huang. Robust regularized singular value decomposition with
application to mortality data. The Annals of Applied Statistics, pages 1540–1561, 2013.
G. Zhou, A. Cichocki, Y. Zhang, and D. P. Mandic. Group component analysis for multiblock
data: Common and individual feature extraction. IEEE transactions on neural networks
and learning systems, 27(11):2426–2439, 2015.
76