0% found this document useful (0 votes)
8 views76 pages

Triple Component Matrix Factorization: Untangling Global, Local, and Noisy Components

Uploaded by

Harsh Vardhan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views76 pages

Triple Component Matrix Factorization: Untangling Global, Local, and Noisy Components

Uploaded by

Harsh Vardhan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 76

Journal of Machine Learning Research 25 (2024) 1-76 Submitted 3/24; Revised 10/24; Published 11/24

Triple Component Matrix Factorization: Untangling Global,


Local, and Noisy Components

Naichen Shi [email protected]


Department of Industrial & Operations Engineering
University of Michigan
Ann Arbor, MI 48109, USA
Salar Fattahi [email protected]
Department of Industrial & Operations Engineering
University of Michigan
Ann Arbor, MI 48109, USA
Raed Al Kontar ∗ [email protected]
Department of Industrial & Operations Engineering
University of Michigan
Ann Arbor, MI 48109, USA

Editor: Mahdi Soltanolkotabi

Abstract
In this work, we study the problem of common and unique feature extraction from noisy data.
When we have N observation matrices from N different and associated sources corrupted
by sparse and potentially gross noise, can we recover the common and unique components
from these noisy observations? This is a challenging task as the number of parameters
to estimate is approximately thrice the number of observations. Despite the difficulty, we
propose an intuitive alternating minimization algorithm called triple component matrix
factorization (TCMF) to recover the three components exactly. TCMF is distinguished from
existing works in literature thanks to two salient features. First, TCMF is a principled method
to separate the three components given noisy observations provably. Second, the bulk of the
computation in TCMF can be distributed. On the technical side, we formulate the problem
as a constrained nonconvex nonsmooth optimization problem. Despite the intricate nature
of the problem, we provide a Taylor series characterization of its solution by solving the
corresponding Karush–Kuhn–Tucker conditions. Using this characterization, we can show
that the alternating minimization algorithm makes significant progress at each iteration
and converges into the ground truth at a linear rate. Numerical experiments in video
segmentation and anomaly detection highlight the superior feature extraction abilities of
TCMF.
Keywords: Matrix Factorization, Heterogeneity, Alternating minimization, Sparse noise,
Outlier identification

1. Introduction
In the era of Big Data, an important task is to find low-rank features from high-dimensional
observations. Methods including principal component analysis (Hotelling, 1933), low-rank

∗. Corresponding author

©2024 Naichen Shi, Salar Fattahi, Raed Al Kontar.


License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided
at http://jmlr.org/papers/v25/24-0400.html.
Shi, Fattahi, and Al Kontar

matrix factorization (Koren et al., 2009), and dictionary learning (Aharon et al., 2006), have
found success in numerous fields of statistics and machine learning (Wright and Ma, 2022).
Among them, matrix factorization (MF) is an efficient method to identify the features that
best explain the observation matrices.
Despite the wide popularity, standard MF methods are known to be brittle in the
presence of outliers with huge noise (Candès et al., 2011). These noises are often sparse but
can have large norms. A series of methods (e.g., (Candès et al., 2011; Netrapalli et al., 2014;
Wong and Lee, 2017; Fattahi and Sojoudi, 2020; Chen et al., 2021)) have been developed to
estimate low-rank features from data that contain outliers. When the portion of outliers
is not too large, one can provably identify the outliers and the low-rank components with
convex programming (Candès et al., 2011) or nonconvex optimization algorithms equipped
with convergence guarantees (Netrapalli et al., 2014).
Recently, there has been a growing number of applications where data are acquired from
diverse but connected sources, such as smartphones, car sensors, or medical records from
different patients. This type of data displays both mutual and individual characteristics. For
instance, in biostatistics, the measurements of different miRNA and gene expressions from
the same set of samples can reveal co-varying patterns yet exhibit heterogeneous trends (Lock
et al., 2013). Statistical modeling of the common information among all data sources and
the specific information for each source is of central interest in these applications. Multiple
works propose to recover such common and unique features by minimizing the square norm
of the residuals of fitting (Lock et al., 2013; Zhou et al., 2015; Gaynanova and Li, 2019; Park
and Lock, 2020; Lee and Choi, 2009; Yang and Michailidis, 2016; Shi and Kontar, 2024;
Shi et al., 2023; Liang et al., 2023). These methods prove to be useful in aligning genetics
features (Lock et al., 2013), visualizing bone and soft tissues in X-ray images (Zhou et al.,
2015), functional magnetic resonance imaging (Kashyap et al., 2019), surveillance video
segmentation (Shi and Kontar, 2024), stocks market analysis (Shi et al., 2023), and many
more.
Though these algorithms achieve decent performance on multiple applications, they rely
on least square estimates, which are not robust to outliers in data. Real-world data are
commonly corrupted by outliers (Tan et al., 2005). Factors including measurement errors or
sensor malfunctions can give rise to large noise in data. These outliers can substantially skew
the estimation of low-rank features. As such, we attempt to answer the following question.

Question: How can one provably identify low-rank common and unique information
robustly from data corrupted by outlier noise?

A natural thought is to borrow techniques in robust PCA to handle outlier noise. Indeed,
there exist a few heuristic methods in literature (Sagonas et al., 2017; Panagakis et al.,
2015; Ponzi et al., 2021) to find robust estimates of shared and unique features. These
methods often use `1 regularization (Sagonas et al., 2017; Panagakis et al., 2015) or Huber
loss (Ponzi et al., 2021) to accommodate the sparsity of noise. However, these algorithms are
mainly based on heuristics and lack theoretical guarantees, thus potentially compromising
the quality of their outputs. A theoretically justifiable method to identify low-rank shared
and unique components from outlier noise is still lacking. In this paper, we will study the
question rigorously and develop an efficient algorithm to solve it.

2
TCMF

2. Problem Statement
We consider the framework where N observation matrices M(1) , M(2) , · · · , M(N ) come from
N ∈ N+ different but associated sources. These matrices M(i) ∈ Rn1 ×n2,(i) have the same
number of features n1 . To model their commonality and uniqueness, we assume each matrix
is driven by r1 shared factors and r2,(i) unique factors and contaminated by potentially gross
noise. More specifically, we consider the model where the observation M(i) from source i is
generated by,
M(i) = U? g V? T(i),g + U? (i),l V? T(i),l + S? (i) , (1)

where U? g ∈ Rn1 ×r1 , V? (i),g ∈ Rn2,(i) ×r1 , U? (i),l ∈ Rn1 ×r2,(i) , V? (i),l ∈ Rn2,(i) ×r2,(i) , S? (i) ∈
Rn1 ×n2,(i) . We use ? to denote the ground truth. r1 is the rank of global (shared) feature
matrices, and r2,(i) is the rank of local (unique) feature matrix from source i. The matrix
U? g V? T(i),g models the shared low-rank part of the observation matrix, as the column space
is the same across different sources. U? (i),l V? T(i),l models the unique low-rank part. S? (i)
models the noise from source i.
In matrix factorization problems, the representations U? and V? often correspond to
latent data features. For instance, in recommender systems, U? can be interpreted as
user features that reveal their preferences on different items in the latent space (Koren
et al., 2009). For better interpretability, it is often desirable to have the underlying features
disentangled so that each feature can vary independently of others (Higgins et al., 2017).
Under this rationale, we consider the model where shared and unique factors are orthogonal,

U? Tg U? (i),l = 0, ∀i ∈ [N ], (2)

where [N ] denotes the set {1, 2, · · · , N }. The orthogonality of features implies that the
shared and unique features span different subspaces, thus describing different patterns in the
observation. The orthogonal condition (2) is thus an inductive bias that reflects our prior
belief about the independence between common and unique factors and naturally models a
diverse range of applications, such as miRNA and gene expression (Lock et al., 2013), human
faces (Zhou et al., 2015), and many more (Sagonas et al., 2017; Shi and Kontar, 2024).
We should note that the orthogonality (2) does not limit the model representation power.
Suppose U? Tg U? (i),l 6= 0 otherwise, we can decompose U? (i),l into the two parts, U? (i),l =
−1 ? T ?  −1 ? T  ?
U? g U? Tg U? g U g U (i),l + I − U? g U? Tg U? g U g U (i),l . The first part is in the
?
column subspace of U g , while the second  part is in the orthogonal
−1
 space of the column
T T
subspace of U? g . If we define U f? (i),l = I − U? g U? U? g U? g U? (i),l , and V

f? (i),g =
g
f? T + U
−1
V? (i),g + V? (i),l U? T(i),l U? g U? Tg U? g , we have, M(i) = U? g V (i),g
f? (i),l V? T + S? (i)
(i),l
where U? T U
g
f? (i),l = 0. This formulation admits the form of model (1) with constraint (2).
The noise term S? (i) in (1) models the sparse and large noise, where only a small
fraction of S? (i) registers as nonzero. The noise sparsity is extensively invoked in literature,
particularly when datasets are plagued by outliers (Candès et al., 2011; Netrapalli et al.,
2014; Chen et al., 2020, 2021).

3
Shi, Fattahi, and Al Kontar

2.1 Challenges
Given data generation model (1), our task is to separate common, individual, and noise
components. The task seems Herculean as the problem is under-definite: we need to estimate
three sets of parameters from one set of observations. There are two major challenges
associated with the problem,
Challenge 1: New identifiability conditions are needed. Standard analysis in robust PCA
(Candès et al., 2011; Netrapalli et al., 2014) often uses the incoherence condition to distinguish
low-rank components from sparse noise. However, the incoherence condition alone is
insufficient to guarantee the separability between common and unique features. Since there
are infinitely many ways in which shared, unique, and noise components can form the
observation matrices, it is not apparent whether untangling them is even feasible. Thus, the
crux of our investigation is to understand when the separation is possible.
Fortunately, we show that a group of conditions–known here as identifiability conditions–
exists that can ensure the precise retrieval of the shared, unique, and sparse noise. Intuitively,
these identifiability conditions require the three components to have “little overlaps”.
Based on these conditions, we will develop an alternating minimization algorithm called
TCMF to iteratively update the three components. An illustration of the algorithm is shown
in the left graph in Figure 1. The hard-thresholding step finds the closest sparse matrix for
the data noise. We use JIMF to denote a subroutine that represents a group of algorithms
(e.g., (Lock et al., 2013; Shi and Kontar, 2024)) to identify common and unique low-rank
features. In essence, JIMF solves a sub-problem in TCMF. It is worth noting that there exist
multiple algorithms in literature to implement JIMF, many of which can produce high-quality
outputs. With the implemented JIMF, TCMF applies hard thresholding and JIMF alternatively
to estimate the sparse, as well as common and unique low-rank components. The left graph
of Figure 1 offers an intuitive understanding of how estimates of various components progress
toward the ground truth with each iterative step.

Figure 1: Left: An illustration of TCMF’s update trajectory. The purple and blue curves rep-
resent the spaces for the low-rank and sparse matrices. The algorithm alternatively
performs hard thresholding and JIMF, making the updates closer and closer to
the ground truth. Middle: An illustration showing why insufficient understanding
about the output of JIMF can be problematic in the convergence analysis. Right:
Our contribution to represent the solution into a Taylor-like series.

Challenge 2: New analysis tools are needed. Showing the exact recovery of low-rank and
sparse components is not easy. Even in standard robust PCA, one needs to apply highly
nontrivial analytical techniques to provide theoretical guarantees. For example, Robust PCA

4
TCMF

(Candès et al., 2011) relies on a “golfing scheme” to construct dual variables that ensure the
uniqueness of a convex optimization problem. Nonconvex robust PCA (Netrapalli et al.,
2014) applies a perturbation analysis of SVD to quantify the improvement of the algorithm
per iteration. These techniques are tailored for standard robust PCA and cannot be directly
extended to the case where both common and unique features are involved, which increases
the complexity of the analysis. The major difficulty stems from the fact that TCMF updates
the low-rank common and unique components by another iterative algorithm JIMF. Unlike
robust PCA, the output of JIMF does not have a closed-form formula. This conceptual
hurdle is illustrated in the middle graph of Figure 1. As a result, novel analysis tools are
needed to justify the convergence of the proposed TCMF.
One of our key contributions in tackling the challenge is to develop innovative analysis
tools by solving the Karush–Kuhn–Tucker conditions of the objective of JIMF and express the
solutions into a Taylor-like series. From the Taylor-like series, we can precisely characterize
the output of JIMF, thereby showing the series converge to a close estimate of the ground
truth shared and unique features.
The Taylor-like series is depicted in the right graph of Figure 1. Perhaps surprisingly,
regardless of the choice of the subroutine JIMF, as long as JIMF finds a close estimate of
the optimal solutions to a subproblem, its output can be represented by an infinite series.
The series describes the optimal solution of the subproblem and is independent of the
intermediate steps in JIMF. The derivation and analysis of Taylor-like series have stand-alone
values in the theoretical research of the sensitivity analysis of matrix factorization. With
the new analysis tool, we are able to show that even if the JIMF only outputs a reasonable
approximate solution, the meta-algorithm TCMF can still take advantage of the information in
such an inexact solution to refine the estimates of the three components. We will elaborate
on the Taylor-like series in greater detail in Section 6 and the Appendix.
We summarize our contributions in the following.

2.2 Summary of Contributions


Identifiability conditions. We discover a group of identifiability conditions sufficient for
the almost exact recovery of common, unique, and sparse components from noisy observation
matrices. Essentially, the identifiability conditions require that the fraction of nonzero
entries in the noise not be too large, the factor matrices be incoherent, and unique factors be
misaligned. The first two conditions are needed even in the standard analysis of the robust
PCA, while the third condition is essential for the disentanglement of the shared and unique
components.
Efficient and distributed algorithm. We propose a constrained nonconvex nons-
mooth matrix factorization problem to solve the shared, unique, and sparse components. De-
spite the nonconvexity of the problem, we design a meta-algorithm called Triple Component
Matrix Factorization (TCMF) to solve the problem. Our approach is able to leverage a wide
range of existing methods for separating the common and unique components to precision .
Furthermore, JIMF can be distributed if the subroutine JIMF is distributed.
Convergence guarantee. We show that, under the identifiability conditions, our
proposed TCMF has a convergence guarantee. To the best of our knowledge, such a guarantee
is the first of its kind, as it ensures the recovery of common, unique, and noise components

5
Shi, Fattahi, and Al Kontar

to high precision. Our theoretical analysis introduces new techniques to solve the KKT
conditions in Taylor-like series and bound each term in the series. It sheds light on the
sensitivity analysis with the `∞ norm.
Case studies. We use a wide range of numerical experiments to demonstrate the
application of TCMF in different case studies, as well as the effectiveness of our proposed
method. Numerical experiments corroborate theoretical convergence results. Also, the case
studies on video segmentation and anomaly detection showcase the benefits of untangling
shared, unique, and noisy components.
In the rest of the paper, we provide a comprehensive review of the literature in Section 3.
Then, we elaborate on the conditions sufficient for the separation of the three components
in Section 4. In Section 5, we introduce the alternating minimization algorithm. We present
our convergence theorem in Section 6 and discuss the key insights in the proof and how
they solve challenge 2. In Section 7, we demonstrate the numerical experiment results. The
detailed proofs are relegated to the Appendix for brevity of the main paper.

3. Related Work
Matrix Factorization There are numerous works that analyze the theoretical and practical
properties of first-order algorithms that solve the (asymmetric) matrix factorization problem
2
minU,V M − UVT F or its variants (Li et al., 2018; Ye and Du, 2021; Sun and Luo, 2016;
Park et al., 2017; Tu et al., 2016). Among them, Sun and Luo (2016) analyzes the local
landscape of the optimization problem and establishes the local linear convergence of a series
of first-order algorithms. Park et al. (2017); Ge et al. (2017) study the global geometry of the
optimization problem. Tu et al. (2016) proposes the Rectangular Procrustes Flow algorithm
that is proved to converge linearly into the ground truth under proper initialization and
a balancing regularization. Recently, Ye and Du (2021) shows that gradient descent with
small and random initialization can converge to the ground truth.
Robust PCA When the observation is corrupted by sparse and potentially large noise,
several approaches can still identify the low-rank components. An exemplary work is Robust
PCA (Candès et al., 2011), which proposes an elegant convex optimization problem called
principal component pursuit that uses nuclear norm and `1 norm to promote the sparsity and
low-rankness of the solutions. It is proved that under incoherence assumptions, the solution
of the convex optimization is unique and corresponds to the ground truth. Several works also
consider the problem of matrix completion under outlier noise (Wong and Lee, 2017; Chen
et al., 2021). Nonconvex robust PCA (Netrapalli et al., 2014) improves the computational
efficiency of principal component pursuit by proposing a nonconvex formulation and using
an alternating projection algorithm to solve it. Though the formulation is nonconvex, the
alternating projection algorithm is also proved to recover the ground truth exactly under
incoherence and sparsity requirements. For the special case of rank-1 robust PCA, Fattahi and
Sojoudi (2020) show that a simple sub-gradient method applies directly to the nonsmooth `1 -
loss provably recovers the low-rank component and sparse noise, under the same incoherence
and sparsity requirements. To model a broader set of noise, Meng and De La Torre (2013)
consider a mixture of Gaussian noise models and exploit the EM algorithm to estimate the
low-rank components. Robust PCA has found successful applications in video segmentation
(Bouwmans and Zahzah, 2014), image processing (Vaswani et al., 2018), change point

6
TCMF

detection (Xiao et al., 2020), and many more. Nevertheless, the formulations of robust PCA
focus on shared low-rank features among all data and neglect unique components.

Distributed matrix factorization The emergence of edge computation has prompted


the research on distributed matrix factorization. Gemulla et al. (2011) exploits distributed
gradient descent to factorize large matrices. Chai et al. (2021) proposes a cryptographic
framework where multiple clients use their local data to collaboratively factorize a matrix
without leaking private information to the server. These works use one set of feature matrices
U and V to fit data from all clients, thus also neglecting the feature differences from different
sources as well as the possible outliers in data. Our method TCMF is distributed when its
subroutine JIMF is distributed. Different from conventional distributed matrix factorization,
TCMF can find common and unique components simultaneously while remaining robust in
the presence of outliers.

Joint and individual feature extraction The literature on using MF to identify


shared and unique features abound (Lock et al., 2013; Zhou et al., 2015; Gaynanova and Li,
2019; Park and Lock, 2020; Lee and Choi, 2009; Yang and Michailidis, 2016; Shi and Kontar,
2024; Liang et al., 2023; Shi et al., 2023). Among them, JIVE (Lock et al., 2013), COBE
(Zhou et al., 2015), PerPCA (Shi and Kontar, 2024), and HMF (Shi et al., 2023) uses mutually
orthogonal features to model the shared and unique components. SLIDE (Gaynanova and
Li, 2019) and BIDIFAC (Park and Lock, 2020) do not pose orthogonality constraints but use
regularizations to encourage the unique features to have small norms. GNMF (Lee and Choi,
2009) and iNMF (Yang and Michailidis, 2016) further add nonnegativity constraints to the
factor matrices. In particular, PerPCA (Shi and Kontar, 2024) and HMF (Shi et al., 2023)
are two distributed algorithms that are guaranteed to converge to the optimal solutions under
proper conditions. It is worth mentioning that these methods do not account for sparse noise
in the observations. In this work, we remedy this challenge by utilizing existing methods as
basic building blocks for our approach, which focuses on simultaneously separating shared
and unique features as well as noise components.

Robust shared and unique feature extraction As discussed, a few heuristic methods
also attempt to find the shared and unique features when data are corrupted by large noise
(Sagonas et al., 2017; Ponzi et al., 2021; Panagakis et al., 2015). Amid them, RaJIVE
(Ponzi et al., 2021) employs robust SVD (Zhang et al., 2013) to remove noise from the
observations and then uses a variant of JIVE (Feng et al., 2018) to separate common and
unique components. RJIVE (Sagonas et al., 2017) proposes a constrained optimization
formulation to minimize the `1 norm of the fitting residuals and exploits ADMM to solve the
problem. RCICA (Panagakis et al., 2015) adopts a similar optimization objective but uses a
regularization to encourage the similarity of common subspaces and only works for N = 2
cases. Though these methods can achieve decent performance in applications including
facial expression synthesis and audio-visual fusion, they are based on heuristics and it is
not clear whether their output converges to the ground truth common and unique factors.
In contrast, we prove that TCMF is guaranteed to recover the ground truths and use a few
numerical examples to show that TCMF indeed recover more meaningful components.

7
Shi, Fattahi, and Al Kontar

4. Identifiability Conditions
Our goal is to decouple the common components, unique components, and the sparse noise,
 N
given a group of data observations M(i) i=1 . At first glance, such decoupling may seem
impossible or even ill-defined: roughly speaking, the number of unknown variables, namely
global components, local components, and noise, are thrice the number of observed data
 N
matrices M(i) i=1 , and hence, there are infinite number of decouplings that can give rise
to the same M(i) .
The very first question to ask is whether such decoupling is possible and, if so, which
properties can ensure the identifiability of three components. Intriguingly, we are able to
prove that the exact decoupling of shared features, unique features, and noise is possible if
there is “little overlap” among the three components. Below, we will formalize this intuition
in more detail. Though intuitive, it turns out that these conditions can guarantee the
identifiability of the shared components, unique components, and the sparse noise.

4.1 Sparsity
As discussed, identifying arbitrarily dense and large noise from signals is not possible. Hence,
we consider sparse noise where only a small fraction of observations are corrupted. To
characterize the sparsity of S? (i) , we use the following definition of α-sparsity.

Definition 1 (α-sparsity) A matrix S ∈ Rn1 ×n2 is α-sparse if at most αn1 entries in each
column and at most αn2 entries in each row are nonzero.

The definition follows from that of Netrapalli et al. (2014). In Definition 1, α characterizes
the maximum portion of corrupted entries in each row and each column. Intuitively, if
a matrix is α-sparse with small α, then its nonzero entries are “spread out” instead of
concentrated on specific columns or rows.

4.2 Incoherence
It is shown that distinguishing sparse components from arbitrary low-rank components is
also hard (Candès et al., 2011; Netrapalli et al., 2014). As a simple counterexample, the
matrix M = ei eTj , where we use ei to denote the basis vector of axis i, has its ij-th entry
to be 1 and all other entries to be 0. This matrix has rank 1, and is also sparse since it
has only one nonzero entry. Thus, deciding whether it is sparse or low rank is difficult as it
satisfies both requirements.
From the above analysis, one can see that the low-rank components should not be
sparse. In other words, to be distinguishable from the sparse noise, their elements should
be sufficiently spread out. In the literature, this requirement is often characterized by the
so-called incoherence condition (Candès et al., 2011; Netrapalli et al., 2014).

Definition 2 (µ-incoherence) A matrix U ∈ Rn×r is µ-incoherent if



µ r
max eTi U 2 ≤ √ ,
i n

where ei ∈ Rn is the standard basis vector of axis i, defined as ei = (0, 0, ...0, 1, 0, ...0)T

8
TCMF

The incoherence condition restricts the maximum row-wise `2 norm of a matrix U, thus
preventing the entries of U from being too concentrated on a few specific axes.
Remember that in model (1), U? g V? T(i),g and U? (i),l V? T(i),l represent the global (shared)
and local (unique) factors. For any n ≥ r, we use On×r to denote the set of n by r matrices
whose column vectors are orthonormal, On×r = {W ∈ Rn×r |WT W = I}. We assume the
SVD of U? g V? T(i),g and U? (i),l V? T(i),l has the following form,
( ? ?T
U g V (i),g = H? g Σ? (i),g W? T(i),g
, (3)
U? (i),l V? T(i),l = H? (i),l Σ? (i),l W? T(i),l

where H? g ∈ On1 ×r1 , W? (i),g ∈ On2,(i) ×r1 , H? (i),l ∈ On1 ×r2,(i) . Moreover, W? (i),l ∈
On2,(i) ×r2,(i) are orthogonal matrices, Σ? (i),g ∈ Rr1 ×r1 and Σ? (i),l ∈ Rr2,(i) ×r2,(i) are posi-
tive diagonal matrices. In (3), we consider the case where the global and local column
singular vectors are orthogonal, i.e., H? Tg H? (i),l = 0. We use r2 = maxi r2,(i) throughout the
paper.
To avoid overlapping between sparse and low-rank components, we assume the row
and column singular vectors H? g , H? (i),l , W? (i),g , and W? (i),l are all µ-incoherent. This
assumption ensures that the low-rank components do not have entries too concentrated on
specific rows or columns. As a result, the incoherence on singular vectors encourages the
low-rank components to distribute evenly on all entries, which is distinguished from sparse
noises that are nonzero on a small fraction of entries.

4.3 Misalignment
As discussed in (3), we use orthogonality between shared and unique features Ĥ?Tg Ĥ? (i),l = 0
to encode our prior belief about the independence of different features. This is equivalent
to U? Tg U? (i),l = 0. Such orthogonality, however, is still insufficient to guarantee the
identifiability of shared and unique factors.
To see this, consider a counterexample where all U? (i),l ’s are equal, i.e., U? (1),l = U? (2),l =
· · · = U? (N ),l . In this case, “unique” factors are also shared among all observation matrices.
Thus, separating them from the ground truth U? g is not possible. From this counterexample,
we can see that it is essential for the local features not to be perfectly aligned with each
other. Next, we formally introduce the notion of misalignment. For a full column-rank −1 T
matrix U ∈ Rd×n , we define the projection matrix PU ∈ Rd×d as PU = U UT U U .
Definition 3 (θ-misalignment) We say {U? (i),l } are θ-misaligned if there exists a positive
constant θ ∈ (0, 1) such that:
N
!
1 X
λmax PU? (i),l ≤ 1 − θ. (4)
N
i=1
 P 
By the triangular inequality of λmax (·), we know λmax N1 N i=1 P U ?
(i),l

 
1 P N
N i=1 λmax PU? (i),l = 1. Thus, the introduced θ is always nonnegative. Indeed, all
PU? (i),l ’s have a common nonempty eigenspace with eigenvalue 1 if and only if θ = 0. Thus,
the θ-misalignment condition requires that the subspaces spanned by all unique factors do

9
Shi, Fattahi, and Al Kontar

not contain a common subspace. On the contrary, all global features are shared; hence, the
subspaces spanned by these features are also identical. This comparison shows that the
misalignment condition unequivocally distinguishes unique features from shared ones.
As a concrete example, consider N = 2 and U(1),l = (cos ϑ, sin ϑ)T , U(2),l =
(cos ϑ, − sin ϑ)T for ϑ ∈ [0, π4 ]. Indeed, the angle between U(1),l and U(2),l is 2ϑ and
!
1  cos2 ϑ, 0
PU(1),l + PU(2),l = .
2 0, sin2 ϑ

Hence, by definition, θ = sin2 ϑ. We can thus clearly see that when ϑ increases, the U(1),l
and U(2),l become more misaligned.
The notion of θ-misalignment is first proposed by Shi and Kontar (2024) and intimately
related to the uniqueness conditions in Lock et al. (2013).

5. Algorithm
The introduced identifiability conditions restrict the overlaps between shared, unique, and
sparse components. It remains to develop algorithms to untangle the three parts from N
matrices. In Section 5.1, we introduce a constrained optimization formulation, and in Section
5.2, we propose an alternating minimization program to decouple the three parts. The
alternating minimization requires solving subproblems to distinguish shared features from
unique ones.
Throughout the paper, we use kAk or kAk2 to denote the operator norm of a matrix
A ∈ Rm×n and kAkF to denote the Frobenius norm of A. We use r to denote r = r1 + r2 .

5.1 Constrained Nonconvex Nonsmooth Optimization


We design a constrained optimization problem to decouple the three components. The
decision variables x include features, coefficients, and sparse noise estimates: x =
Ug , {U(i),l , V(i),g , V(i),l , S(i) }N

i=1 . The constrained optimization is formulated as,

N
X
min hi (Ug , V(i),g , U(i),l , V(i),l , S(i) ; λ)
x (5)
i=1
s.t. UTg U(i),l = 0, ∀i ∈ [N ].

Here, hi is a regularized fitting residual consisting of two parts:

hi (Ug , V(i),g , U(i),l , V(i),l , S(i) ; λ)


= fi (Ug , V(i),g , U(i),l , V(i),l , S(i) ) + Φi (S(i) ; λ)
1 T T
2
= M(i) − Ug V(i),g − U(i),l V(i),l − S(i) (fi )
2 F
2
+ λ S(i) 0 . (Φi )

Term (fi ) measures the distance between the sum of shared, unique, and sparse components
and the observation matrix M(i) . It denotes the residual of fitting. A common approach for

10
TCMF

solving this problem is based on convex relaxation (Candès et al., 2011). However, convex
relaxation increases the number of variables to O(n1 n2 ), while our nonconvex formulation
keeps it in the order of O(max{n1 , n2 }(r1 + r2 )), which is significantly smaller.
Term (Φi ) is an `0 regularization term that promotes the sparsity of matrix S(i) . The
parameter λ mediates the balance between the `0 penalty with the residual of fitting. A
large value of λ leads to sparser S(i) with only large nonzero elements. Conversely, a small
value of λ yields a denser S(i) with potentially small nonzero elements. Therefore, to identify
both large and small nonzero values of S(i) , while correctly filtering out its zero elements,
we propose to gradually decrease the value of λ during the optimization of objective (5).
We use the notation hi (Ug , V(i),g , U(i),l , V(i),l , S(i) ; λ) to explicitly show that the objective
hi is dependent on the regularization parameter λ.
At first glance, the proposed optimization problem (5) may appear daunting due to
its inherent nonconvexity and nonsmoothness. Notably, it exhibits two distinct sources of
nonconvexity: firstly, both terms (fi ) and (Φi ) are nonconvex, and secondly, the feasible
set corresponding to the constraint UTg U(i),l = 0 is also nonconvex. Furthermore, the `0
regularization term in (Φi ) introduces nonsmoothness into the problem. However, we will
introduce an intuitive and efficient algorithm designed to alleviate these challenges and
effectively solve the problem. Surprisingly, under our identifiability conditions introduced in
Section 4, this algorithm can be proven to converge to the ground truth.

5.2 Alternating Minimization


One efficient approach to solving a `0 regularized objective is alternating minimization.  We
divide the decision variables x into 2 blocks, Ug , {V(i),g , U(i),l , V(i),l } and {S(i) } , and
alternatively minimize one block with the block of variables fixed.
More specifically, the alternating minimization proceeds by epochs, each comprised of
two steps. For ease of exposition, we use Ûg,t−1 , {V̂(i),g,t−1 , Û(i),l,t−1 , V̂(i),l,t−1 , Ŝ(i),t−1 }N
i=1
to denote the values of x at the end of epoch t − 1. Theˆnotation represents the estimated
values of the variables.  
In the first step, we fix the values of Ûg,t−1 , {V̂(i),g,t−1 , Û(i),l,t−1 , V̂(i),l,t−1 } , and
optimize over {S(i) }. The optimal Ŝ(i),t has a simple closed-form solution given by hard-
thresholding,
2
T T
Ŝ(i),t = arg min M(i) − Ûg,t−1 V̂(i),g,t−1 − Û(i),l,t−1 V̂(i),l,t−1 − S(i) + λ2t S(i) 0
S(i) F
h i
T T
= Hardλt M(i) − Ûg,t−1 V̂(i),g,t−1 − Û(i),l,t−1 V̂(i),l,t−1 ,

where Hardλ (·) is the hard-thresholding operator. For a matrix X ∈ Rm×n , the hard
thresholding operator is defined as:
(
Xij , if |Xij | > λ
[Hardλ (X)]ij = . (7)
0 , if Xij ∈ [−λ, λ]

The coefficient λ is a thresholding parameter that controls the sparsity of the output.
To recover the correct sparsity pattern of Ŝ(i),t , our approach is to maintain a small false

11
Shi, Fattahi, and Al Kontar

positive rate (elements that are incorrectly identified as nonzero), while gradually improving
the true positive rate (elements that are correctly identified as nonzero). To this goal, we
start with a large λ to obtain a conservative estimate of Ŝ(i),t . Then, we decrease λ to refine
the estimate. 
In the second step, we fix Ŝ(i),t and optimize Ug , {V(i),g , U(i),l , V(i),l } under the
constraint UTg U(i),l = 0. Removing the `0 regularization term that is independent of
Ug , {V(i),g , U(i),l , V(i),l } , the optimization subproblem takes the following form,

N
X 2
T T
min M̂(i) − Ug V(i),g − U(i),l V(i),l
(Ug ,{V(i),g ,U(i),l ,V(i),l }) i=1
F (8)
s.t. UTg U(i),l = 0, ∀i ∈ [N ],

where M̂(i) = M(i) − Ŝ(i),t .


Despite its nonconvexity, there exist several iterative algorithms to solve the above
optimization problem, including but not limited to JIVE (Lock et al., 2013), COBE (Zhou
et al., 2015), PerPCA (Shi and Kontar, 2024), PerDL (Liang et al., 2023), and HMF (Shi
et al., 2023). Given the similarity of these methods, we employ the name Joint and Individual
Matrix Factorization (JIMF) to encapsulate the subroutine addressing problem (8).
The versatile JIMF is a meta-algorithm that can be implemented using any of the afore-
mentioned methods, provided that they generate good solutions. Among these algorithms,
PerPCA and HMF are of special interest as they are proved to converge to the optimal
solutions of (8) under suitable conditions. They are also intrinsically federated as most of
the computation can be distributed on N sources where the data are generated.
As problem (8) does not have a simple closed-form solution, the algorithms discussed
above are iterative. The iterative algorithms do not output exact optimal solutions. Instead,
they refine the estimates at every iteration. Therefore, there will be a difference between
our algorithm-generated solutions and the optimal solution. To characterize the degree of
such difference, we resort to employing the concept of -optimality, a notion well-known in
the optimization community.

Definition 4 (-optimality) Given (Ûg , {V̂(i),g , Û(i),l , V̂(i),l }) as any global optimal solution
to the problem (8) and a constant  > 0, we say (Û g , {V̂ (i),g , Û (i),l , V̂ (i),l }) is an -optimal
solution to (8) if it satisfies,

Û g V̂T(i),g + Û (i),l V̂T(i),l − Ûg V̂(i),g


T T
− Û(i),l V̂(i),l ≤ , ∀i

and
ÛTg Û (i),l = 0, ∀i.

The nonconvexity of (8) gives rise to multiple global optimal solutions. Our definition
of -optimality only emphasizes the closeness between the product of features and the
coefficients, and the product of any set of global optimal solutions. As discussed, there exist
multiple methods proposed to solve (8) that demonstrate decent practical performance. In
particular, PerPCA, PerDL, and HMF are proved to converge to the optimal solutions of
(8) at linear rates when initialized properly. Hence, under suitable initializations, PerPCA

12
TCMF

and HMF can reach -optimality of (8) within O log 1 iterations for any value of . The


details of the two algorithms will be discussed in Appendix A.1.


With the help of subroutine JIMF, the main alternating minimization algorithm proceeds
by optimizing two blocks of variables iteratively. We present the pseudo-code in Algorithm
1.

Algorithm 1 TCMF: alternating minimization


1: Input observation matrices from N sources {M(i) }N i=1 , constant λ1 , multiplicative factor
ρ ∈ (0, 1), precision .
2: Initialize Û g,0 , V̂ (i),g,0 , Û (i),l,0 , V̂ (i),l,0 , Ŝ(i),0 to be zero matrices.
3: for Epoch t = 1, ..., T do
4: for Source i = 1, h· · · , N do i
5: Ŝ(i),t = Hardλt M(i) − Û g,t−1 V̂T(i),g,t−1 − Û (i),l,t−1 V̂T(i),l,t−1
6: end for  
7: (Û g,t , {V̂ (i),g,t }, {Û (i),l,t }, {V̂ (i),l,t }) = JIMF {M̂(i) } = {M(i) − Ŝ(i),t }, 
8: Set λt+1 = ρλt + 
9: end for
10: Return {Û g,T , {V̂ (i),g,T }, {Û (i),l,T }, {V̂ (i),l,T }}.

 
In Algorithm 1, we use JIMF {M̂(i) },  to denote the call for a subroutine to solve (8) to
-optimality. In each epoch, sparse matrices Ŝ(i),t are firstly estimated by hard thresholding.
Then M̂(i) = M(i) − Ŝ(i),t are calculated, which are subsequently decoupled into the shared
and unique components via a JIMF call. The output of this subroutine is represented as
(Û g,t , {V̂ (i),g,t }, {Û (i),l,t }, {V̂ (i),l,t }), where the superscript  signifies -optimality. The
outputs (Û g,t , {V̂ (i),g,t }, {Û (i),l,t }, {V̂ (i),l,t }) are used to improve the estimate of Ŝ(i) in
the next epoch. After each epoch, we decrease the thresholding parameter λt by a constant
ρ < 1, then add a constant . The inclusion of  in λt+1 is necessary to ensure that the
estimated Ŝ(i),t+1 does not contain any false positive entries. By incorporating  into λt+1 ,
we guarantee that the inexactness of the JIMF outputs does not undermine the false positive
rate of the entries in Ŝ(i),t+1 .
Then, per-epoch computational complexity of Algorithm 1 is O(n1 n2 N ) when the rank
r1 and r2 are small r1 , r2  n1 , n2 . To see this, we can add up the computation complexity
complexity of hard-thresholding and JIMF. Element-wise hard-thresholding requires O(n1 n2 )
computations for each source. Efficient implementations of the subroutine JIMF, such as
PerPCA and HMF, can converge into the −optimal solutions within O(log 1 ) steps, where
each step require O(n1 n2 ) computations. Therefore, the per-epoch computational complexity
of TCMF is O(n1 n2 N ), where log factors are omitted.
Furthermore, if the JIMF and hard-thresholding are distributed among N sources, TCMF
can further exploit parallel computation to reduce the running time. In the regime where com-
munication cost is negligible, the per-iteration total running time scales as O (n1 n2 + N n1 ).
We will show later that such a design can ensure that the estimation error diminishes
linearly. A pictorial representation of Algorithm 1 is plotted in the left graph of Figure 1.

13
Shi, Fattahi, and Al Kontar

6. Convergence Analysis
In this section, we will analyze the convergence of Algorithm 1. Our theorem characterizes the
conditions under which Algorithm 1 converges linearly to the ground truth. We additionally
introduce σmax > 0 to denote an upper bound of the singular values of {U? g V? T(i),g +
U? (i),l V? T(i),l }N
i=1 , and σmin > 0 to denote a lower bound on the smallest nonzero singular
values of {U g V? T(i),g + U? (i),l V? T(i),l }N
?
i=1 . For simplicity we assume n2,(i) = n2 , r2,(i) = r2 ,
and r = r1 + r2 in this section.

Theorem 5 (Convergence of Algorithm 1) Consider the true model (1) with SVD de-
fined in (3), where nonzero singular values of U? g V? T(i),g + U? (i),l V? T(i),l are lower bounded
by σmin > 0 and upper bounded by σmax ≥ σmin for each source i. Suppose that the following
conditions are satisfied:
N
• (µµ-incoherency) The matrices H? g and H? (i),l , W? (i),g , W? (i),l i=1 are µ-


incoherent for a constant µ > 0.


N
• (θθ -misalignment) The local feature matrices U? (i),l i=1 are θ-misaligned for a


constant 0 < θ < 1.


 
N
• (αα-sparsity) The matrices S? (i) i=1 ’s are α-sparse for some α = O µ4θr2 , where


r = r1 + r2 .
√ 2 
Then, there exist constants Cg,1 , Cg,2 , Cl,1 , Cl,2 , Cs,1 , Cs,2 > 0 and ρmin = O α µ√θr < 1
σ√ 2
max µ r 
such that the iterations of Algorithm 1 with λ1 = n1 n2 ,  ≤ λ1 (1 − ρmin ), and 1 − λ1 >
ρ ≥ ρmin satisfy

Û g,t V̂T(i),g,t − U? g V? T(i),g ≤ Cg,1 ρt + Cg,2  (9)


Û (i),l,t V̂T(i),l,t − U? (i),l V? T(i),l ≤ Cl,1 ρt + Cl,2  (10)


Ŝ(i),t − S? (i) ≤ Cs,1 ρt + Cs,2 . (11)


Theorem 5 presents a set of sufficient conditions under which the model is identifiable,
and Algorithm 1 converges to the ground truth at a linear rate. As discussed in Section 4,
these conditions are indeed sufficient to guarantee the identifiability of the true model. In
particular, µ-incoherency is required for disentangling global and local components from
noise, whereas θ-misalignment is needed to separate local and global components. Moreover,
there is a natural trade-off
 2  between the parameters µ, θ, and α: the upper bound on the
sparsity level α, O µθ4 r2 , is proportional to θ2 , implying that more alignment among local
feature matrices can be tolerated only at the expense of sparser noise matrices. Similarly,
α is inversely proportional to µ4 , indicating that more coherency in the local and global
components is only possible with sparser noise matrices. We also highlight the dependency
of α on the rank r; such dependency is required even in the standard settings of robust
PCA (Netrapalli et al., 2014; Chandrasekaran et al., 2011; Hsu et al., 2011), albeit with
a milder condition on r. Finally, the scaling of α does not depend on the the number of

14
TCMF

sources N , suggesting that the convergence guarantees provided by Theorem 5 are valid for
extremely large datasets.
Two important observations are in order. First, we do not impose any constraint on the
norm or sign of the sparse noise S? (i) . Thus, Theorem 5 holds for arbitrarily large noise
values. Second, at every epoch, Algorithm 1 solves the inner optimization problem (8) via
JIMF to -optimality. Also, the convergence of Algorithm 1 is contingent upon the precision
of JIMF output: the `∞ norm of the optimization error should not be larger than O (λ1 ).
Such requirement is not strong as even the trivial solution Ug , V(i),g , U(i),l , V(i),l = 0 is
λ1 -optimal. One should expect many algorithms to perform much better than the trivial
solution. Indeed, methods including PerPCA and PerDL are proved to output -optimal
solutions for arbitrary small  within logarithmic iterations, thus satisfying the requirement.
In practice, heuristic methods including JIVE or COBE can output reasonable solutions
that may also satisfy the requirement in Theorem 5.
In the next section, we provide the sketch of the proof for Theorem 5.

6.1 Proof Sketch of Theorem 5


Algorithm 1 is essentially an alternating minimization algorithm comprising a hard-
thresholding step, followed by a joint and individual matrix factorization step. Our overar-
ching goal is to control the estimation error at each iteration of the algorithm, showing that
it decreases by a constant factor after every epoch. To this goal, we make extensive use of
the error matrix E(i),t defined as E(i),t = S? (i) − Ŝ(i),t for every client i.
In the ideal case where E(i),t = 0, the global solution of (5) coincides with the true
shared and unique components, which is guaranteed by Theorem 1 in Shi and Kontar (2024).
Therefore, it is crucial to control the behavior of {E(i),t }N i=1 and its effect on the recovered
solution throughout the course of the algorithm. We define L?(i) = U? g V? T(i),g + U? (i),l V? T(i),l
T
and L̂(i),t = Ûg,t V̂(i),g,t T
+ Û(i),l,t V̂(i),l,t as the true and estimated low-rank components
of client i. Similarly, L̂(i),t = Û g,t V̂T(i),g,t + Û (i),l,t V̂T(i),l,t is the reconstructed low-rank
component from -optimal estimates. The following steps outline the sketch of our proof:
Step 1: α-sparsity of the  initial
 error: At the first iteration, the threshold level
λ1 is large, enforcing supp Ŝ(i),1 ⊆ supp S? (i) , which in turn implies supp E(i),1 ⊆
 

supp S? (i) . Therefore, the initial error matrix E(i),1 is also α-sparse.


Step 2: Error reduction via JIMF. Suppose that E(i),t is α-sparse. In Step 7, L̂(i),t
is obtained by applying JIMF on M(i) − Ŝ(i),t = L?(i) + E(i),t . Note that the input to JIMF
is the true low-rank component perturbed by an α-sparse matrix E(i),t . One of our key
contributions is to show that L?(i) − L̂(i),t is much smaller than E(i),t ∞ , provided that

the true local and global components are µ-incoherent and E(i),t is α-sparse. This fact is
delineated in the following key lemma.
Lemma 6 (Error reduction via JIMF (informal)) Suppose that the conditions of The-
orem 5 are satisfied. Moreover, suppose that E(i),t is α-sparse for each client i. We have
√ 2
? αµ r n o
L(i) − L̂(i),t ≤C· √ · max E(j),t ∞ ,
∞ θ j

15
Shi, Fattahi, and Al Kontar

where C > 0 is a constant.

Indeed, proving Lemma 6 is particularly daunting since L̂(i),t does not have a closed-form
solution. We will elaborate on the major techniques √to prove Lemma 6 in Section 6.2.
2
Suppose that α is small enough such that C · αµ √ r ≤ ρ . Then, Lemma 6 implies
θ 2
n o
? ρ
that L(i) − L̂(i),t ≤ 2 maxi E(i),t ∞ . From the definition of -optimality, we know
∞ n o
L?(i) − L̂(i),t ≤ ρ2 maxi E(i),t ∞ + . This implies that the `∞ norm of the error in

the output of JIMF shrinks by a factor of ρ2 compared with the error in the input E(i),t ∞
(modulo an additive factor ). As will be discussed next, this shrinkage in the `∞ norm of
the error is essential for the exact sparsity recovery of the noise matrix.
Step 3: Preservation
n of sparsity
o via hard-thresholding. Given that
?  ρ 
L(i) − L̂(i),t ≤ 2 maxi E(i),t ∞ + , our next goal is to show that supp E(i),t+1 ⊆
 ∞ n o
supp S? (i) (i.e., E(i),t+1 remains α-sparse) and maxi E(i),t+1 ∞ ≤ 2λt+1 . To prove
 
supp E(i),t+1 ⊆ supp S? (i) , suppose that S?(i)
 
= 0 for some (k, l), we have
     kl 
Ŝ(i),t+1 6 0 only if M(i) − L̂(i),t
= = L?(i) − L̂(i),t > λt+1 . On the other
kl kl n o kl
hand, in the Appendix, we show that maxi E(i),t ∞ ≤ 2λt . This implies that
n o
L?(i) − L̂(i),t ≤ ρ2 maxi E(i),t ∞ +  ≤ ρλt +  = λt+1 . This in turn leads to
  ∞
= E(i),t+1 kl = 0, and hence, supp E(i),t+1 ⊆ supp S? (i) . Finally, according
  
Ŝ(i),t+1
kl   
to the definition of hard-thresholding, we have Ŝ(i),t+1 − S?(i) + L?(i) − L̂(i) ≤ λt+1 ,
  kl
which, by triangle inequality, yields E(i),t+1 kl ≤ L?(i) − L̂(i)

+ λt+1 ≤ 2λt+1 .
kl
Step n4: Establishing
o linear convergence. Repeating Steps 2 and 3, we have
maxi E(i),t+1 ∞ ≤ 2λt+1 and L?(i) − L̂(i),t ≤ λt+1 for all t. Noting that λt =


ρλt−1 +  =  + ρ + ρ2 λt−2 = · · · =  + ρ + ρ3  + · · · + ρt−1 λ1 ≤ 1−ρ + ρt−1 λ1 , we establish
n o  
that maxi E(i),t ∞ = O() and L?(i) − L̂(i),t = O() in O log(λ 1 /)
log(1/ρ) iterations.

Step 5: Untangling global and local components. Under the misalignment condition,
a small error of the joint low-rank components L?(i) − L̂(i),t indicates that both the shared
F
component and the unique component is small. More specifically, Theorem 1 in Shi and
Kontar (2024) indicates U? g V? T(i),g − Ûg,t V̂(i),g,t
T , U? (i),l V? T(i),l − Û(i),l,t V̂(i),l,t
T =
  F F
O L? (i) − L̂(i),l,t . Since L? (i) − L̂(i),l,t shrinks linearly to a small constant, we
F F
can conclude that the estimation errors for shared and unique features also decrease linearly
to O().

6.2 Proof of Lemma 6


At the crux of our proof for Theorem 5 lies Lemma 6. In its essence, Lemma 6 seeks to
answer the following question: if the input to JIMF is corrupted by α-sparse noise matrices

16
TCMF

{E(i) }, how will the recovered solutions change in terms of `∞ norm? We highlight that the
standard matrix perturbation analysis, such as the classical Davis-Kahan bound (Bhatia,
2013) as well as the more recent `∞ bound (Fan et al., 2018), fall short of answering this
question for two main reasons. First, these bounds are overly pessimistic and cannot take
into account the underlying sparsity structure of the noise. Second, they often control the
singular vectors and singular values of the perturbed matrices, whereas the optimal solutions
to problem (8) generally do not correspond to the singular vectors of M̂(i) .
To address these challenges, we characterize the optimal solutions of (8) by analyzing
its Karush–Kuhn–Tucker (KKT) conditions. We establish the KKT condition and ensure
the linear independence constraint qualification (LICQ). Afterward, we obtain closed-form
solutions for the KKT conditions in the form of convergent series and use these series to
control the element-wise perturbation of the solutions.
KKT conditions. The following lemma shows two equivalent formulations for the KKT
conditions. For convenience, we drop the subscript t in our subsequent arguments.

Lemma 7 Suppose that {Ûg , Û(i),l , V̂(i),g , V̂(i),l } is the optimal solution to problem (8)
and M̂(i) has rank at least r1 + r2 . We have
N 
X 
T T
Ûg V̂(i),g + Û(i),l V̂(i),l − M̂(i) V̂(i),g = 0 (12a)
i=1
 
T T
Ûg V̂(i),g + Û(i),l V̂(i),l − M̂(i) V̂(i),l = 0 (12b)
 T
T T
Ûg V̂(i),g + Û(i),l V̂(i),l − M̂(i) Û(i),l = 0 (12c)
 T
T T
Ûg V̂(i),g + Û(i),l V̂(i),l − M̂(i) Ûg = 0 (12d)
ÛT(i),l Û(i),l = I, ÛTg Ûg = I, ÛT(i),l Ûg = 0. (12e)

Moreover, there exist positive diagonal matrices Λ1 ∈ Rr1 ×r1 , Λ2,(i) ∈ Rr2 ×r2 , and Λ3,(i) ∈
Rr1 ×r2 such that the optimality conditions imply:

M̂(i) M̂T(i) Ĥ(i),l = Ĥ(i),l Λ2,(i) + Ĥg Λ3,(i) (13a)


N N
1 X 1 X
M̂(i) M̂T(i) Ĥg = Ĥg Λ1 + Ĥ(i),l ΛT3,(i) (13b)
N N
i=1 i=1
ĤTg Ĥg = I, ĤT(i),l Ĥ(i),l = I, ĤTg Ĥ(i),l = 0, (13c)

for some Ĥg ∈ On1 ×r1 that spans the same subspaces as Ûg , and some Ĥ(i),l ∈ On1 ×r2 that
spans the same subspaces as Û(i),l .

The Λ3,(i) term in (13) complicates the relation between Ĥg and Ĥ(i),l . When Λ3,(i) is
nonzero, one can see that neither Ĥg nor Ĥ(i),l span a invariant subspace of M̂(i) M̂T(i) . As
a consequence, perturbation analysis from Netrapalli et al. (2014) based on characteristic
equations is not applicable. To alleviate this issue, we provide a more delicate control over
the solution set of (13).

17
Shi, Fattahi, and Al Kontar

Solutions to KKT conditions The characterization (13) contains structural information


for Ĥg and Ĥ(i),l that can be exploited for the perturbation analysis. To see this, recall the
definition M̂(i) = M(i) − Ŝ(i) . Combining this definition with (13) leads to

   
?T ? T T ? ?T

 E (i),t L (i) + L E
(i) (i),t + E E
(i),t (i),t Ĥ(i),l − Ĥ Λ
(i),l 2,(i) = − L (i) L Ĥ
(i) (i),l − Ĥ Λ
g 3,(i)


N 


1 X
 
E(i),t L? T(i) + L? (i) ET(i),t + E(i),t ET(i),t Ĥg − Ĥg Λ1

N
 i=1

N N
 !
1 X ? 1 X


 ? T

 =− L (i) L (i) Ĥg − Ĥ(i),l Λ3,(i) .
N N

i=1 i=1
(14)

We can show that the norms of input errors E(i),t Λ3,(i) are upper bounded by O( α).
Thus, when the sparsity parameter α is not too large, we can write the solutions to (14) as
a series of α.
In the limit α = 0, we have E(i),t = 0, thus we can solve the leading terms of Ĥ(i),l and
Ĥg from (14). When α is not too large, we can prove the following lemma,
Lemma 8 (informal) If α is not too large, then Ĥg and Ĥ(i),l introduced in Lemma 7
satisfy,

N
!
 1 X T

−1
 √
? ?
L (i) L (i) Ĥg − Ĥ(i),l Λ2,(i) Λ3,(i) T
Λ−1 + O( α)


 Ĥg = 6

 N


 i=1
Ĥ(i),l = L (i) L? T(i) Ĥ(i),l Λ−1
 ?
2,(i) (15)

  
N



 1 X ? ?T

−1 T

 Λ−1 −1
− L L Ĥ − Ĥ Λ Λ 6 Λ3,(i) Λ2,(i) + O( α),

(j) g (j),l
  (j) 2,(j) 3,(j)
N



j=1
√ √
where O( α)’s are terms whose Frobenius norm and `∞ norm is upper bounded by O( α),
and Λ6 is defined as Λ6 = Λ1 − N1 N −1 T
P
j=1 Λ3,(j) Λ2,(j) Λ3,(j) .

The formal version of lemma 8 and its proof are relegated to the appendix. We now briefly
introduce our methodology for deriving the solutions in Lemma 8. For matrices A, B, X, Y
satisfying the Sylvester equation AX − XB = −Y, if the spectra of A Pand Bp are −1−p
separated,
i.e., σmax (A) < σmin (B), then the solution can be written as X = ∞ p=0 A YB . We
apply this solution form to (14) and iteratively expand Ĥg and Ĥ(i),l . The exact forms
of the resulting series are shown in (41) and (42) in the appendix. In the series, each
term is a product of a group of sparse matrices, an incoherent matrix, and some remaining
terms. Based on the special structure of the series, we can calculate upper bounds on
the Frobenius norm and maximum  row norm of each term  in the series. The leading
terms are simply N i=1 L (i) L (i) Ĥg − Ĥ(i),l Λ2,(i) Λ3,(i) Λ6 and L? (i) L? T(i) Ĥ(i),l Λ−1
1 PN ? ? T −1 T −1
2,(i) −
 
1 P N ? ?T −1 T −1 −1
N j=1 L (j) L (j) Ĥg − Ĥ(j),l Λ2,(j) Λ3,(j) Λ6 Λ3,(i) Λ2,(i) . By summing up the norm of all
remaining higher-order terms in the series and applying a few basic inequalities in geometric
series, we can prove the result in Lemma 8.

18
TCMF

Perturbations on the optimal solutions to (8) Lemma 8 then allows us to establish


T
the inequality in Lemma 6. In epoch t, we know L̂(i),t = Ûg,t V̂(i),g,t T
+ Û(i),l,t V̂(i),l,t , where
Ûg,t , V̂(i),g,t , Û(i),l,t , V̂(i),l,t are the optimal solutions to the subproblem (8). By Lemma 7,
one can replace Ûg,t , V̂(i),g,t , Û(i),l,t , V̂(i),l,t by Ĥg and Ĥ(i),l , and rewrite L̂(i),t as,

L̂(i),t = Ĥg,t ĤTg,t M̂(i) + Ĥ(i),l,t ĤT(i),l,t M̂(i) .

Then, we can replace Ĥg,t and Ĥ(i),l,t by the Taylor-like series described in Lemma
8. The error between L? (i) and L̂(i),t can be written as the summation of a few terms.
The leading term is H? g H? Tg L? (i) + H? (i),l H? T(i),l L? (i) , which is identical to L? (i) because

of the SVD (3). The remaining terms are errors resulting from O( α) terms in (15)
and E(i) . Each of the error terms possesses a special structure that allows us to derive
an upper bound on its `∞ norm. By summing up all these bounds, we can show that

L̂(i),t − L? (i) ≤ O( α maxj E(j),t ∞ ). The detailed calculations on the upper bounds

on the `∞ norm of error terms are long and repetitive, thus relegated to the proof of Lemma
20 in the Appendix.

7. Numerical Experiments
In this section, we investigate the numerical performance of TCMF on several datasets. We
first use synthetic datasets to verify the convergence in Theorem 5 and validate TCMF’s
capability in recovering the common and individual features from noisy observations. Then,
we use two examples of noisy video segmentation and anomaly detection to illustrate the
utility of common, unique, and noise components. We implement Algorithm 1 with HMF
(Shi et al., 2023) as its subroutine JIMF. Experiments in this section are performed on a
desktop with 11th Gen Intel(R) i7-11700KF and NVIDIA GeForce RTX 3080. Code is
available in the linked Github repository.

7.1 Exact Recovery on Synthetic Data


On the synthetic dataset, we simulate the data generation process in (1). We use N = 100
sources and set the data dimension of M(i) to 15 × 1000 in each source. We randomly
generate r1 = 3 global features and r2 = 3 local features for each source. The local features
are first generated randomly, then deflated to be orthogonal to the global ones. The sparse
noise matrix S? (i) is randomly generated from the Bernoulli model, i.e., each entry of S? (i)
is nonzero with probability p and zero with probability 1 − p. We use p as a proxy of

the sparsity parameter α = p. The value of each entry in S? (i) is randomly sampled from
{−100, 100} with equal probability. Next, we use (1) to construct the observation matrix.
With the generated {M(i) }, we run Algorithm 1 with ρ = 0.99 to estimate local, global,
and sparse components. The subroutine JIMF in Algorithm 1 is implemented by HMF with
spectral initialization. As discussed in Appendix A.1, HMF is an iterative algorithm. In
practice, for each call of HMF, we run 500 iterations with constant stepsize 0.005, which
take around 62 seconds in our machine and generate satisfactory outputs. To quantitatively
evaluate the convergence error, we calculate the `∞ error of local, global, and sparse
components as specified in Theorem 5. More specifically, we calculate the `∞ global error at

19
Shi, Fattahi, and Al Kontar

epoch t as
N
1 X
`∞ − global error = Û g,t V̂T(i),g,t − U? (i),g V? T(i),g ,
N ∞
i=1
the `∞ local error at epoch t as
N
1 X
`∞ − local error = Û (i),l,t V̂T(i),l,t − U? (i),l V? T(i),l ,
N ∞
i=1

and the `∞ sparse noise error at epoch t as


N
1 X
`∞ − sparse error = Ŝ(i),t − S? (i) .
N ∞
i=1

We show the error plot for three different sparsity parameters α in Figure 2.

Figure 2: Error plots of Algorithm 1. The x-axis denotes the iteration index, and the y-axis
shows the `∞ error at the corresponding iteration. The y-axis is in log scale.

From Figure 2, it is clear that the global, local, and sparse components indeed converge
linearly to the ground truth.
Further, we compare the feature extraction performance of TCMF with benchmark algo-
rithms, including JIVE (Lock et al., 2013), RJIVE (Sagonas et al., 2017), RaJIVE (Ponzi
et al., 2021), and HMF (Shi et al., 2023). We do not include the comparison with RCICA
(Panagakis et al., 2015) because RCICA is designed only for N = 2, while we have 100 different
sources. Since the errors of different methods
 vary drastically, we calculate and report the  log-
2
1 PN  T ? ? T
arithm of global error as g-error = log10 N i=1 Û g,t V̂ (i),g,t − U (i),g V (i),g , the
F
 
2
1 PN  T ? ? T
logarithm of local error as l-error = log10 N i=1 Û (i),l,t V̂ (i),l,t − U (i),l V (i),l ,
 F
PN 2
?
and the logarithm of sparse noise error as s-error = log10 i=1 Ŝ(i),t − S (i) at
F

20
TCMF

t = 20. We run experiments from 5 different random seeds and calculate the mean and
standard deviation of the log errors. Results are reported in Table 1.

Table 1: Recovery error of different algorithms. The columns g-error, l-error, and
s-error stand for the log recovery errors of global components, local components,
and sparse components.

α = 0.01 α = 0.1
g-error l-error s-error g-error l-error s-error
JIVE 5.52 ± 0.01 5.64 ± 0.01 - 6.52 ± 0.01 6.58 ± 0.01 -
HMF 5.49 ± 0.01 5.62 ± 0.01 - 6.48 ± 0.01 6.55 ± 0.01 -
RaJIVE 5.46 ± 0.01 5.36 ± 0.05 5.71 ± 0.05 6.48 ± 0.00 6.25 ± 0.14 6.59 ± 0.12
RJIVE 5.49 ± 0.01 5.44 ± 0.01 5.77 ± 0.01 6.48 ± 0.00 6.47 ± 0.00 6.78 ± 0.00
TCMF -3.38 ± 0.14 -3.37 ± 0.13 -2.94 ± 0.08 -1.93 ± 0.09 -1.95 ± 0.06 -1.54 ± 0.04

Table 1 shows that TCMF outperforms benchmark algorithms by several orders. This
is understandable as TCMF is provably convergent into the ground truth, while benchmark
algorithms either neglect sparse noise or rely on instance-dependent heuristics.

7.2 Video Segmentation from Noisy Frames


An important task in video segmentation is background-foreground separation. There
are several matrix factorization algorithms that can achieve decent performance in video
segmentation, including robust PCA (Candès et al., 2011), PerPCA (Shi and Kontar, 2024),
and HMF (Shi et al., 2023). However, the separation is much more challenging when the
videos are corrupted by large noise (Shen et al., 2022). TCMF can naturally handle such tasks
with its power to recover global and local components from highly noisy measurements.
We use a surveillance video from Vacavant et al. (2013) as an example. In the video,
multiple vehicles drive through the circle. We add large and sparse noise to the frames to
simulate the effects of large measurement errors. More specifically, similar to 7.1, we sample
each entry of noise from i.i.d. Bernoulli distribution that is zero with probability 0.99 and
nonzero with probability 0.01. And each entry is sampled from {−500, 500} with equal
probability. Then we apply TCMF on the noisy frames to recover Ûg , {V̂(i),g , Û(i),l , V̂(i),l , Ŝ(i) }.
We set ρ to 0.95 and use T = 15 epochs. The subroutine JIMF is still implemented by HMF
with spectral initialization. We also set the number of iterations for HMF to 500. This is
a conservative choice to ensure small optimization error  in the subroutine. To visualize
T
the results, we plot global components Ûg V̂(i),g T . They are
and local components Û(i),l V̂(i),l
shown in Table 2.
In Table 2, the background and foreground are clearly separated from the noise. The
result highlights TCMF’s ability to extract features in high-dimensional noisy data.
We compare TCMF to several benchmark methods, namely JIVE, HMF, RJIVE, RaJIVE,
and Robust PCA. These algorithms, including JIVE, HMF, RJIVE, and RaJIVE, are capable
of producing joint and individual components of video frames. In our evaluation, we consider
the joint component as the background and the individual component as the foreground. As
for robust PCA, we flatten each image into a row vector and create a large matrix Mstack
by stacking these row vectors. We then utilize the nonconvex robust PCA (Netrapalli et al.,

21
Shi, Fattahi, and Al Kontar

Table 2: Foreground Background separation


Frame 1 2 3

Original
noisy
frames

Noise

Global
components

Local
components

2014) to extract the sparse and low-rank components from Mstack . The low-rank component
is regarded as the background, while the sparse component captures the foreground.
To assess the performance of these methods, we calculate the differences between the
recovered background and foreground compared to the ground truths. Specifically, we
estimate the mean squared error (MSE), peak signal-to-noise ratio (PSNR), and structural
similarity index (SSIM) of the recovered foreground and background with respect to the
true foreground and background. The comparison results are presented in Table 3.

Table 3: Background and foreground recovery quality metrics for different algorithms.
Background Foreground Wall-clock
MSE ↓ PSNR ↑ SSIM ↑ MSE ↓ PSNR ↑ SSIM ↑ time (s) ↓
JIVE 415 -26 0.08 2521 14 0.03 1.6 × 103
HMF 198 -22 0.18 2413 14 0.05 2.3 × 101
PerPCA 236 -23 0.14 2389 14 0.07 9.8 × 101
RJIVE 277 -24 0.13 1309 16 0.22 9.2 × 101
RaJIVE 170 -22 0.22 166 26 0.18 1.2 × 104
Robust PCA 0.0016 31 1.00 5105 11 0.61 3.3 × 10−1
TCMF 0.0003 33 1.00 98 31 0.98 3.5 × 102

22
TCMF

In Table 3, a lower MSE, a higher PSNR, and a higher SSIM signify superior recovery
quality. In terms of background recovery, both TCMF and robust PCA exhibit low MSE,
high PSNR, and high SSIM, surpassing other methods. This suggests that both algorithms
effectively reconstruct the background. This outcome was anticipated as TCMF and robust
PCA possess the capability to differentiate between significant noise and low-rank components.
In contrast, other benchmarks either neglect large noise in the model or rely on heuristics.
Furthermore, TCMF showcases marginally superior performance in MSE and PSNR compared
to robust PCA, signifying higher-quality background recovery.
When it comes to foreground recovery, TCMF outperforms benchmark algorithms signifi-
cantly across all metrics. The inability of robust PCA to achieve high-quality foreground
recovery is likely due to its inability to separate sparse noise from the foreground. JIVE and
HMF yield high MSE and low PSNR, indicating noisy foreground reconstruction. Although
heuristic methods, such as RJIVE and RaJIVE, exhibit slight performance improvements
over JIVE and HMF, they still fall short of the performance exhibited by TCMF. This com-
parison underscores TCMF’s remarkable power to identify unique components from sparse
noise accurately.
We also report the running time of each experiment in Table 3. Compared with heuristic
methods to robustly separate the shared and unique components, TCMF exhibits a slightly
longer running time than RJIVE but significantly outperforms RaJIVE in terms of speed.
The comparison highlights TCMF’s superior performance with moderate computation demands.
Although Robust PCA demonstrates a relatively short running time in this instance, larger-
scale experiments presented in Appendix B will show that Robust PCA has larger running
time scaling as the problem size increases.

7.3 Case Study: Defect Detection on Steel Surface


Hot rolling is an important process in steel manufacturing. For better product quality, a
critical task is to detect and locate the defects that arise in the rolling process (Jin et al.,
2000). In this study, the dataset (Jin et al., 2000; Yan et al., 2018) comes from the HotEye
video of a rolling steel plate. The video captures sharp pictures of the surface of the steel
plate. An example is shown in the left graph of Figure 3. The irregular dark dots in the
graph indicate surface defects that require subsequent investigations (Jin et al., 2000).
As different frames of the rolling video are related, they possess similar background
patterns. Meanwhile, each frame also contains unique variations that reflect frame-by-frame
differences. On top of the changing patterns, there are small defects on the surface of the
steel plate. The defects, as shown in the left graph of Figure 3, only occupy small spatial
regions and thus can naturally be modeled by sparse outliers.
In such scenarios, the application of TCMF enables the identification of defects and
extraction of common and unique patterns simultaneously. For this experiment, we use
TCMF to segment 100 hot-rolling video frames. The right graph of Figure 3 illustrates two
frames selected from the rolling video alongside the corresponding recovered global, local,
and sparse components. We set the reduction parameter ρ = 0.97 and the number of epochs
T = 100. The details of the subroutine JIMF are relegated to Appendix A.1. Additionally,
as a comparative analysis, we employ nonconvex robust PCA (Netrapalli et al., 2014) to
recover and display the low-rank and sparse components from frames. Our robust PCA

23
Shi, Fattahi, and Al Kontar

implementation alternatively applies SVD and hard thresholding. We require all entries in
the sparse component to be negative in the hard-thresholding step to encode the domain
knowledge that surface defects tend to have lower temperatures. The hyper-parameters for
SVD and thresholding are consistent for both TCMF and Robust PCA.

Figure 3: Left: An example of the surface of the steel bar. There are a few anomalies inside
the red ellipse. Right: Recovered sparse noises, shared components, and unique
components from 2 frames.

In Figure 3, we can see that TCMF effectively identifies the small defects on the steel
plate surface. The global component reflects the general patterns in the video frames, while
the local component accentuates the variations in different frames. In contrast, the sparse
components recovered by robust PCA do not faithfully represent the surface defects.
We proceed to show that TCMF-recovered sparse components can be conveniently leveraged
for frame-level anomaly detection. Our task here is to identify which frames contain surface
anomalies. Inspired by the statistics-based anomaly detection (Chandola et al., 2009), we
construct simple test statistics to monitor the anomalies. The test statistics is defined as
the `1 norm of the recovered sparse noise on each frame Ŝ(i) . Indeed, a large Ŝ(i)
1 1
provides strong evidence for surface defects. The choice of `1 norm is not special as we find
other norms, such as `2 norm, would yield a similar performance.
After using TCMF to extract the sparse components, we calculate the test statistics for each
frame. Then, we can set up a simple threshold-based classification rule for anomaly detection:
when the `1 norm exceeds the threshold, we report an anomaly in the corresponding frame.
In the case study, the threshold is set to be the highest value in the first 50 frames, which is
the in-control group that does not contain anomalies (Yan et al., 2018). We plot the test
statistics and thresholds in Figure 5. The blue dots and red crosses denote the (ground truth)

24
TCMF

normal and abnormal frame labels in Yan et al. (2018). In an ideal plot of test statistics, one
would expect the abnormal samples to have higher `1 norms, while normal samples should
have lower norms. This is indeed the case for Figure 5, where a simple threshold based on
the sparse features can distinguish abnormal samples from normal ones with high accuracy.
In comparison, we also calculate the `1 norm of sparse noise recovered by robust PCA
and plot the testing statistics in Figure 4. In Figure 4, the `1 norm is less indicative of
anomaly labels, as some abnormal samples have small test statistics, while some normal
samples have large statistics. It is also hard to use a threshold on the test statistics to
predict anomalies.

Figure 4: Test statistics of robust PCA. Figure 5: Test statistics of TCMF.

The comparison highlights TCMF’s ability to find surface defects. The results are un-
derstandable as TCMF uses a more refined model to decompose the thermal frames into
three parts, thus having more representation power to fit the underlying physics in the
manufacturing process. As a result, the recovered sparse components are more representative
of the anomalies.

8. Conclusion
In this work, we propose a systematic method TCMF to separate shared, unique, and noise
components from noisy observation matrices. TCMF is the first algorithm that is provably
convergent to the ground truth under identifiability conditions that require the three
components to have “small overlaps”. TCMF outperforms previous heuristic algorithms
by large margins in numerical experiments and finds interesting applications in video
segmentation, anomaly detection, and time series imputation.
Our work also opens up several venues for future theoretical exploration in separating
shared and unique low-rank features from noisy matrices. For example, a minimax lower
bound on the µ, θ, and α can help fathom the statistical difficulty of such separation. Also,
as many existing methods for JIMF rely on good initialization to excel, designing efficient
algorithms for JIMF that are independent of the initialization is also an interesting topic.
On the practical side, methods to integrate TCMF with other machine learning models, e.g.,
auto-encoders, to find nonlinear features in data are worth exploring.

25
Shi, Fattahi, and Al Kontar

Acknowledgments

This research is supported in part by Raed Al Kontar’s NSF CAREER Award 2144147 and
Salar Fattahi’s NSF CAREER Award CCF-2337776, NSF Award DMS-2152776, and ONR
Award N00014-22-1-2127,

26
TCMF

Appendix A. Details of subroutine algorithm


In this section, we will elaborate on two subroutine algorithms in literature to solve the
problem (8), specifically known as HMF and PerPCA. Among a plethora of existing method-
ologies aiming to distinguish shared and unique features, these two exhibit an exceptional
significance, as they are proved to converge to the optimal resolution of problem (8) linearly
under appropriate initial conditions. We underscore that the usage of JIMF is not restricted
solely to these two methods. In essence, any algorithm with the ability to segregate common
and unique components can be effectively employed as JIMF.

A.1 Heterogeneous matrix factorization


Heterogeneous matrix factorization (HMF) (Shi et al., 2023) is an algorithm proposed to solve
the following problem,

N
X
f˜i Ug , {V(i),g , U(i),l , V(i),l }

min
Ug ,{U(i),l }i=1,··· ,N
i=1
N
X 1 T T
2 β 2 β 2
= M̂(i) − Ug V(i),g − U(i),l V(i),l + UTg Ug − I F
+ UT(i),l U(i),l − I
2 F 2 2 F
i=1
subject to UTg U(i),l = 0, ∀i.
(16)
Compared with (8), the objective in (16) contains two additional regularization terms
2 2
β
2 UTg Ug − I F + β2 UT(i),l U(i),l − I . The regularization terms enhance the smoothness
F
of the optimization objective thereby facilitating convergence. Despite the regularization
terms, any optimal solution to (16) is also an optimal solution to (8). We can prove the
claim in the following proposition.

Proposition 9 Let ÛHMF HMF HMF HMF


g , {V̂(i),g , Û(i),l , V̂(i),l } be one set of optimal solutions to (16), then
ÛHMF HMF HMF HMF
g , {V̂(i),g , Û(i),l , V̂(i),l } is also a set of optimal solution to (8)

Proof The proof is straightforward. We first claim that ÛHMF T ÛHMF = I and ÛHMF T ÛHMF = I.
g g (i),l (i),l
We prove the claim by contradiction. Suppose otherwise, we can find a QR decomposition
of ÛHMF
g and ÛHMF HMF = Q R and ÛHMF = Q
(i),l as Ûg g g g (i),l R(i),l , where Qg and Q(i),l ’s are
orthonormal and Rg and R(i),l ’s are upper-triangular. Furthermore, not both Rg and R(i),l
2 2
are identity matrices, thus RTg Rg − I F
+ RT(i),l R(i),l − I > 0. Now, we construct a
F
refined set of solutions as,
ined
ÛHMF,ref
g = Qg
HMF,ref ined
V̂(i),g HMF
= V̂(i),g RTg
ined
ÛHMF,ref
(i),l = Q(i),l
HMF,ref ined (i),l
V̂(i),l = V̂HMF RT(i),l .

27
Shi, Fattahi, and Al Kontar

Then it’s easy to verify that

N  
HMF,ref ined HMF,ref ined HMF,ref ined
X
f˜i ÛHMF,ref
g
ined
, V̂ (i),g , Û(i),l , V̂ (i),l
i=1
N  β 
X  2 2
= f˜i ÛHMF
g , V̂ HMF
(i),g , ÛHMF
(i),l , V̂ HMF
(i),l − RTg Rg − I F
+ RT(i),l R(i),l − I
2 F
i=1
XN  
< f˜i ÛHMF
g , V̂ HMF
(i),g , ÛHMF
(i),l , V̂ HMF
(i),l ,
i=1

which contradicts with the global optimality of ÛHMF HMF HMF HMF
g , {V̂(i),g , Û(i),l , V̂(i),l }. This proves the
claim.
 
From the orthogonality, we know fi ÛHMFg , V̂ HMF , ÛHMF , V̂HMF
(i),g (i),l (i),l =
 
f˜i ÛHMF HMF HMF HMF
g , V̂(i),g , Û(i),l , V̂(i),l .

Now suppose ÛHMF HMF HMF HMF


g , {V̂(i),g , Û(i),l , V̂(i),l } is not an optimal solution to (8). Then, we can
find a different set of feasible solution ÛJIMF
g
JIMF , ÛJIMF , V̂JIMF } such that
, {V̂(i),g (i),l (i),l

N
X  
fi ÛJIMF
g , V̂ JIMF
(i),g , ÛJIMF
(i),l , V̂ JIMF
(i),l
i=1
N
X  
< fi ÛHMF
g , V̂ HMF
(i),g , ÛHMF
(i),l , V̂ HMF
(i),l
i=1
XN  
= f˜i ÛHMF HMF HMF HMF
g , V̂(i),g , Û(i),l , V̂(i),l .
i=1

We can similarly define a set of refined solutions

ined
ÛJIMF,ref
g = QJIMF
g
JIMF,ref ined JIMF JIMF T
V̂(i),g = V̂(i),g Rg
ined
ÛJIMF,ref
(i),l = QJIMF
(i),l
JIMF,ref ined (i),lT
V̂(i),l = V̂HMF RJIMF
(i),l ,

JIMF,ref ined
where QJIMF
g , RJIMF
g , QJIMF JIMF
(i),l , R(i),l are QR decompositions that satisfy Ûg =
ined
QJIMF
g RJIMF
g and ÛJIMF,ref
(i),l = QJIMF JIMF
(i),l R(i),l . Based on the refined set of solutions, we

28
TCMF

can prove that,


N  
JIMF,ref ined JIMF,ref ined JIMF,ref ined
X
f˜i ÛJIMF,ref
g
ined
, V̂ (i),g , Û (i),l , V̂ (i),l
i=1
N
X  
= fi ÛJIMF
g , V̂ JIMF
(i),g , Û JIMF
(i),l , V̂ JIMF
(i),l
i=1
XN  
< f˜i ÛHMF
g , V̂ HMF
(i),g , ÛHMF
(i),l , V̂ HMF
(i),l ,
i=1

which contradicts the optimality of ÛHMF HMF HMF HMF


g , {V̂(i),g , Û(i),l , V̂(i),l }.
This completes the proof.

HMF optimizes the objective by gradient descent. To ensure feasibility, HMF employs a
special correction step to orthogonalize Ug and U(i),l without changing the objective at
every step. The pseudo-code is presented in Algorithm 2.

Algorithm 2 JIMF by heterogeneous matrix factorization


1: Input matrices {M̂(i) }N i=1 , stepsize ητ , iteration budget R.
2: Initialize Ug,1 , V(i),g, 1 , U(i),l, 1 , V(i),l,1 to be small random matrices.
2 2
3: for Iteration τ = 1, ..., R do
4: for index i = 1, · · · , N do −1 T
5: Correct U(i),l,τ = U(i),l,τ − 1 − Ug,τ UTg,τ Ug,τ Ug,τ U(i),l,τ − 1
2 2
−1
6: Correct V(i),g,τ = V(i),g,τ − 1 + V(i),l,τ UT(i),l,τ − 1 Ug,τ UTg,τ Ug,τ
2 2
7: Update U(i),g,τ +1 = Ug,τ − ητ ∇Ug f˜i
8: Update V 1 = V(i),g,τ − ητ ∇V
(i),g,τ + 2 f˜i
(i),g

9: Update U(i),l,τ + 1 = U(i),l,τ − ητ ∇U(i),l f˜i


2
10: Update V(i),l,τ +1 = V(i),l,τ − ητ ∇V f˜i (i),l
11: end for
Calculates Ug,τ +1 = N1 N
P
12: i=1 U(i),g,τ +1
13: end for
14: Return Ug,R , {V(i),g,R }, {U(i),l,R }, {V(i),l,R }.

In Algorithm 2, we use τ to denote the iteration index, where the half-integer index
denotes the update of the variable is half complete: it is updated by gradient descent but
is not feasible yet. It is proven that under a group of sufficient conditions, Algorithm 2
converges to the optimal solutions of problem (16). The sufficient conditions require the
stepsize ητ to be chosen appropriately and the initialization close to the optimal solution
(Shi et al., 2023).
In practice, Algorithm 2 is often efficient and accurate. Therefore, we implement HMF
as the subroutine JIMF for all of our numerical simulations in Section 7. To initialize
Algorithm 2, we adopt a spectral initialization approach. Specifically, we concatenate all

29
Shi, Fattahi, and Al Kontar

matrices column-wise to form Mconcat = [M(1) , M(2) , · · · , M(N ) ]. Subsequently, we perform


a Singular Value Decomposition (SVD) on the concatenated matrix Mconcat to extract the
top r1 column singular vectors, which serve as the initialization for Ug,1 in Algorithm 2.
Utilizing the calculated Ug,1 , we deflate M(i) by subtracting the projection of M(i) onto
Ug,1 , denoted as Mdef
(i)
late
= M(i) − Ug,1 UTg,1 M(i) . We then conduct another SVD to identify
the top r2 singular vectors of Mdef
(i)
late
, which are utilized as the initialization for U(i),l, 1 .
2
The initializations for the coefficient matrices are established as V(i),g, 1 = MT(i) Ug,1 and
2
V(i),l,1 = MT(i) U(i),l, 1 .
2
The stepsize η in Algorithm 2 is individually adjusted for each dataset to achieve the
fastest convergence. We choose a large total number of iterations R to ensure a small
optimization error . Specifically, in the synthetic data, we set the stepsize to 0.005 and
R = 500. In the video segmentation task, we set the stepsize to 5 × 10−6 and R = 200. And
on the hot rolling data, we set the stepsize to 4 × 10−5 and R = 500. In our experiments, we
observe that the regularization parameter β exerts a negligible influence on the convergence
of Algorithm 2. Consequently, we maintain β within the range of 10−6 to 10−5 in all our
experiments.

A.2 Personalized PCA


Personalized PCA (Shi and Kontar, 2024) is another subroutine to solve (8). More specifically,
personalized PCA seeks to find orthonormal features Ug and U(i),l to minimize the residual
of fitting, as shown in the following objective,
N
1X 2
min M̂(i) − Ug UTg M̂(i) − U(i),l UT(i),l M̂(i)
Ug ,{U(i),l }i=1,··· ,N 2 F (17)
i=1
T
subject to Ug Ug = I, UT(i),l U(i),l = I, UTg U(i),l = 0, ∀i.

The objective only optimizes the feature matrices Ug and U(i),l , but it’s essentially equivalent
to problem (8). The formal statement is presented in the following proposition.

Proposition 10 Let ÛPerPCA


g , {ÛPerPCA
(i),l } be one set of optimal solutions to (17), then
ÛPerPCA
g , {M̂T(i) ÛPerPCA
g , ÛPerPCA T PerPCA
(i),l , M̂(i) Û(i),l } is also a set of optimal solution to (8)

Proof We will also prove the proposition by contradiction. If


PerPCA
Ûg T PerPCA
, {M̂(i) Ûg PerPCA T PerPCA
, Û(i),l , M̂(i) Û(i),l } is not a set of optimal solution to (8),
we can find a different set of feasible solutions ÛJIMF
g
JIMF , ÛJIMF , V̂JIMF } such that
, {V̂(i),g (i),l (i),l

N
X  
fi ÛJIMF
g , V̂ JIMF
(i),g , Û JIMF
(i),l , V̂ JIMF
(i),l
i=1
N
X  
T T
< fi ÛPerPCA
g , M̂ Û
(i) g
PerPCA
, Û PerPCA
(i),l , M̂ Û PerPCA
(i) (i),l
i=1
N
X 2
T PerPCA T
= M̂(i) − UPerPCA
g UPerPCA
g M̂(i) − UPerPCA
(i),l U(i),l M̂(i) .
F
i=1

30
TCMF

If we fix Ug and U(i),l to be ÛJIMF g and ÛJIMF


(i),l in problem (8), then the optimal
 −1
JIMF,opt JIMF,opt
solution of V(i),g and V(i),l is V(i),g = M̂T(i) ÛJIMF
g ÛJIMF T ÛJIMF
g g and V(i),l =
 −1
M̂T(i) ÛJIMF JIMF T JIMF
(i),l Û(i),l Û(i),l . As a result,

N
X  −1  −1 2
T JIMF T T JIMF T
M̂(i) − ÛJIMF
(i),l ÛJIMF
(i),l Û(i),l ÛJIMF
(i),l M̂(i) − ÛJIMF
(i),g ÛJIMF
(i),g Û(i),g ÛJIMF
(i),g M̂(i)
i=1 F
N
X  
JIMF,opt JIMF,opt
= fi ÛJIMF
g , V̂ (i),g , ÛJIMF
(i),l , V̂ (i),l
i=1
XN  
≤ fi ÛJIMF
g
JIMF
, V̂(i),g , ÛJIMF
(i),l , V̂ JIMF
(i),l
i=1
N
X 2
T PerPCA T
< M̂(i) − UPerPCA
g UPerPCA
g M̂(i) − UPerPCA
(i),l U(i),l M̂(i) .
F
i=1

 −1/2
ine ine
If we define ÛJIMF,ref
(i),l = ÛJIMF ÛJIMF T ÛJIMF
(i),l (i),l (i),l and ÛJIMF,ref
(i),g =
 −1/2
ine ine
ÛJIMF JIMF T JIMF
(i),g Û(i),g Û(i),g , then ÛJIMF,ref
(i),l and ÛJIMF,ref
(i),g are also feasible for (17)
and achieve lower objective. This contradicts the optimality of UPerPCA
g and UPerPCA
(i),l .

To solve the constrained optimization problem (17), personalized PCA adopts a dis-
tributed version of Stiefel gradient descent. The pseudo-code is presented in Algorithm
3.
In Algorithm 3, GR denotes generalized retraction. In practice, it can be implemented
− 1
via polar projection GRU (V) = (U + V) UT U + VT U + UT V + VT V 2 . Algorithm 3
can also be proved to converge to the optimal solutions with suitable choices of stepsize and
initialization (Shi and Kontar, 2024).

Appendix B. Additional running time comparisons

In this section, we include the additional running time comparison between TCMF and Robust
PCA. We use a set of synthetic datasets with varying numbers of sources N , then compare
the per-iteration running time of the two algorithms. More specifically, we follow the setting
in Section 7.1 where n1 = 15 and n2 = 1000, and generate synthetic datasets where the
number of sources N changes from 100 to 10000. Then, we apply TCMF and Robust PCA on
the same dataset. We do not parallelize computations for either algorithm for fair comparison.
The per-iteration running time of the two algorithms is collected and plotted in Figure 6.

31
Shi, Fattahi, and Al Kontar

Algorithm 3 JIMF by personalized PCA


Input observation matrices {M̂(i) }N i=1 , stepsize ητ , iteration budget R.
Initialize Ug,1 , and U(1),l, 1 , · · · , U(N ),l, 1 .
2 2
Calculate S(i) = M̂(i) M̂T(i) for each i.
for iteration τ = 1, ..., R do
for index i = 1, · · · , N do  
Correct U(i),l,τ = GRU(i),l,τ − 1 −Ug,τ Ug,τ U(i),l,τ − 1
 2 2
Calculate G(i),τ = I − Ug,τ UTg,τ − U(i),l,τ UT(i),l,τ S(i) Ug,τ , U(i),l,τ
 

Update U(i),g,τ +1 = Ug,τ + ητ (G(i),τ )1:d,1:r1


 
Update U(i),l,τ + 1 = GRU(i),l,τ ητ (G(i),τ )1:d,(r1 +1):(r1 +r2,(i) )
2
end for  P 
Update Ug,τ +1 = GRUg,τ N1 N i=1 U (i),g,τ +1 − U g,τ
end for
Calculate V(i),g,R = M̂T(i) Ug,R and V(i),l,R = M̂T(i) U(i),l,R .
Return Ug,R , {V(i),g,R }, {U(i),l,R }, {V(i),l,R }.

Figure 6: Running time comparison between runtime of TCMF and Robuct PCA.

From Figure 6, it is clear that the running time of TCMF scales linearly with the number
of sources N , which is consistent with the complexity analysis.
In contrast, though Robuct PCA has a smaller per-iteration runtime when N is small,
as N becomes larger, the runtime increases faster than TCMF. This is because Robuct
PCA vectorizes the observation matrices from each source. The resulting vector from each
source has dimension n1 n2 . Robuct PCA then concatenates these vectors into a n1 n2 × N
matrix and alternatively performs Singular Value Decomposition and hard-thresholding. For
each application of SVD, the computational complexity is O(n1 n2 (N 2 + N ) + N 3 ) when
N ≤ n1 n2 (Li et al., 2019). Indeed, in Figure 6, the slope of the initial part of the Robust
PCA curve is around 1.2, and the slope of the final part is around 3.0, suggesting that the
running time scales cubically in the large N regime.

32
TCMF

Such comparison highlights TCMF’s computational advantage when N is large.

Appendix C. Proof of Theorem 5


In this section, we will introduce the details of the proof of Theorem 5. We will firstly
introduce a few basic lemmas, then prove the KKT conditions in Lemma 7. Based on the
KKT conditions, we introduce an infinite series to represent the solutions to (8). Next, we
will prove Lemma 20, which is a formal version of Lemma 6. Finally, we will use induction
to prove Theorem 21, which is the formal version of Theorem 5.
Remember that we use L? (i) to denote L? (i) = L? (i),g + L? (i),l , where L? (i),g and L? (i),l
are the global and local components for source i defined as L? (i),g = U? g V? T(i),g and
L? (i),l = U? (i),l V? T(i),l . We assume all nonzero singular values of L? (i) are lower bounded
by σmin > 0 and upper bounded by σmax > 0. As introduced in the proof sketch, we use
E(i),t = S? (i) − Ŝ(i),t to denote the difference between our estimate of the sparse noise at
epoch t and the ground truth. The following notations will be used throughout our proof:
N
1 X
F(i) = E(i),t L? T(i) +L ? T
(i) E(i),t + E(i),t ET(i),t , i ∈ [N ], and F(0) = F(i) (18)
N
i=1
N
1 X
T(i) = L? (i) L? T(i) , i ∈ [N ], and T(0) = T(i) . (19)
N
i=1

Since in the ground truth model, the SVD of L? (i) can be written as L? (i) =
 ? T
H g , H? (i),l diag(Σ(i),g , Σ(i),l ) W? (i),g , W? (i),l , one can immediately see that T(i) ’s
 

nonzero eigenvalues are upper bounded by σmax 2 and lower bounded by σmin2 . Finally,

recall that we use Ûg , Û(i),l , V̂(i),g , and V̂(i),l to denote the optimal solutions to (8) (we
omit the subscript t here for Qkbrevity.) For a series of square matrices of the same shape
A1 , · · · , Ak ∈ Rr×r , we use
Q m=1 Am to denote the product of these matrices in the ascend-
ing order of indices, and 1m=k Am to denote the product of these matrices in the descending
order of indices,
Yk
Am = A1 A2 · · · Ak−1 Ak
m=1
Y1
Am = Ak Ak−1 · · · A2 A1 .
m=k
Our next two lemmas provide upper bound on the maximum row-norm of the errors
with respect to the `∞ -norms of E(i) . By building upon these two lemmas, we provide a key
result in Lemma 13 connecting {F(i) } and the error matrices {E(i) }.

Lemma 11 Suppose that E(1) , · · · , E(N ) , ∈ Rn1 ×n2 are α-sparse and U ∈ Rn1 ×r is µ-
incoherent and kU k ≤ 1. For any integers p1 , p2 , · · · , pk ≥ 0, and i1 , i2 , · · · , ik ∈
{0, 1, · · · , N }, we have
s
k
! 2(p1 +p2 +···+pk )
µ2 r
Y 
T T p`
max ej (E(i` ) E(i` ) ) U ≤ αn max E(i) ∞ . (20)
j n1 i
`=1 2

33
Shi, Fattahi, and Al Kontar

With a slight abuse of notation, in Lemma 11 and the rest of the paper, we define E(0) ET(0)
to be,
N
T 1 X
E(0) E(0) = E(i) ET(i) . (21)
N
i=0

Proof We will prove it by induction on the exponent. From the definition of incoherence,
we know that when p1 + · · · + pk = 0, the inequality (20) holds. Now suppose that
the inequality (20) holds for all p1 , p2 , · · · , pk ≥ 0 such that p1 + · · · + pk ≤ s − 1 and
i1 , i2 , · · · , ik ∈ {0, 1, · · · , N }. We will prove the statement for p1 + · · · + pk = s. Without
loss of generality, we assume p1 ≥ 1. One can write

k
! 2 k
! !2
Y X Y
eTj (E(i` ) ET(i` ) )p` U = eTj (E(i` ) ET(i` ) )p` Uel
`=1 2 l `=1
k
! !2
X Y
= eTj E(i1 ) ET(i1 ) (E(i1 ) ET(i1 ) )p1 −1 (E(i` ) ET(i` ) )p` Uel
l `=2
k
! !2
X Xh i Y
= E(i1 ) ET(i1 ) eTh (E(i1 ) ET(i1 ) )p1 −1 (E(i` ) ET(i` ) )p` Uel
j,h
l h `=2
XXh i h i
= E(i1 ) ET(i1 ) E(i1 ) ET(i1 )
j,h1 j,h2
l h1 ,h2
k 2
! !
Y Y
× eTh1 (E(i1 ) ET(i1 ) )p1 −1 (E(i` ) ET(i` ) )p` Uel eTl UT (E(i` ) ET(i` ) )p` (E(i1 ) ET(i1 ) )p1 −1 eh2
`=2 `=k
(22)
.
el eTl = I, we can simplify the summation as,
P
Since l

X   
E(i1 ) ET(i1 ) E(i1 ) ET(i1 )
j,h1 j,h2
h1 ,h2
k 2
! !
Y Y
× eTh1 (E(i1 ) ET(i1 ) )p1 −1 (E(i` ) ET(i` ) )p` UUT (E(i` ) ET(i` ) )p` (E(i1 ) ET(i1 ) )p1 −1 eh2
`=2 `=k
 
X   
≤ E(i1 ) ET(i1 ) E(i1 ) ET(i1 ) 
j,h1 j,h2
h1 ,h2

k
! 2
Y
× max eTm (E(i1 ) ET(i1 ) )p1 −1 (E(i` ) ET(i` ) )p` U
m
`=2 2
4s−4
µ2 r
X    
≤ E(i1 ) ET(i1 ) E(i1 ) ET(i1 ) αn max E(i) ∞
,
j,h1 j,h2 n1 i
h1 ,h2

34
TCMF

where in the last step, we used the induction hypothesis. Now, to complete the proof, we
consider two cases.
If i1 > 0, we have:
X    X
E(i1 ) ET(i1 ) E(i1 ) ET(i1 ) = (E(i1 ) )j,g1 (E(i1 ) )h1 ,g1 (E(i1 ) )j,g2 (E(i1 ) )h2 ,g2
j,h1 j,h2
h1 ,h2 h1 ,h2 ,g1 ,g1
 4
≤ αn1 E(i1 ) ∞
αn2 E(i1 ) ∞
αn1 E(i1 ) ∞
αn2 E(i1 ) ∞
= αn max E(i) ∞
,
i

where the last inequality holds because at most αn1 entries in each column of E(i1 ) are
nonzero and at most αn2 entries in each row of E(i1 ) are nonzero. On the other hand, if
i1 = 0, we have:
   
X    1 X X X
E(0) ET(0) E(0) ET(0) ≤ 2 E(f1 ) ET(f1 )   E(f2 ) ET(f2 ) 
j,h1 j,h2 N
h1 ,h2 h1 ,h2 f1 >0 f2 >0
jk1 jk2
 4
1 2
= N αn max E(i) ∞
.
N2 i

Therefore, in both cases, we have,

k 2 4s
µ2 r
Y 
eTj (E(i` ) ET(i` ) )p` U ≤ αn max E(i) ∞
,
n1 i
`=1 2

for every possible i1 and every j. This concludes our proof.


Next, we present a slightly different lemma.

Lemma 12 Suppose that E(1) , · · · , E(N ) ∈ Rn1 ×n2 are α-sparse and V ∈ Rn2 ×r is µ-
incoherent. For any integers p1 , p2 , · · · , pk ≥ 0, and i1 , i2 , · · · , ik ∈ {0, 1, · · · , N }, we have,

k
!
Y
max eTj (E(i` ) ET(i` ) )p` E(ik+1 ) V
j
`=1 2
s (23)
2(p1 +p2 +···+pk )+1
µ2 r

≤ αn max E(i) ∞
.
n1 i

Proof The proof is analogous to that of Lemma 11, and hence, omitted for brevity.

Combining Lemma 11 and 12, we can show the following key lemma on the connection
between {F(i) } and the error matrices {E(i) }.

Lemma 13 For every i ∈ [N ], suppose that E(i) ∈ Rn1 ×n2 is α-sparse and L? (i) =
H? (i) Σ? (i) W? (i) is rank-r with µ-incoherent matrices H? (i) ∈ Rn1 ×r and W? (i) ∈ Rn2 ×r .

35
Shi, Fattahi, and Al Kontar

For any integers p1 , p2 , · · · , pk ≥ 0, and i1 , i2 , · · · , ik ∈ {0, 1, · · · , N }, the following holds for


any µ-incoherent matrix U ∈ Rn1 ×r ,
s
k p1 +p2 +···+pk
µ2 r

p`
Y
T
max ej F(i` ) U ≤ αn max E(i) ∞ (αn max E(i) ∞ + 2σmax ) ,
j n1 i i
`=1 2
(24)
where F(i) is defined as (18).

Proof We firstly expand Fp(i11 ) Fp(i22 ) · · · Fp(ikk ) U and rearrange the terms by the number of
consecutive E(i) ET(i) terms appearing in the beginning of each factor.

Fp(i11 ) Fp(i22 ) · · · Fp(ikk ) U


   
= E(i1 ) (L? i1 )T + L? i1 ET(i1 ) + E(i1 ) ET(i1 ) · · · E(i1 ) (L? i1 )T + L? i1 ET(i1 ) + E(i1 ) ET(i1 )
   
E(i2 ) (L? i2 )T + L? i2 ET(i2 ) + E(i2 ) ET(i2 ) · · · E(i2 ) (L? i2 )T + L? i2 ET(i2 ) + E(i2 ) ET(i2 )
···
   
E(ik ) (L? (ik ) )T + L? (ik ) ET(ik ) + E(ik ) ET(ik ) · · · E(ik ) (L? (ik ) )T + L? (ik ) ET(ik ) + E(ik ) ET(ik ) U
 p1  p k
= E(i1 ) ET(i1 ) · · · E(ik ) ET(ik ) U
X k −1 
p1 +···+p p 1  pt−1  r−(Pt−1
g=1 pg )
+ E(i1 ) ET(i1 ) ··· E(it−1 ) ET(it−1 ) E(it ) ET(it )
r=0
  (Pt p )−1−r
p
· E(it ) L? T(it ) + L? (it ) ET(it ) F(it )g=1 · · · Fp(ikk ) U.
g
F(it+1
t+1 )

For the first term, by Lemma 11, we have


s 2(p1 +···+pk )
µ2 r
 p1 
T pk
eTj E(i1 ) ET(i1 )

· · · E ik Eik U ≤ αn max E(i) ∞
.
2 n1 i

For the remaining terms, we have


 p 1  pt−1  r−(Pt−1 g=1 pg )
E(i1 ) ET(i1 ) T
· · · E(it−1 ) E(it−1 ) T
E(it ) E(it )
  (Pt p )−1−r
p
· · · Fp(ikk ) U
? T ? T g
E(it ) (L (it ) ) + L (it ) E(it ) F(it )g=1 F(it+1
t+1 )
 p 1  pt−1  r−(Pt−1 g=1 pg )
T T T
= E(i1 ) E(i1 ) · · · E(it−1 ) E(it−1 ) E(it ) E(it ) E(it ) W? (it )
( t
P
pg )−1−r pt+1
× Σ? (it ) H? T(it ) F(it )g=1 F(it+1 ) · · · Fp(ikk ) U
 p 1  pt−1  r−(Pt−1 g=1 pg )
T T T
+ E(i1 ) E(i1 ) · · · E(it−1 ) E(it−1 ) E(it ) E(it ) H? (it )
( t
P
pg )−1−r pt+1
× Σ? (it ) W? T(it ) ET(it ) F(it )g=1 F(it+1 ) · · · Fp(ikk ) U.

36
TCMF

We can bound the two terms separately. By Lemma 12,


 p 1  pt−1  r−(Pt−1
g=1 pg )
eTj E(i1 ) ET(i1 ) ··· E(it−1 ) ET(it−1 ) E(it ) ET(it ) E(it ) W? (it )
s 2r+1
µ2 r

≤ αn max E(i) ∞
.
n1 i

And by Lemma 11,


 p 1  pt−1  r−(Pt−1
g=1 pg )
eTj E(i1 ) ET(i1 ) ··· E(it−1 ) ET(it−1 ) E(it ) ET(it ) H? (it )
s 2r
µ2 r

≤ αn max E(i) ∞
.
n1 i

For an α-sparse matrix E(i) ∈ Rn1 ×n2 , its operator norm is bounded by
X
E(i) 2
= max vT E(i) h = max vj hk [E(i) ]jk
kvk=1,khk=1 kvk=1,khk=1
j,k
X 1  r n1 r 
n2
2 2
≤ max vj + hk [E(i) ]jk
kvk=1,khk=1 2 n2 n1
j,k
 
r r
1 X n1 X n2
≤ max E(i) ∞  vj2 αn2 + h2k αn1 
kvk=1,khk=1 2 n2 n1
j k

= α n1 n2 E(i) ∞ .

Therefore E(i) 2
≤ αn maxi E(i) ∞
. As a result, we know,
 2
F(i) 2 ≤ 2σmax αn max E(i) ∞ + αn max E(i) ∞
i i
 
= αn max E(i) ∞ 2σmax + αn max E(i) ∞ .
i i

We thus have:

eTj Fp(i11 ) Fp(i22 ) · · · Fp(ikk ) U?


s 2(Pk`=1 p` )
µ2 r 

≤ αn max E(i) ∞
n1 i
Pk
p` −1 
`=1 2r+1
X
+ αn max E(i) ∞
i
r=0
  Pk`=1 p` −1−r 
× 2σmax αn max E(i) ∞
2σmax + αn max E(i) ∞
i i

37
Shi, Fattahi, and Al Kontar

s 2 Pk`=1 p`
µ2 r 

= αn max E(i) ∞
n1 i
 Pk`=1 p`   Pk`=1 p`
+ αn max E(i) ∞ 2σmax + αn max E(i) ∞
i i
 Pk`=1 p` 

− αn max E(i) ∞
i
s Pk`=1 p`  Pk`=1 p`
µ2 r

= αn max E(i) ∞ αn max E(i) ∞
+ 2σmax .
n1 i i

This finishes our proof.

Lemma 13 is an important lemma as it provides an upper bound on the maximum row


norm of the product of a group of sparse matrices and an incoherent matrix. We will use
Lemma 13 extensively when we calculate the `∞ norm of error terms in the output of JIMF.
Next, we prove that the optimal solution indeed satisfies the KKT conditions delineated
in Lemma 7.
Proof of Lemma 7 The proof is presented in three parts. In the first part, we show
that the optimal solution optimal Ûg , {V̂(i),g , Û(i),l , V̂(i),l } satisfies the linear independence
constraint qualification (LICQ). This ensures that the optimal solution satisfies the KKT
conditions. In the second part, we prove the validity of the equations in (12). Finally, we
prove the correctness of the equations in (13).
Proof of LICQ. We begin by showing Ûg has full column rank. By contradiction,
0
suppose Ûg has rank r1 < r1 . Since M̂(i) has rank at least r1 + r2 , the residual
T T 0
M̂(i) − Ûg V̂(i),g − Û(i),l V̂(i),l has rank at least 1. Therefore we can always find another Ûg
0
such that fi (Ûg , V̂(i),g , Û(i),l , V̂(i),l ) < fi (Ûg , V̂(i),g , Û(i),l , V̂(i),l ). This contradicts the fact
that Ûg is optimal.
Next we will establish the LICQ of the constraints. We define hijk as the inner product
between the j-th column of Ug and the k-th column of U(i),l , hijk (x) = [Ug ]T:,j [U(i),l ]:,k .
The constraints in (8) can be rewritten as hijk (x̂) = 0, ∀i ∈ [r1 ], ∀j ∈ [r2 ], ∀k ∈ [N ]. LICQ
requires ∇hijk (x̂) to be linearly independent for all ijk (Bertsekas, 1997, Proposition 3.1.1).
Suppose we can find constants ψijk such that N
P Pr1 Pr2
i=1 j=1 k=1 ψijk ∇hijk (x̂) = 0. We
0
consider the partial derivative of hijk over the k -th column of U(i0 ),l . It is easy to derive,

hijk (x̂) = δii0 δkk0 [Ûg ]:,j ,
∂[U(i0 ),l ]:,k0
where δii0 is the Kronecker delta function. Then the constants ψijk should satisfy,
r2
X
ψi0 jk0 [Ûg ]:,j = 0.
j=1
0
As the columns of Ûg are linearly independent, ψi0 jk0 = 0 for each j. This holds for any i
0
and k . Therefore ψijk = 0 for all i, j, k. This implies ∇hijk ’s are linearly independent.

38
TCMF

Proof of Equations (12). The Lagrangian of the optimization problem (8) can be written as

N
1X T T
2
L= Ug V(i),g + U(i),l V(i),l − M̂(i)
2 F (25)
i=1
+ Tr Λ8,(i) UTg U(i),l ,


where Λ8,(i) is the dual variable for the constraint UTg U(i),l = 0.
Under the LICQ, we know that Ûg , {V̂(i),g , Û(i),l , V̂(i),l } satisfies KKT condition. Setting
the gradient of L with respect to V(i),g and V(i),l to zero, we can prove (12d) and (12c).
 −1
Considering the constraint ÛTg Û(i),l = 0, we can solve them as V̂(i),g = M̂T(i) Ûg ÛTg Ûg
 −1
and V̂(i),l = M̂T(i) Û(i),l ÛT(i),l Û(i),l . Then we examine the gradient of L with respect to
U(i),l :
∂ 
T T

L = Ug V(i),g + U(i),l V(i),l − M̂(i) V(i),l + Ug ΛT(8),i .
∂U(i),l
 −1  −1
Substituting V̂(i),g = M̂T(i) Ûg ÛTg Ûg and V̂(i),l = M̂T(i) Û(i),l ÛT(i),l Û(i),l in the
above gradient and setting it to zero, we have
  −1  −1 
T T T T
Ûg Ûg Ûg Ûg + Û(i),l Û(i),l Û(i),l Û(i),l − I M̂(i) V̂(i),l

+ Ûg ΛT(8),i = 0.

Left multiplying both sides by ÛTg , we have Λ8,(i) = 0. Left multiplying


T T
both sides by Û(i),l , we have Û(i),l Û(i),l − I = 0. Therefore we also have
  −1  −1 
Ûg ÛTg Ûg ÛTg + Û(i),l ÛT(i),l Û(i),l ÛT(i),l − I M̂(i) V(i),l = 0. This proves equa-
tion (12b). Now, setting the derivative of L with respect to Ug to zero, we have

N
∂ X
T T

L= Ûg V̂(i),g + Û(i),l V̂(i),l − M̂(i) V̂(i),g = 0.
∂Ug
i=1

Left multiplying both sides by ÛTg , we have ÛTg Ûg − I = 0. We have thus proven (12a).
This completes the proof for (12).
Proof of Equations (13). Equation (12b) can be rewritten as:

M̂(i) M̂T(i) Û(i),l = Û(i),l ÛT(i),l M̂(i) M̂T(i) Û(i),l + Ûg ÛTg M̂(i) M̂T(i) Û(i),l . (26)

Since ÛT(i),l M̂(i) M̂T(i) Û(i),l is positive definite, T


we can use W(i),l Λ2,(i) W(i),l =
ÛT(i),l M̂(i) M̂T(i) Û(i),l to denote its eigen-decomposition, where Λ2,(i) ∈ Rr1 ×r1 is a positive def-
inite diagonal matrix and W(i),l ∈ Rr1 ×r1 is orthonormal. Upon defining Ĥ(i),l = Û(i),l W(i),l ,

39
Shi, Fattahi, and Al Kontar

T ÛT Û
Ĥ(i),l is also orthonormal as ĤT(i),l Ĥ(i),l = W(i),l (i),l (i),l W(i),l = I. Similarly, we rewrite
the equation (12a) as:
N N N
1 X 1 X 1 X
M(i) MT(i) Ûg = Ûg ÛTg M(i) MT(i) Ûg + Û(i),l ÛT(i),l M(i) MT(i) Ûg . (27)
N N N
i=1 i=1 i=1
PN
Since ÛTg 1 i=1 M̂(i) M̂T(i) Ûg is positive definite, we can use Wg Λ1 WgT =
PN N ×r
ÛTg N1 T
i=1 M̂(i) M̂(i) Ûg to denote its eigen decomposition, where Λ1 ∈ R
r 1 1 is positive
diagonal, Wg ∈ Rr1 ×r1 is orthogonal Wg WgT = WgT Wg = I. We define Ĥg as Ĥg = Ûg Wg ,
then Ĥg is also orthonormal. Additionally, ĤTg Ĥ(i),l = WgT ÛTg Û(i),l W(i),l = 0. This
completes the proof of equation (13c).
Next, we proceed with the proof of equations (13b) and (13a). By right multiplying both
sides of (27) with Wg and replacing Ûg and Û(i),l by Ĥg and Ĥ(i),l , we have
N N
1 X T 1 X
M̂(i) M̂(i) Ĥg = Ĥg Λ1 + PĤ M̂(i) M̂T(i) Ĥg . (28)
N (i),l N
i=1 i=1

Similarly, by right multiplying both sides of (26) with W(i),l , we can rewrite (26) as,

M̂(i) M̂T(i) Ĥ(i),l = Ĥ(i),l Λ2,(i) + Ĥg ĤTg M̂(i) M̂T(i) Ĥ(i),l . (29)

We thus prove the equations (13b) and (13a), where Λ3,(i) = ĤTg M̂(i) M̂T(i) Ĥ(i),l . 
We note that the KKT conditions provide a set of conditions that must be satisfied
for all stationary points of (8). Our next key contribution is to use these conditions to
characterize a few interesting properties satisfied by all the optimal solutions. To this goal,
we heavily rely on the spectral properties of Λ1 , Λ2,(i) , and Λ3,(i) .
For simplicity, we introduce three additional notations, Λ4,(i) = −Λ3,(i) , Λ5,(i) =
−ΛT3,(i) /N , and Λ6 ∈ Rr1 ×r1 defined as,
N
1 X
Λ6 = Λ1 − Λ3,(i) Λ−1 T
2,(i) Λ3,(i) . (30)
N
i=1

Λ6 is a symmetric matrix. It is worth noting that Λ6 is well defined as the diagonal matrix
Λ2,(i) is invertible throughput the proof. We also introduce short-hand notation ∆Pg to
denote PĤg − PU? g and ∆P(i),l to denote PĤ − PU? (i),l .
(i),l
Spectral properties of Λ1 , Λ2,(i) , Λ3,(i) , and Λ6 are critical for developing the solutions
to KKT conditions. We will establish these properties in the following lemmas.
Before diving into these properties, we investigate the deviance of the estimates features
Ĥg and Ĥ(i),l to ground truth features U? g and U? (i),l .

2
Lemma 14 If maxi E(i) ∞ ≤ 4σmax µnr , and E(i) is α-sparse with α ≤
4
 −2
1 σmin 4σ 2
4 2
60µ r σ 4 1 + √ max
θσ 2 , we have,
max min

√ 5σ
PU? g − PĤg ≤ αn max E(i) ∞
√ max , (31)
F i 2
θσmin

40
TCMF

and
!
√ σmax σ2
PU? (i),l − PĤ ≤ 3 αn max E(i) ∞ 2 1 + 4 √ max
2
. (32)
(i),l i σmin θσmin

Proof
By Shi and Kontar (2024, Theorem 1), we know that Ĥg and Ĥ(i),l corresponding to the
global optimal solutions to the problem (8) satisfy
N N 2
2 1 X 2 4 X F(i) F
P U? g − PĤg + PU? (i),l − PĤ ≤ 4 . (33)
F N (i),l F N θσmin
i=1 i=1

Note that the norm of the error term F(i) F


is bounded by:

F(i) = L? (i) ET(i),t + E(i),t L? (i) T + E(i),t ET(i),t


F F
 
≤ E(i),t F 2 L? (i) 2 + E(i),t 2 (34)

 
≤ αn E(i),t ∞ 2σmax + αn max E(i) ∞ .
i

Therefore, we know from (33) that

PU? g − PĤg
F
F(i) F √
 
2
≤ 2q ≤ αn E(i),t ∞
2σmax + αn max E(i) ∞
q
4 i 4
θσmin θσmin
√ 5σ
≤ αn max E(i) ∞
√ max ,
i 2
θσmin

where we used the condition αn maxi E(i) ∞ ≤ σmax /2 for the last inequality.
From (8), we can also deduce that that the column vectors of Ĥ(i),l span the top invariant
   
subspace of I − PĤg M̂(i) M̂T(i) I − PĤg . Since column vectors of U? (i),l span the top
invariant subspace of I − PU? g M? (i) M? T(i) I − PU? g , we know from Weyl’s theorem (Tao,
 

2010) and Davis-Khan theorem (Rinaldo, 2017) that,

PU? (i),l − PĤ ≤


(i),l
   
I − PĤg M̂(i) M̂T(i) I − PĤg − I − PU? g M? (i) M? T(i) I − PU? g
 
F
2
 
T
   T
 .
? ?
σmin − I − PĤg M̂(i) M̂(i) I − PĤg − I − PU? g M (i) M (i) I − PU? g

Since
   
I − PĤg M̂(i) M̂T(i) I − PĤg − I − PU? g M? (i) M? T(i) I − PU? g
 
F

41
Shi, Fattahi, and Al Kontar

  
= M̂(i) M̂T(i) − M? (i) M? T(i) I − PĤg
I − PĤg
    F
?T
?
+ M (i) M (i) PU? g − PĤg + PU? g − PĤg M? (i) M? T(i)
F F
!
5√ σ2
≤ F(i) F
2
+ 2σmax PĤg − PU? g ≤ αn max E(i) ∞
σmax 1 + 4 √ max
2
F 2 i θσmin
2
σmin
≤ ,
6
we have,

PU? (i),l − PĤ


(i),l
!
1 5√ σ2
≤ 2
σmin
αn max E(i) ∞
σmax 1 + 4 √ max
2
2
σmin − 2 i θσmin
6
!
√ σmax σ2
≤ 3 αn max E(i) ∞ 2 1 + 4 √ max
2
.
i σmin θσmin

This completes the proof.

With Lemma 14, we first provide upper bound on the operator norm of Λ3,(i) .
Lemma 15 For every i ∈ [N ], suppose that U?(i),l ’s are θ-misaligned, maxi E(i) ∞

2
4σmax µnr , and E(i) is α-sparse with α ≤ 1
10µ2 r
, we have,

Λ3,(i) ≤ 2σmax .

Proof From the KKT condition (13), we know,

Λ3,(i) = ĤTg T(i) + F(i) Ĥ(i),l




≤ T(i) + F(i) ≤ 2σmax .

This completes the proof.

We then estimate lower bounds on the smallest eigenvalues of Λ1 , Λ2,(i) , and Λ6 . These
estimates rely on more refined matrix perturbation analysis.

Lemma 16 For every i ∈ [N ], suppose that U?(i),l ’s are θ-misaligned, maxi E(i) ∞

µ2 r
4σmax n , and E(i) is α-sparse with
 4  2  4 !−2
1 1 σmin σmax 8 σmax
α≤ 1+2 +√ .
64 µ4 r2 σmax σmin θ σmin

The minimum eigenvalues of Λ1 and Λ2,(i) are lower bounded by 34 σmin


2 .

42
TCMF

Proof This lemma is a result of Weyl’s theorem (Tao, 2010) and the perturbation bound
on the eigenspaces. From the first equation in (13), we know,

PĤ T(i) + F(i) PĤ Ĥ(i),l = Ĥ(i),l Λ2,(i) .
(i),l (i),l

Therefore, Ĥ(i),l’s columns are the eigenvectors of the symmetric matrix


PĤ T(i) + F(i) PĤ , with eigenvalues corresponding to the diagonal entries
(i),l (i),l
of Λ2,(i) . According to the definition of T(i) , we know that the eigenvalues of
PU? (i),l T(i) PU? (i),l = H? (i),l Σ? 2(i),l H? T(i),l are lower bounded by σmin 2 . Hence, as a result of

Weyl’s inequality, we have


   
λmin Λ2,(i) = λmin PĤ T(i) + F(i) PĤ
(i),l (i),l

2

≥ σmin − PU? (i),l T(i) PU? (i),l − PĤ T(i) + F(i) PĤ .
(i),l (i),l 2

On the other hand, by triangle inequalities, we have



PU? (i),l T(i) PU? (i),l − PĤ T(i) + F(i) PĤ
(i),l (i),l 2

≤ PU? (i),l T(i) PU? (i),l − PĤ T(i) PĤ + F(i) 2


(i),l (i),l 2

≤ PU? (i),l T(i) PU? (i),l − PĤ T(i) PU? (i),l


(i),l 2

+ PĤ T(i) PU? (i),l − PĤ T(i) PĤ + F(i) 2


(i),l (i),l (i),l 2
2
≤ 2σmax PĤ − PU? (i),l + F(i) 2
(i),l
2
√ √
 
≤ αn max E(i) ∞ 2σmax + αn max E(i) ∞
i i
!
5√ σmax 2
σmax
+ αn max E(i) ∞ 2 1 + 4√ 2
2 i σmin θσmin
1 2
≤ σmin ,
4
where we used the fact T(i) ≤ σmax 2 in the third inequality, Lemma 14 in the 4th inequality,
and the assumed upper bound on α in the last inequality. We thus have λmin Λ2,(i) ≥ 34 σmin2 .


Similarly, we can solve Λ3,(i) from the first equation of (13) as Λ3,(i) =
T

Ĥg T(i) + F(i) Ĥ(i),l . Plugging this into the second equation of (13), we have

N
1 X   
I − PĤ T(i) + F(i) I − PĤ Ĥg = Ĥg Λ1 .
N (i),l (i),l
i=1

Thus, the columns of  Ĥg are the eigenvectors of the matrix


1 PN   
N i=1 I − PĤ(i),l T(i) + F(i) I − PĤ
(i),l
, with eigenvalues correspond-
ing to the diagonal entries of Λ1 . Again, since the minimum eigenvalue of

43
Shi, Fattahi, and Al Kontar

1 PN    
2 , Weyl’s inequality
N i=1 I − P ?
U (i),l T (i) I − P ?
U (i),l is lower bounded by σmin
can be invoked to provide a lower bound on the minimum eigenvalue of Λ1 :
N
!
1 X   
λmin (Λ1 ) = λmin I − PĤ T(i) + F(i) I − PĤ
N (i),l (i),l
i=1
2
≥ σmin −
PN        
i=1 I − PU? (i),l T(i) I − PU? (i),l − I − PĤ T(i) + F(i) I − PĤ
(i),l (i),l
.
N
2

The operator norm on the right hand side can be bounded by triangle inequalities. For each
term in the summation, we have
       
I − PU? (i),l T(i) I − PU? (i),l − I − PĤ T(i) + F(i) I − PĤ
(i),l (i),l 2
       
≤ I − PU? (i),l T(i) I − PU? (i),l − I − PĤ T(i) I − PU? (i),l
(i),l
       
+ I − PĤ T(i) I − PU? (i),l − I − PĤ T(i) I − PĤ
(i),l (i),l (i),l
   
+ I − PĤ F(i) I − PĤ
(i),l (i),l

2
≤2σmax PU? (i),l − PĤ + F(i)
(i),l

1 2
≤ σmin .
4
where we used the assumed upper bound on α. This completes the proof.

We also provide a lower bound on the minimum eigenvalue of the symmetric matrix Λ6 .
Remember that Λ6 is defined as Λ6 = Λ1 − N i=1 Λ3,(i) Λ−1
1 PN T
2,(i) Λ3,(i) .

Lemma 17 For every i ∈ [N ], suppose that U?(i),l ’s are θ-misaligned, maxi E(i) ∞ ≤
 8  −2
2 4σ 2
4σmax µnr , and E(i) is α-sparse with α ≤ (µ2 r640)
1
2
σmin
σmax 1 + √ max
θσ 2
, then, the mini-
min
mum eigenvalue of Λ6 is lower bounded by 34 σmin
2 .

Proof The proof is constructive. We use two steps. In the first step, we introduce a block
matrix Λ7 defined as,
√
···

N Λ1 √ Λ3,(1) Λ3,(N )
 ΛT N Λ2,(1) · · · 0 
 3,(1)
Λ7 =  . , (35)

 .. .. .. ..
. . . 

Λ3,(N ) 0 ··· N Λ2,(N )

and√show that the minimum eigenvalue of the minimum eigenvalue of Λ7 is lower bounded
by N 34 σmin
2 . Then, in the second step, we prove that the minimum eigenvalue of Λ is
6
lower bounded by √1N multiplies the minimum eigenvalue of Λ7 , λmin (Λ6 ) ≥ λmin ( √1N Λ7 ).

44
TCMF

During this proof, we further introduce




 Λ? 1,(i) = H? Tg M? (i) M? T(i) H? g


 N
 Λ? 1 = 1

 X
Λ? 1,(i)

N
 i=1
Λ 2,(i) = H (i),l M? (i) M? T(i) H? (i),l
? ?T






Λ? ?T ? ?T ?

3,(i) = H g M (i) M (i) H (i),l ,

for notational simplicity. From the SVD (3) and the assumption on singular values of L? (i) ,
we know:
 ?
Λ 1,(i) Λ? 3,(i)

? ? T ? ?T ? ? 2
[H g , H (i),l ] L (i) L (i) [H g , H (i),l ] =  σmin I. (36)
Λ? T3,(i) Λ? 2,(i)

Step 1: Minimum eigenvalue of Λ7 : From definitions of Λ1 , Λ2,(i) , and Λ3,(i) in (13), we


know,
√ 
N ĤTg T(0) Ĥg √ ĤTg T(1) Ĥ(1),l ··· ĤTg T(N ) Ĥ(N ),l
 ĤT T Ĥ N ĤT(1),l T(1) Ĥ(1),l · · · 0 
 (1),l (1) g 
Λ7 = 
 .. .. .. . 
. . . .
.

 √ 
ĤT(N ),l T(N ) Ĥg 0 ··· N ĤT(N ),l T(N ) Ĥ(N ),l
| {z }
Λ7,2
√ 
N ĤTg F(0) Ĥg T
√ Ĥg TF(1) Ĥ(1),l ··· ĤTg F(N ) Ĥ(N ),l

 ĤT(1),l F(1) Ĥg N Ĥ(1),l F(1) Ĥ(1),l ··· 0 

+ .. .. .. . .

. . . .. 

T
√ T

Ĥ(N ),l F(N ) Ĥg 0 ··· N Ĥ(N ),l F(N ) Ĥ(N ),l
| {z }
Λ7,1

By Lemma 26, the operator norm of Λ7,1 is upper bounded by,


v
u N
√ u X 2
kΛ7,1 k ≤ max { N F(i) } + t2 F(i)
i=0,1,··· ,N
i=1

√ 5 2
N σmin
≤ 2 N αn max E(i) σ
∞ max
≤ .
i 2 16
(37)
2
σmin 1
where we used the condition α ≤ 2
σmax 320µ2 r
in the last inequality.
We can further decompose Λ7,2 . . Then, we can derive ĤTg T(0) Ĥg =
ĤTg H? g H? Tg T(0) H? g H? Tg Ĥg + ĤTg ∆Pg T(0) Ĥg + ĤTg H? g H? Tg T(0) ∆Pg Ĥg ,
ĤT(i),l T(i) Ĥ(i),l = ĤT(i),l H? (i),l H? T(i),l T(i) H? (i),l H? T(i),l Ĥ(i),l + ĤT(i),l ∆P(i),l T(i) Ĥ(i),l +

45
Shi, Fattahi, and Al Kontar

ĤT(i),l H? (i),l H? T(i),l T(i) ∆P(i),l Ĥ(i),l , and ĤTg T(i) Ĥ(i),l = ĤTg H? g H? Tg T(i) H? (i),l H? T(i),l Ĥ(i),l +
ĤTg ∆Pg T(i) Ĥ(i),l + ĤTg H? g H? Tg T(i) ∆P(i),l Ĥ(i),l .
Therefore, we can rewrite Λ7,2 as,

Λ7,2 = Λ7,3 +
 √ 
N ĤTg H? g Λ? 1 H?Tg Ĥg ĤTg H? g Λ? 3,(1) H?T(1),l Ĥ(1),l ··· ĤTg H? g Λ? 3,(N ) H?T(N ),l Ĥ(N ),l
 T √
 Ĥ(1),l H? (1),l Λ?T3,(1) H?Tg Ĥg N ĤT(1),l H? (1),l Λ? 2,(1) H?T(1),l Ĥ(1),l · · · 0


.. .. .. ,
..

. . . .
 
 √ 
ĤT(N ),l H? (N ),l Λ?T3,(N ) H?Tg Ĥg 0 ··· N ĤT(N ),l H? (N ),l Λ? 2,(N ) H?T(N ),l Ĥ(N ),l
| {z }
Λ7,4

where Λ7,3 consists of residual terms that contain ∆Pg or ∆P(i),l .


We use Lemma 26 to estimate an upper bound for the operator norm of Λ7,3 . The
maximum operator norm of the diagonal block of Λ7,3 is

max{ N ĤTg ∆Pg T(0) Ĥg + ĤTg H? g H? Tg T(0) ∆Pg Ĥg ,

max{ N ĤT(i),l ∆P(i),l T(i) Ĥ(i),l + ĤT(i),l H? (i),l H? T(i),l T(i) ∆P(i),l Ĥ(i),l }}
i
!
√ 2 5√ σmax 2
σmax
≤ N 2σmax αn max E(i) ∞ 2 1 + 4√ 2 .
2 i σmin θσmin
The summation of the operator norm of the off-diagonal blocks of Λ7,3 is
v
u N
u X 2
t2 ĤTg ∆Pg T(i) Ĥ(i),l + ĤTg H? g H? Tg T(i) ∆P(i),l Ĥ(i),l
i=1
!
√ 5√ σmax σ2
≤ 2
2N 2σmax αn max E(i) ∞ 2 1 + 4 √ max
2
.
2 i σmin θσmin
As a result, Lemma 26 implies that,
!
√ √ σmax σ2
kΛ7,3 k ≤ 2
N 10σmax αn max E(i) ∞ 2 1 + 4 √ max
2
i σmin θσmin
√ 2
N σmin
≤ ,
16
(38)
 8  −2
1 σmin 4σ 2
where we applied the condition α ≤ (µ2 r640)2 σmax 1+ √ max
2
θσmin
in the last inequality.
We proceed to estimate the eigenvalue lower bound for Λ7,4 . We first factorize Λ7,4 as,
 
Λ7,4 = Diag ĤTg H? g , ĤT(1),l H? (1),l , · · · , ĤT(N ),l H? (N ),l
√ 
N Λ? 1 √ Λ? 3,(1) ··· Λ? 3,(N )
 ?T
 Λ 3,(1) N Λ? 2,(1) · · · 0


× .. .. .. .. 
. . . .

 √ 
Λ? T3,(N ) 0 ··· N Λ? 2,(N )

46
TCMF

 
× Diag Ĥ?Tg Ĥg , H? T(1),l Ĥ(1),l , · · · , H? T(N ),l Ĥ(N ),l .

Lemma 24 and (36) indicate that Λ? 1,(i) − σmin


2 I  0, Λ? 2
2,(i) − σmin I  0, and
−1 ?
Λ? 1,(i) − σmin
2
I − Λ? T3,(i) Λ? 2,(i) − σmin
2

I Λ 3,(i)  0. (39)

Summing both sides of (39) for i = 1 to N , we know, N Λ? 1 − N σmin 2 I −


PN ?T ? 2 −1 √ √
Λ? 3,(i)  0, which is equivalent to N Λ? 1 − N σmin2 I−

i=1 Λ 3,(i) Λ 2,(i) − σmin I
PN ?T
 √ √  −1
i=1 Λ 3,(i) N Λ? 2,(i) − N σmin
2 I Λ? 3,(i)  0. Again, Lemma 24 indicates,
√ 
N Λ? 1 √ Λ? 3,(1) ··· Λ? 3,(N )
 ?T
 Λ 3,(1) N Λ? 2,(1) · · · 0  √

2
  N σmin
 .. .. .. .. I.

. . . .

 √ 
Λ? T3,(N ) 0 ··· N Λ? 2,(N )

On the other hand, we know form Lemma 26 that,


 
I − Diag ĤTg H? g Ĥ?Tg Ĥg , ĤT(1),l H? (1),l H? T(1),l Ĥ(1),l , · · · , ĤT(N ),l H? (N ),l H? T(N ),l Ĥ(N ),l
≤ max{k∆Pg k , max ∆P(j),l) }
j
!
αn maxi E(i) ∞
(αn maxi E(i) ∞
+ 2σmax ) 8σ 2 1
≤ 2 1 + √ max2
≤ ,
σmin θσmin 8
2
 −1
σmin 8σ 2
where we applied the condition α ≤ 80µ2 rσmax
2 1+ √ max
2
θσmin
in the last inequality.
Hence, Lemma 25 indicates,
√ 2 7
λmin (Λ7,4 ) ≥ N σmin .
8
By Wely’s theorem, we know that
√ 2 7
λmin (Λ7 ) ≥ N σmin − kΛ7,1 + Λ7,3 k
8
√ 2 3
≥ N σmin , (40)
4
where we applied the inequality (37) and (38) in the last inequality. √ 2 3 I  0.
Step 2: Minimum eigenvalue of Λ6 : The inequality (40) is equivalent to Λ7 − N σmin 4
Thus, Lemma 24 implies,

√ √ N 
√ √
−1
2 3 2 3
X
N Λ1 − N σmin I − ΛT3,(i) N Λ2,(i) − N σmin I Λ2,(i)  0.
4 4
i=1

2 3 I. As a result,
Lemma 16 already shows that Λ2,(i)  σmin 4
−1
√ √

T 2 3
Λ3,(i) N Λ2,(i) − N σmin I Λ2,(i)
4

47
Shi, Fattahi, and Al Kontar

  p 
∞ ∞
1 X X
2 3 −1  
=√ ΛT3,(i) Λ−1
2,(i) +
 σmin Λ Λ2,(i)
N p=0 4 2,(i)
p=0
1
 √ ΛT3,(i) Λ−1
2,(i) Λ3,(i) .
N
By rearranging terms, we have,

√ N √
1 X T 2 3
N Λ1 − √ Λ3,(i) Λ−1
2,(i) Λ2,(i)  N σmin I.
N i=1 4

This completes our proof.

With an understanding of spectral properties of Λ1 , Λ2,(i) , and Λ6 , we are now ready to


characterize the solutions to the KKT conditions and provide a proof for Lemma 8. To this
goal, we first write the solutions to (13) into Taylor-like series.

Lemma 18 For every i ∈ [N ], suppose that U?(i),l ’s are θ-misaligned, maxi E(i) ∞ ≤
2
 2 −3
4σmax µnr , and E(i) is α-sparse with α ≤ 40µ1 2 r 2σmax
3 2
σ
. The solutions to (13) satisfy the
4 min
following,

Ĥg = Ĥg,0 + Ĥg,1


∞ ∞ ∞ N
" k #
Y 
2l p2l+1
Fp(0)
X X X X
+ ··· F(i2l+1 ) Ĥg,0
k=0 p0 +p1 ≥1 p2k +p2k+1 ≥1 i1 ,i3 ,··· ,i2k+1 =1 l=0
0  
−p −1
Λ5,(i2l+1 ) Λ−p
Y
× Λ4,(i2l+1 ) Λ4,(i2l+1 ) Λ2,(i2l+1
2l+1 ) 1
2l
Λ−1
6
l=k (41)
∞ ∞ ∞ N
" k #
Y p

Fp(0)
X X X X
+ ··· 2l
F(i2l+1
2l+1 )
Ĥg,1
k=0 p0 +p1 ≥1 p2k +p2k+1 ≥1 i1 ,i3 ,··· ,i2k+1 =1 l=0
0  
−p −1 −p2l −1
Y
× Λ4,(i2l+1 ) Λ4,(i2l+1 ) Λ2,(i2l+1
2l+1 )
Λ Λ
5,(i2l+1 ) 1 Λ6 ,
l=k

and

−p−1
Fp(j) T(j) Ĥ(j),l Λ2,(j)
X
Ĥ(j),l = Ĥ(i),l,0 + + Ĥg,1 Λ4,(j) Λ−1
2,(j)
p=1
∞ ∞ ∞ N k 
" #

2l p2l+1
Fp(0)
X X X X Y
+ ··· F(i2l+1 ) Ĥg,0
k=0 p0 +p1 ≥1 p2k +p2k+1 ≥1 i1 ,i3 ,··· ,i2k+1 =1 l=0
0  
−p −1 −p2l −1
Y
× Λ4,(i2l+1 ) Λ4,(i2l+1 ) Λ2,(i2l+1
2l+1 )
Λ Λ
5,(i2l+1 ) 1 Λ6 Λ4,(j) Λ−1
2,(j)
l=k

48
TCMF

∞ ∞ ∞ N k 
" #

2l p2l+1
Fp(0)
X X X X Y
+ ··· F(i2l+1 ) Ĥg,1
k=0 p0 +p1 ≥1 p2k +p2k+1 ≥1 i1 ,i3 ,··· ,i2k+1 =1 l=0
0  
−p −1 −p2l −1
Y
× Λ4,(i2l+1 ) Λ4,(i2l+1 ) Λ2,(i2l+1
2l+1 )
Λ Λ
5,(i2l+1 ) 1 Λ6 Λ4,(j) Λ−1
2,(j)
l=k

Fp(j) Ĥg Λ4,(j) Ĥg Λ4,(j) Λ−1
X
+ 2,(j) ,
p=1

(42)

where Ĥg,0 is defined as


N
X
Ĥg,0 = T(0) Ĥg Λ−1
6 + T(i) Ĥ(i),l Λ−1 −1
2,(i) Λ5,(i) Λ6 , (43)
i=1

Ĥg,1 is defined as
∞ N
Fp(0) Fp(i11 ) T(i1 ) Ĥ(i1 ),l Λ−p 1 −1 −p0 −1
X X
0
Ĥg,1 = 2,(i1 ) Λ5,(i1 ) Λ1 Λ6 , (44)
p0 +p1 ≥1 i1 =1

and Ĥ(i),l,0 is defined as

Ĥ(i),l,0 = T(j) Ĥ(j),1 Λ−1 −1


2 + Ĥg,0 Λ4,(j) Λ2,(j) (45)
.
Proof Notice that as we defined Λ4,(i) = −Λ3,(i) and Λ5,(i) = −ΛT3,(i) /N , the KKT condition
in (13) can be written as the following Sylvester equations
  


 F (i) Ĥ(i),l − Ĥ (i),l Λ2,(i) = − T (i) Ĥ(i),l + Ĥ g Λ 4,(i)

N (46)
!
X



 F (0) Ĥ g − Ĥg Λ 1 = − T(0) Ĥg + Ĥ (i),l Λ 5,(i) .
i=1

Note that σmin (Λ2,(i) ) > F(i) and σmin (Λ1,(i) ) > F(0) . Therefore, according to Bhatia
(2013, Theorem VII.2.2), the solution to (46) satisfies the following equation

∞ ∞ X N
−p−1
p
Fp(0) Ĥ(i),l Λ5,(i) Λ−p−1
 X X
Ĥ = F(0) T(0) Ĥg Λ1 +

 g


 1
p=0 p=0 i=1
∞ ∞
. (47)

p −p−1 p −p−1
 X X
Ĥ(i),l = F(i) T(i) Ĥ(i),l Λ2,(i) + F(i) Ĥg Λ4,(i) Λ2,(i)



p=0 p=0

We can substitute Ĥ(i),l in the right hand side of the first equation of (47) by the second
equation in (47)
∞ ∞ X
∞ X
N
Fp(0) T(0) Ĥg Λ1−p−1 Fp(0) Fp(i11 ) T(i1 ) Ĥ(i1 ),l Λ−p 1 −1 −p0 −1
X X
0
Ĥg = + 2,(i1 ) Λ5,(i1 ) Λ1
p=0 p0 =0 p1 =0 i1 =1

49
Shi, Fattahi, and Al Kontar

∞ X
∞ X
N
Fp(0) Fp(i11 ) Ĥg Λ4,(i1 ) Λ−p 1 −1 −p0 −1
X
0
+ 2,(i1 ) Λ5,(i1 ) Λ1
p0 =0 p1 =0 i1 =1
N
X N
X
= T(0) Ĥg Λ−1
1 + T(i) Ĥ(i),l Λ−1 −1
2,(i) Λ5,(i) Λ1 + Ĥg Λ4,(i) Λ−1 −1
2,(i) Λ5,(i) Λ1
i=1 i=1
∞ N
Fp(0) Fp(i11 ) T(i1 ) Ĥ(i1 ),l Λ−p 1 −1 −p0 −1
X X
0
+ 2,(i1 ) Λ5,(i1 ) Λ1
p0 +p1 ≥1 i1 =1
∞ N
Fp(0) Fp(i11 ) Ĥg Λ4,(i1 ) Λ−p 1 −1 −p0 −1
X X
0
+ 2,(i1 ) Λ5,(i1 ) Λ1 .
p0 +p1 ≥1 i1 =1

(48)

We then move the third term on the right hand side of (48) to the left hand side, multiply
both sides by Λ1 Λ−1
6 on the right, and recall the definition of Λ6,(i) in (30), we have,

N
X
Ĥg = T(0) Ĥg Λ−1
6 + T(i) Ĥ(i),l Λ−1 −1
2,(i) Λ5,(i) Λ6
i=1
| {z }
Ĥg,0
∞ N
F(i1 ) T(i1 ) Ĥ(i1 ),l Λ−p
p0 p1 1 −1 −p0 −1
X X
+ F(0) 2,(i1 ) Λ5,(i1 ) Λ1 Λ6
p0 +p1 ≥1 i1 =1
| {z }
Ĥg,1
∞ N
Fp(0) Fp(i11 ) Ĥg Λ4,(i1 ) Λ−p 1 −1 −p0 −1
X X
0
+ 2,(i1 ) Λ5,(i1 ) Λ1 Λ6 .
p0 +p1 ≥1 i1 =1
| {z }
residual term
(49)

On the right hand side of (49), one can see that the Ĥg,0 and Ĥg,1 are products of sparse
matrices, incoherent matrices, and remaining terms. Therefore we can use Lemma 13 to
calculate an upper bound on their maximum row norm. However, the residual term does
not have such specific structure as we do not know whether Ĥg is incoherent. As a result
we cannot provide precise estimate on its maximum row norm directly. To circumvent the
issue, notice that (49) has a recursive form. Therefore, the residual term can be replaced by

∞ N
Fp(0) Fp(i11 ) Ĥg Λ4,(i1 ) Λ−p 1 −1 −p0 −1
X X
0
Ĥg → Ĥg,0 + Ĥg,1 + 2,(i1 ) Λ5,(i1 ) Λ1 Λ6 . (50)
p0 +p1 ≥1 i1 =1

The result will have 5 terms, the first 4 of which have the structure specified in Lemma
13. The 5-th term does not as it contains Ĥg . We can apply the replacement rule (50) again
for the 5-th term, generating 7 terms. After applying the replacement rule ω times, where ω

50
TCMF

is an integer, the results become,

Ĥg = Ĥg,0 + Ĥg,1


ω ∞ ∞ ∞ N X
N N
p
Fp(0) Fp(i11 ) Fp(0) Fp(i33 ) · · · Fp(0)
X X X X X X
+ ··· ··· 0 2 2k
F(i2k+1
2k+1 )
k=0 p0 +p1 ≥1 p2 +p3 ≥1 p2k +p2k+1 ≥1 i1 =1 i3 =1 i2k+1 =1
−p −1 −p −p1 −1 −p0 −1
Ĥg,0 Λ4,(i2k+1 ) Λ2,(i2k−1
2k+1 )
Λ5,(i2k+1 ) Λ1 2k−2 Λ−1
6 · · · Λ4,(i1 ) Λ2,(i1 ) Λ5,(i1 ) Λ1 Λ6
ω ∞ ∞ ∞ N X
N N
p
Fp(0) Fp(i11 ) Fp(0) Fp(i33 ) · · · Fp(0)
X X X X X X
+ ··· ··· 0 2 2k
F(i2k+1
2k+1 )
k=0 p0 +p1 ≥1 p2 +p3 ≥1 p2k +p2k+1 ≥1 i1 =1 i3 =1 i2k+1 =1
−p −1 −p −p1 −1 −p0 −1
Ĥg,1 Λ4,(i2k+1 ) Λ2,(i2k−1
2k+1 )
Λ5,(i2k+1 ) Λ1 2k−2 Λ−1
6 · · · Λ4,(i1 ) Λ2,(i1 ) Λ5,(i1 ) Λ1 Λ6
∞ ∞ ∞ N X
N N
p p
Fp(0) Fp(i11 ) Fp(0) Fp(i33 ) · · · F(0)
X X X X X
2ω+2
+ ··· ··· 0 2
F(i2ω+3
2ω+3 )
p0 +p1 ≥1 p2 +p3 ≥1 p2ω+2 +p2ω+3 ≥1 i1 =1 i3 =1 i2ω+3 =1
−p −1 −p −p1 −1 −p0 −1
Ĥg Λ4,(i2ω+3 ) Λ2,(i2ω+3
2ω+3 )
Λ5,(i2ω+3 ) Λ1 2ω+2 Λ−1
6 · · · Λ4,(i1 ) Λ2,(i1 ) Λ5,(i1 ) Λ1 Λ6 ,

(51)

which holds for any integer ω ≥ 0.


Recall that our goal is to write Ĥg in a form with which we can easily determine its
maximum row norm. By observing (51), we know Lemma 13 can be applied to estimate
the maximum row norm of all but the last terms. The last summation term still cannot be
handled by Lemma 13 directly. To resolve the issue, we take an alternative route to use ω
to control the last summation term.
We claim that under the provided upper bound for α, the last term will approach
zero in the limit ω → ∞. To see this, note that Lemma 16 and Lemma 17 show
that σmin (Λ1 ), σmin (Λ2,(i) ), and σmin (Λ6 ) are lower bounded by 34 σmin
2 . Since F
(i) ≤
5
αn maxi E(i) ∞ (2σmax + αn maxi E(i) ∞ ) ≤ αn maxi E(i) ∞ 2 σmax for each i, we have,

∞ ∞ N X
N N
p
Fp(0) Fp(i11 ) Fp(0) Fp(i33 ) · · · F(i2ω+3
X X X X
0 2
··· ··· 2ω+3 ) F
p0 +p1 ≥1 p2ω+2 +p2ω+3 ≥1 i1 =1 i3 =1 i2ω+3 =1
−p −1 −p2ω+2 −p1 −1 −p0 −1
Ĥg Λ4,(i2ω+3 ) Λ2,(i2ω+3
2ω+3 )
Λ5,(i2ω+3 ) Λ1 Λ−1
6 · · · Λ4,(i1 ) Λ2,(i1 ) Λ5,(i1 ) Λ1 Λ6
∞ ∞
 !p0 +···+p2ω+3
X X αn maxi E(i) ∞ 52 σmax
≤ ··· 3 2
p0 +p1 ≥1 p2ω+2 +p2ω+3 ≥1 4 σmin
!2(ω+2)
2
2σmax
× 3 2
4 σmin
5
 !2(ω+2) !2(ω+2)
αn maxi E(i) 2
∞ 2 σmax 2σmax
≤ 4 3 2 3 2
σ
4 min 4 σmin
 2(ω+2)
1
≤ ,
2

51
Shi, Fattahi, and Al Kontar

 2
−3
1 2σmax
where we used Lemma 23 in the first inequality and the condition that α ≤ 40µ2 r 3 2
σ
4 min
in the last inequality.
Therefore, we can take the limit ω → ∞ in (51) and rewrite it as a series. The series is
absolutely convergent when α is small. Finally, we prove (41). Though (41) is an infinite
series, each term in the series is the product of sparse matrices and an incoherent matrix.
Such structure will be useful later when we use Lemma 13 to calculate the maximum row
norm of Ĥg .
Now we proceed to derive an expansion for Ĥ(i),l . We can replace Ĥg on the right hand
side of the second equation of (47) with (51) to derive,

Ĥ(i),l = T(i) Ĥ(i),l Λ−1 −1


2,(i) + Ĥg Λ4,(i) Λ2,(i)
∞ ∞
Fp(i) T(i) Ĥ(i),l Λ−p−1 Fp(i) Ĥg Λ4,(i) Λ−p−1
X X
+ 2,(i) + 2,(i)
p=1 p=1
∞ ∞ ∞ N
" k #
Y 
2l p2l+1
Fp(0)
X X X X
+ ··· F(i2l+1 ) Ĥg,0
k=0 p0 +p1 ≥1 p2k +p2k+1 ≥1 i1 ,i3 ,··· ,i2k+1 =1 l=0
0  
−p −1 −p2l −1
Y
× Λ4,(i2l+1 ) Λ4,(i2l+1 ) Λ2,(i2l+1
2l+1 )
Λ Λ
5,(i2l+1 ) 1 Λ6 Λ4,(i) Λ−1
2,(i)
l=k
∞ ∞ ∞ N
" k #
Y 
2l p2l+1
Fp(0)
X X X X
+ ··· F(i2l+1 ) Ĥg,1
k=0 p0 +p1 ≥1 p2k +p2k+1 ≥1 i1 ,i3 ,··· ,i2k+1 =1 l=0
0  
−p −1 −p2l −1
Y
× Λ4,(i2l+1 ) Λ4,(i2l+1 ) Λ2,(i2l+1
2l+1 )
Λ5,(i 2l+1 ) Λ1 Λ6 Λ4,(i) Λ−1
2,(i) .
l=k
(52)

We thus prove (42).

In Lemma 18, although the series of Ĥg and Ĥ(i),l ’s have infinite terms, when α is not
too large, the leading term is only the first term. This is delineated in the following lemma,
which is a formal version of Lemma 8. For simplicity, we introduce a notation
 
αn maxi E(i) ∞ αn maxi E(i) ∞ + 2σmax
ζ= 3 2 . (53)
4 σmin

Lemma 19 Suppose that the conditions of Lemma 18 are satisfied. Additionally, suppose
√  2
σmin
that α ≤ 6−3 2 1
80 µ2 r σmax , we have

Ĥg = Ĥg,0 + δHg


Ĥ(i),l = Ĥ(i),l,0 + δH(i),l

52
TCMF

where δHg and δH(i),l satisfy


!2  ! !2 
2
2σmax 2
2σmax 2
2σmax
kδHg k ≤ ζ 3 2
1 + 2
3 2 +2 3 2
 (54)
4 σmin 4 σmin 4 σmin
s !2  ! !2 
µ2 r 2
2σmax 2
2σmax 2
2σmax
max eTj δHg ≤ ζ 4 3 2
1 +
3 2 + 3 2
 (55)
j n1 4 σmin 4 σmin 4 σmin
 !2 !3 !4 
2σ 2 2
2σmax 2
2σmax 2
2σmax 2
2σmax
δH(i),l ≤ ζ 3 max
2
2 +
3 2 +3 3 2 +4 3 2 +4 3 2

4 σmin 4 σmin 4 σmin 4 σmin 4 σmin
(56)
s
2
µ2 r 2σmax
max eTj δH(i),l ≤ ζ 23 2
j n1 4 σmin
 !2 !3 !4 
2σ 2 2
2σmax 2
2σmax 2
2σmax
× 1 + 3 max
2
+5 3 2 +4 3 2 +4 3 2
, (57)
4 σmin 4 σmin 4 σmin 4 σmin

with ζ is defined in (53).


Proof We need to provide upper bounds on the series in Lemma 18. From Lemma 18, Ĥg
can be written as a series. We can define δHg as the summation of all but the first term in
the series, as in
∞ ∞ ∞ N
" k #
X X X X Y p p 
2l+1
δHg = Ĥg,1 + ··· 2l
F(0) F(i2l+1 ) Ĥg,0
k=0 p0 +p1 ≥1 p2k +p2k+1 ≥1 i1 ,i3 ,··· ,i2k+1 =1 l=0
0  
−p −1
Λ5,(i2l+1 ) Λ−p
Y
× Λ4,(i2l+1 ) Λ2,(i2l+1
2l+1 ) 1
2l
Λ−1
6
l=k
∞ ∞ ∞ N k 
" #

p
Fp(0)
X X X X Y
+ ··· 2l
F(i2l+1
2l+1 )
Ĥg,1
k=0 p0 +p1 ≥1 p2k +p2k+1 ≥1 i1 ,i3 ,··· ,i2k+1 =1 l=0
0  
−p −1 −p2l −1
Y
× Λ4,(i2l+1 ) Λ2,(i2l+1
2l+1 )
Λ Λ
5,(i2l+1 ) 1 Λ6 .
l=k

Hence, by applying Lemma 22, we have



X ∞
X ∞
X N
X
kδHg k ≤ Ĥg,1 + ···
k=0 p0 +p1 ≥1 p2k +p2k+1 ≥1 i1 ,i3 ,··· ,i2k+1 =1
" k # 0  
p2l p2l+1  −p −1
Λ5,(i2l+1 ) Λ−p
Y Y
F(0) F(i2l+1 ) Ĥg,0 Λ4,(i2l+1 ) Λ2,(i2l+1
2l+1 ) 1
2l
Λ−1
6
l=0 l=k
∞ ∞ ∞ N k
" #
p2l p2l+1 
X X X X Y
+ ··· F(0) F(i2l+1 ) Ĥg,1
k=0 p0 +p1 ≥1 p2k +p2k+1 ≥1 i1 ,i3 ,··· ,i2k+1 =1 l=0

53
Shi, Fattahi, and Al Kontar

0
−p −1
Λ5,(i2l+1 ) Λ−p
Y
× Λ4,(i2l+1 ) Λ2,(i2l+1
2l+1 ) 1
2l
Λ−1
6 .
l=k

We first estimate an upper bound for Ĥg,1 ,

∞ N
p0 p1
T(i1 ) Ĥ(i1 ),l Λ−p 1 −1 −p0 −1
X X
Ĥg,1 ≤ F(0) F(i1 ) 2,(i1 ) Λ5,(i1 ) Λ1 Λ6
p0 +p1 ≥1 i1 =1

X   p0 +p1
≤ αn max E(i) ∞
αn max E(i) ∞
+ 2σmax
i i
p0 +p1 ≥1
!2 !p0 +p1
2
σmax 1
×2 3 2 3 2
4 σmin 4 σmin
  !2
αn maxi E(i) ∞
αn maxi E(i) ∞
+ 2σmax σ2
≤ 3 2 2 3 max
2
4 σmin 4 σmin
   −1
αn maxi E(i) ∞
αn maxi E(i) ∞
+ 2σmax
2 1 − 3 2

4 σmin
  !2
αn maxi E(i) ∞
αn maxi E(i) ∞
+ 2σmax √ 2
σmax
≤ 3 2 4 2 3 2 ,
4 σmin 4 σmin

where we used Lemma 22 in the first inequality, the upper bound on F(i) in the sec-
ond inequality. Because of the upper bound on α, we can use auxiliary Lemma 23
to derive an upper bound on the series. The last inequality comes from the fact that
αn maxi kE(i) k∞ (αn maxi kE(i) k∞ +2σmax ) −1 √
(1 − 3 2
σ
) ≤ 2.
4 min
Therefore, we can proceed to estimate,

kδHg k
!2 ∞ 
k+1 !2(k+1) !2
√ 2
σmax 2X 2
2σmax 2
σmax 2
≤ ζ4 2 3 2 + ζ 3 2 3 2

4 σmin
1−ζ 4 σmin 4 σmin
1−ζ
k=0
∞  k+1 !2(k+1) ! !
X 2 2
2σmax 2
σmax 2
2σmax
+ ζ 3 2 3 2 1+ 3 2
1−ζ 4 σ min 4 σ min 4 σmin
k=0
!2  !2 −1
√ 2
σmax 2ζ 2σ 2
max
≤ ζ4 2 3 2 1 −
3 2

σ
4 min
1 − ζ σ
4 min
!2 ! ! !2 −1
2 2
2σmax 2
σmax 2σ 2 2
+ζ 1 + 3 max 1 − 2ζ 2σmax 
.
1 − ζ 34 σmin2 3 2
σ
4 min σ 2
4 min
1 − ζ 3 2
4 σmin

54
TCMF

We can estimate an upper bound on maxk eTk δHg in a similar fashion.


We first show that, for any j = 1, 2, · · · , N ,
k
Fp(i`` ) T(j)
Y
max eTk
k
`=1
k  
Fp(i`` ) H? g H? Tg + H? (j),l H? T(j),l T(j)
Y
= max eTk
k
`=1
k k
Fp(i`` ) H? g H? Tg T(j) + max eTk Fp(i`` ) H? g H? Tg T(j)
Y Y
≤ max eTk
k k
`=1 `=1
s Pkm=1 pm
µ2 r

2
≤ 2σmax αn max E(i) ∞
(αn max E(i) ∞
+ 2σmax ) ,
n1 i i

(58)
where we used the triangle inequality in the first inequality, and Lemma 13 together with
r = r1 + r2 in the second inequality.
p`
A similar equality also holds for maxk eTk k`=1 F(i
Q
`)
T(0) :

k k N N k
1 1 X
Fp(i`` ) T(0) Fp(i`` ) Fp(i`` ) T(j)
Y Y X Y
max eTk = max eTk T(j) ≤ T
max ek
k k N N k
`=1 `=1 j=1 j=1 `=1
s Pkm=1 pm
µ2 r

2
≤ 2σmax αn max E(i) ∞
(αn max E(i) ∞
+ 2σmax ) .
n1 i i
(59)
Combining the above two inequalities, we have
max eTj δHg
j

X ∞
X ∞
X N
X
≤ max eTj Ĥg,1 + ···
j
k=0 p0 +p1 ≥1 p2k +p2k+1 ≥1 i1 ,i3 ,··· ,i2k+1 =1
k  0
" #

p −p −1
Fp(0) Λ5,(i2l+1 ) Λ−p
Y Y
max eTj 2l
F(i2l+1
2l+1 )
Ĥg,0 Λ4,(i2l+1 ) Λ2,(i2l+1
2l+1 ) 1
2l
Λ−1
6
j
l=0 l=k
∞ ∞ ∞ N k 
" #

2l p2l+1
Fp(0)
X X X X Y
+ ··· max eTj F(i2l+1 ) Ĥg,1
j
k=0 p0 +p1 ≥1 p2k +p2k+1 ≥1 i1 ,i3 ,··· ,i2k+1 =1 l=0
0
−p −1
Λ5,(i2l+1 ) Λ−p
Y
× Λ4,(i2l+1 ) Λ2,(i2l+1
2l+1 ) 1
2l
Λ−1
6
l=k
s !2
X µ2 r 2
2σmax
≤ ζ p0 +p1 3 2
n1 4 σmin
p0 +p1 ≥1

55
Shi, Fattahi, and Al Kontar


s !2 !2(k+1)
X X X µ2 r 2
2σmax 2
2σmax
+ ··· ζ p0 +···+p2k+3 3 2 3 2
n1 4 σmin 4 σmin
k=0 p0 +p1 ≥1 p2k+2 +p2k+3 ≥1


!2(k+1) !2  !2 
X X X 2
2σmax 2
2σmax 2
2σmax
+ ··· ζ p0 +···+p2k+1 3 2 3 2
1 +
3 2

k=0 p0 +p1 ≥1 p2k +p2k+1 ≥1 4 σmin 4 σmin 4 σmin
s !2 ∞
 !2 k+1 s
!2
2 µ2 r 2
2σmax 2ζ X 2
2r
2σmax
µ 2σ 2
max
≤ζ +
3 2
 
3 2 3 2
1−ζ n1 4 σmin
1−ζ n1
4 σmin σ
4 min
k=0
 !2 k+1 s

! !
X 2ζ 2σ 2 µ 2r 2σ 2 2σ 2
max max max
+   1+ 3 2
1 − ζ 34 σmin
2 n1 3 2
4 σ min 4 σmin
k=0
s !2  !2 −1 ! !!
2 2
µ r 2σmax2 2ζ 2σ 2 2σ 2 2σ 2
max max max
≤ζ 3 2
1 −
3 2
 1+ 3 2 1+ 3 2 ,
1−ζ n1 4 σmin
1 − ζ 4 σmin 4 σmin 4 σmin

where we applied (58), (59) in the second inequality, and Lemma 23 in the third inequality.
Similarly, we define δH(i),l as the summation,
∞ ∞
Fp(i) T(i) Ĥ(i),l Λ−p−1 Fp(i) Ĥg,0 Λ4,(i) Λ−p−1
X X
δH(i),l = 2,(i) + 2,(i)
p=1 p=1

Fp(i) δHg Λ4,(i) Λ−p−1
X
+ 2,(i) .
p=0

We first calculate the `2 norm of δH(i),l as,


∞ ∞
Fp(i) T(i) Ĥ(i),l Λ−p−1 Fp(i) Ĥg,0 Λ4,(i) Λ−p−1
X X
δH(i),l ≤ 2,(i) + 2,(i)
p=1 p=1

Fp(i) δHg Λ4,(i) Λ−p−1
X
+ 2,(i)
p=0
∞ ∞
!2 !
X 1 X 1 1 2
2σmax 2σ 2
≤ ζ p σmax
2
3 2 + ζ p σmax
2
3 2 3 2 1 + 3 max
2
4 σmin 4 σmin
2 4 σmin 4 σmin
p=1 p=1

X σ2
+ kδHg k ζ p 3 max
2
p=0 4 σmin
 !2 !3 !4 
2
1 σmax  2
2σmax 2
2σmax 2
2σmax 2
2σmax
≤ζ 2 + +3 +4 +4 ,
1 − ζ 34 σmin
2 3 2
4 σmin
3 2
4 σmin
3 2
4 σmin
3 2
4 σmin

where we applied the upper bound on kδHk in the last inequality.

56
TCMF

Finally, we have,

max eTj δH(i),l


j
∞ ∞
max eTj Fp(0) Ĥ(i),l Λ−p 0 −1
max eTj Fp(i) Ĥg,0 Λ4,(i) Λ−p−1
X X
0
≤ T(i) 2,(i) + 2,(i)
j j
p0 =1 p=1

max eTj Fp(i) δHg Λ4,(i) Λ−p−1
X
+ 2,(i) .
j
p=0

The first two summations can be upper bounded by Lemma 22, and the last summation
can be estimated in a similar way we calculate maxj eTj δHg . We omit the details and
present the estimated upper bound for brevity.
This completes our proof.

Equipped with the aforementioned perturbation analysis on Ĥg and H(i),l , we are ready
to provide the formal version of Lemma 6.

Lemma 20 Under the same conditions as Lemma 19, we have:



L? (i) − L̂(i) ≤ αµ2 r max E(j) ∞
C4 ,
∞ j

where C4 is a constant satisfying,


 10  18 !
σmax 1 σmax
C4 = O √ + . (60)
σmin θ σmin

Proof Notice that L? (i) = L? (i),g + L? (i),l , and L̂(i) = (PĤg + PĤ )M̂(i) = (PĤg +
(i),l
PĤ )(L? (i),g + L? (i),l + E(i),t ). Therefore, we have
(i),l

L? (i) − L̂(i)

≤ (PĤg + PĤ )(L? (i),g + L? (i),l + E(i),t ) − L? (i),g − L? (i),l


(i),l ∞
? ? ?
≤ PĤg (L (i),g +L (i),l + E(i),t ) − L (i),g

? ? ?
+ PĤ (L (i),g +L (i),l + E(i),t ) − L (i),l
(i),l ∞
? ?
≤ PĤg L (i),l + PĤ L (i),g
∞ (i),l ∞
 
+ Ĥg,0 Ĥg,0 + Ĥg,0 δHg + δHg ĤTg,0 + δHg δHTg (L? (i),g + E(i),t ) − L? (i),g
T T

 
+ Ĥ(i),l,0 ĤT(i),l,0 + Ĥ(i),l,0 δHT(i),l + δH(i),l ĤT(i),l,0 + δH(i),l δHT(i),l

57
Shi, Fattahi, and Al Kontar

(L? (i),l + E(i),t ) − L? (i),l


≤ Ĥg,0 ĤTg,0 L? (i),g −L ?


(i),g + Ĥ(i),l,0 ĤT(i),l,0 L? (i),l − L? (i),l
∞ ∞

+ Ĥg,0 δHTg L? (i),g + δHg ĤTg,0 L? (i),g + δHg δHTg L? (i),g ∞


∞ ∞

+ Ĥg,0 δHTg E(i),t + δHg Ĥg,0 E(i),t + δHg δHTg E(i),t ∞


∞ ∞

+ Ĥ(i),l,0 δHT(i),l L? (i),l + δH(i),l Ĥ(i),l,0 L ?


(i),l + δH(i),l δHT(i),l L? (i),l
∞ ∞ ∞

+ Ĥ(i),l,0 δHT(i),l E(i),t + δH(i),l ĤT(i),l,0 E(i),t + δH(i),l δHT(i),l E(i),t


∞ ∞ ∞
? ?
+ PĤg L (i),l + PĤ L (i),g .
∞ (i),l ∞
(61)

There are 16 terms in (61), we will bound each of them respectively.


Bounding the first term of (61):

Ĥg,0 ĤTg,0 L? (i),g − L? (i),g



 
?T
≤ H ? T ?
g H g Ĥg,0 Ĥg,0 L (i),g I − H? g H? Tg Ĥg,0 ĤTg,0 L? (i),g
− L? (i),g +
∞ ∞
2
µ r  
≤ H? g H? Tg Ĥg,0 ĤTg,0 L? (i),g − L? (i),g + I − H? g H? Tg Ĥg,0 ĤTg,0 L? (i),g
n ∞
µ2 r µ2 r  
≤ Ĥg,0 ĤTg,0 L? (i),g − L? (i),g + I − H? g H? Tg Ĥg,0 ĤTg,0 L? (i),g
n  n
? ?T T ?
+ I − H g H g Ĥg,0 Ĥg,0 L (i),g .

(TM1)

Recall the definition of Ĥg,0 as,


N
X
Ĥg,0 = T(0) Ĥg Λ−1
6 + T(i) Ĥ(i),l Λ−1 −1
2,(i) Λ5,(i) Λ6
i=1
N
X
= Ĥg −F(0) Ĥg Λ−1
6 − F(i) Ĥ(i),l Λ−1 −1
2,(i) Λ5,(i) Λ6 ,
i=1
| {z }
δHg,0

where
PN we used the KKT condition (13a) and the definition of Λ6 = Λ1 +
−1
i=1 3,(i) Λ2,(i) Λ5,(i) .
Λ
The first term in (TM1) is thus bounded by,

µ2 r
Ĥg,0 ĤTg,0 L? (i),g − L? (i),g
n
µ2 r
≤ Ĥg ĤTg L? (i),g − H? g H? Tg L? (i),g
n

58
TCMF

µ2 r µ2 r µ2 r
+ Ĥg δHTg,0 L? (i),g + δHg,0 ĤTg L? (i),g + δHg,0 δHTg,0 L? (i),g .
n n n
The second term in (TM1) is bounded by
 
I − H? g H? Tg Ĥg,0 ĤTg,0 L? (i),g
   
≤ I − H? g H? Tg Ĥg ĤTg L? (i),g + I − H? g H? Tg δHg,0 ĤTg L? (i),g
   
?T ?T
? T ?
+ I − H g H g Ĥg δHg,0 L (i),g + I − H g H g δHg,0 δHTg,0 L? (i),g
?

≤ Ĥg ĤTg − H? g H? Tg σmax + 2 kδHg,0 k σmax + kδHg,0 k2 σmax .

The third term in (TM1) is bounded by

 
I − H? g H? Tg Ĥg,0 ĤTg,0 L? (i),g

N 
?T T(i)
X 
= Ĥ ?
(i) Ĥ (i) Ĥg Λ−1
6 + Ĥ ? ?T −1 −1
(i) Ĥ (i) T(i) Ĥ(i),l Λ2,(i) Λ5,(i) Λ6 L? (i),g
N
i=1 ∞
N
µ2 r X T(i)
 
≤ Ĥ? (i) Ĥ?T(i) Ĥg Λ−1 ? ?T −1 −1
6 + Ĥ (i) Ĥ (i) T(i) Ĥ(i),l Λ2,(i) Λ5,(i) Λ6 L? (i),g .
n N
i=1
(62)

We know that,
 
H? (i),l H? T(i),l T(i) = PĤ T(i) + H? (i),l H? T(i),l − PĤ T(i)
(i),l (i),l
 
= PĤ S(i) − PĤ F(i) + H? (i),l H? T(i),l − PĤ T(i) ,
(i),l (i),l (i),l

and that,
S(i)
PĤ Ĥg + PĤ S(i) Ĥ(i),l Λ−1
2,(i) Λ5,(i) = 0.
(i),l N (i),l

As a result, we have
 
I − H? g H? Tg Ĥg,0 ĤTg,0 L? (i),g

 
µ2 r
N
X −PĤ F(i) + H? (i),l H? T(i),l − PĤ T(i)
Ĥ? (i) Ĥ?T(i)
(i),l (i),l

n N
i=1
 
× Ĥg + Ĥ(i),l Λ−1 −1 ?
2,(i) N Λ5,(i) Λ6 L (i),g
 
2 N F + σ 2 H ? H?T − P
!
µ r 1 X (i) max (i),l (i),l Ĥ(i),l 2σ 2
≤ σmax 3 2
1 + 3 max
2
.
n N 4 σmin 4 σmin
i=1

59
Shi, Fattahi, and Al Kontar

Combing them all, we have,

Ĥg,0 ĤTg,0 L? (i),g − L? (i),g



N
! !
2
µ2 r F(0) 2
2σmax 1 X F(i) + σmax 2
∆P(i),l 2σmax
≤ σmax 2 k∆Pg k + 6 3 2 3 2 + 3 2 3 2 .
n 4 σmin 4 σmin
N 4 σmin 4 σmin
i=1
(63)

Bounding the second term of (61):

Ĥ(i),l,0 ĤT(i),l,0 L? (i),l − L? (i),l



? ?T T ?
≤ H − L? (i),l
(i),l H (i),l Ĥ(i),l,0 Ĥ(i),l,0 L (i),l

 
+ I − H? (i),l H? T(i),l Ĥ(i),l,0 ĤT(i),l,0 L? (i),l

µ2 r
≤ H? (i),l H? T(i),l Ĥ(i),l,0 ĤT(i),l,0 L? (i),l − L? (i),l
n
 
+ I − H? (i),l H? T(i),l Ĥ(i),l,0 ĤT(i),l,0 L? (i),l

µ2 r µ2 r  
≤ Ĥ(i),l,0 ĤT(i),l,0 L? (i),l − L? (i),l + I − H? (i),l H? T(i),l Ĥ(i),l,0 ĤT(i),l,0 L? (i),l
n  n ∞
? ?T
+ I−H (i),l H (i),l Ĥ(i),l,0 ĤT(i),l,0 L? (i),l .

(TM2)

We will estimate upper bounds of three terms in (TM2) respectively.


Recall that the definition of Ĥ(i),l,0 as,

Ĥ(i),l,0 = T(i) Ĥ(i),l Λ−1 −1


2,(i) + Ĥg,0 Λ4,(i) Λ2,(i)

= S(i) Ĥ(i),l Λ−1 −1 −1 −1


2,(i) + Ĥg Λ4,(i) Λ2,(i) − F(i) Ĥ(i),l Λ2,(i) + δHg,0 Λ4,(i) Λ2,(i)

= Ĥ(i),l − δH(i),l .

The first term in (TM2) is thus bounded by,

µ2 r
Ĥ(i),l,0 ĤT(i),l,0 L? (i),l − L? (i),l
n
µ2 r
≤ Ĥ(i),l Ĥ(i),l L? (i),g − H? g H? Tg L? (i),g
n
µ2 r µ2 r µ2 r
+ Ĥ(i),l δHT(i),l,0 L? (i),l + δH(i),l,0 ĤT(i),l L? (i),l + δH(i),l,0 δHT(i),l,0 L? (i),l .
n n n
The second term in (TM2) is upper bounded by
 
I − H? (i),l H? T(i),l Ĥ(i),l,0 ĤT(i),l,0 L? (i),l
   
≤ I − H? (i),l H? T(i),l PĤ L? (i),l + I − H? (i),l H? T(i),l δH(i),l,0 ĤT(i),l L? (i),l
(i),l

60
TCMF

   
+ I − H? (i),l H? T(i),l Ĥ(i),l δHT(i),l,0 L? (i),l + I − H? (i),l H? T(i),l δH(i),l,0 δHT(i),l,0 L? (i),l
2
≤ PĤ − H? (i),l H? T(i),l σmax + 2 δH(i),l,0 σmax + δH(i),l,0 σmax .
(i),l

Then we bound the third term of (TM2). From the definition of Ĥ(i),l,0 and T(i) , we
know,

 
I − H? (i),l H? T(i),l Ĥ(i),l,0
= H? g H? Tg T(i) Ĥ(i),l Λ−1 ? ?T −1
2,(i) + H g H g Ĥg,0 Λ4,(i) Λ2,(i)
 
+ I − H? g H? Tg − H? (i),l H? T(i),l Ĥg,0 Λ4,(i) Λ−1
2,(i)

= H? g H? Tg T(i) Ĥ(i),l Λ−1


2,(i)
 
N
X  
?T T(0) Ĥg Λ−1 ?T
+H ?
gH g 6 + T(j) Ĥ(j),l Λ−1 −1 
2,(j) Λ5,(j) Λ6 Λ4,(i) Λ−1
2,(i) + I−H ?
(i),l H (i),l
j=1
 
N N
X T(j)
X
× H? (j),l H? T(j),l Ĥg + H? (j),l H? T(j),l T(j) Ĥ(j),l Λ−1  −1 −1
2,(j) Λ5,(j) Λ6 Λ4,(i) Λ2,(i)
N
j=1 j=1
? ?T −1
=H g H g PĤg S(i) Ĥ(i),l Λ2,(i)
 
N
X
+ H? g H? Tg PĤg S(0) Ĥg Λ−1
6 + S(j) Ĥ(j),l Λ−1 −1 
2,(j) Λ5,(j) Λ6 Λ4,(i) Λ−1
2,(i)
j=1
?T
−H ? −1
g H g PĤg F(i) Ĥ(i),l Λ2,(i) − H? g H? Tg ∆Pg T(i) Ĥ(i),l Λ−1
2,(i)
 
N
X
− H? g H? Tg PĤg F(0) Ĥg Λ−1
6 + F(j) Ĥ(j),l Λ−1 −1 
2,(j) Λ5,(j) Λ6 Λ4,(i) Λ−1
2,(i)
j=1
 
N
X
− H? g H? Tg ∆Pg T(0) Ĥg Λ−1
6 + T(j) Ĥ(j),l Λ−1 −1 
2,(j) Λ5,(j) Λ6 Λ4,(i) Λ−1
2,(i)
j=1
 
+ I − H? (i),l H? T(i),l
 !
N
X Ĥg
×  H? (j),l H? T(j),l Ĥ(j),l ĤT(j),l S(j) + Ĥ(j),l Λ−1
2,(j) Λ5,(j)
 Λ−1 −1
6 Λ4,(i) Λ2,(i)
N
j=1
 
− I − H? (i),l H? T(i),l
 !
N
X Ĥg
× H? (j),l H? T(j),l Ĥ(j),l ĤT(j),l F(j) + Ĥ(j),l Λ−1
2,(j) Λ5,(j)
 Λ−1 −1
6 Λ4,(i) Λ2,(i)
N
j=1
 
− I − H? (i),l H? T(i),l

61
Shi, Fattahi, and Al Kontar

 !
N
X Ĥg
× H? (j),l H? T(j),l ∆P(j),l T(j) + Ĥ(j),l Λ−1
2,(j) Λ5,(j)
 Λ−1 −1
6 Λ4,(i) Λ2,(i)
N
j=1

.
From the KKT conditions, we know ĤTg Si Ĥ(i),l +
 
T −1 P N −1 −1
Ĥg S(0) Ĥg Λ6 + j=1 S(j) Ĥ(j),l Λ2,(j) Λ5,(j) Λ6 Λ4,(i) = 0 and
 

ĤT(j),l S(j) Ng + Ĥ(j),l Λ−12,(j) Λ5,(j) = 0.
Therefore, we have,
 
max eTk I − H? (i),l H? T(i),l Ĥ(i),l,0
k
s  !2 
2
µ r F(i) F(0) 2σ 2 σ 2 2 2
1 + 2σmax + 2σmax 

≤ 3 2
+ 3 2 3 max 2
+ k∆Pg k 3 max
2 3 2 3 2
n1 4 σmin 4 σmin 4 σmin 4 σmin 4 σmin 4 σmin
N N
! !
1 X F(j) 2σmax 2 2
2σmax  1 X 2
σmax 2
2σmax 2
2σmax 
+ 3 2 3 2 2 + 3 3 2 + ∆P (j) 3 2 2 3 2 1 + 3 2 .
N 4 σmin 4 σmin 4 σmin
N 4 σmin 4 σmin 4 σmin
j=1 j=1

(64)

Combing these results, we have,

Ĥ(i),l,0 ĤT(i),l,0 L? (i),l − L? (i),l



!
µ2 r  F
(i) F(0) 2
2σmax
≤ σmax × 3 2 + 3 2 6+73 2
n 4 σmin 4 σmin 4 σmin
 !2 
σ 2 2 2
+ k∆Pg k 3 max 1 + 2σmax + 2σmax  + 2 ∆P(i),l
2 3 2 3 2
σ
4 min 4 σmin 4 σmin
N N
! !
1 X F(j) 2σmax 2 2
2σmax  1 X 2
σmax 2
2σmax 2σ 2 
+ 3 2 3 2 2 + 3 3 2 + ∆P (j) 3 2 2 3 2 1 + 3 max
2
.
N σ σ 4 σmin
N 4 σmin 4 σmin 4 σmin
j=1 4 min 4 min j=1

(65)

Bounding the third term of (61):

Ĥg,0 δHTg L? (i),g



N
!
X
= max eTj T(0) Ĥg Λ−1
6 + T(i) Ĥ(i),l Λ−1 −1
2,(i) Λ5,(i) Λ6 δHg L? (i),g ek
j,k
i=1
 !2 
µ2 r 2σ 2 2
2σmax
≤ σmax kδHg k  3 max
2
+ 3 2
.
n 4 σmin 4 σmin

62
TCMF

(66)

Bounding the fourth term of (61):

Ĥg,0 δHTg E(i),t



X
= max eTj Ĥg,0 δHTg el eTl E(i),t ek
j,k
l

≤ max eTj Ĥg,0 δHTg el αn1 E(i) ∞


j,l
N
!
X
= max eTj T(0) Ĥg Λ−1
6 + T(i) Ĥ(i),l Λ−1 −1
2,(i) Λ5,(i) Λ6 δHTg el αn1 E(i) ∞
j,l
i=1
!
2
µ2 r 2σmax 2σ 2
≤ 1 + 3 max αn1 E(i) ,
n1 34 σmin
2 2
4 σmin

(67)

where we used the definition of α-sparsity in the second inequality, and applied Lemma 19
in the last inequality.
Bounding the fifth term of (61):

δHg ĤTg,0 L? (i),g



N
!
X
= max eTj δHg T(0) Ĥg Λ−1
6 + T(i) Ĥ(i),l Λ−1 −1
2,(i) Λ5,(i) Λ6 L? (i),g ek
j,k
i=1
s
N
µ2 r X
≤ max T(0) Ĥg Λ−1
eTj δHg
6 + T(i) Ĥ(i),l Λ−1 −1
2,(i) Λ5,(i) Λ6
j n2
i=1
s !
2
µ2 r σmax 2
2σmax
T
≤ σmax max ej δHg 1+ 3 2 .
j n2 34 σmin
2
4 σmin

Bounding the sixth term of (61):

δHg ĤTg,0 E(i),t



N
!T
X X
= max eTj δHg T(0) Ĥg Λ−1
6 + T(i) Ĥ(i),l Λ−1 −1
2,(i) Λ5,(i) Λ6 E(i),t ek
j,k
l i=1
!s
2
2σmax 2
2σmax µ2 r
≤ max eTj δHg 3 2 1+ 3 2 αn1 kEk∞ ,
j
4 σmin 4 σmin
n1

63
Shi, Fattahi, and Al Kontar

where we applied the incoherence condition on T(0) and T(i) , and the relation
P T
l el E(i),t ek ≤ αn1 kEk∞ .
Bounding the seventh term of (61):

δHg δHTg L? (i),g ∞


= max eTj δHg δHTg L? (i),g ek
j,k
s
µ2 r
≤ max eTj δHg σmax ,
j n2
(68)
where we applied the incoherence on L? (i),g and kδHg k ≤ 1 in the first inequality.
Bounding the eighth term of (61):

δHg δHTg E(i),t ∞


X
= max eTj δHg δHTg el eTl E(i),t ek
j,k
l
≤ max eTj δHg eTl δHg αn1 kEk∞
j,l
 2
T
= max ej δHg αn1 kEk∞ ,
j

where we applied l eTl E(i),t ek ≤ αn1 kEk∞ in the first inequality.


P
Bounding the ninth term of (61):

Ĥ(i),l,0 δHT(i),l L? (i),l



 
= max ej T(i) H(i),l Λ−1
T
2,(i) + Ĥ −1 T ?
g,0 4,(i) 2,(i) δH(i),l L (i),l ek
Λ Λ
j,k
 !2 
2
µ r 2σ 2 2 2
= σmax 3 max 1 + 2σmax + 2σmax  δH(i),l ,
n 2 3 2 3 2
4 σmin 4 σmin 4 σmin

where we applied the incoherence condition of T(0) and T(i) .


Bounding the tenth term of (61):

Ĥ(i),l,0 δHT(i),l E(i),t



X
= max eTj Ĥ(i),l,0 δHT(i),l el eTl E(i),t ek
j,k
l

= max eTj Ĥ(i),l,0 δHT(i),l el


αn1 kEk∞
j,l
s  !2 
2 2
µ r 2σmax  2σ 2 2σ 2
≤ 1 + 3 max + 3 max  eTl δH(i),l αn1 kEk ,

n1 34 σmin
2 σ 2
4 min σ 2
4 min

64
TCMF

where we applied the incoherence condition in the first inequality, (57) in the second
inequality.
Bounding the eleventh term of (61):

δH(i),l ĤT(i),l,0 L? (i),l


= max eTj δH(i),l ĤT(i),l,0 L? (i),l ek


j,k
 !2  s
2
σmax 2
2σmax 2
2σmax µ2 r
≤ max eTj δH(i),l 3 2
1 +
3 2 + 3 2
 σmax ,
j
4 σmin 4 σmin 4 σmin
n2

where we applied the incoherence condition in the first inequality, and (57) in the second
inequality.
Bounding the twelfth term of (61):

δH(i),l ĤT(i),l,0 E(i),t



X
= max eTj δH(i),l ĤT(i),l,0 el eTl E(i),l ek
j,k
l

≤ max eTj δH(i),l ĤT(i),l,0 el αn1 kEk∞


j,l
s  !2 
2 2
µ r 2σmax  2
2σmax 2
2σmax
≤ max eTj δH(i),l 1 + +  αn1 kEk ,

j n1 34 σmin
2 3 2
4 σmin
3 2
4 σmin

(69)

where we applied the condition l eTl E(i),t ek ≤ αn1 kEk∞ in the first inequality, the
P
incoherence condition in the second inequality.
Bounding the thirteenth term of (61):

δH(i),l δHT(i),l L? (i),l


= max eTj δH(i),l δHT(i),l L? (i),l ek


j,k
s
µ2 r (70)
≤ max eTj δH(i),l δH(i),l σmax
j n2
s
µ2 r
≤ max eTj δH(i),l σmax ,
j n2

where we apply the incoherence condition in the first inequality, and δH(i),l ≤ 1 in the
second inequality.

65
Shi, Fattahi, and Al Kontar

Bounding the fourteenth term of (61):

δH(i),l δHT(i),l E(i),t



X
= max eTj δH(i),l δHT(i),l em eTm E(i),t ek
j,k
m
≤ max eTj δH(i),l
eTm δH(i),l αn1 kEk∞
j.m
 2
T
≤ max ej δH(i),l αn1 kEk∞ ,
j

where we applied the condition l eTl E(i),t ek ≤ αn1 kEk∞ in the first inequality.
P
Bounding the fifteenth term of (61):

PĤg L? (i),l
∞ 
= max ej Ĥg,0 + δHg ĤTg H? (i),l Σ(i),l W? T(i),l ek
T
j,k
s
µ2 r
 
≤ max eTj Ĥg,0 + max eTj δHg ĤTg H? (i),l σmax
j j n2
s
µ2 r
 
≤ max eTj Ĥg,0 + max eTj δHg k∆Pg k σmax
j j n2
! s
µ2 r 2σmax
2 2σ 2 µ2 r
= 1 + 3 max k∆Pg k σmax + σmax k∆Pg k max eTj δHg .
n 34 σmin
2 2
4 σmin
n2 j

The second inequality comes from the relation ĤTg H? (i),l =


 
ĤTg I − H? g H? Tg H? (i),l ĤTg PĤg − PĤg H? g H? Tg H? (i),l

= ≤
PĤg − PĤg H? g H? Tg = PĤg (PĤg − H? g H? Tg ) ≤ PĤg − H? g H? Tg .
Bounding the sixteenth term of (61):

 
PĤ = max eTj Ĥ(i),0 + δH(i),l ĤT(i),l H? g Σ? (i),g W? T(i),g ek
L? (i),g
(i),l ∞ j,k
s
µ2 r
 
≤ max eTj Ĥ(i),l,0 + max eTj δH(i),l ĤT(i),l H? g σmax
j j n2
s
µ2 r
 
≤ max eTj Ĥ(i),l,0 + max eTj δH(i),l ∆P(i),l σmax
j j n2
 !2 
µ2 r 2σ 2  2
2σmax 2
2σmax
≤ σmax 3 max
2
1 + 3 2 + 3 2
 ∆P(i),l (71)
n σ
4 min σ
4 min σ
4 min
s
µ2 r
+ ∆P(i),l σmax max eTj δH(i),l (TM16)
n2 j

66
TCMF

, where we again applied Lemma 19in the first


 inequality. The second inequality comes
 from
T ? T ? T
the relation Ĥ(i),l H g = Ĥ(i),l I − PĤg H g = Ĥ(i),l PH? g − PĤg PH? g H g ≤ ?

PH? g − PĤg PH? g ≤ PH? g − PH? g .


Combining these sixteen terms (TM1)-(TM16) and considering the fact that α ≤ 1, we
have,

L? (i) − L̂(i) ≤ αµ2 r kEk∞ C4 ,

where

!9 !5
2
2σmax 2
2σmax 1
 1

C4 = 34327 3 2 + 534 3 2 θ − 2 = O κ9 + κ5 θ − 2 . (72)
4 σmin 4 σmin

This completes our proof.


Finally we will prove Theorem 5. We will first state its formal version below.
Theorem 21 Suppose that the conditions of Lemma 19 are satisfied. Additionally, suppose
ρ2
that there exists a constant 0 < ρmin < 1 such that α ≤ 4µ4min
r2 C 2
. Then, the following
4
σ√ 2
max µ r
statements hold at iteration t ≥ 1 of Algorithm 1 with λ1 = n1 n2 ,  ≤ λ1 (1 − ρmin ), and
1 − λ1 > ρ ≥ ρmin :
 
1. supp Ŝ(i),t ⊂ supp S? (i) for every i ∈ [N ].


2
2. Ŝ(i),t − S? (i) ≤ 2λt ≤ 4σmax µnr for every i ∈ [N ].

3. L̂(i),t − L? ≤  + ρλt for every i ∈ [N ].


Moreover, we have
 
 T ? ?T  t
Û g,t V̂ (i),g,t −U g V (i),g =O ρ + , for every i ∈ [N ]. (73)
∞ 1−ρ
and
 
 T ? ?T 
Û (i),l,t V̂ (i),l,t −U (i),l V (i),l = O ρt + , for every i ∈ [N ]. (74)
∞ 1−ρ
Remark The definition of the term ρmin in the statement of the above theorem is kept
intentionally implicit to streamline the presentation. In what follows, we will give an
estimate of the requirements on α purely in terms of the parameters of the problem.
8   
θ σmax
Lemma 14 requires α = O µ4 r2 σmin . Lemma 15 requires α = O µ12 r . Lemma 16
  12    12 
θ σmin θ σmin
requires α = O µ4 r2 σmax . Lemma 17 requires α = O µ4 r2 σmax . Lemma
  6    2 
θ σmin θ σmin
18 requires α = O µ2 r σmax . And Lemma 19 requires α = O µ2 r σmax .

67
Shi, Fattahi, and Al Kontar

  10 
√1 σmax
As C4 = O θ σmin , the additional requirement in Theorem 21 requires α =
  20 
θ σmin
O µ4 r2 σmax . Taking the intersections of all these requirements, we can derive
  20 
θ σmin
the upper bound on α as α = O µ4 r2 σmax .

Proof We will prove this theorem by induction.


µ2 r  
Base case: At t = 1, L̂(i),0 = 0. As λ1 = n σmax , we have Ŝ(i),1 = Hard µ2 r M(i) . By
n
σmax
definition of hard-thresholding, if the jk-th entry of Ŝ(i),1 is nonzero, we know [M(i) ]jk >
µ2 r µ2 r
n σmax .Since [L? (i) ]jk ≤ n σmax for each j and k, we must have [S? (i) ]jk > 0. This
proves Claim 1 for t = 1.
Now we will prove Claim 2 holds when t = 1. If [Ŝ(i),1 ]jk = 0, we know
[S? ? 2 ? 2
(i) ]jk + [L (i) ]jk ≤ µ r/nσmax , thus [S (i) ]jk ≤ 2µ r/nσmax . If [Ŝ(i),1 ]jk 6= 0, by
the definition of hard-thresholding, we know [Ŝ(i),1 ]jk = [M(i) ]jk = [S? (i) ]jk + [L? (i) ]jk .
By rearranging terms, we have [S? (i) ]jk − [Ŝ(i),1 ]jk = [L? (i) ]jk ≤ µ2 r/nσmax . We hence
proved Claim 2 for t = 1.
2
Since E(i),1 = S? (i) − Ŝ(i),1 , we have E(i),1 ∞
≤ 2 µnr σmax for each i as well. Also,
ρ2 2
by Claim 1, E(i),1 ’s are α-sparse. Therefore by Lemma 20, when α ≤ 4µ4min r2 C42
≤ 4µ4ρr2 C 2 ,
4
√ 2
L̂(i),1 − L? (i) ≤ 2 α µnr σmax C4 ≤ ρλ1 . From the definition of -optimality and triangle

inequality, we know L̂(i),1 − L? (i) ≤ ρλ1 + . We thus proved Claim 3 for t = 1.

Induction step: Now supposing that Claims 1, 2, and 3 hold for iterations 1, · · · , t, we
will show their correctness for the iteration t + 1. Since Claim 3 holds for iteration t, we
2
know L̂(i),t − L? (i) ≤ ρλt +  under the condition α ≤ 4µ4ρr2 C 2 . With the choice of λt+1 =
∞ 4

ρλt + , if the jk-th entry of Ŝ(i),t+1 is nonzero, we have [S? (i) ]jk + [L? (i) ]jk − [L̂(i),t ]jk >
λt+1 . Since [L? (i) ]jk − [L̂(i),t ]jk ≤ λt+1 , we must have [S? (i) ]jk > 0. This proves Claim 1
for iteration t + 1.
We will now proceed to prove Claim 2. We consider each entry of Ŝ(i),t+1 =
h i
Hardλt+1 S? (i) + L? (i) − L̂(i),t . From the definition of hard-thresholding, we know
 
[Ŝ(i),t+1 ]jk − [S? (i) ]jk + [L? (i) ]jk − [L̂(i),t ]jk ≤ λt+1 . Remember that we know
[L? (i) ]jk − [L̂(i),t ]jk ≤ λt+1 from the correctness of Claim 3 at iteration t and the up-
per bound on α, we can derive [Ŝ(i),t+1 ]jk − [S? (i) ]jk ≤ 2λt+1 by triangle inequality. We
hence prove Claim 2.
For Claim 3, since E(i),t+1 = S? (i) − Ŝ(i),t+1 , we have E(i),t+1 ∞ ≤ 2λt+1 for each i as
well. Also, by Claim 1 at iteration t, E(i),t ’s are α-sparse at iteration t. Therefore by Lemma
√ 2
20, L̂(i),t+1 − L? ≤ 2 αµ2 rC4 λt+1 . Under the constraint that α ≤ 4µ4ρr2 C 2 , we know
∞ 4

68
TCMF

L̂(i),t+1 − L? ≤ ρλt+1 . From the definition of -optimality and triangle inequality, we



have L̂(i),t+1 − L? ≤ ρλt+1 + . We thus proved Claim 3 at iteration t + 1.

Combining them, we can conclude that 1, 2, and 3 hold for every t = 1, 2, · · · .
T
Finally, we will prove (73) and (74). We have known that Ûg,t V̂(i),g,t = PĤg M̂(i) , then
from similar analysis of (61), we have,

T
Ûg,t V̂(i),g,t − U? g V? T(i),g

? ?
+ E(i) − L? (i),g

= PĤg L (i),g +L (i),l

?
≤ PĤg L (i),l

 
+ T(0) Ĥg Λ1 Ĥg T(0) + T(0) Ĥg Λ−1
−2 T
1 δHT
g + δH Λ−1 T
g 1 Ĥ T
g (0) + δHg δHT
g

(L? (i),g + E(i),t ) − L? (i),g .


In Lemma 20, we have shown that each term above is upper bounded by O(maxi E(i),t ∞
).
Therefore by Claim 2, we have T
Ûg,t V̂(i),g,t − U? ?T = O(maxi E(i),t ) = O(λt ) =
g V (i),g ∞


O(ρt + 1−ρ ). (73) follows accordingly by triangle inequality.
We can prove (74) in a similar way. This completes our proof of Theorem 21.

Appendix D. Auxiliary Lemma


This section discusses some helper lemmas useful for our main proofs. These lemmas are
mostly derived from basic linear algebra and series.
The following lemma is a well-known result and provides an upper bound on the norm
of product matrices.

Lemma 22 Form two matrices A ∈ Rm×n and B ∈ Rn×p , we have,

kABkF ≤ kAk2 kBkF

and
kABk2 ≤ kAk2 kBk2

Proof The proof is straightforward and can be found in Sun and Luo (2016).

Lemma 23 For x, y ∈ [0, 1) such that x + y < 1, the following relation holds:
X 2x
xp1 +p2 ≤ . (75a)
1−x
p1 +p2 ≥1

69
Shi, Fattahi, and Al Kontar

Proof This proof follows from the direct calculation.

X ∞ X
X ∞
xp1 +p2 = xp1 +p2 − 1
p1 +p2 ≥1 p1 =0 p2 =0
  
∞ ∞
X X 1
= xp 1   xp 2  − 1 = −1
(1 − x)2
p1 =0 p2 =0

2x − x2 2x
= 2
≤ .
(1 − x) 1−x

We also present a lemma related to the Schur complement of block matrices.

Lemma 24 For symmetric matrices A0 ∈ Rr0 ×r0 , A1 ∈ Rr1 ×r1 , · · · , AN ∈ Rr1 ×r1 , and
Bi ∈ Rr0 ×rN for i ∈ {1, · · · , N }, we can construct a symmetric block matrix C as,
 
A0 B1 B2 · · · BN
 BT A1 0 ··· 0 
 1 
 T ··· 0 
C =  B2 0 A2 . (76)
 .. .. . . 
 . . . ··· 0 
BTN 0 0 · · · AN

Then,
PN C is−1positive definite if and only if A1 , A2 , · · · , AN are positive definite and A0 −
B A B T is positive definite.
i=1 i i i

Proof Sicne Ai ’s are positive definite, they are invertible. Thus we can decompose C as

I B1 A−1 B2 A−1 · · · BN A−1


 
1 2 A0 − i Bi A−1 BTi
N
 P 
i 0 ··· 0
0 I 0 ··· 0 
 0 A1 · · · 0 
C=
0 0 I · · · 0  

.. .. . .


 .. .. .. . .   . . . 0 
. . . . 0 
0 0 · · · AN
0 0 0 ··· I | {z }
| {z } C2
C1
 
I 0 0 ··· 0
A−1 BT I 0 ··· 0
 1−1 1 
T 0 I ···
· A2 B2 0.

 .. .. .. . . 
 . . . . 0
ATN BTN 0 0 ··· I
| {z }
CT
1

70
TCMF

On the right hand side, C1 and CT1 are both invertible. Thus, C is positive definite if and
only if C2 is positive definite. Since C2 is a block diagonal matrix, we prove the statement
in the lemma.

The following lemma provides an eigenvalue lower bound on the product of three matrices.

Lemma 25 For matrix A ∈ Rn×n , and symmetric positive semidefinite matrix B ∈ Rn×n ,
we know that,

λmin AT BA ≥ λmin (B) λmin AT A .


 

Proof The proof follows from the Courant–Fischer–Weyl variational principle.

λmin AT BA = min vT AT BAv



kvk=1

≥ λmin (B) min kAvk2


kvk=1

= λmin (B) λmin AT A .




We finally present the lemma that provides an upper bound of the operator norm of
block matrices.

Lemma 26 For a symmetric block matrix C defined as


 
A1 B12 B13 · · · B1N
 BT · · · B2N 
 12 A2 B23 
 T T · · · B3N 
C =  B13 B23 A3 , (77)
 .. .. .. .. .. 
 . . . . . 
BT1N BT2N BT3N · · · AN

where Ai ’s are symmetric, we have,


s X
kCk ≤ max {kAi k} + 2 kBij k2 . (78)
i=1,··· ,N
i<j

Note: in (78), the diagonal blocks and off-diagonal blocks are treated differently.
Proof We first prove for the special case where Bij = 0. In this case,

kCk2 = max kCvk2


kvk=1
N
X N
X
= max kAi vi k2 ≤ max kAi k2 kvi k2
kvk=1 kvk=1
i=1 i=1

71
Shi, Fattahi, and Al Kontar

N
X
≤ max {kAi k2 } × kvi k2 = max {kAi k2 }.
i=1,··· ,N i=1,··· ,N
i=1

We then prove for the special case where Ai = 0. We have


2
N
X X
kCk2 = max kCvk2 = max Bij vj
kvk=1 kvk=1
i=1 j6=i
N X
X N X
X
= max vkT BTik Bij vj ≤ max kvk k kBik k kBij k kvj k
kvk=1 kvk=1
i=1 j,k6=i i=1 j,k6=i
 2   
N
X X N
X X X
= max  kBij k kvj k ≤ max  kBij k2   kvj k2 
kvk=1 kvk=1
i=1 j6=i i=1 j6=i j
X
=2 kBij k2 ,
i<j

where we used Cauchy-Schwarz inequality in the second inequality.


By applying triangle inequality of the matrix operator norm, we can combine the upper
bounds derived from two special cases and obtain (78).

References
M. Aharon, M. Elad, and A. Bruckstein. K-svd: An algorithm for designing overcomplete
dictionaries for sparse representation. IEEE Transactions on signal processing, 54(11):
4311–4322, 2006.

D. P. Bertsekas. Nonlinear programming. Journal of the Operational Research Society, 48


(3):334–334, 1997.

R. Bhatia. Matrix analysis, volume 169. Springer Science & Business Media, 2013.

T. Bouwmans and E. H. Zahzah. Robust pca via principal component pursuit: A review for a
comparative evaluation in video surveillance. Computer Vision and Image Understanding,
122:22–34, 2014. ISSN 1077-3142. doi: https://doi.org/10.1016/j.cviu.2013.11.009. URL
https://www.sciencedirect.com/science/article/pii/S1077314213002294.

E. J. Candès, X. Li, Y. Ma, and J. Wright. Robust principal component analysis? Journal
of the ACM (JACM), 58(3):1–37, 2011.

D. Chai, L. Wang, K. Chen, and Q. Yang. Secure federated matrix factorization. IEEE
Intelligent Systems, 36(5):11–20, 2021. doi: 10.1109/MIS.2020.3014880.

V. Chandola, A. Banerjee, and V. Kumar. Anomaly detection: A survey. ACM computing


surveys (CSUR), 41(3):1–58, 2009.

72
TCMF

V. Chandrasekaran, S. Sanghavi, P. A. Parrilo, and A. S. Willsky. Rank-sparsity incoherence


for matrix decomposition. SIAM Journal on Optimization, 21(2):572–596, 2011.

X. Chen, B. Zhang, T. Wang, A. Bonni, and G. Zhao. Robust principal component analysis
for accurate outlier sample detection in rna-seq data. Bmc Bioinformatics, 21(1):1–20,
2020.

Y. Chen, J. Fan, C. Ma, and Y. Yan. Bridging convex and nonconvex optimization in robust
pca: Noise, outliers, and missing data. Annals of statistics, 49(5):2948, 2021.

J. Fan, W. Wang, and Y. Zhong. An l-infinity eigenvector perturbation bound and its
application to robust covariance estimation. Journal of Machine Learning Research, 18
(207):1–42, 2018.

S. Fattahi and S. Sojoudi. Exact guarantees on the absence of spurious local minima for
non-negative rank-1 robust principal component analysis. Journal of machine learning
research, 2020.

Q. Feng, M. Jiang, J. Hannig, and J. Marron. Angle-based joint and individual variation
explained. Journal of multivariate analysis, 166:241–265, 2018.

I. Gaynanova and G. Li. Structural learning and integrative decomposition of multi-view


data. Biometrics, 75(4):1121–1132, 2019.

R. Ge, C. Jin, and Y. Zheng. No spurious local minima in nonconvex low rank problems:
A unified geometric analysis. In International Conference on Machine Learning, pages
1233–1242. PMLR, 2017.

R. Gemulla, E. Nijkamp, P. J. Haas, and Y. Sismanis. Large-scale matrix factorization


with distributed stochastic gradient descent. In Proceedings of the 17th ACM SIGKDD
international conference on Knowledge discovery and data mining, pages 69–77, 2011.

I. Higgins, L. Matthey, A. Pal, C. Burgess, X. Glorot, M. Botvinick, S. Mohamed, and


A. Lerchner. beta-VAE: Learning basic visual concepts with a constrained variational
framework. In International Conference on Learning Representations, 2017. URL https:
//openreview.net/forum?id=Sy2fzU9gl.

H. Hotelling. Analysis of a complex of statistical variables into principal components. Journal


of Educational Psychology, 24:417–441, 1933. doi: http://dx.doi.org/10.1037/h0071325.

D. Hsu, S. M. Kakade, and T. Zhang. Robust matrix decomposition with sparse corruptions.
IEEE Transactions on Information Theory, 57(11):7221–7234, 2011.

N. Jin, S. Zhou, and T.-S. Chang. Identification of impacting factors of surface defects in hot
rolling processes using multi-level regression analysis. Society of Manufacturing Engineers
Southfield, MI, USA, 2000.

R. Kashyap, R. Kong, S. Bhattacharjee, J. Li, J. Zhou, and B. T. Yeo. Individual-specific


fmri-subspaces improve functional connectivity prediction of behavior. NeuroImage, 189:
804–812, 2019.

73
Shi, Fattahi, and Al Kontar

Y. Koren, R. Bell, and C. Volinsky. Matrix factorization techniques for recommender systems.
Computer, 42(8):30–37, 2009. doi: 10.1109/MC.2009.263.

H. Lee and S. Choi. Group nonnegative matrix factorization for eeg classification. In
D. van Dyk and M. Welling, editors, Proceedings of the Twelth International Conference
on Artificial Intelligence and Statistics, volume 5 of Proceedings of Machine Learning
Research, pages 320–327, Hilton Clearwater Beach Resort, Clearwater Beach, Florida USA,
16–18 Apr 2009. PMLR. URL https://proceedings.mlr.press/v5/lee09a.html.

X. Li, J. Haupt, J. Lu, Z. Wang, R. Arora, H. Liu, and T. Zhao. Symmetry. saddle points,
and global optimization landscape of nonconvex matrix factorization. In 2018 Information
Theory and Applications Workshop (ITA), pages 1–9, 2018. doi: 10.1109/ITA.2018.8503215.

X. Li, S. Wang, and Y. Cai. Tutorial: Complexity analysis of singular value decomposition
and its variants. arXiv preprint arXiv:1906.12085, 2019.

G. Liang, N. Shi, R. A. Kontar, and S. Fattahi. Personalized dictionary learning for


heterogeneous datasets. In Advances in Neural Information Processing Systems, 2023.

E. F. Lock, K. A. Hoadley, J. S. Marron, and A. B. Nobel. Joint and individual variation


explained (jive) for integrated analysis of multiple data types. The annals of applied
statistics, 7(1):523, 2013.

D. Meng and F. De La Torre. Robust matrix factorization with unknown noise. In Proceedings
of the IEEE International Conference on Computer Vision, pages 1337–1344, 2013.

P. Netrapalli, N. UN, S. Sanghavi, A. Anandkumar, and P. Jain. Non-convex robust pca.


Advances in neural information processing systems, 27, 2014.

Y. Panagakis, M. A. Nicolaou, S. Zafeiriou, and M. Pantic. Robust correlated and individual


component analysis. IEEE transactions on pattern analysis and machine intelligence, 38
(8):1665–1678, 2015.

D. Park, A. Kyrillidis, C. Carmanis, and S. Sanghavi. Non-square matrix sensing without


spurious local minima via the Burer-Monteiro approach. In A. Singh and J. Zhu, editors,
Proceedings of the 20th International Conference on Artificial Intelligence and Statistics,
volume 54 of Proceedings of Machine Learning Research, pages 65–74. PMLR, 20–22 Apr
2017. URL https://proceedings.mlr.press/v54/park17a.html.

J. Y. Park and E. F. Lock. Integrative factorization of bidimensionally linked matrices.


Biometrics, 76(1):61–74, 2020.

E. Ponzi, M. Thoresen, and A. Ghosh. Rajive: Robust angle based jive for integrating noisy
multi-source data. arXiv preprint arXiv:2101.09110, 2021.

A. Rinaldo. Davis-kahan theorem. Advanced Statistical Theory I, 2017.

C. Sagonas, Y. Panagakis, A. Leidinger, and S. Zafeiriou. Robust joint and individual


variance explained. In Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, pages 5267–5276, 2017.

74
TCMF

B. Shen, W. Xie, and Z. J. Kong. Smooth robust tensor completion for back-
ground/foreground separation with missing pixels: Novel algorithm with convergence
guarantee. Journal of Machine Learning Research, 23(217):1–40, 2022.

N. Shi and R. A. Kontar. Personalized pca: Decoupling shared and unique features. Journal
of Machine Learning Research, 25(41):1–82, 2024. URL http://jmlr.org/papers/v25/
22-0810.html.

N. Shi, R. A. Kontar, and S. Fattahi. Heterogeneous matrix factorization: When features


differ by datasets. arXiv preprint arXiv:2305.17744, 2023.

R. Sun and Z.-Q. Luo. Guaranteed matrix completion via non-convex factorization. IEEE
Transactions on Information Theory, 62(11):6535–6579, 2016.

Y. L. Tan, V. Sehgal, and H. H. Shahri. Sensoclean: Handling noisy and incomplete data in
sensor networks using modeling. Main, pages 1–18, 2005.

T. Tao. 254a, notes 3a: Eigenvalues and sums of hermi-


tian matrices. https://terrytao.wordpress.com/2010/01/12/
254a-notes-3a-eigenvalues-and-sums-of-hermitian-matrices/, 2010. Accessed:
2022-03-01.

S. Tu, R. Boczar, M. Simchowitz, M. Soltanolkotabi, and B. Recht. Low-rank solutions of


linear matrix equations via procrustes flow. In M. F. Balcan and K. Q. Weinberger, editors,
Proceedings of The 33rd International Conference on Machine Learning, volume 48 of
Proceedings of Machine Learning Research, pages 964–973, New York, New York, USA,
20–22 Jun 2016. PMLR. URL https://proceedings.mlr.press/v48/tu16.html.

A. Vacavant, T. Chateau, A. Wilhelm, and L. Lequievre. A benchmark dataset for outdoor


foreground/background extraction. In Computer Vision-ACCV 2012 Workshops: ACCV
2012 International Workshops, Daejeon, Korea, November 5-6, 2012, Revised Selected
Papers, Part I 11, pages 291–300. Springer, 2013.

N. Vaswani, T. Bouwmans, S. Javed, and P. Narayanamurthy. Robust subspace learning:


Robust pca, robust subspace tracking, and robust subspace recovery. IEEE Signal
Processing Magazine, 35(4):32–55, 2018. doi: 10.1109/MSP.2018.2826566.

R. K. Wong and T. C. Lee. Matrix completion with noisy entries and outliers. The Journal
of Machine Learning Research, 18(1):5404–5428, 2017.

J. Wright and Y. Ma. High-dimensional data analysis with low-dimensional models: Princi-
ples, computation, and applications. Cambridge University Press, 2022.

W. Xiao, X. Huang, F. He, J. Silva, S. Emrani, and A. Chaudhuri. Online robust principal
component analysis with change point detection. IEEE Transactions on Multimedia, 22
(1):59–68, 2020. doi: 10.1109/TMM.2019.2923097.

H. Yan, K. Paynabar, and J. Shi. Real-time monitoring of high-dimensional functional data


streams via spatio-temporal smooth sparse decomposition. Technometrics, 60(2):181–197,
2018.

75
Shi, Fattahi, and Al Kontar

Z. Yang and G. Michailidis. A non-negative matrix factorization method for detecting


modules in heterogeneous omics multi-modal data. Bioinformatics, 32(1):1–8, 2016.

T. Ye and S. S. Du. Global convergence of gradient descent for asymmetric low-rank matrix
factorization. Advances in Neural Information Processing Systems, 34:1429–1439, 2021.

L. Zhang, H. Shen, and J. Z. Huang. Robust regularized singular value decomposition with
application to mortality data. The Annals of Applied Statistics, pages 1540–1561, 2013.

G. Zhou, A. Cichocki, Y. Zhang, and D. P. Mandic. Group component analysis for multiblock
data: Common and individual feature extraction. IEEE transactions on neural networks
and learning systems, 27(11):2426–2439, 2015.

76

You might also like