0% found this document useful (0 votes)
16 views70 pages

Fundamental Computational Limits in Pursuing Invariant Causal Prediction and Invariance-Guided Regularization

This paper discusses the challenges of achieving invariant causal prediction from heterogeneous environments, highlighting that existing methods are computationally inefficient and the problem is NP-hard. The authors propose a new estimator that balances computational efficiency and statistical accuracy while addressing the fundamental difficulty of distinguishing between truly important and spurious variables. Empirical results support the effectiveness of their approach, which allows for smooth interpolation between causal and predictive solutions.

Uploaded by

Kevin Ouyang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views70 pages

Fundamental Computational Limits in Pursuing Invariant Causal Prediction and Invariance-Guided Regularization

This paper discusses the challenges of achieving invariant causal prediction from heterogeneous environments, highlighting that existing methods are computationally inefficient and the problem is NP-hard. The authors propose a new estimator that balances computational efficiency and statistical accuracy while addressing the fundamental difficulty of distinguishing between truly important and spurious variables. Empirical results support the effectiveness of their approach, which allows for smooth interpolation between causal and predictive solutions.

Uploaded by

Kevin Ouyang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 70

Fundamental Computational Limits in Pursuing Invariant Causal

Prediction and Invariance-Guided Regularization



Yihong Gu1 , Cong Fang2 , Yang Xu2 , Zijian Guo3 , Jianqing Fan1
1
Princeton University, 2 Peking University, and 3 Rutgers University
arXiv:2501.17354v1 [math.ST] 29 Jan 2025

Abstract
Pursuing invariant prediction from heterogeneous environments opens the door to learning
causality in a purely data-driven way and has several applications in causal discovery and robust
transfer learning. However, existing methods such as ICP (Peters et al., 2016) and EILLS
(Fan et al., 2024) that can attain sample-efficient estimation are based on exponential time
algorithms. In this paper, we show that such a problem is intrinsically hard in computation:
the decision problem, testing whether a non-trivial prediction-invariant solution exists across two
environments, is NP-hard even for the linear causal relationship. In the world where P̸=NP, our
results imply that the estimation error rate can be arbitrarily slow using any computationally
efficient algorithm. This suggests that pursuing causality is fundamentally harder than detecting
associations when no prior assumption is pre-offered.
Given there is almost no hope of computational improvement under the worst case, this paper
proposes a method capable of attaining both computationally and statistically efficient estimation
under additional conditions. Furthermore, our estimator is a distributionally robust estimator
with an ellipse-shaped uncertain set where more uncertainty is placed on spurious directions than
invariant directions, resulting in a smooth interpolation between the most predictive solution
and the causal solution by varying the invariance hyper-parameter. Non-asymptotic results and
empirical applications support the claim.
Keywords: Causality, Distributional Robustness, Invariant Prediction, Maximin Effects, NP-hardness,
Parsimonious Reduction.

1 Introduction
How do humans deduce the cause of a target variable from a set of candidate variables when only passive
observations are available? A natural high-level principle is to identify the variables that produce consistent
predictions at different times, locations, experimental conditions, or more generally, across various environments.
This heuristic is implemented in statistical learning by seeking invariant predictions from diverse environments
(Peters et al., 2016; Heinze-Deml et al., 2018; Fan et al., 2024; Gu et al., 2024). This approach goes beyond
just learning associations in the recognition hierarchy (Bareinboim et al., 2022) and enables the discovery of
certain data-driven causal relationships without prior causal assumptions. However, existing methods that
realize general invariance learning rely on explicit or implicit exhaustive searches, which are computationally
inefficient. This raises the question of whether learning invariant predictions is fundamentally hard. This
paper contributes to understanding the fundamental limits and introducing a novel relaxed estimator for
invariance learning. Theoretically, we prove this problem is intrinsically hard using a reduction argument
(Karp, 1972) with novel constructions. Our theoretical message further implies that learning data-driven
causality is fundamentally harder than detecting associations. On the methodological side, we propose a
relaxation in two aspects: our approach balances computational efficiency and statistical accuracy on one
hand while optimizing trade-offs between prediction power and robustness on the other.
∗ Supported by NSF Grants DMS-2210833 and DMS-2412029.

1
1.1 Pursuit of Linear Invariant Predictions
Suppose we are interested in pursuing the linear invariant relationship between the response variable Y ∈ R
and explanatory covariate X ∈ Rd using data from multiple sources/environments. Let E be the set of
(e) (e)
environments. For each environment e ∈ E, we observe n data {(Xi , Yi )}ni=1 that are i.i.d. drawn from
(e) (e) (e)
some distribution (X , Y ) ∼ µ satisfying
(e) (e)
Y (e) = (βS⋆ ⋆ )⊤ XS ⋆ + ε(e) with E[XS ⋆ ε(e) ] ≡ 0 (1.1)

where β ⋆ is the true parameter that is invariant across different environment and S ⋆ = supp(β ⋆ ) denotes
the support of β ⋆ , while the distribution of µ(e) may vary across environments. Here we assume different
environments have the same sample size n for presentation simplicity. The goal is to recover S ⋆ and β ⋆ using
(e) (e)
the observed data DE = {(Xi , Yi )}i∈[n],e∈E .
Methods inferring the invariant set S ⋆ from (1.1) can be applied to causal discovery under the structural
causal model (SCM) (Glymour et al., 2016) framework. This is because when observing environments
where interventions are applied within the covariates X, S ⋆ = {j : Xj is direct cause of Y } satisfies (1.1)
and is unique in some sense when the intervention is non-degenerate and enough (Peters et al., 2016; Gu
et al., 2024); see the discussion in Section 1.4. Though initially motivated by causal discovery under the SCM
framework that may be sensitive to model misspecification, pursuing invariant predictions from heterogeneous
environments itself is a much more generic principle in statistical learning, or a type of inductive bias in
causality (Bühlmann, 2020; Gu et al., 2024), that can also facilitate, for example, robust transfer learning
(Rojas-Carulla et al., 2018), prediction fairness among sub-populations (Hébert-Johnson et al., 2018), and
out-of-distribution generalization (Arjovsky et al., 2019).
Unlike the standard linear regression under which each variable Xj is either truly important (j ∈ S ⋆ ) or
exogenously spurious (Fan & Zhou, 2016) (j ∈ / S ⋆ but E[Xj ε] = E[Xj (Y − X ⊤ β ⋆ )] = 0), the set of candidate
variables in (1.1) can be decomposed into three groups:
Spurious Variables (S ⋆ )c
[z  }|[ {
⋆ ⋆ E ⋆ E

{1, . . . , d} = S j∈
/ S : Cov (ε, Xj ) ̸= 0 j∈
/ S : Cov (ε, Xj ) = 0 , (1.2)
| {z } | {z }
Endogenously Spurious Variables G:= Exogeneously Spurious Variables

(e) (e)
where CovE (ε, Xj ) := |E|1
P
e∈E E[ε Xj ] is the pooled covariance between the noise and the covariate
Xj across different environments. The major difference compared with standard linear regression and the
main difficulty behind such an estimation problem is the presence of endogenously spurious variables G
(Fan & Liao, 2014). The exogenously spurious variable is one that lacks predictive power for the noise
ε = Y − XS⊤⋆ βS⋆ ⋆ in population and only increases the estimation error by n−1/2 if it is falsely included. It
usually does not cause the bias of estimation but inflates slightly the variance. In contrast, endogenously
spurious variables contribute to predicting the noise; thus, the false inclusion of any such variable results
in inconsistent estimation due to the biases they create. An illustrative example is to classify whether the
object in an image is a cow (Y = 0) or camel (Y = 1) using three extracted features X1 =body shape,
X2 =background color, and X3 = temperature or time that the photo is taken. Here S ⋆ = {1} is the invariant
and causal feature, while G = {2} helps predict the noise ε, since cows (resp. camels) usually appear on
green grass (resp. yellow sand) in the data collected. X3 is exogeneously spurious: including it does not
increase estimation bias but slight variance. From a statistical viewpoint, the core difficulty is to distinguish
whether a variable is truly important, or endogenously spurious among those statistically significant variables
that contribute to predicting Y . This is where multi-environment comes into play. There is a considerable
literature on estimating the parameter β ⋆ in (1.1) (Peters et al., 2016; Rothenhäusler et al., 2019, 2021;
Pfister et al., 2019; Arjovsky et al., 2019; Yin et al., 2021).
Fan et al. (2024) first realized sample-efficient estimation for the general model (1.1) and offered a
comprehensive non-asymptotic analysis in terms of both n and E, this idea is further extended to the
fully non-parametric setting in Gu et al. (2024). Specifically, it shows that given data from finitely many

2
environments |E| < ∞, one can identify S ⋆ with n = ∞ under the minimal identification condition:

∀S ⊆ [d] with S ∩ G ̸= ∅ =⇒ ∃e, e′ ∈ E, β (e,S) ̸= β (e ,S) (1.3)

where β (e,S) := argminsupp(β)⊆S E(X,Y )∼µ(e) [|Y − β ⊤ X|2 ]. This requires that S ⋆ is the maximum set that
preserves the invariance structure in that incorporating any endogenously spurious variables in G will result
in shifts in predictions across E. Turning to the empirical counterpart, the optimal rate for linear regression
can be attained therein using their proposed environment invariant linear least squares (EILLS) estimator.
This implies that as long as β ⋆ can be identified under finitely many environments, unveiling the data-
driven causality parameter β ⋆ in (1.1) is as statistically efficient as estimating the association counterpart
in standard linear regression.
Promising through the above progress, the invariance pursuit procedure has two drawbacks. The first is
about the computational burden. The estimation error is only guaranteed for the global minimizer of the
objective function in Fan et al. (2024) and Gu et al. (2024). An exponential-in-d algorithm is adopted to
find the global minimizer of the objective function that Fan et al. (2024) proposes. Though the Gumbel trick
introduced by Gu et al. (2024) allows variants of gradient descent algorithm to perform well in practice, the
nonconvexity nature is still kept and there are no theoretical guarantees on the optimization.
The second is that the invariant model is typically conservative in its predictive performance for a
new environment. Though it finds the “maximum” invariant set, the invariant prediction model will
eliminate the endogenously spurious variables that result in heterogeneous predictions in E. This may
result in conservativeness in prediction with the help of the endogenous variables, which is the best for the
adversarial environment but is not so for the prediction environment of interest. In the aforementioned
cow-camel classification task, suppose r1 = 95% cows (resp. camels) appear on grass (resp. sand) in the
first environment e = 1 and the spurious ratio is r2 = 70% in environment e = 2. In this case, an invariant
prediction model drops the background color X2 due to its variability across environments. In general, a
prediction model without X2 is intuitively the best when r = 0%, yet potentially reduces predictive power
compared to the ones including X2 when evaluated in an environment with r > 50%.
The above discussion gives rise naturally to the following two questions, which will be addressed in this
paper.

Q1. Can statistically efficient estimation of β ⋆ in (1.1) be attained by computationally efficient


algorithms in general? If not, can it be attainable under some additional conditions?
Q2. Can we have benefits by designing methods that smoothly “interpolate”P the estimators for
the invariant causal model β ⋆ and the most predictive solution β̄ := argminβ e∈E E(X,Y )∼µ(e)
E[|Y − β ⊤ X|2 ]?

1.2 Computational Barrier


The main theoretical message this paper delivers is: the problem of finding invariant solutions is intrinsically
hard. In the following, we introduce a decision problem whose fundamental computation complexity is
equivalent to causal invariance learning. Denote the boolean operators AND, OR and NOT by ∧, ∨ and ¬,
respectively. We glance at the two questions below.

A. What does the formula below evaluate? (a) True (b) False

(True ∧ True) ∨ False ∧ (True ∨ ¬True) ∧ (¬False ∨ ¬(True ∧ True)) (1.4)

B. Can we choose v1 , v2 , v3 , v4 in {True, False} to make the result of the formula as True? (a)
Yes (b) No

(v1 ∧ v4 ) ∨ v2 ∧ (v3 ∨ ¬v4 ) ∧ (¬v2 ∨ ¬(v1 ∧ v3 )) (1.5)

3
The latter question is an instance of the circuit satisfiability (CircuitSAT) problem (Karp, 1972). The
answers to both questions are (a), and (1.4) offers the unique valid solution to (1.5) as (v1 , v2 , v3 , v4 ) =
(True, False, True, True).
From an intuitive perspective, we argue that the relationship between “finding the best linear predictor”
and “finding any non-trivial invariant (causal) prediction” shares some similarities with the relationship
between the two questions posed above. While both scenarios involve the same setting, that is “boolean
formula” for the second pair and “linear model” for the first pair, and may potentially yield the same
solution, their computation complexities and hierarchy in recognition tasks differ significantly. The former
ones only involve simple arithmetic calculations, are straightforward in thought, and can be solved quickly.
In contrast, the latter ones will suffer from inevitable brute force attempts, require complicated reasoning,
and necessitate a potentially larger time budget. The latter tasks involve reasoning using the information
extracted from the corresponding former perception tasks.
Formally, consider the testing problem ExistsLIS-2 using population-level quantities.
Problem 1.1 (ExistsLIS-2). Consider the case of |E| = 2. Given the positive definite covariance matrices
Σ(1) , Σ(2) ∈ Rd×d with Σ(e) = E[X (e) (X (e) )⊤ ] and the covariance vectors u(1) , u(2) ∈ Rd with u(e) =
E[X (e) Y (e) ], it asks whether it is possible to find a non-empty prediction-invariant set S ⊆ [d] such that
β (1,S) = β (2,S) ̸= 0. Here β (e,S) is defined in (1.3) and can be arithmetically calculated as β (e,S) =
(e) (e)
[(ΣS )−1 uS , 0S c ] provided Σ(e) is positive definite thus invertible.
Problem 1.1 simplifies the original linear invariance pursuit problem, i.e., estimating β ⋆ or S ⋆ in (1.1),
in several aspects: we consider only two heterogeneous environments to identify β ⋆ when G ̸= ∅, and it only
checks the existence of solution.
As the answer to Q1 in Section 1.1, this paper shows that the aforementioned simplified ExistsLIS-2
problem is NP-hard, which is essentially the same as the problem CircuitSat with an instance example
(1.5). Furthermore, the NP-hardness is not because of the existence of exponentially many possible invariant
solutions, it remains when β ⋆ is identifiable by (1.3). Many problems are classified as NP-hard, other
examples include 3Sat, MaxClique, Partition (Erickson, 2023). The Cook–Levin theorem (Karp, 1972)
states that if there exists a polynomial time algorithm to solve any NP-hard problem, then P=NP, meaning all
the N(ondeterministic-)P(olynomial-time) problems, which is verifiable in polynomial time, are P(olynoimal-
time) problems that are solvable in polynomial time. It is suspected, but is still a conjecture (Bovet et al.,
1994; Fortnow, 2021), that P̸=NP. This implies it is unlikely that there exists any polynomial-time algorithms
for NP-hard problems. This paper proves the NP-hardness of ExistsLIS-2 problem and an easier problem
with constraint (1.3) by constructing a parsimonious polynomial-time reduction from the 3Sat problem, a
simplification of CircuitSat, to our ExistsLIS-2 problem. See the formal definition of NP-hardness and
reduction in Section 2.
In many statistical problems, though attaining correct variable selection suffers from computational
barriers, it is possible to construct a computationally efficient and accurate estimator of the continuous
parameters of interest. For example, as a convex relaxation of L0 regularized least squares, L1 regularized
least squares can obtain n−1/2 (Bickel et al., 2009) prediction error rate in general and match the same
optimal n−1 rate under the additional yet mild restricted eigenvalue (RE) condition (Candes & Tao, 2007)1 .
On the other hand, compared with L0 (Zhang & Zhang, 2012) penalty, L1 penalty requires a much more
restrictive, usually impossible (Fan & Li, 2001; Zou, 2006), condition to attain variable selection consistency
(Zhao & Yu, 2006; Meinshausen & Bühlmann, 2006). It is natural to ask if obtaining a reasonable prediction
error using a computationally efficient algorithm is possible in finding invariant predictions. Our result also
says “No” if P̸=NP.
In summary, this paper proves that consistent variable selection and reasonable prediction error in
finding invariant predictions are NP-hard. In the world of P̸=NP, this establishes a dilemma between
computational and statistical tractability for the invariance pursuit problem, and such an impossibility
result has implications for several fields and questions.
1 The RE condition can be relaxed by the restricted strong convexity condition. In this case, if the covariate is zero-mean

Gaussian (Raskutti et al., 2010) or sub-Gaussian (Rudelson & Zhou, 2013), optimal estimation error can be obtained by L1
regularization when |supp(β ⋆ )| log p = o(n) provided the curvature is bounded from below, i.e., λmin (E[XX ⊤ ]) ≳ 1.

4
(a) It has long been hypothesized that there may exist some intrinsic computation barrier in finding
invariant solutions given that the problem has a combinatorial formulation and all the existing provable
sample-efficient methods use exhaustive search explicitly or implicitly. It is still open whether finding
an invariant solution is fundamentally hard or can be solved by a (still not discovered) computationally
efficient algorithm. We offer a definite pessimistic answer to this.
(b) Our established dilemma above shows that pursuing invariance is fundamentally harder than pursuing
sparsity. The latter can guarantee a decent prediction error using computationally efficient algorithms
under a mild assumption that does not hurt the generality of the problem, and the corresponding
estimation error will decrease when we keep increasing n. However, these no longer apply to the
former. Thus, the relaxation tricks used in the sparsity pursuit like L1 regularization may not be a
good fit, and potentially new relaxation techniques should be introduced to pursue invariance.

1.3 Our Proposed Method


This paper proposes a simple method that answers question Q2 with “Yes” by achieving a better balance
between prediction power and invariance, while partially circumventing the computational barriers as the
second part of Q1. Given data from environments E, the population-level estimator with n = ∞ is the
minimizer of the following objective function
d
1 X X
β k,γ = argmin E[|Y (e) − β ⊤ X (e) |2 ] + γ wkE (j) · |βj |.
β∈Rd |E| j=1
e∈E

It regularizes the pooled least squares using pre-calculated weighted L1 penalty, where the adaptive, data-
driven weight wkE (j) on |βj | is the upper bound of the prediction variations across environments E when
incorporating variable xj ; see the details in Section 3. Here γ is the hyper-parameter that trades off predictive
power and robustness against spurious signals, and k is the hyper-parameter that controls the computation
budget through wkE (j). The key features of our proposed estimator are as follows.
(a) For the computation concern, our proposed estimator provably attains the causal identification, i.e.,
β k,γ = β ⋆ for large enough γ, by paying affordable computation cost (small k) under some unknown
low-dimensional structure among the variables. On the other hand, by increasing the computation
budget k to p, our proposal achieves the causal identification under the same assumptions as those in
EILLS (Fan et al., 2024).
(b) The estimator reaches the goal in Q2 by tuning γ. When causal identification is attained in (a), it
leads to a continuous solution path interpolating the pooled least squares solution with γ = 0 and
the causal solution β ⋆ with large enough γ. For any fixed γ, it has a certain distributional robustness
interpretation in that β k,γ can be represented as the maximin effects (Meinshausen & Bühlmann, 2015;
Guo, 2024) over some uncertainty set.

1.4 Related Works and Our Contribution


Peters et al. (2016) first considers (1.1) with more distributional constraints for causal discovery. To be
specific, they consider doing causal discovery that infers the direct cause of the target response Y , using data
under different environments (Didelez et al., 2012; Meinshausen et al., 2016), where in each environment,
some unknown interventions are applied to the variables other than Y . Under the modularity assumption
(Schölkopf et al., 2012), which is also referred to as autonomy (Haavelmo, 1944; Aldrich, 1989) or stability
(Dawid & Didelez, 2010), in the SCM framework that the intervention on Xj will only change the conditional
distribution of Xj given all its direct causes, the conditional distribution of Y given all its direct causes will
remain the same across these different environments. This leads to the following distributional invariance
(e)
structure if a linear model with exogenous noise is further assumed: Y (e) = (βS⋆ ⋆ )⊤ XS ⋆ + ε with ε ∼ Fε ⊥ ⊥
(e) ⋆
XS ⋆ and E[ε] = 0, where S is the direct cause of the target response Y . A hypothesis-test based method is
proposed in Peters et al. (2016) to guarantee P(Sb ⊆ S ⋆ ) ≥ 1 − α. However, the set Sb∞ it selects when n = ∞

5
will stand in between ∅ and S ⋆ , i.e., ∅ ⊆ Sb∞ ⊆ S ⋆ , and easily be collapsed to ∅ in most of the cases when
the interventions are not enough. The idea of penalizing least squares using exact invariance regularizer
(Fan et al., 2024; Gu et al., 2024) will select variables Sb∞ satisfying S ⋆ ⊆ Sb∞ ⊆ S̄ as n = ∞ where S̄
is the Markov blanket of Y , but it will eliminate any of Y ’s child if it is intervened in a non-degenerate
manner. Though causal the solution is, it may lack some predictive power under the circumstances discussed
before Q2. The estimator proposed in this paper leverages the invariance principle as an inductive bias for
“soft” regularization instead of that for “hard” structural equation estimation and can alleviate the lack of
predictive power in this aspect.
There are also attempts to attain both computationally and statistically efficient estimation under (1.1).
For example, Rothenhäusler et al. (2019, 2021) consider the case where the mechanism among all covariate
and response variables (X, Y ) remain unchanged and linear, while the heterogeneity across environments
comes from additive interventions on X. Estimators similar to instrumental variable (IV) regression in
causal identification are proposed. This idea is further extended (Kania & Wit, 2022; Shen et al., 2023), but
can not go beyond circumventing the computation barrier by assumptions similar to IV regression. This is
conceptually the same as least squares that follow the prior untestable assumptions to pinpoint the unique
solution and may suffer from model misspecification. Li & Zhang (2024) studies a similar model with one
additional constraint – the covariance between XS ⋆ remains the same. A seemingly computation-efficient
variable selection method is proposed. However, the additional constraint seems to be superfluous in that it
cannot change the NP-hardness of the problem; see Appendix B.1. Therefore, there is still a gap in attaining
sample-efficient estimation by computation-efficient algorithms under mild assumptions that will not ruin
the prior-knowledge blind nature of invariance pursuit. This paper makes progress in this direction.
There is also a considerable literature on robustifying prediction using the idea of distributionally robust
optimization, which finds a predictor that minimizes the worst-case risk on a set of distributions referred
to as the uncertain set. The uncertain set is typically a (isotropic) sphere in postulated metric centered
on the training distribution. Examples of pre-determined metrics include KL divergence (Bagnell, 2005),
f -divergence (Duchi & Namkoong, 2021) and Wasserstein distance (Mohajerin Esfahani & Kuhn, 2018;
Blanchet et al., 2019). Such a postulated metric is uninformative which leads to a relatively conservative
solution. Our estimator is a distributionally robust estimator with an ellipsoid-shaped uncertainty set. It
assigns minimal uncertainty to invariant (causal) directions while allocating greater uncertainty to spurious
directions, which balances the robustness and power in a better way.
The NP-hardness and the conjecture P̸=NP are used to derive computation barriers in many statistical
problems, mainly about detecting sparse low-dimensional structures in high-dimensional data. For the
sparse linear model, Huo & Ni (2007) shows finding the global minima of L0 penalized least squares is
NP-hard, Chen et al. (2014) shows the NP-hardness holds for any Lq loss and Lp penalty with q ≥ 1 and
p ∈ [0, 1), and Chen et al. (2017) extends it to general convex loss and concave penalty. However, these are
computation barriers tailored to specific algorithms, not the fundamental limits of the problem itself. Zhang
et al. (2014) shows when P̸=NP, in the absence of the restricted eigenvalue condition, any polynomial-time
algorithm can not attain estimation error faster than n−1/2 , which is attained by L1 regularization but is
sub-optimal compared with optimal n−1 error. There is also a considerable literature on deriving statistical
sub-optimality of computationally efficient algorithms using the reduction from the planted clique problem
(Brennan & Bresler, 2019), such as sparse principle component (Berthet & Rigollet, 2013a,b; Wang et al.,
2016), sparse submatrix recovery (Ma & Wu, 2015). However, a reasonable error is still attainable using
computationally efficient alternatives. As discussed above, this is not the case for pursuing invariance as
shown by this paper.
Our Contributions. The main contributions are as follows:
• We establish the fundamental computational limits of finding prediction-invariant solutions in linear
models, which is the first in the literature. Our proof is based on constructing a novel parsimonious
reduction from the 3Sat problem to the ExistLIS-2 problem.
• A simple estimator is proposed to relax the computational budget and exact invariance pursuit using
two hyper-parameters. It allows for provably computational and statistical efficiency estimation of

6
the exact invariant (causal) parameters with mild additional assumptions and also offers flexibility in
trade-offing efficiency and invariance (robustness).
Organization. This paper is organized as follows. In Section 2, we introduce the concept of NP-hardness
and present our main computation barrier result accompanied by the proofs. In Section 3, we propose our
method that relaxes the computation budget and conservativeness, illustrate its distributional robustness
interpretation, and present the corresponding non-asymptotic result. The proofs for the results in Section 3
are deferred to the supplement material. Section 4 collects the real-world application.
Notations. We will use the following notations. Let X ∈ Rd , Y ∈ R be random variables and x, y be
their instances,
Pm respectively. We let [m] = {1, . . . , m}. For a vector z = (z1 , . . . , zm )⊤ ∈ Rm , we let
∥z∥q = ( j=1 |zj |q )1/q with q ∈ [1, ∞) be its ℓq norm, and let ∥z∥∞ = maxj∈[m] |zj |. For given index set
S = {j1 , . . . , j|S| } ⊆ [m] with j1 < · · · < j|S| , we denote [z]S = (zj1 , . . . , zj|S| )⊤ ∈ R|S| and abbreviate it
as zS if there is no ambiguity. We use A ∈ Rn×m to denote a n by m matrix, use AS,T = {ai,j }i∈S,j∈T to
denote a sub-matrix and abbreviate it as AS if √ S = T and n = m. For a d-dimensional vector z and d × d
positive semi-definite matrix A, we let ∥z∥A = z ⊤ Az, and let λmin (A) (resp. λmax (A)) be the minimum
(resp. maximum) eigenvalue of A.
We collect data from multiple environments E. For each environment e ∈ E, we observe n data
(e) (e)
{(Xi , Yi )}ni=1 which are drawn i.i.d. from µ(e) . We denote E[f (X (e) , Y (e) )] = f (x, y)µ(e) (dx, dy) and
R
b (X (e) , Y (e) )] = 1 n f (X (e) , Y (e) ), and define
E[f
P
n i=1 i i

1 X (e) 1 X (e)
Σ(e) = E[X (e) (X (e) )⊤ ], u(e) = E[X (e) Y (e) ], Σ = Σ , u= u . (1.6)
|E| |E|
e∈E e∈E

We assume there is no collinearity, i.e., Σ(e) ≻ 0 such that we can define the population-level best linear
predictor constrained on any set S in each environment e, β (e,S) := argminsupp(β)⊆S E[|Y (e) − β ⊤ X (e) |2 ], and
all the environment, β (S) := argminsupp(β)⊆S e∈E E[|Y (e) − β ⊤ X (e) |2 ]. Let the pooled least squares loss
P
over all the environments be
1 X
RE (β) = E[|Y (e) − β ⊤ X (e) |2 ]. (1.7)
2|E|
e∈E

2 The Fundamental Limit of Computation


2.1 Preliminary: NP-hardness
We first introduce the idea of decision problem, NP-hardness, and reduction argument.
Definition 1 (Decision Problem). A decision problem P is a problem whose output is 1/0, meaning Yes/No.
Let x be an instance of the problem, we use |x| to denote the size of its input and use XP to denote the set
of all the problem instances. We use Sx to denote the set of solutions for the problem instance x. We use
the notation x ∈ XP,1 if the answer to the instance x is 1(Yes). Clearly, we have x ∈ XP,1 ⇐⇒ |Sx | ≥ 1.
The particular decision problem that we consider is the 3Sat problem below.
Vk
Problem 2.1 (3Sat). Given a conjunctive normal form (CNF) i=1 (li,1 ∨ li,2 ∨ li,3 ) of k clauses, where
the literal li,u is either vℓ or ¬vℓ for some boolean variable vℓ ∈ {True, False} with ℓ ∈ [n], it asks if there
exists an assignment of the variables such that the entire formula evaluates to True. The size of a problem
instance is k. Sx is the set of assignments of (vℓ )nℓ=1 to let the formula be True.
We now present an instance of the 3Sat problem.
Example 2.1 (An Instance of 3Sat Problem). Consider an instance x with k = 9 clauses, the input is an
CNF f = (v1 ∨ v2 ∨ v3 ) ∧ (v1 ∨ v2 ∨ ¬v3 ) ∧ (v1 ∨ ¬v2 ∨ v3 ) ∧ (v1 ∨ ¬v2 ∨ ¬v3 ) ∧ (¬v1 ∨ ¬v2 ∨ v3 ) ∧ (¬v1 ∨
v2 ∨ ¬v3 ) ∧ (¬v1 ∨ ¬v2 ∨ ¬v3 ) ∧ (¬v4 ∨ v4 ∨ v2 ) ∧ (¬v1 ∨ v2 ∨ v4 ) in n = 4 variables. It is easy to see that
Sx = {(True, False, False, True)} and hence the answer to above 3Sat instance is 1(Yes).

7
We also consider a potentially easier variant of 3Sat to be used in the section. The problem is potentially
easier than 3Sat because it pursues the same target under additional non-trivial restrictions.
Problem 2.2 (3Sat-Unique). The 3Sat-Unique problem is the same as 3Sat under the promise that the
solution is unique if exists, i.e., X3Sat-Unique = {x ∈ X3Sat , |Sx | ≤ 1}.
We then introduce the idea of reduction and NP-hardness.
Definition 2 (Reduction). We say T : XP → XQ is a deterministic polynomial-time reduction from problem
P to problem Q if there exists some polynomial p such that for all x ∈ XP , (1) T (x) can be calculated on a
deterministic Turing machine with time complexity p(|x|); and (2) T (x) ∈ XQ,1 if and only if x ∈ XP,1 .
We say T : XP → XQ is a randomized polynomial-time reduction (Valiant & Vazirani, 1985) from problem
P to problem Q if there exists some polynomial p such that (1) T (x) can be calculated on a randomized (coin-
flipping) Turning machine with computational complexity p(|x|) for any x ∈ XP ; (2) For all x ∈ XP \ XP,1 ,
T (x) ∈
/ XQ,1 ; (3) For all x ∈ XP,1 , P[T (x) ∈ XQ,1 ] ≥ 1/p(|x|).
Definition 3 (NP-hardness). We say a problem P is NP-hard under deterministic (resp. randomized)
polynomial-time reduction if there exists deterministic (resp. randomized) polynomial-time reduction from
the circuit satisfiability problem (Karp, 1972) to problem P .
The NP-hardness of a problem is widely used to measure the existence of the underlying computational
barrier for the problem; examples in statistics include sparse PCA under particular regime (Berthet &
Rigollet, 2013a,b; Wang et al., 2016), sparse regression (Zhang et al., 2014) without restricted eigenvalue
condition. The underlying reason why an NP-hard problem P is “hard” can be illustrated via the Cook–
Levin theorem (Karp, 1972): the existence of any polynomial-time algorithm for the NP-hard problem under
deterministic polynomial-time reduction will assert P=NP, which implies any NP problem, defined as the
problem whose validness of solution can be verified within polynomial-time, can be solved within polynomial-
time. The NP-hardness under randomized polynomial-time reduction can be understood similarly: the
existence of any polynomial-time algorithm for such a problem implies any NP problem can be solved within
polynomial-time with high probability, that is, for any NP decision problem P , we can design a polynomial-
time randomized algorithm A e such that

∀x ∈ XP \ XP,1 , A(x)
e =0 and ∀x ∈ XP,1 , P[A(x)
e = 1] ≥ 1 − 0.01|x|−100 .

If the conjecture “P̸=NP” holds, then the NP-hardness of a problem naturally implies “there is no polynomial-
time algorithm for the problem”. We introduce the NP-hardness under randomized polynomial-time reduction
to characterize the computation barrier of the linear invariance pursuit under identification condition (1.3).
We have the following result for the above two problems.
Lemma 2.1. The problem 3Sat is NP-hard under deterministic polynomial-time reduction. The problem
3Sat-Unique is NP-hard under randomized polynomial-time reduction.
Proof of Lemma 2.1. The NP-hardness of 3Sat follows from Karp (1972), the proof for the NP-hardness of
3Sat-Unique can be found in Appendix A.4.

2.2 The Hardness of Population-level Linear Invariance Pursuit


When |E| = 2, we will show that finding a non-trivial invariant solution using population covariance matrices
has a computation barrier similar to the 3Sat problem. Moreover, even when β ⋆ and S ⋆ are identifiable, the
computation limit remains in a similar manner to the 3Sat-Unique problem. This claim can be rigorously
delivered in the following Theorem 2.1. Without loss of generality, we assume that X (e) and Y (e) are all
zero-mean random variables in each environment.
Problem 2.3 (Existence of Linear Prediction-Invariant Set). Let d ∈ N+ be the dimension of the explanatory
covariate, and E be the number of environments. Let Σ(1) , . . . , Σ(E) ∈ Rd×d be positive definite matrices

8
representing the covariance matrices of X (e) , i.e., Σ(e) = E[X (e) (X (e) )⊤ ], and u(1) , . . . , u(E) be d-dimensional
vectors representing the covariance between X (e) and Y (e) , i.e., u(e) = E[X (e) Y (e) ]. In this case, the
(e,S) (e,S) (e) (e)
population-level least squares solutions can be written as β (e,S) = [βS , 0S c ] with βS = (ΣS )−1 uS
(S) (S) (e) (e)
and β (S) = [βS , 0S c ] with βS = ( e∈E ΣS )−1 ( e∈E uS ).
P P
We define the problem ExistLIS as follows:
[Input] Σ(1) , . . . , Σ(E) and u(1) , . . . , u(E) satisfying the above constraints.
[Output] Returns 1(Yes) if there exists S ⊆ [d] such that β (e,S) ≡ β (S) ̸= 0; otherwise 0(No).
We simplify the original problem, that is, unveiling S ⋆ in (1.1), when n = ∞ from two aspects in
Problem 2.3. Firstly, we only use the first-order linear information rather than the full distribution information
such that the input of the problem is of O(d2 ) when |E| = O(1). The space of ExistLIS can be seen as
a “linear projection” of the space of the problems that recovering S ⋆ in (1.1) provided Σ(e) ≻ 0. Secondly,
we state it as a decision problem rather than a solution-solving problem: it suffices to answer whether a
non-trivial invariant set exists instead of pursuing one. For simplicity in this section, we use the terminology
“invariant set” instead of “linear prediction-invariant set”. We define the concept of the maximum invariant
set to present the same problem under the identification condition (1.3).
Definition 4 (Invariant Set and Maximum Invariant Set). Under the setting of Problem 2.3, we say a set
S̄ is a invariant set if β (e,S̄) ≡ β (S̄) . We say a set S̄ is a maximum invariant set if it is an invariant set and
satisfies
!
 
(S∪S̄) (S̄) (e,S) (e′ ,S)
∀S ⊆ [d], either β =β or sup ∥β −β ∥2 > 0 . (2.1)
e,e′ ∈[E]

Problem 2.4 (Existence of Linear Invariant Set under Identification). Problem ExistLIS-Ident is defined
as the same problem as ExistLIS with the additional constraint that there exists a maximum invariant set
S†.
Note that S † can be an empty set, under which the corresponding problem instance does not have non-
trivial invariant solutions. Observe that the boolean formula (a∨b) is equivalent to the statement (if ¬a then
b). As required by (2.1), an invariant set S̄ is a maximum invariant set if incorporating any variable that
enhances the prediction performance will lead to shifts in best linear predictions. Therefore, the existence of
the maximum invariant set defined in Definition 4 is just a restatement of the identification condition (1.3),
that is, S̄ is a maximum invariant set if and only if S ⋆ = S̄ satisfies (1.1) and (1.3) simultaneously.
Problem 2.4 is an easier version of the problem of recovering S ⋆ in (1.1) with the identification constraint
(1.3) in population n = ∞. The following example gives an instance of the problem ExistLIS-Ident. This
example also indicates that the maximum invariant set may not be unique, but all the maximum invariant
sets yield the same prediction performance.
Example 2.2 (An Instance of ExistLIS-Ident Problem). Consider an instance with d = 4, E = 2 and
input
√2 √2
       
1 0 15
0 1 0 5
0 2 2
 0 1 √1 0  0 1 √1 0  1   1 
(Σ(1) , Σ(2) ) =  2 15 , 2 5  , (u(1) , u(2) ) =  3  ,  √6  .
     q   
 √15 √115 3
5 0   √
5
√1
5
7
5 0 2 5  5 
0 0 0 1 0 0 0 1 0 0

It can be seen as a “linear projection” of the following data-generating process with e = {1, 2} and independent
standard normal random variables ε0 , . . . , ε4 :
(e) (e) (e)
X1 ← ε1 , X 2 ← ε2 , X4 ← ε4 ,
(e) (2)
Y (e) ← 2 · X1 + X2 + ε0 ,

(e) ( 3)e−2 Y (e) + ε3
X3 ← √ .
5

9
It is easy to see that the sets ∅, {1}, {2}, {4}, {1, 2}, {1, 4}, {2, 4}, {1, 2}, {1, 2, 4} are all invariant sets, while
the sets {1, 2} and {1, 2, 4} are maximum invariant sets.
From the perspective of a computational problem, the existence of a maximum invariant set offers non-
trivial constraints on the problem and one can construct a model where this condition fails to hold; see
Example 2.3 below. On the other hand, the non-existence of a maximum invariant set rarely happens under
the causal discovery setting. To be specific, under the setting of the structural causal model with intervention
on X, it is known from Theorem 3.1 in Gu et al. (2024) that a maximum invariant set always exists if the
intervention is non-degenerate, which occurs with probability 1 under suitable measure on the intervention.
Example 2.3 (An Instance of ExistLIS that is not ExistLIS-Ident). Consider the model of Example 4.1
in Fan et al. (2024) with s(1) = 1/2 and s(2) = 2, that is, the SCMs in environment e ∈ {1, 2} are
(e)

X1 ← 0.5ε1
(e) (e)

Y ← X1 + 0.5ε0
(e)
X2 ← 22e−3 Y (e) + ε2

with ε0 , . . . , ε2 are i.i.d. standard Gaussian random variables. It is easy to check that the sets ∅, {1}, {2} are
all invariant sets but none of them satisfies the second constraint (2.1), and the set {1, 2} is not an invariant
set. So there does not exist a maximum invariant set.
Given XExistLIS-Ident ⊊ XExistLIS , ExistLIS may be potentially harder than ExistLIS-Ident. We
will establish NP-hardness to both ExistLIS and ExistLIS-Ident to rule out the possibility that the
computational hardness is because of nonidentifiability, or in other words, computational difficulty can be
resolved when S ⋆ is identifiable in (1.1) by (1.3).

Theorem 2.1. When E = 2, the problem ExistsLIS is NP-hard under deterministic polynomial-time
reduction; the problem ExistsLIS-Ident is NP-hard under randomized polynomial-time reduction.
Theorem 2.1 states that there exist certain fundamental computational limits under the problem of
pursuing a linear invariant prediction: the difficulties are intrinsically inherited in the problem itself – there
does not exist a polynomial-time algorithm to test whether there exists a non-trivial invariant prediction in
general if P̸=NP.
Remark 1 (NP-hardness under More Restrictive Conditions). It is worth noticing that the underlying
computational barrier is attributed to the nature of the problem, i.e., pursuing invariance, instead of artificial
and technical difficulties. Such a barrier will remain for other cousin models and models under more
restrictive conditions. Examples include (1) finding a prediction with stronger invariance condition like
distributional invariance in Peters et al. (2016); (2) problems with row-wise sparse covariance matrices
where all the covariance matrices only have constant-level non-zero entries in each row; (3) problems with
well-separated heterogeneity in that the variations in prediction are large for all the non-invariant solutions.
See the rigorous statement and discussion in Appendix A.
We will show a much easier problem with fixed (Σ(1) , u(1) ) structure is NP-hard.

2.3 Proof of Theorem 2.1


The following lemma claims that we can construct a parsimonious polynomial-time reduction from the well-
known problem 3Sat to our problem ExistLIS that preserves the number of solutions. Given an instance x
of the 3Sat problem stated in Problem 2.1, we let Sx = {v ∈ {True, False}n : v let the formula to be True}
be its set of solutions. Given an instance y of the ExistLIS problem with E = 2, we define its solution
set Sy ⊆ [d] as the set of all the S satisfying β (1,S) = β (2,S) ̸= 0. We let k be the number of clauses in the
instance x and d be the number of covariance in the instance y and omit the dependency on x (resp. y) in k

10
(resp. d) for presentation simplicity. For an integer m, We let 1m be a m-dimensional vector with all entries
being 1, and let Im be a m × m identity matrix.
Unlike the standard reduction argument whose goal is to find a polynomial time reduction T : X3Sat →
XExistLIS such that 1{|Sx | > 0} = 1{|ST (x) | > 0}, we will construct a parsimonious reduction satisfying
|Sx | = |ST (x) |. This finer construction transfers the promise of the unique solution in 3Sat-Unique to the
promise of the identification in ExistLIS-Ident.
Lemma 2.2. We can construct a parsimonious polynomial-time reduction from 3Sat to ExistLIS: for each
instance x of problem 3Sat with input size k, we can transform it to y = T (x) of problem ExistLIS within
polynomial-time with d = 7k + 1 such that |Sy | = |Sx |.
Proof of Lemma 2.2. We construct the reduction as follows. Let x be any 3Sat instance with k clauses.
Without loss of generality, we assume that each variable has appeared at least once in some clause. For
each clause, we use action ID in {0, . . . , 7} to represent the assignment for it. For example, for the clause
v1 ∨ ¬v2 ∨ ¬v5 and the action ID 6 with binary representation 110 means we let v1 = True, ¬v2 = True
and ¬v5 = False. One will not adopt action ID 0 in a valid solution because a 3Sat valid solution should
let each clause evaluate to True. For arbitrary i, i′ ∈ [k] and t, t′ ∈ [7], we say the action ID t in clause i
contradicts the action ID t′ in clause i′ if and only if t will assign a boolean variable to be True (resp. False)
while t′ will assign the same boolean variable to be False (resp. True). In the proof, we use i, i′ to represent
the index in [k], and use j, j ′ to represent the index in [d].
We construct the problem y as follows: we set d = 7k + 1, and use fixed first environment (Σ(1) , u(1) ) =
(Id , 1d ). For the second environment, we pick

5dI7k + A 12 · 17k (5d + 12 ) · 17k


   
Σ(2) = ⊤ and u(2)
= ,
1
2 · 17k 5d 5d + 21 k

where the 7k by 7k symmetric matrix A is defined as

i = i′ and t = t′


 1{t contradicts itself}
i = i′ and t ≠ t′

1
A7(i−1)+t,7(i′ −1)+t′ = (2.2)


 1 i ̸= i′ and t contradicts t′
0 otherwise

for any i, i′ ∈ [k] and t, t′ ∈ [7]. It is easy to verify that Σ(1) and Σ(2) are all positive definite matrices and it
is a deterministic polynomial-time reduction. Indeed, one has λmin (Σ(e) ) ≥ 1 for any e ∈ [2]. By definition,
S ∈ Sy if and only if β (1,S) = β (2,S) with |S| ≥ 1.
The intuitions behind the constructions are as follows: (a) the construction of (Σ(1) , u(1) ) is to enforce
the entries in the valid solutions, i.e., β (S) with S ∈ Sy , being either 0 or 1; (b) the positive non-integer 21
together with the last column of Σ(2) is to make sure d ∈ S for any S ∈ Sy , which further let |S| = k + 1 for
any S ∈ Sy ; (c) the construction of A is to connect any valid S ∈ Sy to a valid solution v ∈ Sx in a bijective
manner. The above intuitions can be formally stated as follows: the first claim (a) follows directly from our
construction of (Σ(1) , u(1) ), we defer technical verification of (b) and (c) to the end of the proof.
(a) (1,S)
S ∈ Sy ⇐⇒ S ̸= ∅ and β (2,S) = β (1,S) with βj = 1{j ∈ S}
(b)
⇐⇒ S = S̊ ∪ {d} with |S̊| = k and Aj,j ′ = 0 ∀j, j ′ ∈ S̊ ⊆ [7k]
(2.3)
(c)
⇐⇒ S = S̊ ∪ {d} where S̊ = {7(i − 1) + ai }ki=1 with ai ∈ [7] s.t. adopting
action ID ai in clause i ∈ [k] will lead to a valid solution v ∈ Sx .

Based on (2.3), for any v ∈ Sx , we can find a corresponding S ∈ Sy : Let v ∈ {True, False}n be the
assignments of the variables and ai be the corresponding action ID induced by v. Then it follows from (2.3)
that S = {d} ∪ {7(i − 1) + ai }ki=1 ∈ Sy . On the other hand, for any S ∈ Sy , we can also find a corresponding

11
v ∈ Sx by (2.3). Note the mapping between Sy and A = {(ai )ki=1 : (ai )ki=1 is induced by some solution v ∈
Sx } and the mapping between A and Sx are all bijective maps. So we can conclude that |Sx | = |A| = |Sy |.
Proof of (2.3) (b). The direction ⇐ is obvious. For the ⇒ direction, we first show that d ∈ S using the proof
by contradiction argument. Suppose |S| ≥ 1 but d ∈ / S, we pick j ∈ S, then
7k
h
(2) (2,S)
i X 1 (2)
ΣS β S = 5d + Aj,j ′ 1{j ′ ∈ S} =
̸ 5d + = uj
j 2
j ′ =1

(2,S) (1,S)
where the first equality follows from the assumption βj = βj = 1{j ∈ S} and d ∈/ S, and the inequality
follows from the fact that A ∈ {0, 1}7k×7k hence the L.H.S. is an integer. This indicates that β (1,S) ̸= β (2,S)
if |S| ≥ 1 and d ∈/ S. Given d ∈ S, we then obtain
7k
1 (2)
h
(2) (2,S)
i 1X 1
5d + k = ud = ΣS βS = 5d + 1{j ′ ∈ S} = 5d + (|S| − 1),
2 |S| 2 ′ 2
j =1

(2) (2) (2,S) 1


which implies that |S| = k + 1. Now we still have the constraint uS̊ = ΣS̊ βS̊ + 2 · 1k . The last claim

Aj ′ ,j = 0 for any j , j ∈ S̊ then follows from this by observing that
 
1 (2) 1 (i)
5d + · 1k = ΣS̊ 1k + · 1k =⇒ AS̊ 1k = 0 =⇒ Aj ′ ,j = 0 ∀j ′ , j ∈ S̊
2 2

where (i) follows from the fact that A ∈ {0, 1}7k×7k .


Proof of (2.3) (c). Turning to (c). For the ⇒ direction, S̊ = S \ {d} admits the form S̊ = {7(i − 1) + ai }ki=1
with ai ∈ [7] follows from the fact that |S̊| = k and for each i ∈ [k], there should be exactly one index
7(i − 1) + r for some r ∈ [7] because we have A7(i−1)+t,7(i−1)+t′ = 1 provided t ̸= t′ . The satisfiability
of the variable assignment induced by (ai )ki=1 can be realized by setting the variables based on the action
ID ai starting from i = 1 to i = k. The above procedure has no conflicts because if the conflicts between
the assignment of a boolean variable at clause i and that at clause i′ will lead to A7(i−1)+ai ,7(i′ −1)+ai′ = 1
by the definition of matrix A, which is contrary to the condition AS̊ = 0. For the ⇐ direction, the claim
|S̊| = k is obvious. For any j, j ′ ∈ S̊, one can write j = 7(i − 1) + ai and j ′ = 7(i′ − 1) + ai′ . When j = j ′ ,
Aj,j ′ = Aj,j = 0 follows from the fact that v ∈ / Sx if ai contradicts itself. We use proof by contradiction
when j ̸= j ′ : if Aj,j ′ = 1, then the action ai for clause i will contradict the action ai′ for clause i′ by the
definition of A, this is contrary to the fact that the actions (ai )ki=1 lead to a valid solution v ∈ Sx .
The NP-hardness of ExistLIS follows from Lemma 2.2 and Lemma 2.1.
Now we are ready to establish the NP-hardness of ExistLIS-Ident. Given the problem 3Sat-Unique is
NP-hard under randomized polynomial-time reduction by Lemma 2.1, it suffices to show that we can reduce
any 3Sat-Unique problem x with input size k to a ExistLIS-Ident problem y with input size d = 7k + 1
under deterministic polynomial-time reduction. We let y be the problem constructed from x in Lemma 2.2.
Now it suffices to show that y is ExistLIS-Ident, that is, y satisfies the constraint in Problem 2.4. Note
that |Sy | = |Sx | ∈ {0, 1} by our parsimonious reduction in Lemma 2.2 and the promise in 3Sat-Unique
problem x, we consider the following two cases.
Case 1. |Sx | = 0: We claim that S † = ∅ is the maximum invariant set. In this case, |Sx | = 0 by our reduction
construction in Lemma 2.2. This implies that for all the S ⊆ [d] with |S| ≥ 1, the condition β (e,S) ≡ β (S)
will not hold, which validates (2.1) for S † .
Case 2. |Sx | = 1: We claim that S † = Se with Se being picked from v → S, e where v is the unique solution
in x, and S is the mapped solution by our constructed reduction in Lemma 2.2 satisfying |S|
e e = k + 1 and
(1,S) (2,S) e we can claim that ∥β (1,S) − β (2,S) ∥2 > 0 for any S ∈
= 1k+1 = βS . Given Sy = {S}, / {∅, S},
e e
βS e this

verifies (2.1) for S .

12
2.4 Hardness of Finding Approximate Solutions with Error Guarantees
The claim in Theorem 2.1 indicates a computational barrier exists in finding an exact invariant set. At first
glance, it does not rule out the possibility that there exists some polynomial-time algorithm that can find
an approximate solution whose prediction is relatively close to one of the non-trivial invariant ones. The
construction in Theorem 2.1 implicitly implies this, as demonstrated in Corollary 2.2. As a by-product,
Corollary 2.2 also rules out the possibility of finding a non-trivial invariant solution if one exists, as it allows
for estimation errors.
Problem 2.5. Consider the same setting as Problem 2.3 with E = 2 and suppose further Y (e) = (β (e,[d]) )⊤ X (e) ,
i.e., there is no intrinsic noise.
[Input] (Σ(1) , Σ(2) ) and (u(1) , u(2) ) as in Problem 2.3.
[Output] Return a d-dimensional vector β̄: β̄ should be an approximate solution to any of the non-trivial
invariant solutions if there exists a non-trivial invariant solution, that is

∥β̄ − β (S) ∥2Σ


inf P (e) |2 ]
< (20d)−1 if {S : β (1,S) = β (2,S) ̸= 0} =
̸ ∅; (2.4)
S:β (e,S) ≡β (S) ̸=0 e∈[2] E[|Y

β̄ can be an arbitrary d-dimensional vector otherwise.


Corollary 2.2. If Problem 2.5 can be solved by a polynomial-time algorithm, then 3Sat can also be solved
by a polynomial-time algorithm.

Moreover, in the construction in Lemma 2.2, the solutions are well-separated: whenever variable selection
is incorrect, the resulting predictions in the two environments are not very close, and the pooled prediction
also deviates from any invariant predictions; see the formal claims in (2.6) and (2.5), respectively. The
inequality (2.6) also rules out the possibility of finding a o(d−4 )-approximate invariant set in a computationally
efficient manner.

Lemma 2.3 (Relative Estimation Error Gap). In the constructed instance in Lemma 2.2, if we let Y (e) =
(β (e,[d]) )⊤ X (e) , then the following holds,

† ∥β (S) − β (S ) ∥2Σ
∀S, S ⊆ [d], P (e) |2 ]
∈ [1{S ̸= S † }(40d)−1 , 1] (2.5)
e∈[2] E[|Y
(S)
− β (e,S) ∥2Σ(e)
P
e∈[2] ∥β
∀S ⊆ [d], P (e) |2 ]
∈ {0} ∪ [(10d)−4 , 1] (2.6)
e∈[2] E[|Y

Remark 2 (Dilemma between Statistical and Computational Tractability). One can choose either the
relative distance to the closest non-trivial invariant solution δ1 , or the relative prediction variation defined
in the L.H.S. of (2.6) δ2 as the “estimation error” of interests. If P̸=NP, taking all the polynomial-time
algorithms into consideration, Corollary 2.2 claims that the worst-case estimation error δ1 is lower bounded
by (20d)−1 , and Lemma 2.3 shows that the worst-case estimation error δ2 is lower bounded by (10d)−4 . A
finer construction in Appendix A improves the error lower bounds in (2.4), (2.5) and (2.6) to be d−ϵ for
any fixed ϵ > 0. Given that our theorem is stated at a population level, and one can estimate all the β (e,S)
uniformly well provided n ≳ poly(d), we can claim that the statistical estimation error can be arbitrarily slow
with polynomial-time algorithms if P̸=NP.
Proof of Corollary 2.2. We use the same reduction as in Lemma 2.2. For 3Sat instance x, we let y = T (x)
be the constructed ExistLIS instance in Lemma 2.2. Let β̄ be the output required by Problem 2.5 in the
instance y, and Se = {j : β̄j ≥ 0.5}. Following the notations therein, we claim that

(a)
Se ∈ Sy ⇐⇒ |Sy | ≥ 1 ⇐⇒ x ∈ X3Sat,1 (2.7)

13
Therefore, if an algorithm A can take Problem 2.5 instance y as input and return the desired output β(y) b
within time O(p(|y|)) for some polynomial p, then the following algorithm can solve 3Sat within polynomial
time: for any instance x, it first transforms x into y = T (x), then use algorithm A to solve y and gets the
returned β̄, and finally output 1{Se ∈ Sy }.
It remains to verify (a): the ⇒ direction is obvious. For the ⇐ direction, suppose |Sy | ≥ 1, the estimation
error guarantee in Problem 2.5 indicates that
s r
† † ∥ β̄ − β (S † ) ∥2 (i) (20d)−1 10d2 1
(S ) (S ) Σ
∥βb − β ∥∞ ≤ ∥β̄ − β ∥2 ≤ < ≤
λmin (Σ) 2d 2

for some S † ∈ Sy . Here (i) follows from the the error guarantee (2.4), and the facts λmin (Σ) ≥ 0.5λmin (Σ(2) ) ≥
2d and e∈[2] E[|Y (e) |] ≤ 10d2 in (C.2) and (C.1), respectively. This further indicates Se = S † by the fact
P
(S † )
that βj = 1{j ∈ S † } for any j ∈ [d].

2.5 Remarks and Lessons from Theorem 2.1


The fundamental limits delivered in Theorem 2.1 assert that realizing both computationally and statistically
efficient estimation is impossible unless P=NP or simplifying the original problem. The latter may result in
restrictive applicability.
Two remarks on the severity of the computational barrier are worth mentioning. Firstly, in the world
of P̸=NP, any polynomial-time algorithm can not attain certain estimation accuracy d−ϵ for arbitrary fixed
ϵ > 0 by Remark 2. This indicates that the computational barrier for pursuing invariance is more severe than
that for other estimation problems such as pursuing sparsity (Zhang et al., 2014; Wang et al., 2016) in which
polynomial-time algorithms can obtain a sub-optimal but still decent rate. Secondly, the computational
barrier is due to pursuing invariance itself rather than picking from exponentially many invariant solutions
based on some criterion or the non-identifiability of the problem. In fact, the computational barrier remains
under the promise of one unique invariant solution in Theorem 2.1.
The results and construction in Theorem 2.1 also imply that the computation barrier will remain under
some typical potential strategies under the worst case: the construction of the identity Σ(1) implies that
perfect orthogonal covariance in |E|−1 environments will not help. Secondly, the construction of xd indicates
that under the worst case, searching all the variable sets with cardinality less than r cannot furnish any
insights on determining whether there are invariant sets whose cardinalities are larger than or equal to r.
Finally, a finer construction in Appendix A asserts that further imposing row-wise constant-level sparsity
on all the covariance matrices will not help, or in other words, the computation difficulty is not due to the
dense covariance structure.

3 Regularization by Environment Prediction Variation


The results in Section 2 indicate that consistent estimation with polynomial-time algorithms is impossible
under the worst-case scenario. Such a worst-case hardness remains when there is (1) perfect orthogonality in
one environment, and (2) near-perfect sparsity across different environments. In Section 3.1, we first impose
one additional restrictive assumption, see how the computational barrier can be resolved, and derive the
distributional robustness interpretation of our proposed estimator under this assumption. Section 3.2 further
demonstrates the general estimator and establishes the corresponding causal identification and distributional
robustness result when n = ∞. The finite sample estimator and the non-asymptotic results are presented in
Section 3.3.
Without loss of generality, we assume the covariate is non-degenerate and (pooled) normalized.
Condition 3.1 (Non-collinearity and Normalization). Assume Σ(e) ≻ 0 for any e ∈ E. Recall the definition
in (1.6), we have Σj,j = 1 for any j ∈ [d].

14
3.1 Warmup: Orthogonal Important Covariate
Let us first impose an additional restrictive assumption Condition 3.2 in the model (1.1) and see how the
computational barrier can be circumvented under this condition. In the following Section 3.2, we shall
consider a more general relaxation regime and establish a tradeoff between the additional assumption and
computational complexity.
(e)
Condition 3.2. For all e ∈ E, Σi,j = 0 for any i, j ∈ S ⋆ with i ̸= j.

Recall the definition of β (e,S) and β (S) . If Condition 3.2 holds, then under (1.1) and (1.3) S ⋆ can be
simplied as
n o
(e,{j}) ({j})
S ⋆ = j : ∀e ∈ E, βj ≡ βj

({j})
that involves only marginal regression coefficients, where βj stands for the pooled effect by simply using
the j-th variable as the predictor. This means under Condition 3.2, one can enumerate j ∈ [d] and screen
(e) (e′ )
out those Xj with varying marginal regression coefficients, i.e., Xj with rj ̸= rj for some e, e′ ∈ E, where
(e) (e) (e)
rj = E[Xj Y (e) ]/E[|Xj |2 ]. The survived variables will furnish S ⋆ . Turning to the empirical counterpart,
it is a multi-environment version of the sure-screening (Fan & Lv, 2008).
The above procedure is still of a discontinuity style. Recall RE (β) in (1.7), the main idea motivates
minimizing the following penalized least squares
J1 (β)
z }| {
d s 2
X 1 X (e)  (e,{j}) ({j})
Q1,γ (β) = RE (β) + γ |βj | Σj,j βj − βj , (3.1)
j=1
|E|
e∈E
| {z }
w1 (j)

where the penalty term measures the discrepancy across different environments.
(e) (e,{j}) ({j}) 2 (e,{j}) ({j}) 2
Here we use Σj,j |βj − βj | rather than |βj − βj | since the former is x-scale invariant
and has a better explanation in prediction. To be specific, the term w1 (j) will be the same if we replace X
by aX for any a ∈ R \ {0}. More importantly, it can be explained as the variation of optimal prediction in
L2 norm across environments, namely,
Z o2
1 X n (e,j)
w1 (j) = f (x) − f (j) (x) µ(e) (dx) (3.2)
|E|
e∈E

(e,{j}) ({j})
where f (e,j) (x) = βj xj is the best linear prediction on Xj in environment e and f (j) (x) = βj xj is
the best linear prediction on Xj across all environments.
The proposed optimization program can be understood in two aspects. On the one hand, it maintains the
capability to solve the invariant pursuit problem, that is, recover β ⋆ from (1.1), when γ is large enough. To
see this, when γ ≍ 1, the introduced penalty γJ1 (β) will place a constant penalty on the spurious variables,
i.e., j ∈ G, and will not penalize any variables in S ⋆ . Therefore, one can expect that β ⋆ will be the unique
minimizer of Q1,γ (β) as γ is large enough so that the penalty term is larger than the prediction error of
using β ⋆ . On the other hand, it maximizes relaxed worst-case explained variance over small perturbations
around the pooled least squares, defined as β̄ := Σ−1 u, when γ is small. Recall the definition of pooled
quantity (Σ, u) in (1.6), the two-fold characterization of the population-level minimizer of (3.1) can be
formally delivered as follows.
Proposition 3.1. Let Pγ (Σ, u) = (X, Y ) ∼ µ : E[XX ⊤ ] = Σ, |E[XY ] − u| ≤ γ · (w1 (1), . . . , w1 (d))⊤ be


the uncertainty set of distributions. Under Condition 3.1, Q1,γ (β) has an unique minimizer β γ satisfying

β γ = argmin max E(X,Y )∼µ |Y − β ⊤ X|2 − |Y |2 .


  
µ∈Pγ (Σ,u) (3.3)
β

15
Moreover, under (1.1) with S ⋆ further satisfying (1.3), if Condition 3.2 holds, then β γ = β ⋆ when γ ≥ γ ⋆ :=
1
P (e) (e)
maxj∈G | |E| e∈E E[Xj ε ]|/w1 (j), where w1 (j) is defined in (3.1).

Proposition 3.1 offers interpretations of the population-level minimizer β γ of Q1,γ (β) for varying γ from
two perspectives. On the one hand, β γ can be interpreted as the distributionally robust prediction model over
the uncertainty set Pγ (Σ, u): it minimizes the worst-case negative explained variance, or it is the maximin
effects (Meinshausen & Bühlmann, 2015; Guo, 2024) over the uncertainty set Pγ (Σ, u). The uncertainty
class contains all joint distributions of (X, y), where the covariates X have the second-order moment matrix
as Σ and the covariance between X and Y is perturbed around u. Similar to Theorem 1 in Meinshausen &
Bühlmann (2015) and Proposition 1 in Guo (2024), β γ has the following geometric explanation, that

β γ = argmin β ⊤ Σβ with Θγ = {β : |Σβ − u| ≤ γ · (w1 (1), . . . , w1 (d))⊤ }. (3.4)


β∈Θγ

This basically says that β γ is the projection of the null β = 0 on the convex closed set Θγ with respect to
the norm ∥ · ∥ = ∥Σ1/2 · ∥2 ; see the proof in Appendix D.2. The distributional robustness (3.3) and geometric
interpretation (3.4) are independent of the invariance structure (1.1) and further structural assumption
Condition 3.2. Instead, they are attributed to the choice of L1 regularization with inhomogeneous weights
(w1 (1), . . . , w1 (d)). This is a realization of the heuristic idea of adopting an anisotropic uncertainty ellipsoid
based on the observed environments. Specifically, more uncertainty is placed on the variables predicting
differently in the observed environments than those with invariant predictions.
On the other hand, consider the case where the data generating process satisfies the invariance structure
(1.1), the sufficient heterogeneity (1.3), together with an additional structure assumption Condition 3.2. Now
the above distributionally robust procedure will place zero uncertainty on the invariant, causal variables,
and will place linear-in-γ uncertainty on the spurious variables. The minimizer β γ will coincide with the
true, causal parameter β ⋆ when γ is large enough.
Let us illustrate the above ideas using the toy example below.
Example 3.1. Consider the following data-generating process with d = 3, E = {1, 2} and independent
standard normal random variables ε0 , . . . , ε3 , the cause-effect relationship and the intervention effects are
illustrated in Fig. 1 (a). The constant factors before εj with j ≥ 2 are added to ensure Xj has a unit
variance.
(1) (2)
X1 ← ε1 , X1 ← ε1 ,
(1) (2)
Y (1) ← X1 + ε0 , Y (2) ← X1 + ε0 ,
(1)
and (2)

X2 ← (2/3) · Y (1) + (1/3) · ε2 , X2 ← 0.5 · Y (2) + ( 2)−1 · ε2 ,
(1) (2)
p
X3 ← (2/3) · Y (1) + (1/3) · ε3 ; X3 ← 0.25 · Y (2) + 7/8 · ε3 .

In Example 3.1, X1 is the invariant (causal) variable, while X2 and X3 are all endogenous spurious (reverse
causal) variables as shown in Fig. 1 (a). They have identical spurious predictive powers in environment e = 1,
and variable X3 is confronted with stronger perturbations than X2 in environment e = 2. The invariance
structure is well identified with S ⋆ = {1} satisfying (1.1) and (1.3) simultaneously. The prediction variation
in (3.2) are (w1 (1), w1 (2), w1 (3)) = (0, 1/6, 1/4).
Fig. 1 (b) visualize the maximin effect (3.3) over the uncertainty set shaped by the prediction variation.
For given fixed γ, the uncertainty set in E[XY ] in (3.3) does not place uncertainty on the causal variable X1 ,
while it places a relatively small uncertainty γ/6 on the variables X2 which suffers from less perturbation, and
a relatively large uncertainty γ/4 on the variable X3 that predicts more differently in observed environments
E. This two-dimensional uncertainty plane in covariance space further yields the two-dimensional uncertainty
plane centered on the pooled least squares β̄ in the solution space after the affine transformation x → Σ−1 x
as shown in Fig. 1 (b). The uncertainty sets Θγ all lie in the same hyper-plane and their diameter scales
linearly with γ. The corresponding population-level minimizer β γ is the projection of the null β = 0 on
Θγ . This leads to a solution path that connects the most predictive solution β̄ and the causal solution

16
Θ3.6 βjγ
X1

1 β3
Y β2
Θ2
2/3 2/3
β1
X2 X3 β3

e=1 γ
Θ0.4 0.4 2 3.6
β0 = β̄ β2 γ
X1 ⋆
βFAIR,j
β = β 3.6 β 0.4
β2
1
β1
Y
β2
0.5 0.25
β1
X2 X3 β3

e=2 γ
(a) (b) (c)

Figure 1: (a) A structural causal model illustration of the multi-environment model in Example 3.1: the arrow from node u to
node v with number s means there is a linear causal effect s of u on v. (b) visualize the uncertainty set Θγ in three checkpoints
of γ ∈ {0.4, 2, 3.6} and regularization path of the proposed estimator (3.3) in the three-dimensional parameter space β ∈ R3 .
For each γ, the uncertainty set Θγ is a two-dimensional plane filled by colors changing from red to blue as γ increases. The
upper panel of (c) depicts how the population level solution β γ ∈ R3 changes according to γ in each coordinate j ∈ [3]: the
causal variable is represented by green solid line, and the two spurious (reverse causal) variable are represented by yellow
dashed (β2 ) and dotted (β3 ) lines, respectively. The lower panel of (c) plots the counterpart for the FAIR-Linear estimator in
Gu et al. (2024).

β ⋆ continuously. When γ is smaller than the critical threshold, such a prediction β γ still leverages part
of the spurious variables for prediction and will have better prediction over β ⋆ and β̄ when it is deployed
in an environment
√ p where the reverse causal effects are still positive but slightly shrinkage, for example,
X3 ← Y /3 2 + 8/9ε3 . Such a solution β γ stands in between β ⋆ and β̄: it is more robust than β̄ and
less conservative than β ⋆ . As a comparison, the FAIR-Linear (Gu et al., 2024) estimator that solves the
hard-constrained structural estimation problem is less flexible in this regard, as shown in the lower panel of
Fig. 1 (c), it adopts certain hard threshold and choose either to include or eliminate the spurious variables.

3.2 Interpolating between the Orthogonal and General Cases


The population-level minimizer of (3.1) can solve the linear invariance pursuit in (1.1) efficiently within
time complexity O((|E| + n)d + TLasso (|E| · n, d)), where TLasso (N, d) is the complexity of running a d-variate
N -sample Lasso. However, the estimation can only be guaranteed when Condition 3.2 holds, and it may fail
when Condition 3.2 does not hold. Here, we introduce a more general relaxation balancing estimation error
and time complexity.
Instead of calculating the prediction variation of the marginal linear predictor for each variable Xj , we
consider calculating the prediction variation of the predictors using variable size less or equal to k. For the
population-level counterpart, it minimizes the following objective
Jk (β)
z }| {
d s
X 1 X (e,S) (S)
Qk,γ (β) = RE (β) + γ |βj | · min ∥βS − βS ∥2 (e) (3.5)
j=1
S:j∈S,|S|≤k |E| ΣS
e∈E
| {z }
wk (j)

17
with some computational budget hyper-parameter k ∈ N.
As k grows or equivalently as more computational budget is paid, the space of instances that can be
solved enlarges and will finally coincide with that of EILLS or FAIR when k ≥ |S ⋆ |. On the other hand,
if the computational budget we can pay is relatively limited, one can still probably solve some problem
instances with low-dimensional structures as elaborated in the following Theorem 3.3.

Condition 3.3 (Restricted Invariance). For any j ∈ S ⋆ , there exists some S ⊆ [d] with |S| ≤ k and j ∈ S
such that β (e,S) ≡ β (S) for any e ∈ E.
Note that when Condition 3.3 holds, for all j ∈ S ⋆ , the weight wk (j) in the penalty term is equal to
0. On the other hand, for a large enough γ, all endogenous variables will be excluded due to a positive
wk (j). Hence, the object (3.5) will screen out all endogenously spurious variables and meanwhile minimize
the prediction errors using the remaining variables. Condition 3.3 naturally holds when k ≥ |S ⋆ |. When
k < |S ⋆ |, Condition 3.3 requires a stronger identification condition than the invariance assumption (1.1)
such that all the invariant variables Xj with j ∈ S ⋆ can be identified using a smaller set Sj with |Sj | ≤
k < |S ⋆ |. This is a generic condition and can hold under different circumstances. For example, there are
(e)
some shared group-orthogonal structures in the set S ⋆ such as ΣS ∗ admits a block diagonal structure with
the maximum block size ≤ k, which includes the diagonal case in Condition 3.2 as a specific instance, or
the insufficiency of interventions on the ancestors of S ⋆ , for example, all the ancestors of Y are free of
intervention. Proposition B.2 in the appendix further offers conditions under which Condition 3.3 holds.
The following two theorems generalize Proposition 3.1 for growing k.
Theorem 3.2. Let Pγ,k (Σ, u) = (X, Y ) ∼ µ : E[XX ⊤ ] = Σ, |E[Xj Y ] − uj | ≤ γ · wk (j) ∀j ∈ [d] be the


uncertain set of distributions. Under Condition 3.1, Qk,γ (β) has a unique minimizer β k,γ satisfying

β k,γ = argmin E(X,Y )∼µ [|Y − β ⊤ X|2 − |Y |2 ]



max (3.6)
β µ∈Pγ,k (Σ,u)

Theorem 3.3. Under the setting of Theorem 3.2, assume the invariance structure (1.1) holds with S ⋆
satisfying (1.3). Suppose further that Condition 3.3 holds, then β k,γ = β ⋆ when γ ≥ γk⋆ with γk⋆ :=
1
P (e) (e)
maxj∈G | |E| e∈E E[Xj ε ]|/wk (j).

Remark 3. One can show that γk⋆ is uniformly upper bounded by


 
⋆ (e)
γk ≤ min λmin (Σ ) · γ ∗
e∈E

where γ ∗ is the critical threshold, or the signal-to-noise ratio in heterogeneity in Fan et al. (2024). It was
defined on a square scale, so a square root is taken here; see the formal definition of γ ∗ in (D.2) in the
appendix. This indicates that one does not need to adopt a potentially larger hyper-parameter to achieve
causal identification compared with EILLS in Fan et al. (2024), recalling the scaling Condition 3.1.
Similar to Proposition 3.1, the first distributional robustness interpretation (3.6) in Theorem 3.2 is
due to adopting inhomogeneous L1 penalization on the variables based on a finer prediction variation
(wk (1), · · · , wk (d)) observed in the environments E than the marginal counterpart (w1 (1), · · · , w1 (d)). The
second theorem Theorem 3.3 states that when additional structural assumption (3.3) holds, the causal
parameter β ⋆ under (1.1) with (1.3) can be identified by our estimator when γ is large enough.

18
3.3 Empirical-level Estimator and Non-asymptotic Analysis
Turning to the empirical counterpart, for given k and γ, we consider minimizing the following empirical-level
penalized least squares
Q
b k,γ (β)
z }| {
2 d
1 X 
(e) (e)
X p
βbk,γ = argmin Yi − β ⊤ Xi +γ· |βj | wbk (j),
β 2n|E| j=1
(3.7)
e∈E,i∈[n]
1 X b(e,S) b(S) 2
with w
bk (j) = inf βS − βS .
S⊆[d],|S|≤k,j∈S |E| b (e)
Σ S
e∈E

The weighted L1 -penalty aims at attenuating the endogenously spurious variables. This will be applied to
the low-dimensional regime d = o(n). Under the high-dimensional regime d ≳ n, we further add another L1
penalization with hyper-parameter λ, which aims at reducing exogenously spurious variables:

βbk,γ,λ = argmin Q
b k,γ (β) + λ∥β∥1 . (3.8)
β∈Rd

For the theoretical analysis, we impose some standard assumptions used in linear regression.
Condition 3.4 (Regularity). The following conditions hold:
(a) (Data Generating Process) We collect data from |E| ∈ N+ environments. For each environment e ∈
(e) (e) (e) (e) i.i.d.
E, we observe (X1 , Y1 ), . . . , (Xn , Yn ) ∼ µ(e) . The data from different environments are also
independent.
(b) (Non-collinearity and Normalization) Assume Σ(e) ≻ 0 for any e ∈ E. Recall the definition in (1.6),
we have Σj,j = 1 for any j ∈ [d].
(c) (Sub-Gaussian Covariate and Noise) There exists some constants σx ∈ [1, ∞) and σy ∈ R+ such that
 2 
h n
(e) (e)
oi σx
∀e ∈ E E exp v ⊤ (ΣS )−1/2 XS ≤ exp · ∥v∥22 ∀S ⊆ [d], v ∈ R|S| ,
2
!
h n
(e)
oi λ2 σy2
E exp λY ≤ exp ∀λ ∈ R.
2

(d) (Relative Bounded Covariance) There exists a constant b ∈ [1, ∞) such that
−1/2 (e) −1/2
∀e ∈ E and S ⊆ [p] λmax (ΣS ΣS ΣS ) ≤ b.

To simplify the presentation, let c1 be such that c1 ≥ max{b, σx , σy } and |E| ≤ nc1 .
These assumptions are standard in the analysis of linear regression. It is easy to see the sub-Gaussian
covariate conditions hold with σx = 1 when X (e) ∼ N (0, Σ(e) ). The sub-Gaussian condition can be relaxed by
the finite fourth-moment conditions with robust inputs; see Fan et al. (2021). Our error bound is independent
of supe∈E λmax (Σ(e) ) given fixed b. The maximum eigenvalue λmax (Σ(e) ) may grow with d in the presence of
highly correlated covariates such as factor models (Fan et al., 2022; Fan & Gu, 2024). It is also easy to see
that b ≤ |E| by observing that
n  o−1
−1/2 (e) −1/2 (e) (e)
λmax (ΣS ΣS ΣS ) ≤ λmin (ΣS )−1/2 ΣS (ΣS )−1/2 ≤ |E|. (3.9)

The following theorem establishes the L2 error bound with respect to β k,γ identified in Theorem 3.2 in
the low-dimensional regime.

19
Theorem 3.4. Assume Condition 3.4 holds. There exists a constant C e = O(poly(c1 )) such that if n ≥
−t
e max{d, k log d, t} and t ≥ log n, then with probability at least 1 − e ,
C
r ( s )
k,γ k,γ d γ p 1 + t/d
∥βb − β ∥2 ≤ C e· · t + log(n) + k log d + ,
n κ κ · |E|

where κ = mine∈E λmin (Σ(e) ).


As shown in Theorem 3.2 and Theorem 3.3, the invariance hyper-parameter γ interpolates the most
predictive solution, the pooled least squares β̄ = Σ−1 u, with γ = 0 and the most robust solution, the
invariant (causal) solution β ⋆ , with large enough γ ≥ γk⋆ in a smooth manner when the additional condition
Condition 3.3 holds. Under the regime of κ ≍ γk⋆ ≍ 1, our proposed empirical estimator converges to the
target β̄ at the rate of {d/(n·|E|)}1/2 on one hand γ = 0. On the other hand, combining it with Theorem 3.3,
we also have the convergence rate to the causal parameter β ⋆ , that is,
" r #
d
k,γ ⋆
P ∥β − β ∥2 ≤ C {log(n) + k log(d)}
b ≥ 1 − n−10
n

with the proper choice of γ ≍ γk⋆ ≍ 1. When 0 < γ < γk⋆ , the estimator βbk,γ serves as an invariance
information guided distributionally robust estimator, whose variance of the empirical estimator lies in
between the two.
Turning to the high-dimensional regime, we have the following result. The main message is that the
proposed estimator in (3.8) can handle the high-dimensional covariates in a similar spirit to Lasso (Tibshirani,
1997; Bickel et al., 2009) for the sparse linear model with the help of another L1 penalty.

Theorem 3.5. Assume Condition 3.4 holds. Denote S k,γ = supp(β k,γ ). There exists a constant C e =
−1 k,γ −10
O(poly(c1 )) such that if n ≥ C(k + κ |S |) log d, then with probability at least 1 − (nd) ,
e
p r s !
12 |S k,γ | k log d + log n log d + log n
∥βbk,γ,λ − β k,γ ∥2 ≤ λ if λ ≥ C
e γ· + .
κ n n · |E|

4 Real Data Applications


In this section, we compare our method invariance-guided regularization (IGR) with other estimators in two
real data applications: daily stock log-return prediction and earth climate system prediction. Our proposed
method attains more robust predictions compared with predecessors. We summarize the framework in
Algorithm 1. In the two applications, we simply adopt k = 2. The performance of varying k ∈ {1, 2, 3} is
similar; see Appendix F.4.

4.1 Stock Log-return Prediction


We follow Varambally et al. (2023) and use the daily log-returns of 100 stocks from S&P 100, defined as
the differences in the logarithms of the closing prices of successive days. We denote the daily log-returns of
these stocks as {Zt,j }t∈[T ],j∈[100] where T is the length of the sequence. In this study, we focus on predicting
the stocks in the Real Estate sector: American Tower (Symbol: AMT) and Simon Property Group (Symbol:
SPG). For the task of predicting the outcome variables {AMT, SPG} with index j0 ∈ [100], the target response
variable is Yt = Zt,j0 , and the covariate is Xt = {Zt−1,j : j ∈ [100]}∪{Zt,j : j ̸= j0 }, the same as (Varambally
et al., 2023).
We use data from 800 consecutive days starting in August 2018 and partition this time series into seven
segments: days 1–100 (D1 ) and 101–200 (D2 ) serve as the two training environments, days 201–400 (D3 ) as
the validation environment, and days 401–500, 501–600, 601–700, and 701–800 as the four test environments

20
Algorithm 1 Linear Regression with Invariance-Guided Regularization (IGR)
(e) (e)
1: Input: Training environments {D(e) }e∈E with D(e) = {(Xi , Yi )}ni=1 ; validation environment D(valid) .
2: Input: computational budget k.
3: Input: candidate sets of hyper-parameters Γ and Λ.
4: For each pair of hyper-parameters (γ, λ) ∈ Γ × Λ, calculate βbk,γ,λ using (3.8) on training environments.
5: Choose hyper-parameters as
1 X
γ b ∈ argmin
b, λ (Xi⊤ βbk,γ,λ − Yi )2 (4.1)
γ∈Γ,λ∈Λ |D(valid) |
(Xi ,Yi )∈D (valid)

Output: β k,bγ ,λ .
b
6:

denoted as {D3+i }4i=1 . This partitioning is motivated by the results of Varambally et al. (2023), which
indicate that the market behavior between the two training time spans differs significantly. We set both X
and Y to be zero-mean in each environment to remove the effect of the trend.
We fix the computational budget k = 2 and compare our method with Causal Dantzig (Rothenhäusler
et al., 2019), Anchor Regression (Rothenhäusler et al., 2021) and DRIG (Shen et al., 2023) with the aid of L1
penalty (if applicable), along with PCMCI+ (Runge, 2020) with the aid of L2 penalty. The hyper-parameters
for all models are determined via the validation set D(valid) = D3 using the criterion similar to (4.1). Here
the hyper-parameters in the two prediction tasks are determined independently. We finally evaluate each
method using the worst-case out-of-sample R2 across the four test environments defined as

− Yb (X))2
P
2 2 (X,Y )∈De (Y
min Roos,e with Roos,e =1− P (4.2)
e∈{4,5,6,7} (X,Y )∈De Y2

where Yb (X) is the model’s prediction. Here we use the R2 rather than the mean squared error in (4.1) to
present the result to illustrate the challenge of this task, given most of the previous methods have negative
out-of-sample R2 , indicating that their fitted models are even worse than simply using the null prediction
model.
This process is repeated 100 times. For each trial, we use a random sample of 90 data in each training
environment D1 , D2 to fit the model. The average ± standard deviation of the worst-case out-of-sample R2
is reported in Table 1.

Data AMT SPG


IGR (Ours) 0.131 ± 0.074 0.048 ± 0.039
Causal Dantzig −0.150 ± 0.296 −0.006 ± 0.072
Anchor/Lasso −0.199 ± 0.097 −0.018 ± 0.021
DRIG −0.553 ± 0.309 −0.201 ± 0.099
PCMCI+ 0.051 ± 0.075 −0.057 ± 0.041

Table 1: The average ± standard deviation of the worst-case out-of-sample R2 (4.2) for predicting the stocks AMT and SPG using
different estimators.

We can see that our method outperforms competing methods in terms of robustness, as it provides more
consistent estimations across different environments. In particular, our method achieves a positive worst-case
out-of-sample R2 when predicting SPG, while the other methods result in negative R2 values. To qualitatively
illustrate why most of the other competing methods yield negative R2 values, we apply LASSO with an L1
penalty parameter of 0.125 on the training data in the AMT task to select covariates. Using the selected
covariates, we refit the target on the training environments D1 , D2 , as well as one of the test environments
D6 . As shown in Fig. 2, the resulting estimations differ drastically, highlighting strong heterogeneity across
environments. This observation partially explains why other methods may produce negative R2 values.

21
(0, 84) (1, 60) (0, 21) (0, 22) (0, 59) (0, 69) (0, 36) (0, 74) (0, 96) (0, 76) (0, 60)
0.4

Environments
{1, 2}
0.0
{6}
−0.4
Variables

Figure 2: The estimated coefficients of the selected variables are shown for D1 ∪ D2 and D6 . Warm colors represent positive
coefficients, while cool colors indicate negative coefficients. Variables are denoted as (τ, j), where τ ∈ {0, 1} represents the time
lag, and j indicates the stock index.

4.2 Climate Dynamic Prediction


We apply our method to the NCEP-NCAR Reanalysis Dataset (Kalnay et al., 1996) provided by NOAA
PSL, Boulder, Colorado, USA. The dataset is widely used in atmospheric and climate research. It comprises
10512 global grid points with a resolution of 2.5 degrees in both latitude and longitude, spanning multiple
vertical levels, and is available on a daily timescale. The dataset encompasses a range of meteorological
properties, including air temperature (air), clear sky upward solar flux (csulf), surface pressure (pres),
sea level pressure (slp), and others.
We treat the aforementioned four properties {air, csulf, slp, pres} as four independent tasks. For each
property a ∈ {air, csulf, slp, pres}, we perform time-series prediction on {Za,t,j }t∈[T ],j∈[r] , where Za,t,j
represents the (pre-processed) measure of the property a in geometric region j at timestep t. We omit
the dependency on a and denote it as {Zt,j }t∈[T ],j∈[r] when a is clear from context. See how the data is
pre-processed in Appendix F.1.
In our experiment, we consider the datasets for the years 1950 (D1 ) and 2000 (D2 ) as the two training
environments, the year 2010 as the validation environment (D3 ), and the year 2020 as the test environment
(D4 ), all on a daily timescale. The target is to predict a set of variables Y = {j1 , . . . , jw } ⊆ [60], namely
Yt = (Zt,j1 , . . . , Zt,jw )⊤ , using all the variables from the past seven days as covariates, namely,

Xt := (Zt−1,1 , . . . , Zt−1,60 , Zt−2,1 , . . . , Zt−2,60 , . . . , Zt−7,1 , . . . , Zt−7,60 ) .

Here Y is the variables that can be predicted by Xt significantly better than simply using the null prediction
model; see the formal procedure on determining Y in Appendix F.2.
For competing estimators, we consider PCMCI+ (Runge, 2020) and Granger causality (Granger, 1969)
with the aid of L2 penalty, along with the following three causality-oriented linear models: Causal Dantzig
(Rothenhäusler et al., 2019), Anchor Regression (Rothenhäusler et al., 2021) and DRIG (Shen et al., 2023)
with the aid of L1 penalty (if applicable). We use mean squared error (MSE) as both the validation metric
and test metric, which is defined as
X
MSEe = ∥Yt − Yb (Xt )∥22 e ∈ [4] (4.3)
(Xt ,Yt )∈De

where Yb (Xt ) is the model’s prediction. The hyper-parameters for each model are tuned using the validation
environment D3 as described in Algorithm 1.
This process is repeated 100 times. For each trail, we use a random sample of 300 data in each training
environment D1 , D2 to fit the model. The average ± standard deviation of the average mean squared error
on the test environment D4 of each method for each task is reported in Table 2. The quantitative results
show that our method outperforms all competing methods across all tasks, indicating that IGR can provide
more robust predictions. We also qualitatively visualize the causal relation detected by our method; see
Appendix F.5.

22
Data air csulf pres slp
IGR(Ours) 3.7838 ± 0.3281 2.0523 ± 0.0883 1.6077 ± 0.1122 3.0466 ± 0.1955
Causal Dantzig 4.3742 ± 0.2099 2.6197 ± 0.0455 2.0429 ± 0.1502 3.5819 ± 0.3161
LASSO 3.9171 ± 0.3194 2.1327 ± 0.0567 1.6726 ± 0.0897 3.1261 ± 0.1887
Anchor 3.9007 ± 0.2394 2.1142 ± 0.0615 1.6638 ± 0.0981 3.1235 ± 0.1622
DRIG 3.9579 ± 0.2594 2.1844 ± 0.1233 1.7235 ± 0.1176 3.2890 ± 0.1618
Granger 4.3174 ± 0.3842 2.3182 ± 0.0736 1.8470 ± 0.1222 3.5308 ± 0.1484
PCMCI+ 4.3533 ± 0.3062 2.4024 ± 0.0422 1.9499 ± 0.1213 3.6627 ± 0.2711

Table 2: The average ± standard deviation of the mean squared error (4.3) of the four tasks air temperature (air), clear sky
upward solar flux (csulf), surface pressure (pres) and sea level pressure (slp) using different estimators.

References
Aldrich, J. (1989). Autonomy. Oxford Economic Papers, 41(1), 15–34.
Arjovsky, M., Bottou, L., Gulrajani, I., & Lopez-Paz, D. (2019). Invariant risk minimization. arXiv preprint
arXiv:1907.02893.
Bagnell, J. A. (2005). Robust supervised learning. In AAAI (pp. 714–719).
Bareinboim, E., Correa, J. D., Ibeling, D., & Icard, T. (2022). On pearl’s hierarchy and the foundations of
causal inference. In Probabilistic and causal inference: the works of judea pearl (pp. 507–556).
Berthet, Q. & Rigollet, P. (2013a). Complexity theoretic lower bounds for sparse principal component
detection. In Conference on learning theory (pp. 1046–1066).: PMLR.
Berthet, Q. & Rigollet, P. (2013b). Optimal detection of sparse principal components in high dimension.
The Annals of Statistics, 41(4), 1780–1815.
Bickel, P. J., Ritov, Y., & Tsybakov, A. B. (2009). Simultaneous analysis of lasso and dantzig selector. The
Annals of statistics, 37(4), 1705–1732.
Blanchet, J., Kang, Y., Murthy, K., & Zhang, F. (2019). Data-driven optimal transport cost selection
for distributionally robust optimization. In 2019 winter simulation conference (WSC) (pp. 3740–3751).:
IEEE.
Bovet, D. P., Crescenzi, P., & Bovet, D. (1994). Introduction to the Theory of Complexity, volume 7. Prentice
Hall London.
Brennan, M. & Bresler, G. (2019). Optimal average-case reductions to sparse pca: From weak assumptions
to strong hardness. In Conference on Learning Theory (pp. 469–470).: PMLR.
Bühlmann, P. (2020). Invariance, causality and robustness. Statistical Science, 35(3), 404–426.
Candes, E. & Tao, T. (2007). The Dantzig selector: Statistical estimation when p is much larger than n.
The Annals of Statistics, 35(6), 2313 – 2351.
Chen, X., Ge, D., Wang, Z., & Ye, Y. (2014). Complexity of l2 -lp unconstrained minimization. Mathematical
Programming, 143(1), 371–383.
Chen, Y., Ge, D., Wang, M., Wang, Z., Ye, Y., & Yin, H. (2017). Strong np-hardness for sparse optimization
with concave penalty functions. In International Conference on Machine Learning (pp. 740–747).: PMLR.
Conze, J., Gani, J., & Fernique, X. (1975). Regularité des trajectoires des fonctions aléatoires gaussiennes.
Springer.
Dawid, A. P. & Didelez, V. (2010). Identifying the consequences of dynamic treatment strategies: A decision-
theoretic overview. Statistics Surveys, 4(none), 184 – 231.

23
Didelez, V., Dawid, P., & Geneletti, S. (2012). Direct and indirect effects of sequential treatments. arXiv
preprint arXiv:1206.6840.
Duchi, J. C. & Namkoong, H. (2021). Learning models with uniform performance via distributionally robust
optimization. The Annals of Statistics, 49(3), 1378–1406.

Erickson, J. (2023). Algorithms.


Fan, J., Fang, C., Gu, Y., & Zhang, T. (2024). Environment invariant linear least squares. Annals of
Statistics, 52(5), 2268–2292.
Fan, J. & Gu, Y. (2024). Factor augmented sparse throughput deep relu neural networks for high dimensional
regression. Journal of the American Statistical Association, 119(548), 2680–2694.

Fan, J. & Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties.
Journal of the American statistical Association, 96(456), 1348–1360.
Fan, J. & Liao, Y. (2014). Endogeneity in high dimensions. Annals of statistics, 42(3), 872.
Fan, J., Lou, Z., & Yu, M. (2022). Are latent factor regression and sparse regression adequate? arXiv
preprint arXiv:2203.01219.
Fan, J. & Lv, J. (2008). Sure independence screening for ultrahigh dimensional feature space. Journal of
the Royal Statistical Society: Series B (Statistical Methodology), 70(5), 849–911.
Fan, J., Wang, K., Zhong, Y., & Zhu, Z. (2021). Robust high dimensional factor models with applications
to statistical machine learning. Statistical Science, 36(2), 303–327.
Fan, J. & Zhou, W.-X. (2016). Guarding against spurious discoveries in high dimensions. Journal of Machine
Learning Research, 17(203), 1–34.
Fortnow, L. (2021). Fifty years of p vs. np and the possibility of the impossible. Communications of the
ACM, 65(1), 76–85.

Glymour, M., Pearl, J., & Jewell, N. P. (2016). Causal inference in statistics: A primer. John Wiley & Sons.
Granger, C. W. (1969). Investigating causal relations by econometric models and cross-spectral methods.
Econometrica: journal of the Econometric Society, (pp. 424–438).
Gu, Y., Fang, C., Bühlmann, P., & Fan, J. (2024). Causality pursuit from heterogeneous environments via
neural adversarial invariance learning. arXiv preprint arXiv:2405.04715.
Guo, Z. (2024). Statistical inference for maximin effects: Identifying stable associations across multiple
studies. Journal of the American Statistical Association, 119(547), 1968–1984.
Haavelmo, T. (1944). The probability approach in econometrics. Econometrica: Journal of the Econometric
Society, (pp. iii–115).
Hébert-Johnson, U., Kim, M., Reingold, O., & Rothblum, G. (2018). Multicalibration: Calibration for the
(computationally-identifiable) masses. In International Conference on Machine Learning (pp. 1939–1948).:
PMLR.
Heinze-Deml, C., Peters, J., & Meinshausen, N. (2018). Invariant causal prediction for nonlinear models.
Journal of Causal Inference, 6(2).
Huo, X. & Ni, X. (2007). When do stepwise algorithms meet subset selection criteria? The Annals of
Statistics, (pp. 870–887).

24
Kaiser, H. F. (1958). The varimax criterion for analytic rotation in factor analysis. Psychometrika, 23(3),
187–200.
Kalnay, E., Kanamitsu, M., Kistler, R., Collins, W., Deaven, D., Gandin, L., Iredell, M., Saha, S., White,
G., Woollen, J., Zhu, Y., Leetmaa, A., Reynolds, B., Chelliah, M., Ebisuzaki, W., Higgins, W., Janowiak,
J., Mo, K. C., Ropelewski, C., Wang, J., Jenne, R., & Joseph, D. (1996). The NCEP/NCAR 40-Year
Reanalysis Project. Bulletin of the American Meteorological Society, 77(3), 437–472.
Kania, L. & Wit, E. (2022). Causal regularization: On the trade-off between in-sample risk and out-of-sample
risk guarantees. arXiv preprint arXiv:2205.01593.
Karp, R. M. (1972). Reducibility among combinatorial problems. In Complexity of Computer Computations:
Proceedings of a symposium on the Complexity of Computer Computations (pp. 85–103). New York:
Springer.
Kumar, K. K., Rajagopalan, B., & Cane, M. A. (1999). On the weakening relationship between the indian
monsoon and enso. Science, 284(5423), 2156–2159.
Li, S. & Zhang, L. (2024). Fairm: Learning invariant representations for algorithmic fairness and domain
generalization with minimax optimality. arXiv preprint arXiv:2404.01608.
Li, T., Zhang, Y., Chang, C.-P., & Wang, B. (2001). On the relationship between indian ocean sea surface
temperature and asian summer monsoon. Geophysical Research Letters, 28(14), 2843–2846.
Ma, Z. & Wu, Y. (2015). Computational barriers in minimax submatrix detection. The Annals of Statistics,
(pp. 1089–1116).
Meinshausen, N. & Bühlmann, P. (2006). High-dimensional graphs and variable selection with the lasso.
The Annals of Statistics, (pp. 1436–1462).
Meinshausen, N. & Bühlmann, P. (2015). Maximin effects in inhomongeous large-scale data. The Annals of
Statistics, 43(4), 1801–1830.

Meinshausen, N., Hauser, A., Mooij, J. M., Peters, J., Versteeg, P., & Bühlmann, P. (2016). Methods for
causal inference from gene perturbation experiments and validation. Proceedings of the National Academy
of Sciences, 113(27), 7361–7368.
Mendelson, S., Pajor, A., & Tomczak-Jaegermann, N. (2007). Reconstruction and subgaussian operators in
asymptotic geometric analysis. Geometric and Functional Analysis, 17(4), 1248–1282.
Mohajerin Esfahani, P. & Kuhn, D. (2018). Data-driven distributionally robust optimization using the
wasserstein metric: Performance guarantees and tractable reformulations. Mathematical Programming,
171(1), 115–166.
Peters, J., Bühlmann, P., & Meinshausen, N. (2016). Causal inference by using invariant prediction:
identification and confidence intervals. Journal of the Royal Statistical Society. Series B (Statistical
Methodology), (pp. 947–1012).
Pfister, N., Bühlmann, P., & Peters, J. (2019). Invariant causal prediction for sequential data. Journal of
the American Statistical Association, 114(527), 1264–1276.

Raskutti, G., Wainwright, M. J., & Yu, B. (2010). Restricted eigenvalue properties for correlated gaussian
designs. The Journal of Machine Learning Research, 11, 2241–2259.
Rojas-Carulla, M., Schölkopf, B., Turner, R., & Peters, J. (2018). Invariant models for causal transfer
learning. The Journal of Machine Learning Research, 19(1), 1309–1342.

25
Rothenhäusler, D., Bühlmann, P., & Meinshausen, N. (2019). Causal dantzig: fast inference in linear
structural equation models with hidden variables under additive interventions. The Annals of Statistics,
47(3), 1688–1722.
Rothenhäusler, D., Meinshausen, N., Bühlmann, P., & Peters, J. (2021). Anchor regression: Heterogeneous
data meet causality. Journal of the Royal Statistical Society. Series B, Statistical Methodology, 83(2),
215–246.
Rudelson, M. & Zhou, S. (2013). Reconstruction from anisotropic random measurements. IEEE Transactions
on Information Theory, 6(59), 3434–3447.
Runge, J. (2020). Discovering contemporaneous and lagged causal relations in autocorrelated nonlinear time
series datasets. In Conference on Uncertainty in Artificial Intelligence (pp. 1388–1397).: Pmlr.
Runge, J., Petoukhov, V., Donges, J. F., Hlinka, J., Jajcay, N., Vejmelka, M., Hartman, D., Marwan, N.,
Paluš, M., & Kurths, J. (2015). Identifying causal gateways and mediators in complex spatio-temporal
systems. Nature communications, 6(1), 8502.
Schölkopf, B., Janzing, D., Peters, J., Sgouritsa, E., Zhang, K., & Mooij, J. (2012). On causal and anticausal
learning. arXiv preprint arXiv:1206.6471.
Shen, X., Bühlmann, P., & Taeb, A. (2023). Causality-oriented robustness: exploiting general additive
interventions. arXiv preprint arXiv:2307.10299.
Talagrand, M. (2005). The generic chaining: upper and lower bounds of stochastic processes. Springer
Science & Business Media.
Tibshirani, R. (1997). The lasso method for variable selection in the cox model. Statistics in medicine, 16(4),
385–395.
Timmermann, A., An, S.-I., Kug, J.-S., Jin, F.-F., Cai, W., Capotondi, A., Cobb, K. M., Lengaigne,
M., McPhaden, M. J., Stuecker, M. F., et al. (2018). El niño–southern oscillation complexity. Nature,
559(7715), 535–545.
Valiant, L. G. & Vazirani, V. V. (1985). Np is as easy as detecting unique solutions. In Proceedings of the
seventeenth annual ACM symposium on Theory of computing (pp. 458–463).
Varambally, S., Ma, Y.-A., & Yu, R. (2023). Discovering mixtures of structural causal models from time
series data. arXiv preprint arXiv:2310.06312.
Vejmelka, M., Pokorná, L., Hlinka, J., Hartman, D., Jajcay, N., & Paluš, M. (2015). Non-random correlation
structures and dimensionality reduction in multivariate climate data. Climate Dynamics, 44, 2663–2682.
Vershynin, R. (2018). High-dimensional probability: An introduction with applications in data science,
volume 47. Cambridge university press.

Wang, T., Berthet, Q., & Samworth, R. (2016). Statistical and computational trade-offs in estimation of
sparse principal components. Annals of Statistics, 44(5), 1896–1930.
Yin, M., Wang, Y., & Blei, D. M. (2021). Optimization-based causal estimation from heterogenous
environments. arXiv preprint arXiv:2109.11990.

Zhang, C.-H. & Zhang, T. (2012). A general theory of concave regularization for high-dimensional sparse
estimation problems. Statistical Science, 27(4), 576–593.
Zhang, Y., Wainwright, M. J., & Jordan, M. I. (2014). Lower bounds on the performance of polynomial-time
algorithms for sparse linear regression. In Conference on Learning Theory (pp. 921–948).: PMLR.

26
Zhao, P. & Yu, B. (2006). On model selection consistency of lasso. The Journal of Machine Learning
Research, 7, 2541–2563.
Zou, H. (2006). The adaptive lasso and its oracle properties. Journal of the American statistical association,
101(476), 1418–1429.

27
Supplemental Materials
The supplemental materials are organized as follows:

Appendix A provides additional discussions about the computation barrier omitted in the main text.
Appendix B contains the discussions omitted in the main paper.
Appendix C contains the proofs for the computation barrier results.
Appendix D contains the proofs for the population-level results.
Appendix E contains the proofs for the finite sample results.

A More Discussions on Computational Barriers


In this section, we (1) answer the following question related to the NP-hardness of the problem ExistLIS;
and (2) prove Lemma 2.1 in the main text.

Q1 The setup of (1.1) are searching for a set with weaker invariance conditions adopted in causal discovery
literature. Can stronger invariance conditions like the full distributional invariance in Peters et al.
(2016) help?
Q2 The NP-hardness of 0/1 Knapsack Problem relies on the exponential total budget, and there is
poly(number of items, total budget) algorithm. Is the NP-hardness in ExistLIS due to the existence of
many varying solutions with heterogeneity signal e−Ω(d) such that a computationally efficient algorithm
is possible if all the non-invariant solutions have large heterogeneity signals?
Q3 The covariance matrices in the construction Lemma 2.2 is dense. Is computationally efficient estimation
attainable when the covariance matrices are all sparse?

The brief answers to the above questions are all “No”, and the rigorous statements can be found in the
following subsections.

A.1 Stronger Invariance Condition Cannot Help


We consider the following task which is a special case of pursuing the distributional invariance, under which
the conditional independence test can be easily done by doing simple calculations on the full covariance
matrix on (X ⊤ , Y )⊤ . To be specific, (X, Y ) are multivariate normal distributed with positive definite full
covariance matrix in each environment. The decision problem can be described as follows.
Problem A.1. Let d ∈ N+ be the dimension of the explanatory covariate, and E be the number of
environments. We assume that for each e ∈ {1, . . . , E},
 (e)    (e)
u(e)

X Σ
∼ N 0,
Y (e) (u(e) )⊤ v (e)

where Σ(1) , . . . , Σ(E) ∈ Rd×d are positive definite matrices, u(1) , . . . , u(E) are d-dimensional vectors, and v (e)
is a scalar satisfying v (e) > (u(e) )⊤ (Σ(e) )−1 u(e) . We say a set S is a non-trivial distribution-invariant set if
(e,S) ⊤ (e) (e,S)
β (e,S) ≡ β (S) ̸= 0 and v (e) − (βS ) ΣS β S ≡ vε . (A.1)

We define the problem ExistDIS as follows:


[Input] Σ(1) , . . . , Σ(E) , u(1) , . . . , u(E) and v (1) , . . . , v (E) satisfying the above constraints.
[Output] Returns 1 if there exists a non-trivial distribution-invariant set otherwise 0.
We define the problem ExistDIS-Unique as the same problem with the promise that the non-trivial distribution-
invariant set is unique if exists.

28
The following lemma shows that (A.1) is equivalent to the full distribution invariance condition (Assumption
1 in Peters et al. (2016)) under the setting in Problem A.1.
Lemma A.1. Under the setting of Problem A.1, S satisfies (A.1) if and only if
(e) (e)
∃β̄ ∈ Rd , βS ̸= 0 s.t. Y (e) = XS β̄S + ε(e) with ε(e) ∼ Fε ⊥⊥ XS ∀e ∈ {1, . . . , E} (A.2)

It is then easy to see that the problem ExistDIS-Unique corresponds to the case where the non-trivial
invariant set is unique if it exists and ICP (Peters et al., 2016) can uniquely identify S ⋆ . We have the
following result. The proof idea is that we construct the problem such that the additional invariant noise-
level constraint trivially holds for all the prediction-invariant solutions.

Theorem A.1. When E = 2, the problem ExistsDIS is NP-hard under deterministic polynomial-time
reduction, the problem ExistsDIS-Unique is NP-hard under randomized polynomial-time reduction.

A.2 NP-hardness Remains when It is Well-Separated


For any fixed ϵ ∈ (0, 1), consider the following restricted version of the problem.
Problem A.2 (Existence of Linear Invariant Set under ϵ-Separation). For any fixed constant ϵ > 0, Problem
Exist-ϵ-Sep-LIS is defined as the same problem as ExistLIS with the additional ϵ-separation conditions
as follows
(a) 1 ≤ E[|Y (e) |2 ] ≤ 1000 for any e ∈ [E];
PE (e,S) (S)
1
(b) |E| e=1 ∥βS − βS ∥2 (e) ∈ {0} ∪ [d−ϵ /1280, ∞) for any S ⊆ [d].
ΣS

(c) ∥β (S) − β (S ) ∥2Σ ∈ [1{S ̸= S † }d−ϵ /1280, ∞) for any S ⊆ [d] and any invariant set S † .
Condition (a) promises O(1) variance for the response, which is a typical regime considered by linear
regression analysis. Condition (b) enforces the prediction variation should be Ω(d−ϵ ) if it is not an invariant
set, and condition (c) assures that the non-invariant prediction should be Ω(d−ϵ ) away from the invariant
prediction. The next theorem confirms that NP-hardness remains under this restrictive case.
Theorem A.2. For any fixed ϵ > 0, the problem Exist-ϵ-Sep-LIS is NP-hard under deterministic polynomial-
time reduction.
The above construction also naturally implies that the computation barrier remains if our target is to
find a solution close to some invariant solution within O(d−ϵ ) error.

Problem A.3. Consider the problem Exist-ϵ-Sep-LIS with E = 2 and suppose further Y (e) = (β (e,[d]) )⊤ X (e) ,
i.e., there is no intrinsic noise. The input is the same, and it is required to output βb ∈ Rd such that
inf S:β (e,S) ≡β (S) ̸=0 ∥βb − β (S) ∥2Σ ≤ d−ϵ /4 if {S : β (e,S) ≡ β (S) ̸= 0} =
̸ ∅, its output can be an arbitrary
d-dimensional vector otherwise.

Corollary A.3. If Problem A.3 can be solved by a worst-case polynomial-time algorithm, then 3Sat can
also be solved by a worst-case polynomial-time algorithm.
The key idea is to divide the set [d] into two blocks: a block with size dϵ/3 , whose construction is similar
to Lemma 2.2, and a remaining auxiliary block, where there is no invariant solution in this block and the
predictive variance is carefully controlled. It is interesting to see if a similar result holds for ϵ = 0. We leave
it for future studies.

A.3 NP-Hardness Remains for Row-wise O(1)-Sparse Covariance


The following theorem shows the problem ExistLIS is NP-hard even when each row or column of matrix
Σ(e) , e ∈ E has only O(1) non-zero elements.

29
Theorem A.4. Consider the problem ExistLIS with the additional constraint that for any e ∈ [E], each
row of matrix Σ(e) has no more than C non-zero elements for some universal constant C > 0. The above
problem is NP-hard under deterministic polynomial-time reduction when E = 2.
The proof idea is as follows. We first reduce the general 3Sat problem x with k clauses to another 3Sat
problem x′ with O(k 2 ) clauses. In x′ , each variable at most appears on 15 times. This will further lead
to a row-wise sparse A in Lemma 2.2. A finer construction will also adopted to distribute the constraints
imposed by the last dense row of Σ(2) into O(k 2 ) sparse rows.

A.4 Proof of Lemma 2.1


The proof is similar to Theorem 1.1 in Valiant & Vazirani (1985). We will use the following lemma akin to
their Lemma 2.1. For any u, v ∈ {0, 1}n , we let u · v be the inner product over GF[2] of u, v.
Lemma A.2. Given any 3Sat formula f with n variables v = (vi )ni=1 and w ∈ {0, 1}n , let S be the set of
all the variables that make the formula evaluates to be true. One can construct a 3Sat formula f ′ in at most
2n variables v ′ and 3(n − 1) + k + 1 clauses such that there exists a bijective between its solution set S ′ and
S ∩ {v : v · w = 0}.
Proof of Lemma A.2. Let i1 , . . . , im be the indices with wij = 1. The constraint w · v = 0 can be written as
vi1 ⊗ · · · ⊗ vim = 0, where ⊗ is the XOR operation, which is equivalent to

vi1 ⊗ vi2 = t1 , vi3 ⊗ t1 = t2 , ··· vim−1 ⊗ tm−2 = tm−1 , vim ⊗ tm−1 = tm , tm = 0

with another m ≤ n binaries variables t1 , · · · , tm−1 , tm . The constraint x1 ⊗ x2 = x3 is equivalent to the


following 4-clause 3Sat formula

(¬x1 ∨ ¬x2 ∨ ¬x3 ) ∧ (x1 ∨ ¬x2 ∨ x3 ) ∧ (¬x1 ∨ x2 ∨ x3 ) ∧ (x1 ∨ x2 ∨ ¬x3 )

The last constraint tm = 0 can be written as the clause (¬tm ∨ ¬tm ∨ ¬tm ).

The rest of the proof is the same as that in Theorem 1.1: we can construct a randomized polynomial
reduction from 3Sat to 3Sat-Unique.

B Omitted Discussions
B.1 Discussion on Li & Zhang (2024)
Problem B.1. Under the same setting as Problem 2.3 with E = 2, it takes (Σ(1) , Σ(2) ) and (u(1) , u(2) ) as
(1) (2)
input and is required to determine whether there exists S ⊆ [d] with |S| ≥ d/7 such that ΣS = ΣS and
(1) (2)
uS = uS .
Here we test the existence of any “large”, namely |S| ≥ d/7, covariance-invariant set rather than any
covariance-invariant set, this is because S is covariance-invariant in Problem B.1 will imply {j} is covariance-
invariant for any j ∈ S. Testing the existence of a univariate invariant set is trivial and has O(d·E) algorithm.
The proof is similar to Theorem 2.1 by letting u(1) = u(2) .
Theorem B.1. Problem B.1 is NP-hard.

B.2 Discussion on Condition 3.3


To show when Condition 3.3 holds for small k under the structural causal model framework, we first introduce
the setting of SCM with intervention on X; see also Gu et al. (2024) Section 3. We first introduce the
definition of SCM and the setting considered.

30
Definition 5 (Structural Causal Model). A structural causal model M = (S, ν) on p variables Z1 , . . . , Zp
can be described using p assignment functions {f1 , . . . , fp } = S:
Zj ← fj (Zpa(j) , Uj ) j = 1, . . . , p,

Qp ⊆ {1, . . . , p} is the set of parents, or the direct causes, of the variable Zj , and the joint distribution
where pa(j)
ν(du) = j=1 νj (duj ) over p independent exogenous variables (U1 , . . . , Up ). For a given model M , there is
an associated directed graph G(M ) = (V, E) that describes the causal relationships among variables, where
V = [p] is the set of nodes, E is the edge set such that (i, j) ∈ E if and only if i ∈ pa(j). G(M ) is acyclic if
there is no sequence (v1 , . . . , vk ) with k ≥ 2 such that v1 = vk and (vi , vi+1 ) ∈ E for any i ∈ [k − 1].
As in Peters et al. (2016), we consider the following data-generating process in |E| environments. For each
(e) (e) (e) (e)
e ∈ E, the process governing p = d + 1 random variables Z (e) = (Z1 , . . . , Zd+1 ) = (X1 , . . . , Xd , Y (e) ) is
(e) (e)
derived from an SCM M (S , ν). We let e0 ∈ E be the observational environment for reference and the
rest are interventional environments. We let G be the directed graph representing the causal relationships
in e0 , and simply let G be shared across E without loss of generality. We assume G is acyclic. In each
environment e ∈ E, the assignments are as follows:
(e) (e) (e)
Xj ← fj (Zpa(j) , Uj ), j = 1, . . . , d
(e)
(B.1)
Y (e) ← fd+1 (Xpa(d+1) , Ud+1 ).

Here the distribution of exogenous variables (U1 , . . . , Ud+1 ), the cause-effect relationship {pa(j)}d+1
j=1 represented
by G, and the structural assignment fd+1 are invariant across e ∈ E, while the structural assignments for X
may vary among e ∈ E. The heterogeneity, which is emphasized by superscript (e) is due to the arbitrary
interventions on the variables X. We use Zpa(j) to emphasize that Y can be the direct cause of some variables
in the covariate vector.
(e) (e )
We denote I ⊆ [d], defined as I := {j : fj ̸= fj 0 for some e ∈ E}, be the set of variables intervened,
We summarize the above data-generating process as a condition.
Condition B.1. Suppose {M (e) }e∈E are defined by (B.1), G is acyclic, and fd+1 is a linear function.
Proposition B.2. Under the model (1.1) with regularity condition Condition 3.4, suppose one of the
following conditions holds.
⊤ (e) (e)
(a) There exists a partition of S ⋆ = ∪L ⋆ ⋆
l=1 Sl such that E[XS ⋆ (XSr⋆ ) ] = 0 for any l ̸= r and |Sl | ≤ k.
l

(b) Assume Condition B.1 holds such that we can define the ancestor set recursively as at(j) = pa(j) ∪
We have I ∩ at(d + 1) = ∅, S ⋆ = pa(d + 1), and k ≥ 1.
S
k∈pa(j) at(k).
Then Condition 3.3 holds.
Proof of Proposition B.2. We first prove (a). To be specific, we show that
(e,Sl⋆ )
∀e ∈ E, l ∈ [L], βS ⋆ = βS⋆l⋆ .
l

It follows from Condition 3.4 and the definition of least squares that
i−1
(e,S ⋆ )
h
(e) (e)
β S ⋆ l = ΣS ⋆ E[XS ⋆ Y (e) ]
l l l
  
h i−1
(e) (e) (e) (e)
X
= ΣS ⋆ E XS ⋆ ε(e) + (XS ⋆ )⊤ βS⋆l⋆ + (XSr⋆ )⊤ βS⋆r⋆ 
l l l
r̸=l
 
(i)
h i−1  h i 
(e) (e) (e) (e) (e)
X
= ΣS ⋆ E XS ⋆ ε(e) + ΣS ⋆ βS⋆l⋆ + E[XS ⋆ (XSr⋆ )⊤ ]βS⋆r⋆
l  l l l 
r̸=l

= βS⋆l⋆

31
where (i) follows from the exogeneity of XS ⋆ in (1.1) and (a).
Now we prove (b). Given the condition in (b), we have for any j ∈ S ⋆ = pa(d + 1), and e, e′ ∈ E
(e) (e′ ) ′
(e) E[Xj Y (e) ] E[Xj Y (e ) ] (e′ )
βj = (e)
= (e′ )
= βj .
E[|Xj |2 ] E[|Xj |2 ]

C Proofs for Computation Fundamental Limits


C.1 Proof of Lemma 2.3
Proof of (2.5). We first establish the upper bound in (2.5). It follows from the definition of β (S) that

∥β (S) − β (S ) ∥2Σ = uS Σ−1 −1 −1
S uS + uS † ΣS † uS † − 2uS∩S † ΣS∩S † uS∩S †
(a) 1 X X
≤ 2× E[|Y (e) |2 ] ≤ E[|Y (e) |2 ].
2
e∈[2] e∈[2]

 
Σ u
here (a) follows from the fact that the pooled full covariance matrix is positive
u⊤ 12 e∈E E[|Y (e) |2 ]
P
semi-definite.
1
 
A 1k
Now we turn to the lower bound. We denote A = 1 ⊤e 2 . It is easy to see that ∥A∥e F ≤ d,
2 1k 0
combining this with the fact that ∥A∥ e F , the maximum and minimum eigenvalue of Σ(2) can be
e 2 ≤ ∥A∥
controlled by
e 2 ≤ λmin (Σ(2) ) ≤ λmax (Σ(2) ) ≤ 5d + ∥A∥
4d ≤ 32k − ∥A∥ e 2 ≤ 6d (C.1)

When there is no intrinsic noise, the variance of Y (e) can be exactly calculated as

E[|Y (1) |2 ] = (u(1) )⊤ (Σ(1) )−1 u(1) = d

and upper bounded as


 −1 1
E[|Y (2) |2 ] = (u(2) )⊤ (Σ(2) )−1 u(2) ≤ λmin (Σ(2) ) ∥u(2) ∥22 ≤ × d × (6d)2 ≤ 9d2 .
4d
Therefore, we have
X
E[|Y (e) |2 ] ≤ 10d2 . (C.2)
e∈[2]

On the other hand, by (C.1), we obtain


 †
⊤  †

∥β (S) − β (S ) ∥2Σ = β (S) − β (S ) ΣΣ−1 Σ β (S) − β (S )
 2 2 (C.3)
1  † 1  †

≥ Σ β (S) − β (S ) ≥ Σ β (S) − β (S ) .
λmax (Σ) 2 7d 2


We denote S1 = S \S † and S2 = S † \S. We will establish the lower bound on ∥∆∥22 for ∆ = Σ(β (S) −β (S ) ) ∈
Rd when S ̸= S † . Given S ̸= S † , one has either S1 ̸= ∅ or S2 ̸= ∅. Without loss of generality, we assume
that S2 ̸= ∅.

32
First, one has
 −1
(S) −1 5d + 1 1e
∥βS ∥2 = (ΣS ) uS 2 = I|S| + AS uS
2 2
2
 −1
1 e 2
= I|S| + AS uS
5d + 1 5d + 1
2
(a)
 −1
1 e 2 2
≤ I|S| + AS − I|S| ∥uS ∥2 + ∥uS ∥2
5d + 1 5d + 1 5d + 1
2 5d + 1 + 0.5k √
(b)
 
d
≤ 1+2 d
5d + 1 5d + 1 2
√ √
≤ (1 + 2/5) × (1 + 0.5/5) d ≤ 1.5 d.

Here (a) follows from the triangle inequality, (b) follows from the fact that ∥(I + M )−1 − I∥2 ≤ 2∥M ∥ if
∥M ∥ ≤ 0.5. Pick j ∈ S2 , it follows from the above upper bound, the fact j ∈ / S and Cauchy Schwarz
inequality that
(S)
∆j = Σ⊤
j,S βS − uj

≤ ∥Aej,S ∥2 ∥β (S) ∥2 − 1 (5d + 0.5 + 1)


S
2
≤ 1.5d − 2.5d − 0.75 ≤ −d − 0.75.

This further yields that ∥∆∥22 ≥ ∥∆j ∥2 ≥ d2 . Combining it with (C.3) and (C.2) completes the proof of the
lower bound.

Proof of (2.6). For the upper bound, we have


X X
∥β (S) − β (e,S) ∥2Σ(e) = min ∥β (e,S) − β∥2Σ(e)
supp(β)⊆S
e∈[2] e∈[2]
X (a) X (e) (e) (e)
≤ ∥β (e,S) ∥2Σ(e) = uS (ΣS )−1 uS
e∈[2] e∈[2]
(b) X
≤ E[|Y (e) |2 ].
e∈[2]

(e,S)
Here
 (e) (a) follows from  the definition of β , (b) follows from the fact that the following covariance matrix
(e)
Σ u
is positive semi-definite.
(u(e) )⊤ E[|Y (e) |2 ]
Turning to the lower bound,
X
∥β (S) − β (e,S) ∥2Σ(e) ≥ ∥β (S) − β (1,S) ∥22
e∈[2]
2
(1) (2) (1)
= Σ−1
S (0.5uS + 0.5uS ) − uS
2
−2 (1) (2) (1) (2) (1)
≥ [λmax (Σ)] 0.5uS + 0.5uS − 0.5uS − 0.5ΣS uS
2
−2 (2) (1) (2)
≥ [4λmax (Σ)] ∥2ΣS uS − uS ∥22

33
(2) (1) (2) (2) (1) (2)
Observe that all the entries in the vector 2ΣS uS − uS are integer. Then unless ΣS uS = uS , in other
(2) (1) (2)
words, S is a invariant set by Definition 4, we have ∥2ΣS uS − uS ∥22 ≥ 1. Therefore, we have
X
∥β (S) − β (e,S) ∥2Σ(e) ≥ ∥β (S) − β (1,S) ∥22 ≥ [4λmax (Σ)]−2 ≥ 784d−2
e∈[2]

if S is not a invariant set. Combining it with the upper bound (C.2) completes the proof.

C.2 Proofs in Appendix A.1


Proof of Lemma A.1. Denote εb(e) = Y (e) − (β (e,S) )⊤ X (e) , we have
(e)
Cond (A.2) ⇐⇒ β (e,S) ≡ β̄S ̸= 0 and Fε ∼ εb(e) ⊥⊥ XS
(a) (b)
⇐⇒ β (e,S) ≡ β̄S ̸= 0 and var(b
ε(e) ) ≡ vε ⇐⇒ Cond (A.1)

where (a) follows from the fact that (X, Y ) are multivariate Gaussian under which independence is equivalent
to uncorrelatedness and the fact that εb(e) is also Gaussian, (b) follows from the fact that
(e,S) ⊤ (e) (e,S) ⊤ (e) (e,S)
ε(e) ) = E[|Y (e) |2 ] − 2(βS
var(b ) E[XS Y (e) ] + (βS ) ΣS β S
(e,S) ⊤ (e) (e,S)
= v (e) − (βS ) ΣS β S .

Proof of Theorem A.1. The proof is similar to that of Theorem 2.1. For each instance x, we use the same
reduction construction of (Σ, u) in problem y constructed in Lemma 2.2 and let

v (1) = 100d5 + k + 1 v (2) = 100d5 + 5d(k + 1) + k,

this furnishes a new problem ye of ExistDIS. It is easy to see that


(e) (e) (e)
∀e ∈ [2], v (e) − (uS )⊤ (ΣS )−1 uS ≥ 1.

Moreover, for any valid solution S ∈ Sy , one has


(1,S) (1) (1,S)
v (1) − βS ΣS β S = v (1) − 1⊤
|S| 1|S| = v
(1)
− (k + 1) = 100d5

and
(2,S) (2) (2,S) (2)
v (2) − βS ΣS β S = v (2) − 1⊤
|S| ΣS 1|S|
 
(a) (2) 1
= v − 5d + 2(|S| − 1) + 1⊤
k (5dId + A )1
S̊ k
2
(b)
= v (2) − 5d(1 + k) − k = 100d5 .

Here (a) follows from the fact that d ∈ S provided S ∈ Sy , (b) follows from the fact that Ai,j = 0 for any
i, j ∈ S̊ and |S| = k + 1 provided S ∈ Sy . This further yields that Sy ⊆ Sye. Combined with the fact that
Sye ⊆ Sy , one further has Sye = Sy . The rest of the proof follows similarly.

34
C.3 Proof of Theorem A.2
We adopt a similar reduction idea as that in Lemma 2.2. Without loss of generality, we assume k ≥ 104 and
ϵ < 0.5.
We first introduce one additional notation. For any integer ℓ > 0, we define the positive definite ℓ × ℓ
matrix Hℓ as follows:
(
2 j = j′
(Hℓ )j,j ′ = (C.4)
1 otherwise
−1
for any j, j ′ ∈ [ℓ]. Namely, Hℓ = Iℓ + 1ℓ 1⊤
ℓ for any ℓ ≥ 1. One can thereby obtain Hℓ
1
= Iℓ − ℓ+1 1ℓ 1⊤
ℓ .
Step 1. Construct the Reduction. For any 3Sat instance x with input size k, we construct an
ExistLIS instance y with size d = ⌈k 3/(ϵ) ⌉ as follows:
   
(1) 32k · I7k+1 0 (1) −1 32k · 17k+1
Σ = and u = (k ) ,
0 Hd−7k−1 1d−7k−1
and
1
(32k + 12 ) · 17k
   
32k · I7k + A 2 · 17k 0
Σ(2) = 1 ⊤
2 · 17k 32k 0  and u(2) = (k −1 )  32k + 12 k  .
0 0 Hd−7k−1 −3 · 1d−7k−1

One can observe that both Σ(1) and Σ(2) are respectively composed by an upper-left (7k + 1) × (7k + 1)
matrix and a lower-right (d − 7k − 1) ×
 (d − 7k −1) matrix Hd−7k−1 . Recall that in the Proof of (2.5) we
1
A 1
e = 1 ⊤ 2 k , and we have ∥A∥ e 2 ≤ ∥A∥e F ≤ 7k + 1. Then similar to
introduce the notation of matrix A
2 1k 0
(2)
(C.1), the maximum and minimum eigenvalue of Σ[7k+1] can be controlled by

e 2 ≤ λmin (Σ(2) ) ≤ λmax (Σ(2) ) ≤ 32k + ∥A∥


24k ≤ 32k − ∥A∥ e 2 ≤ 40k. (C.5)
[7k+1] [7k+1]

Combining with the fact that Hℓ is positive definite for any ℓ ≥ 1, we can conclude that both Σ(1) and Σ(2)
are positive definite, and the above reduction can be calculated within polynomial time.
Now it suffices to show that (1) The above construction is a parsimonious reduction; and (2) The instance
y lies in the problem Exist-ϵ-Sep-LIS. In order to complete the remaining proof, it is helpful to observe
that there are three modifications in this construction compared to the construction in Lemma 2.2.
(a) We introduce an auxiliary (d − 7k − 1)-dimension part [d] \ [7k + 1]. We will show that this part is
precluded by any invariant set.
(1) (2)
(b) We change the diagonal coordinates in Σ[7k+1] from 1 to 32k, and those in Σ[7k+1] from 5d to 32k to
(1) (2)
make E[|Y (1) |2 ] ≍ E[|Y (2) |2 ]. We also change the coordinates of u[7k+1] and u[7k+1] accordingly.

(c) We add a k −1 multiplicative factor in u(1) and u(2) to let E[|Y (1) |2 ], E[|Y (2) |2 ] ≍ 1. This will also result
in all the β (e,S) and β (S) being multiplied by the same k −1 factor.
Step 2. Verification of Parsimonious Reduction. We first claim that the auxiliary (d − 7k − 1)-
dimension part [d] \ [7k + 1] is precluded by any invariant set, namely
∀S † ∈ Sy =⇒ S † ∩ {7k + 2, . . . , d} = ∅. (C.6)

To this end, we adopt the proof-by-contradiction argument. To be specific, if j ∈ S † for some j ∈


(1) (S † ) (1) (2) (S † ) (2)
{7k + 2, . . . , d} and S † ∈ Sy , then the equations ΣS † βS † = uS † and ΣS † βS † = uS † yields
h †
i †
h †
i †
(1) (1) (1) (2) (2) (2)
uj = ΣS † β (S ) = Σj,S † β (S ) and uj = ΣS † β (S ) = Σj,S † β (S ) .
j j

35
(1) (2) (1) (2)
However, in our construction Σj,S † = Σj,S † while uj ̸= uj . This leads to a contradiction. Therefore, an
invariant set should not contain any element in {7k + 2, . . . , d}.
By (C.6), we have the following statements similar to (2.3) in the proof of Lemma 2.2.
(a) † † (1,S † )
S † ∈ Sy ⇐⇒ S † ̸= ∅ and β (2,S )
= β (1,S )
with βj = (k −1 )1{j ∈ S † }
(b)
⇐⇒ S † = S̊ ∪ {7k + 1} with |S̊| = k and Aj,j ′ = 0 ∀j, j ′ ∈ S̊ ⊆ [7k]
(C.7)
(c)
⇐⇒ S̊ = {7(i − 1) + ai }ki=1 with ai ∈ [7] s.t. adopt action ID ai
in clause i ∈ [k] will lead to a valid solution v ∈ Sx .

We emphasis that the proof of C.7(a) and (c) are essentially identical to those of (2.3). For completeness,
we prove (b).
Proof of (C.7) (b). The proof is almost identical to the proof of (2.3)(b) since the major difference is the
k −1 multiplicative factor. The direction ⇐ is obvious. For the ⇒ direction, we first show that 7k + 1 ∈ S †
using the proof by contradiction argument. Suppose |S † | ≥ 1 but 7k + 1 ∈
/ S † , we pick j ∈ S † , then
7k
h
(2) (2,S † )
i X 1 (2)
k Σ S † βS † = 32k + Aj,j ′ 1{j ′ ∈ S † } =
̸ 32k + = k · uj
j ′
2
j =1

(2,S † ) (1,S † )
where the first equality follows from the assumption βj = βj = k −1 1{j ∈ S † } and 7k + 1 ∈
/ S † , and
the inequality follows from the fact that A ∈ {0, 1}7k×7k hence the L.H.S. is an integer. This indicates that
† †
β (1,S ) ̸= β (2,S ) if |S † | ≥ 1 and 7k + 1 ∈
/ S † . Given 7k + 1 ∈ S † , we then obtain
7k
1 (2)
h
(2) (2,S † )
i 1X 1
32k + k = k · u7k+1 = k ΣS † βS † = 32k + 1{j ′ ∈ S † } = 32k + (|S † | − 1),
2 †
|S | 2 ′ 2
j =1

(2) (2) (2,S † )


which implies that |S † | = k + 1. Now we still have the constraint uS̊ = ΣS̊ βS̊ + 1
2 · 1k . The last claim

Aj ′ ,j = 0 for any j , j ∈ S̊ then follows from this by observing that
 
1 (2) 1 (i)
32k + · 1k = ΣS̊ 1k + · 1k =⇒ AS̊ 1k = 0 =⇒ Aj ′ ,j = 0 ∀j ′ , j ∈ S̊
2 2

where (i) follows from the fact that A ∈ {0, 1}7k×7k .


Therefore, we can conclude that this mapping is a parsimonious polynomial-time reduction from 3Sat to
ExistLIS. Given the conditions (1) – (3) further holds as verified below, the instance is an Exist-ϵ-Sep-LIS
instance. Hence the problem Exist-ϵ-Sep-LIS is NP-hard.
Lemma C.1. The above constructed instance is an Exist-ϵ-Sep-LIS instance.
Proof of Lemma C.1. Step 1 Calculating the Variance of Y (e) for e ∈ {1, 2}. Now we calculate
E[|Y (1) |2 ] and E[|Y (2) |2 ]. Without loss of generality we consider the cases where Y (e) is a linear combination
of X (e) for e ∈ {1, 2}, under which E[|Y (e) |2 ] = (u(e) )⊤ (Σ(e) )−1 u(e) for e ∈ {1, 2}.
For e = 1, we have

E[|Y (1) |2 ] = (u(1) )⊤ (Σ(1) )−1 u(1)


(a) (1) (1) (1) (1) (1) (1)
= (u[7k+1] )⊤ (Σ[7k+1] )−1 u[7k+1] + (u[d]\[7k+1] )⊤ (Σ[d]\[7k+1] )−1 u[d]\[7k+1]
 
1 −1
= (k −1 )2 (32k)2 (7k + 1) + 1⊤
d−7k−1 d−7k−1 d−7k−1 .
H 1
32k

36
−1
Here (a) follows from the fact that Σ(1) is a block diagonal matrix. It follows from the identity 1⊤
ℓ Hℓ 1ℓ =
ℓ/(1 + ℓ) that
1
1 < (k −1 )2 (32k)2 (7k + 1)
32k
≤ E[|Y (1) |2 ]
d − 7k − 1
 
≤ (k −1 )2 32k · (7k + 1) +
d − 7k − 1 + 1
< 256.
Similarly, for E[|Y (2) |2 ], following from the fact that Σ(2) is block diagonal, we obtain
E[|Y (2) |2 ] = (u(2) )⊤ (Σ(2) )−1 u(2)
(2) (2) (2) (2) (2) (2)
= (u[7k+1] )⊤ (Σ[7k+1] )−1 u[7k+1] + (u[d]\[7k+1] )⊤ (Σ[d]\[7k+1] )−1 u[d]\[7k+1]
(2) (2) (2) −1
= (u[7k+1] )⊤ (Σ[7k+1] )−1 u[7k+1] + 9(k −1 )2 1⊤
d−7k−1 Hd−7k−1 1d−7k−1 .

(2) (2)
Recall that λmax (Σ[7k+1] ) ≤ 40k and λmin (Σ[7k+1] ) ≥ 24k, we have
(2) (2) (2)
1 < (k −1 )2 (40k)−1 (32k)2 (7k + 1) ≤ (u[7k+1] )⊤ (Σ[7k+1] )−1 u[7k+1]
≤ (k −1 )2 (24k)−1 (32k + k)2 (7k + 1)
< 999.
Therefore,
1 < E[|Y (2) |2 ] < 999 + 9(k −1 )2 < 1000.
Hence we can conclude that 1 ≤ E[|Y (1) |2 ], E[|Y (2) |2 ] ≤ 1000.
1
P (e,S)
Step 2. Calculating the Prediction Variation. Now we lower bound the heterogeneity gap |E| e∈E ∥βS −
(S) −ϵ
βS ∥2 (e) ≥ d /1280 when S is not an invariant set as Definition 4. Denote S1 = S ∩ [7k + 1] and
Σ S
S2 = S \ [7k + 1]. We divide it into two cases when β (1,S) ̸= β (2,S) :
Case 1. S2 ̸= ∅: Observe Σ(1) and Σ(2) are block diagonal matrices, we have
(1,S) (1,S2 ) −1 (1)
βS2 = βS2 = H|S u ,
2 | S2

(2,S) (2,S2 ) −1 (2) −1 (1)


βS2 = βS2 = H|S u = −3H|S
2 | S2
u ,
2 | S2
(C.8)
(S) (S ) (1)
−1 (2) (1)
βS2 = βS22 = (H|S2 | + H|S2 | )−1 (uS2 + uS2 ) = −H|S u .
2 | S2

Substituting the above terms, we can lower bound the heterogeneity gap as
1  (1,S) (S) (2,S) (S)

∥βS − βS ∥2Σ(1) + ∥βS − βS ∥2Σ(2)
2  S S

1 (1,S1 ) (S1 ) 2 (2,S1 ) (S1 ) 2
= ∥βS1 − βS1 ∥Σ(1) + ∥βS1 − βS1 ∥Σ(2)
2 S1 S1
 
1 (1,S2 ) (S2 ) 2 (2,S2 ) (S2 ) 2
+ ∥βS2 − βS2 ∥Σ(1) + ∥βS2 − βS2 ∥Σ(2)
2 S2 S2
 
1 (1,S2 ) (S2 ) 2 (2,S2 ) (S2 ) 2
≥ ∥βS2 − βS2 ∥Σ(1) + ∥βS2 − βS2 ∥Σ(2)
2 S2 S2

1  (1) ⊤ −1 (1) (2) ⊤ −1 (2)



= 4(uS2 ) H|S2 | uS2 + 4(uS2 ) H|S2 | uS2
2
4|S2 |
= (k −1 )2 ≥ 2 · k −2 .
|S2 | + 1

37
(1,S) (2,S) (e,S)
Case 2. S2 = ∅: In this case, we must have βS1 ̸= βS1 because β(S1 )c = 0 for any e ∈ {1, 2}. At the
same time,
1  (1,S) (S) (2,S) (S)

∥βS − βS ∥2Σ(1) + ∥βS − βS ∥2Σ(2)
2  S S

(a) 1 (1,S1 ) (S1 ) 2 (2,S1 ) (S1 ) 2
= ∥βS1 − βS1 ∥Σ(1) + ∥βS1 − βS1 ∥Σ(2)
2 S1 S1
   
(1) (2)
λmin Σ[7k+1] ∧ λmin Σ[7k+1]  
(1,S ) (S ) (2,S ) (S )
≥ ∥βS1 1 − βS11 ∥22 + ∥βS1 1 − βS11 ∥22
2
(b) 24k  
(1,S ) (S ) (2,S ) (S )
≥ ∥βS1 1 − βS11 ∥22 + ∥βS1 1 − βS11 ∥22
2
(c) 24k
 
1 (1,S1 ) (2,S1 ) 2 (1,S ) (2,S )
≥ ∥β − βS1 ∥2 = 6k∥βS1 1 − βS1 1 ∥22 .
2 2 S1
Here (a) follows from the fact that Σ(1) and Σ(2) are block diagonal and S2 = ∅; (b) follows from the
(1) (2)
fact that λmin (Σ[7k+1] ), λmin (Σ[7k+1] ) ≥ 24k; and (c) follows from the fact that ∥a − c∥22 + ∥b − c∥22 ≥
minx ∥a − x∥22 + ∥b − x∥22 ≥ ∥a − (a + b)/2∥22 + ∥a − (a + b)/2∥22 ≥ 0.5∥a − b∥22 for any vector a, b, c.
(2) (2,S) (2) (2)
Recall that ΣS1 βS1 = uS1 and λmax (ΣS1 ) ≤ 40k, then
 ⊤  
(1,S ) (2,S ) (2) (1,S) (2) (2) (2) (1,S) (2)
6k∥βS1 1 − βS1 1 ∥2 = 6k ΣS1 βS1 − uS1 (ΣS1 )−2 ΣS1 βS1 − uS1
6k 2
(2) (1,S) (2)
≥ (2)
ΣS1 βS1 − uS1
λmax (ΣS1 )2 2

6k 2
(2) (1,S) (2)
≥ 2
ΣS1 βS1 − uS1
(40k) 2
6k 2
(2) (1,S) (2)
= (2ΣS1 )(kβS1 ) − (2k) · uS1 .
(80k 2 )2 2

(1,S)
Combining βS1 = (k −1 )1|S1 | and the definition of Σ(2) and u(2) , we obtain that each coordinate of the
(2) (1,S) (2)
vector (2ΣS1 )(kβS1 ) − (2k) · uS1 is an integer. At the same time, we also have
 
(2) (1,S) (2) (2) (1,S) (2,S)
(2ΣS1 )(kβS1 ) − (2k) · uS1 = 2kΣS1 βS1 − βS1 ̸= 0
(2) (2) (1,S) (2)
because ΣS1 has full rank, which further yields ∥(2ΣS1 )(kβS1 ) − (2k) · uS1 ∥22 ≥ 1. So we can conclude that
1  (1,S) (S) (2,S) (S)
 6k 1 −3
∥βS − βS ∥2Σ(1) + ∥βS − βS ∥2Σ(2) ≥ 2 2
≥ k
2 S S (80k ) 1280
under Case 2. Combing the above two cases together, we can conclude that
1  (1,S) 
∥β − β (S) ∥2Σ(1) + ∥β (2,S) − β (S) ∥2Σ(2) ≥ k −3 /1280 ≥ d−ϵ /1280.
2 S S

Step 3. Calculating the Gap between β (S) and β (S ) . Let S † be arbitrary invariant set according to
Definition 4 and S be any set that does not equal to S † . We keep adopting the notation S1 = S ∩[7k+1], S2 =
S \ [7k + 1], and divide it into two cases.
(S) −1 (1)
Case 1. S2 ̸= ∅: In this case, from the calculations above we have βS2 = −H|S u . On the other hand,
2 | S2
(S † )
βS2 = 0 for any invariant set S † according to (C.6). Combing the two facts together yields
† (S ) (1) (1)
−1
∥β (S) − β (S ) ∥2Σ ≥ ∥βS22 ∥2H|S | = (uS2 )⊤ H|S u
2 | S2
2

|S2 | 1
≥ (k −1 )2 ≥ k −2 ≥ d−ϵ /2.
|S2 | + 1 2

38
Case 2. S2 = ∅: In this case, since S2 = ∅, one must have S ⊂ [7k + 1]. On the other hand, in (C.6) we show
that any invariant set S † should also be a subset of [7k + 1]. In this case, we claim that a stronger statement

holds, that for any pair of distinct subsets S, S ′ in [7k + 1], one has ∥β (S) − β (S ) ∥2Σ ≥ d−ϵ /1280.
(2) (2)
Recall that in (C.5) we obtain 24k ≤ λmin (Σ[7k+1] ) ≤ λmax (Σ[7k+1] ) ≤ 40k. This implies 28k ≤
λmin (Σ[7k+1] ) ≤ λmax (Σ[7k+1] ) ≤ 36k. It follows from the assumption S2 = ∅, our construction of Σ
′ (S) (S ′ )
∥β (S) − β (S ) ∥2Σ = ∥β[7k+1] − β[7k+1] ∥2Σ[7k+1]
⊤
(S ′ ) (S ′ )
  
(S) (S)
= β[7k+1] − β[7k+1] Σ[7k+1] Σ−1 [7k+1] Σ [7k+1] β [7k+1] − β [7k+1]

1 
(S) †
(S )
 2 (C.9)
≥ Σ[7k+1] β[7k+1] − β[7k+1]
λmax (Σ[7k+1] ) 2

1  ′
 2
(S) (S )
≥ Σ[7k+1] β[7k+1] − β[7k+1] .
36k 2

First, we provide an upper bound for ∥β (S) ∥2 , as S ⊆ [7k + 1] by assumption S2 = ∅,


 −1
(S) 1e
∥β (S) ∥2 = ∥βS ∥2 = (ΣS )−1 uS 2
= 32kI|S| + A S uS
2
2
 −1
1 e 1
= I|S| + AS uS
64k 32k
2
 −1 !
(a) 1 e 1
≤ I|S| + AS − I|S| + 1 ∥uS ∥2
64k 32k
(b)
 
1 1
≤ +1 ∥u[7k+1] ∥2
32k 32k
1 −1 √
 
1
≤ +1 k 7k + 1(32 + 1/4)k
32k 32k
≤ 3k −1/2 .

Here (a) follows from the triangle inequality, (b) follows from the fact that ∥(I + M )−1 − I∥2 ≤ 2∥M ∥ if

∥M ∥ ≤ 0.5. Hence ∥β (S) ∥2 ≤ 3k −1/2 for any S ⊂ [7k + 1]. Similarly ∥β (S ) ∥2 ≤ 3k −1/2 .
Since S ̸= S ′ , there exists some j ∈ [7k + 1] such that j ∈ (S \ S ′ ) ∨ (S ′ \ S). Without loss of generality,
we assume j ∈ S ′ \ S. Then it follows from the above upper bound, the fact j ∈ S ′ \ S and Cauchy Schwarz
inequality that
(S)
∆j = Σ⊤
j,S βS − uj

≤ ∥Aej,S ∥2 ∥β (S) ∥2 − (k −1 ) 1 (32k)


S
2
1/2 −1/2
≤ (8k) (3k ) − 32/2 ≤ −1.

This further yields that ∥∆∥22 ≥ ∆2j ≥ 1. Combining (C.9), we have ∥β (S) − β (S ) ∥2Σ ≥ 1
36k ≥ d−ϵ /36.
(S † )
Combining Case 1 and Case 2, we complete the lower bound for the gap between β (S) and β .

C.4 Proof of Corollary A.3


We use the same reduction as in Theorem A.2. For any ϵ > 0 and 3Sat instance x, we let y = Tϵ (x) be the
constructed Exist-ϵ-Sep-LIS instance in Theorem A.2. Let βb be the output required by Problem A.3 in

39
the instance y, and Se = {j ∈ [7k + 1] : βbj ≥ k −1 /2}. Following the notations therein, we claim that

(a)
Se ∈ Sy ⇐⇒ |Sy | ≥ 1 ⇐⇒ x ∈ X3Sat,1 (C.10)

Therefore, if an algorithm A can take Problem A.3 instance y as input and return the desired output β(y) b
within time O(p(|y|)) for some polynomial p, then the following algorithm can solve 3Sat within polynomial
time: for any instance x, it first transforms x into y = Tϵ (x), then use algorithm A to solve y and gets the
b and finally output 1{Se ∈ Sy }.
returned β,
It remains to verify (a): the ⇒ direction is obvious. For the ⇐ direction, suppose |Sy | ≥ 1, the estimation
error guarantee in Problem A.3 indicates that
s

(S ) † ∥βb − β (S † ) ∥2Σ (a) √ 1
∥βb[7k+1] − β[7k+1] ∥∞ ≤ ∥βb − β (S ) ∥2 ≤ < 0.25d−ϵ ≤ k −1
λmin (Σ) 2

for some S † ∈ Sy . Here (a) follows from the the error guarantee in Problem A.3, and the fact λmin (Σ) ≥ 1
derived in the proof of Theorem A.2. This further indicates Se = S † by the fact that S † ⊂ [7k + 1] and
(S † )
βj = (k −1 )1{j ∈ S † } for any j ∈ [7k + 1] derived in the proof of Theorem A.2.

C.5 Proof of Theorem A.4


Step 1. Sparse Reduction For 3SAT Problem. We first show that there exists a parsimonious
polynomial-time reduction T from 3Sat problem to the 3Sat problem where in each instance all boolean
variables appear no more than 15 times.
To be specific, given a 3Sat instance x with k clauses and n boolean variables {vm }nm=1 where obviously
n ≤ 3k, we construct the new instance x′ = T (x) as follows, we first introduce n × k boolean variables
{wm,i }m∈[n],i∈[k] . For each i ∈ [k] and m ∈ [n], if boolean variable vm appears in clause i of the original
instance x, we replace the variable vm with wm,i . Then all the original variables {vm }m∈[n] are completely
replaced, and each variable in {wm,i }m∈[n],i∈[k] appears no more than 3 times.
Secondly, we need to add the following n × (k − 1) additional constraints

wm,1 = wm,2 , wm,2 = wm,3 , ··· wm,k−1 = wm,k , ∀m ∈ [n]. (C.11)

Note that a constraint w = w′ is equivalent to

(¬w ∨ ¬w′ ∨ w◦ ) ∧ (w ∨ ¬w′ ∨ w◦ ) ∧ (¬w ∨ w′ ∨ w◦ ) ∧ (w ∨ w′ ∨ w◦ )


(C.12)
∧ (¬w ∨ w′ ∨ ¬w◦ ) ∧ (w ∨ ¬w′ ∨ ¬w◦ )

with an additionally introduced boolean variable w◦ that is forced to be True by the first four clauses in
(C.12). Hence the constraints (C.11) can be translated into 6n(k − 1) clauses, with additionally introduced
n(k−1)
n(k − 1) variables {wℓ◦ }ℓ=1 . Finally, in instance x′ there are k ′ = k + 6n(k − 1) < 18k 2 clauses in total.
Each boolean variable in {wm,i }m∈[n],i∈[k] appears no more than 3 + 2 × 6 = 15 times, and each additionally
n(k−1)
introduced boolean variable in {wℓ◦ }ℓ=1 appears no more than 6 times.
Now we prove that the mapping T we construct is a parsimonious polynomial-time reduction, namely,
for any valid solution v ∈ Sx , setting wm,i = vm for m ∈ [n], i ∈ [k] and wℓ◦ = True for ℓ ∈ [n(k − 1)] leads
to a valid solution w ∈ Sx′ , and such mapping from Sx to Sx′ is a bijection.
The verification of injection is obvious. Now we prove it is a surjection. For any valid solution w of
instance x′ , the constraints (C.11) require wm,1 = · · · = wm,k for m ∈ [n]. Hence setting vm = wm,1 for
m ∈ [n] leads to a valid solution v ∈ Sx whose image is w. This completes to proof for the bijection.

40
Step 2. Construction of ExistLIS-Ident Problem. Next, we construct the 7k ′ × 7k ′ matrix A that
corresponds to the 3Sat instance x′ , as shown in (2.2). Namely,

i = i′ and t = t′

1{t contradicts itself}

i = i′ and t ̸= t′

1
A7(i−1)+t,7(i′ −1)+t′ =
1

 i ̸= i′ and tcontradicts t′
0 otherwise

for any i, i′ ∈ [k ′ ] and t, t′ ∈ [7]. We define a k ′ × k ′ symmetric matrix B as follows:


(
1 |i − i′ | = 1
Bi,i′ = (C.13)
0 otherwise

for any i, i′ ∈ [k ′ ]. Matrix B can be seen as the adjacency matrix of a connected graph over k ′ vertices. We
′ ′
define matrix K ∈ Rk ×7k as follows
(
1 7(i − 1) < j ≤ 7i
Ki,j = (C.14)
0 otherwise

for any i ∈ [k ′ ], j ∈ [7k ′ ]. We construct its corresponding ExistLIS instance y with |y| = 8k ′ as follows:

Σ(1) = I8k′ and u(1) = 18k′ ,

and
1 ⊤
(1000 + 12 ) · 17k′
   
(2) 1000I7k′ + A 2K (2)
Σ = and u = .
1
2K 1000Ik′ + 81 B (1000 + 12 )1k′ + 18 B1k′

One can easily verify both Σ(1) and Σ(2) are positive definite from the fact that Σ(2) is diagonally dominant,
and Hℓ is positive definite for any ℓ ≥ 1. Note that A7(i−1)+s,7(j−1)+t ̸= 0 immediately implies the i-th clause
and the i′ -th clause have shared variable. Since each variable appears no more than 15 times, one clause
shares common variables with up to 3 × 15 other clauses. Then we can conclude that each row of matrix
A has no more than 7 × (3 × 15 + 1) = 322 non-zero elements. Combining with the fact that there are no
more than 2 non-zero elements in each row of B and no more than 7 non-zero elements in each row/column
of K, we can conclude that for any e ∈ E, each row of matrix Σ(e) has no more than 322 + 7 + 2 + 1 < 400
non-zero elements.
Similar to (2.3) in the proof of Lemma 2.2, we claim the following and defer the proof to the end of this
step.
(a) † † (1,S † )
S † ∈ Sy ⇐⇒ ̸ S † ⊂ [8k ′ ] and β (2,S
∅= )
= β (1,S )
with βj = 1{j ∈ S † }
(b)
⇐⇒ S † = S̊ ∪ {7k ′ + 1, . . . , 8k ′ } with |S̊ ∩ {7i − 6, . . . , 7i}| = 1, ∀1 ≤ i ≤ k ′
and Aj,j ′ = 0 ∀j, j ′ ∈ S̊ ⊆ [7k ′ ] (C.15)
(c) ′
⇐⇒ S̊ = {7(i − 1) + ai }ki=1 with ai ∈ [7] s.t. adopt action ID ai
in clause i ∈ [k ′ ] will lead to a valid solution v ∈ Sx′ .

Combining (C.15) and Step 1, we have |Sx | = |Sx′ | = |Sy |. Since d = 8k ′ = poly(k) and such construction
can be done in polynomial time, this mapping admits a deterministic polynomial-time reduction from 3Sat
to the problem we construct. Therefore, we can conclude that the problem we construct is NP-hard.
Proof of (C.15)(a) is essentially identical to the proof of (2.3)(a) in Lemma 2.2. Now we prove (C.15)(b)
and (c).

41
Proof of (C.15)(b) The proof idea is similar to (2.3)(b). The direction ⇐ is obvious. For the ⇒ direction,
we first assert that
{7k ′ + 1, . . . , 8k ′ } ∩ S † ̸= ∅ (C.16)

We use the proof by contradiction argument. If {7k ′ + 1, . . . , 8k ′ } ∩ S † = ∅, there must exist an index
(2,S † ) (1,S † )
j ∈ [7k ′ ] ∩ S † since S † is nonempty. Combined with the fact βj = βj = 1{j ∈ S † }, the equation

(2) (2,S ) (2)
Σj,S † βS † = uj tells

7k
X (2,S † ) 1
1000 + Aj,j ′ βj ′ = 1000 +
2
j ′ =1

The L.H.S. is an integer while the R.H.S. is not an integer. This leads to a contradiction. This proves (C.16).
(2) (2,S † ) (2)
Now we consider the element i + 7k ′ ∈ {7k ′ + 1, . . . , 8k ′ } ∩ S † . Then the equation ΣS † βS † = uS † tells
1 X (2,S † ) 1 X (2,S † ) 1 1 X
βj + βi′ +7k′ + 1000 = 1000 + + 1.
2 8 2 8
7i−6<j≤7i i′ :Bi,i′ =1 i′ :Bi,i′ =1

(2,S † ) (1,S † ) (2,S † )


= 1{j ∈ S † }, then i′ :Bi,i′ =1 βi′ +7k′ can only take values 0, 1 or 2. Through taking
P
Since βj = βj
both sides of the equation modulo 1/2 we can then obtain
(2,S † )
X X
βi′ +7k′ = 1.
i′ :Bi,i′ =1 i′ :Bi,i′ =1

This indicates that all the neighbors of i (with respect to the adjacency matrix B) should be simultaneously
contained in S † . Since B represents the adjacency matrix of a connected graph, we can then inductively
(2) (2,S † ) (2)
prove that {7k ′ + 1, . . . , 8k ′ } ⊂ S † . Given this, the equation ΣS βS = uS † now becomes
1 1
K1S̊ = 1S̊ =⇒ |S̊ ∩ {7i − 6, . . . , 7i}| = 1, for ∀1 ≤ i ≤ k ′
2 2
and
(i)
AS̊ 1k′ = 0 =⇒ Aj ′ ,j = 0, for ∀j ′ , j ∈ S̊
′ ′
where (i) follows from the fact that A ∈ {0, 1}7k ×7k .
Proof of (C.15)(c) The direction ⇒ follows from the proof of (2.3)(c). For the direction ⇐, it follows

from the proof of (2.3)(c) and the fact that S̊ = {7(i − 1) + ai }ki=1 with ai ∈ [7] naturally implies |S̊ ∩ {7i −

6, . . . , 7i}| = 1 for i ∈ [k ].

C.6 Proof of Theorem B.1


It suffices to construct a polynomial-time reduction from 3Sat to Problem B.1. Let x be any 3Sat instance
with input size k, following the notation in Lemma 2.2, we let y = T (x) be an instance of Problem B.1 that
(1) (2)
d = 7k, Σ(1) = 5dId , u(1) = 5d1d , and Σ(2) = 5dId + A, u(2) = 5d1d . Now the constraint uS = uS trivially
holds for any S ⊆ [d]. We claim that
S ∈ Sy ⇐⇒ |S| = k and Aj,j ′ = 0, ∀j, j ′ ∈ S
⇐⇒ S̊ = {7(i − 1) + ai }ki=1 with ai ∈ [7] s.t. adopting action ID ai
in clause i ∈ [k] will lead to a valid solution v ∈ Sx .
The proof of equivalence is identical to that in Lemma 2.2. This completes the proof.

42
D Proofs for the Population-level Results
D.1 Proof of Proposition 3.1
Applying Theorem 3.2 with k = 1 completes the proof of (3.3). To establish the causal identification result,
it suffices to verify Condition 3.3 with k = 1.
To see this, under (1.1) and (1.3), if Condition 3.2 further holds, we have, for each j ∈ S ⋆ and e ∈ E,
(e) (e) (e) ⋆
E[Xj Y (e) ] + ε(e) )]
P
(e,{j})
E[Xj ( i∈S ⋆ Xi βi ) (a)
β = (e) (e)
= (e) (e)
= βj⋆
E[Xj Xj ] E[Xj Xj ]

where (a) follows from


(e) (e) (e)
∀i, j ∈ S ⋆ with i ̸= j, E[Xi Xj ] = 0 and E[Xj ε(e) ] = 0,

provided Condition 3.2 and (1.1), respectively. This completes the proof.

D.2 Proof of (3.4)


Denote q = (w1 (1), . . . , w1 (d)) ∈ Rd , it follows from Proposition 3.1 that

β γ = argmin max Eµ [|Y − β ⊤ X|2 ] − Eµ [Y 2 ]



β µ∈Pγ (Σ,u)
 ⊤
= argmin max β Eµ [XX ⊤ ]β − 2β ⊤ E[XY ]
β µ∈Pγ (Σ,u)
 ⊤
= argmin max β Σβ − 2β ⊤ u
e
β u u−u|≤γ·q
e:|e
n o
= argmin max β ⊤ Σβ − 2β ⊤ Σβe .
β β∈Θ
e γ

It is easy to check that the convex hull of Θγ is itself, applying Theorem 1 of Meinshausen & Bühlmann
(2015) completes the proof.

D.3 Proof of Theorem 3.2


Proof of (3.6). The existence and uniqueness of optimal solution follows from Proposition E.2. We will
show that

Eµ |Y − β ⊤ X|2 − |Y |2 .
 
Qk,γ (β) = sup
µ∈Pk,γ (Σ,u)

For given fixed µ ∈ Pk,γ , one has

Eµ |Y − β ⊤ X|2 − |Y |2 = β ⊤ Eµ XX ⊤ β − 2β ⊤ Eµ [XY ]
   

= β ⊤ Σβ − 2β ⊤ Eµ [XY ].

On the other hand, it follows from the definition of Σ, u and Qk,γ (β) that
1 ⊤
Qk,γ (β) = β Σβ − β ⊤ u + γv ⊤ |β| with v = (wk (1), . . . , wk (d)).
2
Now it suffices to show that for any β ∈ Rd ,
1 ⊤ 1 ⊤
β Σβ − β ⊤ u + γv ⊤ |β| = sup β Σβ − γβ ⊤ u
e. (D.1)
2 u u−u|≤v 2
e:|e

43
To see this, it is easy to verify that, for any given x, a ∈ R and b ∈ R+ , one has

sup −xy = −ax + b|x|,


y∈[a−b,a+b]

then we can obtain


d d
1 ⊤ 1 X 1 X
sup β Σβ − β ⊤ u
e = β ⊤ Σβ + sup ej ) = β ⊤ Σβ −
(−βj u uj βj + |βj |vj γ,
u u−u|≤v
e:|e 2 2 ej ∈[uj −γvj ,uj +γvj ]
j=1 u
2 j=1

this verifies (D.1) and thus completes the proofs of the claim (3.6).

D.4 Proof of Theorem 3.3


Proof of the Causal Identification Result. It follows from Condition 3.3 and the definition of wk (j)
that wk (j) = 0 for any j ∈ S ⋆ . It also follows from (1.3) that
(e)
X
wk (j) ̸= 0 ∀j ∈ [d] with E[Xj ε(e) ] ̸= 0.
e∈E

Therefore, for any β ∈ Rd ,


d
1 X h (e) i X
Qk,γ (β) − Qk,γ (β ⋆ ) = E |Y − β ⊤ X (e) |2 − |Y (e) − (β ⋆ )⊤ X (e) |2 + γ|βj |wk (j)
2|E| j=1
e∈E
d
1 X
= (β − β ⋆ )⊤ Σ(β − β ⋆ ) − (β − β ⋆ )Σ(β̄ − β ⋆ ) + γ |βj |wk (j)
2 j=1
!
(a) X 1 X (e) (e)
≥ |βj | {γ · wk (j)} − βj E[Xj ε ] .
|E|
j∈G e∈E

where G is defined in (1.2) and β̄ = Σ−1 u. Here (a) follows from the fact that the first quadratic term is
non-negative, and the identity
1 Xn o 1 X
Σ(β̄ − β ⋆ ) = E[X (e) Y (e) ] − E[X (e) (X (e) )⊤ β ⋆ ] = E[X (e) ε(e) ].
|E| |E|
e∈E e∈E

Therefore, we have

1 (e)
E[ε(e) Xj ]
P
⋆ |E| e∈E
Qk,γ (β) − Qk,γ (β ) ≥ 0 if γ ≥ max := γk⋆ ,
j∈G wk (j)

this completes the proof.

44
We finally establish the upper bound on γk⋆ . It follows from the definition of wk that
2
1 (e)
E[ε(e) Xj ]
P
|E| e∈E
(γk⋆ )2 = max
j∈G {wk (j)}2
2
1 (e)
E[ε(e) Xj ]
P
|E| e∈E
= max (e,S) (S)
1
− βS ∥2 (e)
j∈G
P
inf S:j∈S |E| e∈E ∥βS
ΣS
2
1 (e)
E[ε(e) XS ]
P
|E| e∈E
2
≤ max max (e,S) (S)
1
− βS ∥2 (e)
j∈G S:j∈S
P
|E| e∈E ∥βS ΣS
2
1 (e) (e)
P
|E| e∈E E[ε XS ]
2
= max (e,S) (S) 2
S:S∩G̸=∅ 1
P
|E| e∈E ∥βS − βS ∥ (e)
Σ S

(e)
Let κmin = mine∈E λmin (Σ ), we have
1 X (e,S) (S) 1 X (e,S) (S)
∥βS − βS ∥2Σ(e) ≥ κmin ∥βS − βS ∥22
|E| S |E|
e∈E e∈E
1 X (e,S) 1 X (e,S)
≥ κmin inf ∥β − β∥22 ≥ κmin ∥β − β̄ (S) ∥22
β:βS c =0 |E| |E|
e∈E e∈E
(S) 1 (e,S) ⋆ 2
P
with β̄ = |E| e∈E β . Plugging it back into the upper bounded on (γk ) , we conclude that
2
1 (e)
E[ε(e) XS ]
P
|E| e∈E
(γk⋆ )2 ≤ (κmin )−1 max 1
P (e,S) − β (S) ∥2
2
= γ ∗ κ2min ,
S:S∩G̸=∅
|E| e∈E ∥β 2

where
2
1 (e)
E[ε(e) XS ]
P
|E| e∈E
γ ∗ = (κmin )−3 max 1
P (e,S) − β (S) ∥2
2
(D.2)
S:S∩G̸=∅
|E| e∈E ∥β 2

is the quantity defined in (4.5) of Fan et al. (2024).

E Proofs for Non-asymptotic Results


E.1 Preliminaries
We first introduce some notations. Recall the definition of (Σ, u) in (1.6), we denote their empirical
counterparts as
b= 1 (e) (e) 1 (e) (e)
X X
Σ Xi (Xi )⊤ and u
b= Xi Yi . (E.1)
n · |E| n · |E|
i∈[n],e∈E i∈[n],e∈E

We define
1 X b h (e) (e)
i 1 ⊤b 1 ⊤
R(β)
b = E |Yi − β ⊤ Xi |2 = β Σβ − β ⊤ u
b+ ub u
b,
2|E| 2 2
e∈E
1 X h (e) (e)
i 1 ⊤ 1
R(β) = E |Yi − β ⊤ Xi |2 = β Σβ − β ⊤ u + u⊤ u.
2|E| 2 2
e∈E

45
We let
1 X (e,S) 2 1 X b(e,S) b(S) 2
(S)
v(S) = βS − βS and vb(S) = βS − βS .
|E| (e)
ΣS |E| b (e)
Σ S
e∈E e∈E

One can expect |v(S) − vb(S)| ≍ (|S|/n)1/2 by CLT. However, applying such a crude bound will result in a
slower rate. Instead, the next proposition targets to establish a shaper instance-dependent error bound for
the difference. We define
(s log(4d/s)) + log(|E|) + t (s log(4d/s)) + t
ρ(s, t) = and ζ(s, t) = (E.2)
n n · |E|
with s ∈ [d] and t > 0.
We also define some concepts that will be used throughout the proof.
Definition 6 (Sub-Gaussian Random Variable). A random variable X is a sub-Gaussian random variable
with parameter σ ∈ R+ if
 
λ 2
∀λ ∈ R, E[exp (λ(X − E[X]))] ≤ exp σ .
2

Definition 7 (Sub-exponential Random Variable). A random variable X is a sub-exponential random


variable with parameter (ν, α) ∈ R+ × R+ if
 
λ 2
∀|λ| < 1/α, E[exp (λ(X − E[X]))] ≤ exp ν .
2
It is easy to verify that the product of two sub-Gaussian random variables is a sub-exponential random
variable, and the dependence of the parameters can be written as follows.
Lemma E.1 (Product of Two Sub-Gaussian Random Variables). Suppose X1 and X2 are two zero-mean
sub-Gaussian random variables with parameters σ1 and σ2 , respectively. Then X1 X2 is a sub-exponential
random variable with parameter (Cσ1 σ2 , Cσ1 σ2 ), where C > 0 is some universal constant.
We also have the following lemma stating the concentration inequality for the sum of independent sub-
exponential random variables.
Lemma E.2 (Sum of Independent Sub-exponential Random Variables). Suppose X1 , . . . XN are independent
sub-exponential random variables with parameters {(νi , αi )}Ni=1 , respectively. There exists some universal
constant C > 0 such that the following holds,
 v 
XN uu X N 
P (Xi − E[Xi ]) ≥ C tt × νi2 + t × max αi  ≤ 2e−t .
 i∈[N ] 
i=1 i=1

The next proposition provides upper bounds for |v(S) − vb(S)|.


Proposition E.1 (Instance-dependent Error Bounds on |v(S) − vb(S)|). Suppose Condition 3.4 hold. There
exists some universal constant C such that, for any t > 0 and ϵ > 0, if Cσx4 ρ(k, t) ≤ 1, then the following
event
nq o
∀S ⊆ [d], |S| ≤ k |v(S) − vb(S)| ≤ C v(S) · σx4 σy2 bρ(k, t) + σx4 σy2 bρ(k, t)

occurs with probability at least 1 − e−t .


The above inequality is instance-dependent in that both L.H.S. and R.H.S. of the inequality contain v(S)
dependent on S. The next proposition claims that one can establish strong convexity around β k,γ .

46
Proposition E.2. Under Condition 3.1, for any k ∈ [d] and γ ≥ 0, Qk,γ (β) is uniquely minimized by some
β k,γ . Moreover, for any β ∈ Rd .
1 1/2
Qk,γ (β) − Qk,γ (β k,γ ) ≥ ∥Σ (β − β k,γ )∥22
2
The next lemma shows the explained variance of β k,γ is smaller than the explained variance of population-
level least squares, the latter is smaller than σy2 .

Lemma E.3. Let β k,γ be the unique minimizer of Qk,γ (β), and β̄ be the unique minimizer of R(β). Then
we have

∥Σ1/2 β k,γ ∥2 ≤ ∥Σ1/2 β̄∥2 ≤ σy . (E.3)

E.2 Proof of Theorem 3.4


We need the following technical lemma.

Lemma E.4. For any x, y, δ ≥ 0 and C > 0, if |x − y| ≤ C(δ 2 + δ y), then
√ √
| x − y| ≤ 2(C + 1)δ.

We are ready to prove Theorem 3.4.


Proof of Theorem 3.4. We consider the following decomposition, for any β ∈ Rd ,

Qk,γ (β) − Qk,γ (β k,γ ) = Qk,γ (β) − Qb k,γ (β) + Qb k,γ (β) − Q b k,γ (β k,γ )

+Qb k,γ (β k,γ ) − Qk,γ (β k,γ )


b k,γ (β k,γ )
b k,γ (β) − Q
=Q
 
+ R(β) − R(β k,γ ) − R(β)
b b k,γ )
− R(β
d
X  
+γ bk (j)) |βj | − |βjk,γ |
(wk (j) − w
j=1

= T1 (β) + T2 (β) + γT3 (β).

For T2 (β), under the event A3 (d, t) and A4 (d, t) defined in (E.18), the following holds with a universal

47
constant C1
 
∀β, T2 (β) = R(β) − R(β k,γ ) − R(β)
b b k,γ )
− R(β
1
= (β − β k,γ )⊤ Σ(β − β k,γ )+(β − β k,γ )⊤ (Σβ k,γ − u)
2
1
− (β − β k,γ )⊤ Σ(β
b − β k,γ ) − (β − β k,γ )⊤ (Σβb k,γ − u
b)
2
1 n o
= {Σ1/2 (β − β k,γ )}⊤ I − Σ−1/2 ΣΣ b −1/2 {Σ1/2 (β − β k,γ )}
2 n o
− {Σ1/2 (β − β k,γ )}⊤ I − Σ−1/2 ΣΣ b −1/2 (Σ1/2 β k,γ )

+ {Σ1/2 (β − β k,γ )}⊤ Σ−1/2 (b


u − u)
(
1 1/2 p
≤ C1 ∥Σ (β − β k,γ )∥22 · σx2 b · ζ(d, t)
2
)
 p
1/2 k,γ 1/2 k,γ
+ ∥Σ (β − β )∥2 · ∥Σ β ∥2 σx2 + σx2 σy b · ζ(d, t)
(
1 1/2 p
≤ C1 ∥Σ (β − β k,γ )∥22 · σx2 b · ζ(d, t)
2
1
+ ∥Σ1/2 (β − β k,γ )∥22
8C1
)
 2
1/2 k,γ 2 2
+ 2C1 ∥Σ β ∥2 σx + σx σy b · ζ(d, t) .

Here we substitute the upper bounds in (E.18) and use condition that n ≥ 3(d + t) such that b · ζ(d, t) ≤
(t + d · log 4)/n ≤ 1. We also use ∥Σ1/2 β k,γ ∥2 ≤ σy derived in Lemma E.3. Substituting the low-dimension
structure, if n · |E| ≥ 64C12 bσx4 (d + t), we can obtain

1 1/2 d+t
∀β, T2 (β) ≤ ∥Σ (β − β k,γ )∥22 + C2 bσx4 σy2 ·
4 n · |E|

using the fact xy ≤ x2 /ϵ + ϵy 2 for any x, y, ϵ > 0.


For T3 (β), we have
d
X
∀β, T3 (β) ≤ bk (j)||βj − βjk,γ |
|wk (j) − w
j=1
q q
≤ sup v(Sbj ) − vb(Sbj ) · ∥β − β k,γ ∥1
j∈[d]
(a) q √
≤ C3 σx4 σy2 bρ(k, t) d∥β − β k,γ ∥2 .

Here in (a) we use the facts ∥x − y∥1 ≤ d∥x − y∥2 and
p p q
∀S with |S| ≤ k, | vb(S) − v(S)| ≲ δ with δ= σx4 σy2 bρ(k, t) (E.4)

by first applying Proposition E.1 and then applying Lemma E.4 provided Cσx4 ρ(k, t) ≤ 1 following from
e log(d) + t) and |E| < nc1 .
n ≥ C(k

48
Now we plug in β = βbk,γ , under which T1 (βbk,γ ) ≤ 0, denote ♣ = bσx4 σy2 , we have

Qk,γ (βbk,γ ) − Qk,γ (β k,γ )


( r )
d+t (k log(d) + log(|E|) + t) · d bk,γ
≤ C4 ♣ · +γ ♣· ∥β − β k,γ ∥2 (E.5)
n · |E| n
1
+ ∥Σ1/2 (βbk,γ − β k,γ )∥22 .
4
On the other hand, it follows from Proposition E.2 that
∥Σ1/2 (βbk,γ − β k,γ )∥22
Qk,γ (βbk,γ ) − Qk,γ (β k,γ ) ≥ . (E.6)
2
Combining (E.6) and (E.5) and recalling that we assume |E| ≤ nc1 , we obtain
 2 
γ (k log(d) + c1 log(n) + t)d d+t
∥βbk,γ − β k,γ ∥22 ≤ C5 · ♣ + .
κ2 n κn · |E|
e = C6 max{σx4 c2 , b1/2 σx2 σy c1/2 }.
We complete the proof with C 1 1

Proof of Lemma E.4. We divide it into two cases.


Case 1. y ≤ δ 2 . In this case, it follows from triangle inequality that
|x| ≤ |x − y| + |y| ≤ C δ 2 + δ 2 ≤ 2Cδ 2 ,


then we can obtain


√ √ √ √ √
| x − y| ≤ | x| + | y| ≤ ( 2C + 1)δ.
Case 2. y ≥ δ 2 . In this case, it follows from the upper bound on |x − y| and the assumption y ≥ δ 2 that

√ √ |x − y| |x − y| Cδ 2 + δ y δ2
| x − y| = √ √ ≤ √ ≤ √ =δ+C ≤ (1 + C)δ.
x+ y y y y
Combining the above two cases completes the proof.

E.3 Proof of Theorem 3.5


The next several lemmas are standard in high-dimensional linear regression analysis, and we simply adapt
it to the multi-environment setting.
Lemma E.5. Suppose Condition 3.4 holds. Let β k,γ be the unique minimizer of Qk,γ (β). Then there exist
some universal constants C such that, the following event

1 X  b h (e) 
(e)
i h 
(e)
i
∀j ∈ [d], E Y − (X (e) )⊤ β k,γ Xj − E Y (e) − (X (e) )⊤ β k,γ Xj
|E|
e∈E
p 
≤ Cbσx2 σy ζ(1, t) + ζ(1, t)

happens with probability at least 1 − e−t .


Lemma E.6. Suppose Condition 3.4 holds. Then there exist some universal constants C, c > 0 such that,
for any constant α > 0, the following event
[ 1 X b ⊤ (e) 2 1
∀θ ∈ {∆ ∈ Rd : ∥∆S c ∥1 ≤ α∥∆S ∥1 } E[|θ X | ] ≥ κ∥θ∥22
|E| 2
S⊆[d],|S|≤s e∈E

n/(Cσx )4 , where s = c(1 + α)−2 σx−4 κ · n|E|/(b · log d).



occurs with probability at least 1 − 3 exp −e

49
Now we are ready to prove Theorem 3.5.
Proof of Theorem 3.5. Denote βb = βbk,γ,λ , β⋆ = (β⋆,1 , . . . , β⋆,d )⊤ = β k,γ , S⋆ = supp(β⋆ ) and ∆
b = βb − β⋆ .
First, one can observe that
(a) (b)
b S ∥1 − ∥ ∆
λ(∥∆ b S c ∥1 ) ≥ λ(∥β⋆ ∥1 − ∥β∥
b 1) ≥ Q b −Q
b k,γ (β) b k,γ (β⋆ ).
⋆ ⋆

Here (a) follows from the triangle inequality; and (b) follows from the fact that βb minimizes Q
b k,γ (β) + λ∥β∥1 .
At the same time, it follows from the definition of Qk,γ (β) that
b

Q b −Q
b k,γ (β) b k,γ (β⋆ )
d
X d
X
= R(β)
b + bk (j) · |βbj | − R(β
w b ⋆) − bk (j) · |β⋆,j |
w
j=1 j=1
d
1 b⊤ b b b⊤ 1 ⊤b ⊤
X  
= β Σβ − β u b − β⋆ Σβ⋆ + β⋆ u
b+γ bk (j) |βbj | − |β⋆,j |
w
2 2 j=1
d
1 b b βb − β⋆ + β⋆ ) − 1 β⋆⊤ Σβ
X  
= (β − β⋆ + β⋆ )⊤ Σ( b ⋆ − (βb − β⋆ )⊤ u
b+γ bk (j) |βbj | − |β⋆,j |
w
2 2 j=1
d
1 b⊤b b b⊤b b ⊤u
X  
= ∆ Σ∆ + ∆ Σβ⋆ − ∆ b+γ bk (j) |βbj | − |β⋆,j |
w (E.7)
2 j=1
d
1 b⊤b b b⊤ b ⊤ (u − Σβ⋆ ) − ∆
b ⊤ (u − Σβ⋆ ) + γ
X  
= ∆ Σ∆ − ∆ (b u − Σβ
b ⋆) + ∆ bk (j) |βbj | − |β⋆,j |
w
2 j=1
(a) 1 ⊤
n o
= ∆ b Σ
b∆b −∆b ⊤ (b
u − Σβ
b ⋆ ) − (u − Σβ⋆ )
2  
Xd   X d  
+ γ − ξj wk (j) βbj − β⋆,j + bk (j) |βbj | − |β⋆,j | 
w
j=1 j=1

1 b⊤b b
= ∆ Σ∆ + T1 + γT2 (β).
b
2
Here (a) follows from the KKT condition that

Σβ⋆ − u = Σ(β⋆ − β̄) = −γ · v ⊙ ξ with v = (wk (1), . . . , wk (d))⊤ (E.8)

and
(
{sign(β⋆ )} j ∈ S⋆
ξj ∈ .
[−1, 1] j∈/ S⋆
n o
For T1 , note that the j-th coordinate of (bu − Σβ
b ⋆ ) − (u − Σβ⋆ ) is

1 X  b h (e) (e)
i h
(e)
i
E (Y − (X (e) )⊤ β k,γ )Xj − E (Y (e) − (X (e) )⊤ β k,γ )Xj .
|E|
e∈E

Denote ♠ = bσx2 σy . Applying Lemma E.5 with t = C log(n · d) yields


s
log d + log n b
T1 ≥ −∥(b
u − Σβ
b ⋆ ) − (u − Σβ⋆ )∥∞ ∥∆∥
b 1 ≥ −C♠ ∥∆∥1 (E.9)
n · |E|

50
with probability at least 1 − (nd)−20 .
For T2 , one has
X  X 
T2 = |βbj | − |β⋆,j | wbk (j) − βbj − β⋆,j wk (j) sign(β⋆,j )
j∈S⋆ j∈S⋆
X X
+ |βbj | · w
bk (j) − βbj · wk (j)ξj
j ∈S
/ ⋆ j ∈S
/ ⋆
X  X 
≥ |βbj | − |β⋆,j | wbk (j) − βbj − β⋆,j wk (j) sign(β⋆,j )
j∈S⋆ j∈S⋆
X X
+ |βbj | · w
bk (j) − |βbj | · wk (j)|ξj |
j ∈S
/ ⋆ j ∈S
/ ⋆
Xh i
= |βbj | · (w
bk (j) − wk (j)) + |βbj | · wk (j)(1 − |ξj |)
j∈S⋆c
X h    i
+ |βbj | − |β⋆,j | · w
bk (j) + |β⋆,j | − (βbj ) sign(β⋆,j ) wk (j)
j∈S⋆
(a) Xh i X h    i
≥ |βbj | · (w
bk (j) − wk (j)) + |βbj | − |β⋆,j | · w
bk (j) + |β⋆,j | − |βbj | wk (j)
j∈S⋆c j∈S⋆
Xh i Xh i
= (|βbj | − |β⋆,j |) · (w
bk (j) − wk (j)) + (|βbj | − |β⋆,j |) · (w
bk (j) − wk (j))
j∈S⋆c j∈S⋆
d
X
≥− |βbj − β⋆,j | · |w
bk (j) − wk (j)|.
j=1

Here (a) follows from the fact that 1 − |ξj | ≥ 0 and (βbj ) sign(β⋆,j ) ≥ −|βbj |. It follows from the upper bound
bk − wk ∥∞ derived in (E.4) that, provided Cσx4 ρ(k, t) ≤ 1, the following holds with probability at least
of ∥w
1 − e−t
p
T2 ≥ −∥βb − β⋆ ∥1 ∥w bk − wk ∥∞ ≥ −∥∆∥ b 1 ♣ · ρ(k, t).

Set t = C log(n · d) and recall that we assume |E| ≤ nc1 . Then the following holds with probability at least
1 − (n · d)−20
r
b 1 · k log d + (c1 + 1) log n .
p
T2 ≥ −C ♣∥∆∥ (E.10)
n
Combining (E.7), (E.9) and (E.10), we obtain
r s !
b S ∥1 − ∥ ∆
b S c ∥1 ) ≥ −∥∆∥
p k log d + (c1 + 1) log n log d + log n
λ(∥∆ ⋆
b 1 C ♣·γ + C♠ .

n n · |E|
| {z }
λ⋆

b S c ∥1 (λ − λ⋆ ) ≤ (λ + λ⋆ ) · ∥∆
This immediately implies ∥∆ b S ∥1 , then the following holds
⋆ ⋆

∥∆
b S c ∥1 ≤ 3∥∆

b S ∥1
⋆ (E.11)

provided λ ≥ 2λ⋆ . Given (E.11), we can apply the restricted strong convexity derived from Lemma E.6 with
α = 3 and combine (E.7), which yields
1 b⊤b b b 1 ≥ κ ∥∆∥
b S ∥1 − ∥∆
λ(∥∆ ⋆
b S c ∥1 ) ≥ ∆ Σ∆ − λ⋆ ∥∆∥ b 22 − λ⋆ ∥∆∥
b 1

2 4

51
with probability over 1 − 3 exp(−cn/σx4 ) ≥ 1 − (n · d)−20 . This further implies
(a) (b) p
b 22 ≤ 4λ⋆ ∥∆∥
κ∥∆∥ b S ∥1 ≤ 12λ∥∆
b 1 + 4λ∥∆ b S ∥1 ≤ 12λ |S⋆ |∥∆∥
b 2.
⋆ ⋆

Here (a) follows from λ ≥ 2λ⋆ and ∥∆ b S c ∥1 ≤ 3∥∆



b S ∥1 ; and (b) follows from Cauchy-Schwarz inequality. By

e = C3 max{σx4 c2 , (c1 b)1/2 σx2 σy , bσx2 σy }, we can conclude that
letting C 1

12λ p
∥∆∥
b 2≤ |S⋆ |.
κ

Proof of Lemma E.5. Note that


1 X  b h (e) 
(e)
i h 
(e)
i
E Y − (X (e) )⊤ β k,γ Xj − E Y (e) − (X (e) )⊤ β k,γ Xj (E.12)
|E|
e∈E

is the recentered average of mean-zero independent random variables, each of which is the product of two
sub-Gaussian variables. By Condition 3.4, the product of sub-Gaussian parameters of (Y (e) − (X (e) )⊤ β k,γ )
(e)
and Xj is no more than
√ √
  r (a)
(e) 1/2 k,γ (e)
C σy + σx max ∥(Σ ) β ∥2 σx max |Σjj | ≤ C(σy + bσx ∥Σ1/2 β k,γ ∥2 )σx b
e∈E e∈E,j∈[d]

(b) √ √
≤ C(σy + bσx σy ) bσx ≤ 2Cbσx2 σy .
Here the (a) follows from Condition 3.4; and (b) follows from Lemma E.3. Consequently, it follows from the
concentration inequality Lemma E.2 that

1 X  b h (e) (e)
i h
(e)
i
∀j ∈ [d], E (Y − (X (e) )⊤ β k,γ )Xj − E (Y (e) − (X (e) )⊤ β k,γ )Xj
|E|
e∈E
p 
≤ 2Cbσx2 σy ζ(1, t) + ζ(1, t)

happens with probability at least 1 − e−t .

E.4 Proof of Proposition E.1


Proposition E.1 is based on instance-dependent decomposition of the response Y (e) . We denote the residual
defined by the least squares solution constrained on XS using all the data as
R(e,S) := Y (e) − (β (S) )⊤ X (e) .
Define the random vector
(e)
U (e,S) = XS R(e,S) . (E.13)
The following deterministic lemma unveils the relationship between the calculated weight v(S) and population-
level FAIR loss proposed by Gu et al. (2024), respectively.
Lemma E.7. Suppose Σ(e) ≻ 0 for any e ∈ E. We have
1 X h i 2
(e) (e)
v(S) = min (Σ(e) )−1/2 E (Y (e) − u⊤ XS )XS (E.14)
u∈RS |E| 2
e∈E
1 X h i 2
= (Σ(e) )−1/2 E U (e,S) (E.15)
|E| 2
e∈E

52
Moreover,
1 X h (e) (e,S) i
E XS R = 0. (E.16)
|E|
e∈E

Proposition E.1 is a deterministic result after defining the following high-probability events.
Lemma E.8. Suppose Condition 3.4 hold. Then there exists some universal constants C1 , C2 > 0 such that
the following two events
(
A1 (s, t) = ∀e ∈ E, S ⊆ [d] with |S| ≤ s,
)
(e)
√ p 
(ΣS )−1/2 (E[U
b (e,S) ] − E[U (e,S)
]) ≤ C1 bσx2 σy ρ(s, t) + ρ(s, t)
2
( (E.17)
A2 (s, t) = ∀e ∈ E, S ⊆ [d] with |S| ≤ s,
)
p 
(e) b (e) )(Σ(e) )−1/2
(ΣS )−1/2 (Σ S S −I ≤ C2 σx2 ρ(s, t) + ρ(s, t)
2

occurs with probability at least 1 − e−t .


Proof of Lemma E.8. See Appendix E.7.
Lemma E.9. Suppose Condition 3.4 hold. Then there exists some universal constants C1 , C2 > 0 such that
the following two events
(
A3 (s, t) = ∀S ⊆ [d] with |S| ≤ s,
)
−1/2 1 X b (e,S) p 
(ΣS ) (E[U ] − E[U (e,S) ]) ≤ C1 σx2 σy bζ(s, t) + bζ(s, t) (E.18)
|E|
e∈E 2
( )
p 
−1/2 b −1/2
A4 (s, t) = ∀S ⊆ [d] with |S| ≤ s, ΣS Σ S (ΣS ) −I ≤ C2 σx2 bζ(s, t) + bζ(s, t)
2

occurs with probability at least 1 − e−t .


Proof of Lemma E.9. See Appendix E.8.
Now we are ready to prove Proposition E.1.
4
Proof of Proposition E.1. The proof proceeds when A1 (k, t) – A4 (k, t) happens and
p Cσx ρ(k, t) ≤ 1 for some
(e)
large enough universal constant C. In this case we have Σ ≻ 0 and ρ(k, t) ≤ ρ(k, t). It follows similar
to the proof of Lemma E.7 that vb(S) = minu∈R|S| qbS (a) with
( )
(S) ⊤ b (S) (S) ⊤ 1 X b (e,S) (e)
qbS (a) = (a − βS ) Σ(a − βS ) − 2(a − βS ) E[R XS ]
|E|
e∈E
1 X b (e) −1/2 b h (e) (e,S) i 2
+ (Σ ) E XS R
|E| 2
e∈E

53
b (e) ≻ 0 for any e ∈ E. We can claim that qbS (a) can be minimized by
provided Σ S
( ) ( )
(S) (S) −1 1 X b (e,S) (e) (S) −1 1 X b (e,S)
a=β b = βS + (Σ)b E[R XS ] = βS + (Σ) b E[U ] ,
|E| |E|
b
e∈E e∈E

substituting it into qbS , we obtain

1 X b h (e,S) i⊤ b (e) −1 b h (e,S) i


vb(S) = qbS (b
a) = E U (Σ ) E U
|E|
e∈E
( ) ( )
1 X b (e,S) −1 1 X
(e,S)
− E[U ] (ΣbS) E[U
b ] =T
b1 + T
b2.
|E| |E|
e∈E e∈E

For T
b 1 , we do the following decomposition,
o⊤
b1 = 1
X n (e) 
(e)
T (ΣS )−1/2 E[U
b (e,S) ] − E[U (e,S) ] + E[U (e,S) ] b (e) (Σ(e) )−1/2 )−1
((ΣS )−1/2 Σ S S
|E|
e∈E
n  o
(e)
× (ΣS )−1/2 E[Ub (e,S) ] − E[U (e,S) ] + E[U (e,S) ]

(e) (e) b (e) (Σ(e) )−1/2 − I satisfying


b (e,S) ] − E[U (e,S) ]) ∈ R|S| , and ∆(e) = (Σ(e) )−1/2 Σ
We let ∆1 = (ΣS )−1/2 (E[U 2 S S S
(e)
∥∆2 ∥ ≤ 0.5 by our assumption on n. Then it follows from Weyl’s theorem that

(e)
 1 1 (e)
λmin (∆2 + I)−1 ≥  ≥ (e)
≥ 1 − 2∥∆2 ∥2
(e)
λmax ∆2 +I 1 + ∥∆2 ∥2

(e)
 1 1 (e)
λmax (∆2 + I)−1 ≤ (e)
≤ (e)
≤ 1 + 2∥∆2 ∥2 ,
λmin (∆2 + I) 1− ∥∆2 ∥2

where the last inequalities follows from the fact that 1/(1 − x) ≤ 1 + 2x and 1/(1 + x) ≥ 1 − 2x when
x ∈ [0, 0.5]. We thus have
(e) (e)
(∆2 + I)−1 − I ≤ 2∥∆2 ∥2 and (∆2 + I)−1 ≤ 1.5 (E.19)
2 2

54
Therefore, it follows from the triangle inequality and Cauchy-Schwarz inequality that
o⊤
b 1 − v(S) = 1
X n (e) (e)
T ∆1 + (ΣS )−1/2 E[U (e,S) ] (∆2 + I)−1
|E|
e∈E
n o
(e) (e)
× ∆1 + (ΣS )−1/2 E[U (e,S) ] − v(S)

1 X (e) (e)
≤ 2∥∆1 ∥2 (∆2 + I)−1 2 (ΣS )−1/2 E[U (e,S) ]
|E| 2
e∈E
1 X (e)
+ ∥∆1 ∥22 (∆2 + I)−1 2
|E|
e∈E
1 X 2
(e)
+ (∆2 + I)−1 − I 2 (ΣS )−1/2 E[U (e,S) ]
|E| 2
e∈E
s s
(a) 1 X 2 1 X (e) 2 1 X (e) 2
(e)
≤ 3 (ΣS )−1/2 E[U (e,S) ] ∥∆1 ∥2 + 1.5 ∥∆1 ∥2
|E| 2 |E| |E|
e∈E e∈E e∈E
 
1 X 2
(e) 2 (e)
+ sup 2∥∆2 ∥2 (ΣS )−1/2 E[U (e,S) ]
e∈E |E| 2
e∈E
(b) q p
≤ 3 v(S) · C12 σx4 σy2 bρ(k, t) + 1.5C12 σx4 σy2 bρ(k, t) + 2C2 σx2 ρ(k, t) · v(S)
(c) p q
≤ 4 v(S) · (C12 + C22 )σx4 σy2 b · ρ(k, t) + 1.5C12 σx4 σy2 bρ(k, t)
where (a) follows from the inequalities (E.19) and Cauchy-Schwarz inequality, (b) follows from (E.17), and
(c) follows from the fact
(e) (e) (e)
∥(ΣS )−1/2 E[U (e,S) ]∥2 ≤ ∥(ΣS )−1/2 E[X (e) Y (e) ]∥2 + ∥(ΣS )1/2 β (S) ∥2
(d)
q
(e) (e) (e)
≤ (E[XS Y (e) ])⊤ (ΣS )−1 (E[XS Y (e) ])
(e) −1/2 1/2
+ ∥(ΣS )1/2 ΣS ∥2 ∥ΣS β (S) ∥2
√ √
≤ σy + bσy ≤ 2 bσy ,
which further implies that
s
1 (e)
p q p
v(S) = ∥(ΣS )−1/2 E[U (e,S) ]∥22 · v(S) ≤ 4bσy2 · v(S).
|E|
(e)
Here (d) follows from the fact that the covariance matrix of [XS , Y (e) ] are positive semi-definite thus the
Schur complement satisfies
(e) (e) (e)
σy2 − (E[XS Y (e) ])⊤ (ΣS )−1 (E[XS Y (e) ]) ≥ 0,
1
P (e)
and a similar argument to the covariance matrix of the mixture distribution [XS , Y ] ∼ |E| e∈E µ(xS ,y) .
b 2 , observe that 1 P (e) (e,S)
For T |E| e∈E E[XS R ] = 0, then following (E.18), (E.19) and the fact that bσx4 ζ(k, t) ≤
σx4 ρ(k, t) ≤ 1 since b ≤ |E| by Condition 3.4,
2
b 2 | ≤ (ΣS )−1/2 1 b (e,S) ] − E[U (e,S) ]) · (Σ−1/2 Σ b S Σ−1/2 )−1
X
|T (E[U S S
|E| 2
e∈E 2
p 2
≤ 1.5(C22 σx2 σy )2 bζ(s, t) + bζ(s, t) ≤ C3 σx4 σy2 ρ(k, t).

55
Putting all the pieces together, we can conclude that

T b 2 − v(S) ≤ T
b1 + T b 1 − v(S) + T b2
 p q 
≤ C4 bσx4 σy2 ρ(k, t) + v(S) bσx4 σy2 ρ(k, t) .

This completes the proof.

E.5 Proof of Proposition E.2


We first establish the existence and uniqueness of β k,γ . The existence of an optimal solution follows from
the fact that Qk,γ (β) is continuous in Rd , and its optimal solution can be attained on the closed set F =
{β : ∥β − β̄∥2 ≤ (1/λmin (Σ))1/2 β̄ ⊤ Σβ̄} given
2Qk,γ (β) ≥ (β − β̄)⊤ Σ(β − β̄) ≥ β̄ ⊤ Σβ̄ = 2Qk,γ (0) ∀ β ∈ F c.
The uniqueness will be established using the proof-by-contradiction argument. Let β ′ and β † be two optimal
solutions with β ′ ̸= β † , then
d
βj′ + βj†
 ′
β + β†
 ′
β + β†
  X
Qk,γ =R + γwk (j)
2 2 j=1
2
 
d
(a) 1  1 X 
< R(β ′ ) + R(β † ) + γwk (j)(|βj′ | + |βj† |)
2 2  j=1 
1
≤ Qk,γ (β ′ ) + Qk,γ β † .

2
Here (a) follows from the fact that R(β) is quadratic function with positive eigenvalues and hence is further
strongly convex. This is contrary to the fact that β ′ and β † are optimal solutions.
Finally, we show the loss is strong convex with respect to β k,γ . Let S k,γ = supp(β k,γ ). Observe that
R(β) − R(β k,γ ) = (β − β̄)⊤ Σ(β − β̄) − (β k,γ − β̄)⊤ Σ(β γ − β̄)
1
= (β − β k,γ )⊤ Σ(β − β k,γ ) − (β − β k,γ )⊤ Σ(β̄ − β k,γ ).
2
Putting these pieces together, we obtain
(a) 1
Qk,γ (β) − Qk,γ (β k,γ ) = (β − β k,γ )⊤ Σ(β − β k,γ ) − (β − β k,γ )⊤ Σ(β̄ − β k,γ )
2
Xd  
+γ vj |βj | − |βjk,γ |
j=1
1
(b)
= (β − β k,γ )⊤ Σ(β − β k,γ )
2
X d n o
+γ vj −(βj − βjk,γ )ξj + |βj | − |βjk,γ |
j=1
d
1 X
= (β − β k,γ )⊤ Σ(β − β k,γ ) + γ vj {|βj | − ξj βj }
2 j=1
(c) 1 1/2
≥ ∥Σ (β − β k,γ )∥22
2
Here (a) follows from the calculation of R(β) − R(β k,γ ), (b) follows from the KKT condition (E.8), (c) follows
from the fact that ξj ∈ [−1, 1]. This completes the proof.

56
E.6 Proof of Lemma E.7
(e,S) (e) (e)
It follows from the identity βS = (ΣS )−1 E[Y (e) XS ] that
1 X (e,S) (S) (e) (e,S) (S)
v(S) = (βS − βS )⊤ ΣS (βS − βS )
|E|
e∈E
1 X (e) (e) (e) (e) (S) (S) (e) (S)
= E[Y (e) XS ]⊤ (ΣS )−1 E[Y (e) XS ] − E[Y (e) XS ]⊤ βS + βS ΣS βS
|E|
e∈E
1 X   2
(e) (e) (e) (S)
= (ΣS )−1/2 E[Y (e) XS ] − ΣS βS .
|E| 2
e∈E

(S) (e)
At the same time, for any a ∈ R|S| , plugging Y (e) = (βS )⊤ XS + R(e,S) gives

1 X   2
(e) (e) (e)
qS (a) = (ΣS )−1/2 E[Y (e) XS ] − ΣS a
|E| 2
e∈E
1   2
(e) (e) (e) (S)
X
= (ΣS )−1/2 E[R(e,S) XS ] − ΣS (a − βS )
|E| 2
e∈E
( )
(S) ⊤ (S) (S) ⊤ 1 X (e,S) (e)
=(a − βS ) Σ(a − βS ) − 2(a − βS ) E[R XS ]
|E|
e∈E
1 X h i 2
(e)
+ (Σ(e) )−1/2 E XS R(e,S) .
|E| 2
e∈E

It follows from the definition of R(e,S) and the definition of β (S) that
1 X (e) 1 X (e) (S) (e)
E[XS R(e,S) ] = E[XS (Y (e) − (βS )⊤ XS )]
|E| |E|
e∈E e∈E
1 X (e) (S)
= E[XS Y (e) ] − ΣS βS = 0.
|E|
e∈E

(S)
This verifies (E.16). Therefore a⋆ = βS attains the global minima of qS (a), this verifies (E.14) and (E.15).

E.7 Proof of Lemma E.8


High probability error bound in A1 (k, t). For any S ⊆ [d] with |S| ≤ s, let w(S,1) , . . . , w(S,NS ) be an
d
1/4−covering of unit ball BS = {x ∈ R : xS c = 0, ∥x∥2 ≤ 1}, that is, for any w ∈ BS , there exists some
π(w) ∈ [NS ] such that

∥w − w(S,π(w)) ∥2 ≤ 1/4. (E.20)

It follows from standard empirical process result that NS ≤ 9|S| , then


s  
X X X d
N= NS ≤ 9|S| ≤ 9i
i=0
i
|S|≤s |S|≤s
 s Xs  i    s X d  i  
9d s d 9d s d (E.21)
≤ ≤
s i=0
d i s i=0
d i
 s  s
9 × 4d
 
9d s d
= 1+ ≤ .
s d s

57
(e)
At the same time, for fixed e and S, denote ξ = (ΣS )−1/2 (E[U
b (e,S) ] − E[U (e,S) ]). It follows from the
variational representation of the ℓ2 norm that
(S,ℓ) ⊤ (S,π(w)) ⊤ (S,ℓ) ⊤ 1
∥ξ∥2 = sup wS⊤ ξ ≤ sup (wS ) ξ + sup (wS − wS ) ξ ≤ sup (wS ) ξ + ∥ξ∥2 ,
w∈BS ℓ∈[NS ] w∈BS ℓ∈[NS ] 4
where the last inequality follows from the Cauchy-Schwarz inequality and our construction of covering in
(S,ℓ)
(E.20). This implies ∥ξ∥2 ≤ 2 supℓ∈[NS ] (wS )⊤ ξ, thus
(e)
sup sup (ΣS )−1/2 (E[U
b (e,S) ] − E[U (e,S) ])
e∈E |S|≤s 2
n
(S,ℓ) ⊤ (e) 1 X  (e) (e,S) (e)

(E.22)
≤2 sup (wS ) (ΣS )−1/2 Xi,S Ri − E[XS R(e,S) ] .
e∈E,|S|≤s,ℓ∈[NS ] n i=1
| {z }
Z1 (e,S,ℓ)

Note for fixed e, S and ℓ, Z1 (e, S, ℓ) is the recentered average of independent random variables, each of which
(S,ℓ) (e) (e)
is the product of two sub-Gaussian variables. By Condition 3.4, (wS )⊤ (ΣS )−1/2 XS has sub-Gaussian
(e,S) (e) (S) ⊤ (e)
parameter at most σx , and the sub-Gaussian parameter of R := Y − (β ) X is no more than
!
(a) 1 X
(e) 1/2 (S) (e) 1/2 −1/2 −1/2 (e) (e)
σy + σx (ΣS ) β ≤ σy + σx (ΣS ) ΣS ΣS E[XS Y ]
2 2 |E|
e∈E 2
s
(b) 1
(e)
≤ σy + σx (ΣS )1/2 ΣS
−1/2
X
E[(Y (e) )2 ] (E.23)
2 |E|
e∈E
(c) √
≤ σy + σx bσy .
Here (a) follows from the property of the operator norm and the definition of β (S) ; (b) follows from the
(S,ℓ) (e) −1/2 (e) (e,S)
Cauchy-Schwarz inequality; and (c) follows from Condition 3.4. Therefore, (wS )⊤ (ΣS )√ XS R
is the product of two sub-Gaussian variables with parameter no more than σx and σy + σx bσy . Then it
follows from the tail bound for sub-exponential random variable that
  r 
′ 1/2 2 u u
∀e ∈ E, |S| ≤ s, ℓ ∈ [NS ], P |Z1 (e, S, ℓ)| ≥ C b σx σy + ≤ 2e−u , ∀u > 0.
n n
Letting u = t + log(2N |E|) ≤ 6 (t + s log(4d/s) + log(|E|)), we obtain
" #
p 
P sup |Z1 (e, S, ℓ)| ≥ 6C ′ b1/2 σx2 σy ρ(s, t) + ρ(s, t)
e∈E,|S|≤s,ℓ∈[NS ]

≤ N |E| × 2e− log(2N |E|)−t ≤ e−t .


Combining with the argument (E.22) concludes the proof of the claim with C1 = 12C ′ .
High probability error bound in A2 (k, t). For any symmetric matrix Q ∈ Rd×d , it follows from the variational
representation of the operator norm that,
(S,ℓ) ⊤ (S,ℓ)
∥QS ∥2 = sup wS⊤ QS wS ≤ sup (wS ) QS (wS )
w∈BS l∈[NS ]
(S,π(w)) ⊤ (S,π(w))
+ sup 2(wS − wS ) QS wS
w∈BS
(S,π(w)) ⊤ (S,π(w))
+ sup (wS − wS ) QS (wS − wS ).
w∈BS

(S,ℓ) ⊤ (S,ℓ) 1 1
≤ sup (wS ) QS (wS ) + ∥QS ∥2 + ∥QS ∥2 ,
l∈[NS ] 2 16

58
(S,ℓ) ⊤ (S,ℓ)
which implies ∥QS ∥2 ≤ 3 supℓ∈[NS ] (wS ) QS (wS ), thus

(e) b (e) )(Σ(e) )−1/2 − I


sup sup (ΣS )−1/2 (Σ S S
e∈E |S|≤s 2

(E.24)
h i
(S,ℓ) (e) b (e) )(Σ(e) )−1/2 − I (w(S,ℓ) ) .
≤3 sup (wS )⊤ (ΣS )−1/2 (Σ S S S
e∈E,|S|≤s,ℓ∈[NS ] | {z }
Z2 (e,S,ℓ)

Note for fixed e, S and ℓ, Z2 (e, S, ℓ) is the recentered average of independent random variables, each of wich is
(S,ℓ) (2) (e)
the square of a sub-Gaussian variable (wS )⊤ (ΣS )−1/2 XS with parameter at most σx , by Condition 3.4.
Then it follows from the tail bound for exponential random variable that
  r 
u u
∀e ∈ E, |S| ≤ s, ℓ ∈ [NS ], P |Z2 (e, S, ℓ)| ≥ C ′ σx2 + ≤ 2e−u , ∀u > 0.
n n

Letting u = t + log(2N |E|) ≤ 6 (t + s log(4d/s) + log(|E|)), we obtain


" #
p 
′ 2
P sup |Z2 (e, S, ℓ)| ≥ 6C σx ρ(s, t) + ρ(s, t) ≤ N |E| × 2e− log(2N |E|)−t ≤ e−t .
e∈E,|S|≤s,ℓ∈[NS ]

Combining with the argument (E.24) concludes the proof of the claim with C2 = 18C ′ .

E.8 Proof of Lemma E.9


High probability error bound in A3 (k, t). The proof idea is almost identical to Lemma E.8. For any S ⊆ [d]
(S) (S)
with |S| ≤ s, let w1 , . . . , wNS be an 1/4−covering of unit ball BS = {x ∈ Rd : xS c = 0, ∥x∥2 ≤ 1}. Recall
(S,ℓ) ⊤
that in Lemma E.8 we obtain ∥ξ∥2 ≤ 2 supℓ∈[NS ] (wS ) ξ by the variational representation of ℓ2 norm.
This immediately yields,

1 X b (e,S)
sup (ΣS )−1/2 (E[U ] − E[U (e,S) ])
|S|≤s |E|
e∈E 2
n
(S,ℓ) 1 X X  (e) (e,S) (e)
 (E.25)
≤2 sup (wS )⊤ (ΣS )−1/2 Xi,S Ri − E[XS R(e,S) ] .
|S|≤s,ℓ∈[NS ] n·E
e∈E i=1
| {z }
Z3 (S,ℓ)

(S,ℓ) (e)
Note by Condition 3.4, for fixed e, S and ℓ, (wS )⊤ (ΣS )−1/2 XS is a sub-Gaussian variable with parameter
 1/2
(S,ℓ) (e) (S,ℓ)
σe,S,ℓ = (wS )⊤ (ΣS )−1/2 (ΣS )(ΣS )−1/2 wS σx , which satisfies

n
!1/2 !1/2
XX X
(σe,S,ℓ )2 = n (σe,S,ℓ )2
e∈E i=1 e∈E
!1/2
X (S,ℓ) (e) (S,ℓ)
= n (wS )⊤ (ΣS )−1/2 (ΣS )(ΣS )−1/2 wS σx
e∈E
! !1/2
(S,ℓ) (e) (S,ℓ)
X
= n· (wS )⊤ (ΣS )−1/2 ΣS (ΣS )−1/2 wS σx
e∈E

= (n · |E|)1/2 σx .

59
Also, from Condition 3.4, we have
 1/2
(S,ℓ) (e) (S,ℓ)
∀e ∈ E, |S| ≤ s, ℓ ∈ [NS ], σe,S,ℓ = (wS )⊤ (ΣS )−1/2 (ΣS )(ΣS )−1/2 wS σx
q
(S,ℓ) (S,ℓ) (E.26)
≤ b · (wS )⊤ wS · σx

= b · σx .

While for fixed e and S, R(e,S) is a sub-Gaussian variable with parameter σy (1 + σx b), as obtained in
(E.23). Thus Z3 (e, S, ℓ) is the recentered average of independent random variables,
√ each of which is the
product of two sub-Gaussian variables with parameters σe,S,ℓ and σy (1 + σx b). Then it follows from the
tail bound for exponential random variable that
√ √
  r 
u u
|S| ≤ s, ℓ ∈ [NS ], P |Z3 (S, ℓ)| ≥ C ′ bσx2 σy b + ≤ 2e−u , ∀u > 0.
n · |E| n · |E|

Letting u = t + log(2N ) ≤ 6 (t + s log(ed/s)), we obtain


" #
p 
′ 2
P sup |Z3 (S, k)| ≥ 6C σx σy bζ(s, t) + bζ(s, t) ≤ N × 2e− log(2N )−t ≤ e−t .
|S|≤s,ℓ∈[NS ]

Combining with the argument (E.25) concludes the proof of the claim with C1 = 12C ′ .
High probability error bound in A4 (k, t). Recall that in Lemma E.8 we obtain that for any symmetric
(S,ℓ) (S,ℓ)
matrix Q ∈ Rd×d , ∥QS ∥2 ≤ 3 supℓ∈[NS ] (wS )⊤ QS wS , by the variational representation of the operator
norm. This immediately yields,

sup (ΣS )−1/2 (Σ


b S )(ΣS )−1/2 − I
|S|≤s 2

(E.27)
h i
(S,ℓ) ⊤ b S )(ΣS )−1/2 − I (w(S,ℓ) ) .
≤3 sup (wS ) (ΣS )−1/2 (Σ S
|S|≤s,ℓ∈[NS ] | {z }
Z4 (S,ℓ)

Note for fixed S and ℓ,


h i
(S,ℓ) ⊤ b S )(ΣS )−1/2 − I (w(S,ℓ) )
Z4 (S, ℓ) = (wS ) (ΣS )−1/2 (Σ S
n 2
1 X X  (S,ℓ) ⊤ (e)
= (wS ) (ΣS )−1/2 Xi − 1,
n · |E| i=1
e∈E

is the recentered average of independent random variables, each of which is the square of sub-Gaussian

(S,ℓ) (e) (S,ℓ)
1/2 √
variable with parameter σe,S,ℓ = (wS )⊤ (ΣS )−1/2 (ΣS )(ΣS )−1/2 wS σx . We have σe,S,ℓ ≤ b · σx

60
as obtained in (E.26), and
n
!1/2
XX
4
(σe,S,ℓ )
e∈E i=1
!1/2
X
= n (σe,S,ℓ )4
e∈E
!1/2
√ X
2
≤ n · (max σe,S,ℓ ) σe,S,ℓ
e,S,ℓ
e∈E
!1/2
√ √ X
2
≤ n· b · σx σe,S,ℓ
e∈E
!1/2
√ √ X (S,ℓ) (e) (S,ℓ)
= n· b · σx (wS )⊤ (ΣS )−1/2 (ΣS )(ΣS )−1/2 wS σx
e∈E
! !1/2
√ √ (S,ℓ)
X (e) (S,ℓ)
= n· b · σx (wS )⊤ (ΣS )−1/2 ΣS (ΣS ) −1/2
wS σx
e∈E
√ √ p
= n· b· |E| · σx2 .
Then it follows from the tail bound for the sub-exponential random variable that
  r 
′ 2 u u
∀|S| ≤ s, ℓ ∈ [NS ], P |Z4 (S, ℓ)| ≥ C σx b + b ≤ 2e−u , ∀u > 0.
n · |E| n · |E|
Letting u = t + log(2N ) ≤ 6 (t + s log(ed/s)), we obtain
" #
p 
′ 2
P sup |Z4 (S, ℓ)| ≥ 6C σx bζ(s, t) + bζ(s, t) ≤ N × 2e− log(2N )−t ≤ e−t .
|S|≤s,ℓ∈[NS ]

Combining with the argument (E.27) concludes the proof of the claim with C1 = 18C ′ .

E.9 Proof of Lemma E.6


We first introduce some notation and outline the sketch of the proof. For any given fixed v ∈ Rd , we define
the random variables Zv and Wv be
1 X (e)
p
Zv = (v ⊤ Xi )2 − v ⊤ Σv and Wv = Zv + v ⊤ Σv,
n · |E|
i∈[n],e∈E

respectively. Given any fixed α > 0, let s = c(1 + α)−2 σx−4 κ · n|E|/(b · log d) where c is a universal constant.
We also define the set
[
Θ = Θs,α := {θ ∈ Rd : ∥θS c ∥1 ≤ α∥θS ∥1 }
S⊆[d],|S|≤s

and abbreviate it as Θ given our analysis focused on any fixed (α, s(α)). Note that Θ is a cone, in the sense
that for any θ ∈ Θ and t > 0 we also have t · θ ∈ Θ, and note that the result we want to prove is quadratic
in θ on both sides. Therefore it suffices to consider {v ∈ Θ : ∥v∥Σ = 1}, and we define the following set
B := Bs,α : = Θ ∩ {θ ∈ Rd : ∥θ∥Σ = 1}
[
= {θ ∈ Rd : ∥θS c ∥1 ≤ α∥θS ∥1 , ∥θ∥Σ = 1}
S⊆[d],|S|≤s

61
and abbreviate it as B. We also define the following metric on Rd

d(v, v ′ ) = σx ∥v − v ′ ∥Σ

and simply let d(v, T ) = inf a∈T d(v, a) for some set T . It suffices to show that there exists some universal
constant C such that
 
1
P inf Zv + 1 ≥ ≥ 1 − 3 exp(−e n/(Cσx )4 ),
v∈B 2

where
n · |E|
n
e := .
b
It is obvious that ne ≥ n follows from b ≤ |E| derived in (3.9). Our proof is divided into three steps.
In the first step, we establish concentration inequalities for any fixed v and v ′ . To be specific, we show
that for some universal constant C > 0, the following holds: for any t > 0,
" r !#
′ ′ t t
P |Zv − Zv′ | > Cd(v, −v )d(v, v ) + ≤ 2e−t , ∀v, v ′ ∈ Rd ; (E.28)
n
e n e
" r !#
t t
P |Zv | > Cσx 2
+ ≤ 2e−t , ∀v ∈ B; (E.29)
n
e n e
" r !#
t
P Wv > Cd(v, 0) +1 ≤ 2e−t , ∀v ∈ Rd . (E.30)
n
e

In the second step, we establish an upper bound on the Talagrand’s γ2 functional (Vershynin, 2018) of
Θ, which is defined as

X
γ2 (Θ, d) := inf sup 2k/2 d(v, Bk ). (E.31)
{Bk }∞ 2k v∈Θ
k=0 :|B0 |=1,|Bk |≤2 k=0

To be specific, we show that


p
γ2 (B, d) ≤ Cσx (1 + α)κ−1/2 s log d (E.32)

where C is a universal constant.


Finally, we combine the concentration inequalities and the complexity measure γ2 (Θ, d) to bound the
e1/2 ≥ Cσx γ2 (B, d) , then
supremum supv∈Θ |Zv |. Specifically, we show that if n
 
n/(Cσx )4 .

P sup |Zv | > 1/2 ≤ 3 exp −e (E.33)
v∈B

Step 1. Establish Concentration Inequalities for Fixed v. In this step we prove the concentration
inequalities (E.28),(E.29) and (E.30). For (E.28), it follows from the definition of Z that
1 X 
(e) (e)

(v ⊤ Xi )2 − ((v ′ )⊤ Xi )2 − v ⊤ Σv − v ′⊤ Σv ′

Zv − Zv ′ =
n · |E|
i∈[n],e∈E
1 X 
(e)
 
(e)

(v + v ′ )⊤ Xi · (v − v ′ )⊤ Xi − v ⊤ Σv − v ′⊤ Σv ′ .

=
n · |E|
i∈[n],e∈E

62
It is the recentered average of independent random variables, each of which is the product of two sub-Gaussian
variables with parameter σe,v+v′ and σe,v−v′ satisfying

(a)
σe,v+v′ σe,v−v′ ≤ (σx ∥v + v ′ ∥Σ(e) ) · (σx ∥v − v ′ ∥Σ(e) )
(b) √
≤ b · d(v, −v ′ ) · σx ∥v − v ′ ∥Σ(e)
(c)
≤ b · d(v, −v ′ )d(v, v ′ );
X (d) X
(σe,v+v′ σe,v−v′ )2 ≤ b · d(v, −v ′ )2 · σx2 · ∥v − v ′ ∥2Σ(e)


i∈[n],e∈E i∈[n],e∈E
(e)
= n|E|bd(v, −v ′ )2 σx2 ∥v − v ′ ∥2Σ
= n|E|bd(v, −v ′ )2 d(v, v ′ )2 .

Here (a) follows


√ from the data generating process Condition 3.4(c); (b) and c follow from the fact that
−1/2 (e) −1/2
∥v∥PΣ(e) ≤ b∥v∥ Σ by λmax (Σ Σ Σ P) ≤ b; (d) follows directly from (b). and (e) follows from
1 2 2
|E| e∈E ∥ · ∥ Σ (e) = ∥ · ∥ Σ since |E| · Σ = e∈E Σ(e) . Using Lemma E.1 and Lemma E.2, we can obtain
that for all v, v ′ ∈ Rd and t > 0,
" r !#
′ ′ t t
P |Zv − Zv′ | > Cd(v, −v )d(v, v ) + ≤ 2e−t
n
e n e

for some universal constant C > 0. This completes the proof of (E.28).
(E.29) is a corollary of (E.28), following from assigning v ′ = 0 and noticing that d(v, 0) = σx for all
v ∈ B.
For (E.30), observe that, Wv2 = Zv + v ⊤ Σv. Combining (E.28) we can conclude that for all v ∈ Rd and
t > 0,
" !#

r
t
P Wv > Cd(v, 0) +1
n
e
" r !#
2 2 t t
= P Wv > Cd(v, 0) 2 + +1
n
e n e
" r ! #
(a) t t
2 2 ⊤
≤ P Wv > Cd(v, 0) + + v Σv
n
e n e
" r !#
2 t t
= P Zv > Cd(v, 0) +
n
e n e
≤ 2e−t .

Here in (a) we use the fact that C > 1 and that d(v, 0)2 = σx2 v ⊤ Σv ≥ v ⊤ Σv since σx ≥ 1.
Step 2. Bounding the γ2 -functional. In this step we prove (E.32). We define another set

B Σ := Σ1/2 B = {x ∈ Rd : ∥x∥2 = 1, Σ−1/2 x ∈ Θ}.

Since (B, d) is isometric to (B Σ , σx ∥ · ∥2 ) and (B Σ , σx ∥ · ∥2 ) is isometric to (σx B Σ , ∥ · ∥2 ). From the fact that
γ2 functional is invariant under isometries, we have

γ2 (B, d) = γ2 (B Σ , σx ∥ · ∥2 ) = γ2 (σx B Σ , ∥ · ∥2 ). (E.34)

63
Also, the γ2 functional respects scaling in the sense that

γ2 (σx B Σ , ∥ · ∥2 ) = σx γ2 (B Σ , ∥ · ∥2 ). (E.35)

Additionally, it follows from Talagrand’s majorizing measure theorem (Talagrand, 2005) that there exists
some universal constant C > 0 such that
 
γ2 (B Σ , ∥ · ∥2 ) ≤ C · Eg∼N (0,Id ) sup g ⊤ x . (E.36)
x∈BΣ

So, it remains to obtain an upper bound the right-hand side as follows:


   
Eg∼N (0,Id ) sup g ⊤ x = Eg∼N (0,Id ) sup (Σ1/2 g)⊤ (Σ−1/2 x)
x∈BΣ x∈BΣ
−1/2
≤ sup ∥Σ x∥1 · Eg∼N (0,Id ) [∥Σ1/2 g∥∞ ]
x∈BΣ
(a) √
≤ (1 + α) s sup ∥Σ−1/2 x∥2 · Eg∼N (0,Id ) [∥Σ1/2 g∥∞ ]
x∈BΣ
(b) √
≤ (1 + α) sκ−1/2 Eg∼N (0,Id ) [∥Σ1/2 g∥∞ ]
(c) √ p
≤ (1 + α) sκ−1/2 · 50 log d.

Here (a) follows from the fact that Σ−1/2 x ∈ B and for any v ∈ B, we have
√ √
∥v∥1 = ∥vS ∥1 + ∥vS c ∥1 ≤ (1 + α)∥vS ∥1 ≤ (1 + α) s∥vS ∥2 ≤ (1 + α) s∥v∥2

for some subset |S| ≤ s by the definition of Θ; (b) follows √ from ∥Σ−1/2 x∥2 ≤ κ−1/2 ∥x∥2 = κ−1/2 ; and (c)
1/2
follows from Eg∼N (0,Id ) [∥Σ g∥∞ ] ≤ Eg∼N (0,Id ) [∥g∥∞ ] ≤ 50 log d by Sudakov-Fernique’s inequality (Conze
et al., 1975) and Condition 3.4(b). Combining with (E.34), (E.36) and (E.35), we complete the proof of
(E.32).
Step 3. Bounding the maximum of |Zv |: In this step, we prove (E.33) following Mendelson et al. (2007).
It follows from the definition of γ2 -functional that there exists a sequence of subsets {Bk : k ≥ 0} of B with
k
|B0 | = 1 and |Bk | ≤ 22 such that for every v ∈ B,

X
2k/2 d(v, πk (v)) ≤ 1.01γ2 (B, d),
k=0

where πk (v) denotes the nearest element of v in Bk . This immediately implies



X ∞
X
2k/2 d(πk+1 (v), πk (v)) ≤ 2k/2 (d(v, πk (v)) + d(v, πk+1 (v)))
k=0 k=0

X (E.37)
≤ (1 + 2−1/2 ) 2k/2 d(v, πk (v))
k=0
≤ 2γ2 (B, d).

n ≥ 2k0 > n
Let the integer k0 satisfy 2e e. It follows from triangle inequality and the definition of Wv and Zv
that

|Zv | ≤ |Zv − Zπk0 (v) | + |Zπk0 (v) | = |Wv2 − Wπ2k (v) | + |Zπk0 (v) |. (E.38)
0

64
From Minkowski’s inequality, we can observe that Wv is sub-additive with respect to v, that is, for any
v1 , v2 ∈ Rd ,
 1/2
1  2
(e) (e)
X
Wv1 +v2 = v1⊤ Xi + v2⊤ Xi 
n · |E|
i∈[n],e∈E
1/2 1/2
(E.39)
 
1  2 1  2
(e) (e)
X X
≤ v1⊤ Xi  + v2⊤ Xi 
n · |E| n · |E|
i∈[n],e∈E i∈[n],e∈E

= W v1 + W v2 .

This helps us to obtain

(Wπk0 (v) − Wv−πk0 (v) )2 − Wπ2k (v) ≤ Wv2 − Wπ2k (v) ≤ (Wπk0 (v) + Wv−πk0 (v) )2 − Wπ2k (v) .
0 0 0

Then we can derive

|Wv2 − Wπ2k (v) |


2
≤ Wv−π k (v) + 2Wv−πk0 (v) Wπk0 (v) . (E.40)
0 0

Therefore, combining (E.38) and (E.40), and letting the positive integer k1 < k0 be determined later, we can
upper bound supv∈B |Zv | as follows
2
sup |Zv | ≤ sup Wv−π k (v) + 2 sup Wv−πk0 (v) sup Wv0 + sup |Zv0 |
0
v∈B v∈B v∈B v0 ∈Bk0 v0 ∈Bk0
2
p
= sup Wv−π k (v) + 2 sup Wv−πk0 (v) sup Zv0 + 1 + sup |Zv0 |
0
v∈B v∈B v0 ∈Bk0 v0 ∈Bk0
2
≤ sup Wv−π k0 (v)
+ 2 sup Wv−πk0 (v) sup (|Zv0 | + 1) + sup |Zv0 |
v∈B v∈B v0 ∈Bk0 v0 ∈Bk0

(E.41)
 
2
= sup Wv−π k (v) + 2 sup Wv−πk0 (v) + 2 sup Wv−πk0 (v) + 1 sup |Zv0 |
0
v∈B v∈B v∈B v0 ∈Bk0
2
≤ sup Wv−π k0 (v)
+ 2 sup Wv−πk0 (v)
v∈B v∈B
  !
+ 2 sup Wv−πk0 (v) + 1 sup |Zv0 − Zπk1 (v0 ) | + sup |Zv1 | .
v∈B v0 ∈Bk0 v1 ∈Bk1

Then it remains to upper bound supv∈B Wv−πk0 (v) , supv0 ∈Bk |Zv0 − Zπk1 (v0 ) | and supv1 ∈Bk |Zv1 |.
0 1
First, we upper bound supv∈B Wv−πk0 (v) . It follows from the sub-additivity of Wv that

X
Wv−πk0 (v) ≤ Wπk+1 (v)−πk (v) , ∀v ∈ B. (E.42)
k=k0

For each k ≥ k0 , we define the following event


 q 
U1 (k) = sup Wπk+1 (v)−πk (v) ≤ 8C 2 /e
k n · d (πk+1 (v), πk (v))
v∈B

k k k+1 k+2
where the constant C is the same as that in (E.30). Since |Bk | ≤ 22 , there are at most 22 × 22 ≤ 22
distinct pairs of (πk+1 (v), πk (v)). Thus, we can take a union bound over all such pairs, combine with (E.30)

65
and use the fact that 2k > n
e provided k ≥ k0 to obtain
 
P U1 (k)
X  q 
≤ n · d (πk+1 (v), πk (v))
P Wπk+1 (v)−πk (v) > 8C 2k /e
(πk+1 (v),πk (v))
 q   (E.43)
X
≤ P Wπk+1 (v)−πk (v) > C 16 · 2k /e
n + 1 · d (πk+1 (v) − πk (v), 0)
(πk+1 (v),πk (v))
k+2
≤ 22 · 2 exp(−16 · 2k ) ≤ exp(−8 · 2k ).
T
Under the event k≥k0 U1 (k), it follows from (E.37) and (E.42) that

X ∞
X
sup Wv−πk0 (v) ≤ n−1/2
sup Wπk+1 (v)−πk (v) ≤ 8Ce 2k/2 d(πk+1 (v), πk (v))
v∈B
k=k0
v∈B
k=k0 (E.44)
n−1/2 γ2 (B, d).
≤ 16Ce

For supv0 ∈Bk |Zv0 − Zπk1 (v0 ) |, we first define the following event for each 0 ≤ k ≤ k0 − 1,
0

 q 
U2 (k) = sup Zπk+1 (v) − Zπk (v) ≤ C · 40σx d(πk+1 (v), πk (v)) 2k /e
n
v∈B

k k k+1 k+2
where the constant C is the same as that in (E.28). Since |Bk | ≤ 22 , there are at most 22 × 22 ≤ 22
distinct pairs of (πk+1 (v), πk (v)). Thus, we can take a union bound over all such pairs, combine with (E.28)
and use the fact that 2k ≤ n e provided k ≤ k0 − 1 to obtain
 
P U2 (k)
X  q 
≤ P Zπk+1 (v) − Zπk (v) > C · 2σx d(πk+1 (v), πk (v)) · 20 2 /e
k n
(πk+1 (v),πk (v))
X 
≤ P Zπk+1 (v) − Zπk (v) > Cd (πk+1 (v), −πk (v)) d(πk+1 (v), πk (v)) (E.45)
(πk+1 (v),πk (v))
q 
× n + (16 · 2k )/e
(16 · 2k )/e n
k+2
≤ 22 · 2 exp(−16 · 2k ) ≤ exp(−8 · 2k ).
Tk0 −1
From triangle inequality, under the event k=k1 U2 (k), we have

0 −1
kX
sup |Zv0 − Zπk1 (v0 ) | ≤ sup |Zπk+1 (v) − Zπk (v) |
v0 ∈Bk0 v∈B
k=k1
0 −1
kX (E.46)
e−1/2
≤ 40Cσx n 2k/2 d(πk+1 (v), πk (v))
k=k1
−1/2
≤ 40Cσx n
e 2γ2 (B, d).

For supv1 ∈Bk |Zv1 |, we define the following event for each 0 ≤ k ≤ k0 − 1,
1

 q 
U3 (k) = sup Zπk (v) ≤ Cσx2 · 32 2 /e
k n (E.47)
v∈B

66
where the constant C is the same as that in (E.29). We take a union bound over all elements in Bk , combine
with (E.29) and use the fact that 2k ≤ n
e to obtain
X  q 
2
P(U3 (k)) ≤ P Zπk (v) > Cσx · 32 2 /ek n
πk (v)
 q 

X
P Zπk (v) > Cσx2 n + (16 · 2k )/e
(16 · 2k )/e n (E.48)
πk (v)
k
≤ 22 · 2 exp(−16 · 2k ) ≤ exp −8 · 2k .


Now we choose k1 such that n e/(223 C 2 σx4 ) ≤ 2k1 < n


e/(222 C 2 σx4 ). Then combining with (E.44), (E.46)

and (E.47), there exists a universal constant C such that, provided n e1/2 ≥ C ′ σx γ2 (B, d), the following holds
T∞  Tk0 −1
under the event k=k0 U1 (k) ∩ k=k1 U2 (k) ∩ U3 (k1 )

2
sup |Zv | ≤ sup Wv−πk (v) + 2 sup Wv−πk0 (v)
0
v∈B v∈B v∈B
  !
+ 2 sup Wv−πk0 (v) + 1 sup |Zv0 − Zπk1 (v0 ) | + sup |Zv1 |
v∈B v0 ∈Bk0 v1 ∈Bk1
 2   
1 1 1 1 1 1
≤ +2· + 2· +1 + < .
64 64 64 64 64 2

Therefore, combine this with (E.43), (E.45) and (E.48), we can conclude that the event
 
1
sup |Zv | <
v∈B 2
occurs with probability at least

X h 0 −1
i kX h i h i
1− P U1 (k) − P U2 (k) − P U3 (k1 )
k=k0 k=k1

X ∞
X
exp(−8 · 2k ) − exp(−8 · 2k ) − exp −8 · 2k1

≥1−
k=k0 k=k0

≥ 1 − 2 exp(−4 · 2 ) − exp −8 · 2k1 k0




≥ 1 − 3 exp −4 · 2k1


n/(C ′ σx4 ) .

≥ 1 − 3 exp −e

With these results, we are ready to prove Lemma E.6.


Proof of Lemma E.6. Combining the results in Step 2 and Step 3, we can conclude that if n e ≥ Cσx4 (1 +
2 −1 −2 −2 −4
α) κ s log d, i.e. s ≤ C (1 + α) σx κ · n|E|/(b log d) where C is a universal constant, then
   
1
P inf Zv + 1 ≥ ≥ P sup |Zv | < 1/2 ≥ 1 − 3 exp(−e n/(Cσx )4 ).
v∈B 2 v∈B

n/(Cσx )4 ), the following holds: for all θ ∈ Rd \ {0}:


Therefore with probability over 1 − 3 exp(−e
1 X b ⊤ (e) 2
E[|θ X | ] = Zθ + θ⊤ Σθ = ∥θ∥2Σ (Zθ/∥θ∥Σ + 1) ≥ 0.5∥θ∥2Σ ≥ 0.5κ∥θ∥22 .
|E|
e∈E

67
E.10 Proof of Lemma E.3
 
Σ u
The R.H.S. of the inequality follows from the fact that the augmented covariance matrix ⊤ is the
u σy2
positive semi-definite matrix and thus For the L.H.S., we apply the proof-by-contradiction argument. To be
specific, we will show that if ∥Σ1/2 β k,γ ∥2 > ∥Σ1/2 β̄∥2 , then β k,γ will not be the unique minimizer of Qk,γ (β),
which is contrary to the claim in Theorem 3.2. To see this, let
βe = argmin ∥Σ1/2 (β − β̄)∥2 with β̄ = Σ−1 u. (E.49)
β=t·β k,γ ,t∈R

Observe that βe is the projection on the subspace {t · β k,γ : t ∈ R} with respect to ∥Σ1/2 · ∥2 norm, this
implies that
(βe − β̄)⊤ Σv = 0 ∀v ∈ {t · β k,γ : t ∈ R}. (E.50)
Then we can obtain that
βe⊤ Σβe (a) β̄ ⊤ Σβe (b) ∥Σ1/2 β̄∥2 ∥Σ1/2 β∥
e 2
∥Σ1/2 β∥
e 2= = ≤
∥Σ1/2 β∥
e 2 ∥Σ1/2 β∥
e 2 ∥Σ1/2 β∥e 2
(c)
= ∥Σ1/2 β̄∥2 < ∥Σ1/2 β k,γ ∥2 ,

which means βe = et · β k,γ with |e


t| < 1 because λmin (Σ) > 0. Here (a) we set v = βe in (E.50); (b) follows
from Cauchy-Schwarz inequality; and (c) follows from our assumption ∥Σ1/2 β k,γ ∥2 > ∥Σ1/2 β̄∥2 . Therefore,
we have
n
X
e − Qk,γ (β k,γ ) = ∥Σ1/2 (βe − β̄)∥2 − ∥Σ1/2 (β k,γ − β̄)∥2 + γ
Qk,γ (β) (|βej | − |β k,γ |)wk (j)
2 2
j=1
(a) n
X (b)
≤ 0+γ (|βej | − |βjk,γ |)wk (j) < 0,
j=1

t · β k,γ with |e
where (a) follows from the minimization program in (E.49), (b) follows from βe = e t| < 1. This
k,γ
is contrary to the fact that β uniquely minimize Qk,γ (β). Then we can conclude that ∥Σ1/2 β k,γ ∥2 ≤
∥Σ1/2 β̄∥2 ≤ σy .

F Implementation Details and Omitted Results in Experiments


In this section we elaborate more on the implementation details.

F.1 Pre-Processing in Climate Dynamic Prediction


For Climate Dynamic Prediction, we follow the approach of Runge et al. (2015) and conduct preprocessing
as follows. For each of the four tasks, we perform the cosine transform on the grid dat. Specifically, for a
measurement x at a grid with latitude ϕ ∈ [−π, π], we apply the following transformation:
p
xcos = x ∗ cos(ϕ).
The cosine transform compensates for the varying areas that grids at different latitudes represent, helping to
avoid over-compression or over-amplification of grids at higher latitudes. Next, we estimate the covariance
matrix on the training data and compute the eigenvectors, which are then rotated using the Varimax (Kaiser,
1958; Vejmelka et al., 2015) criterion. We select N = 60 top significant components based on a comparison of
the eigenvalues of the original data with those of the surrogate data that only represent the autocorrelation
structure. Finally, for each task, the component weight matrix computed from the training dataset is
multiplied with cosine-transformed daily-gridded time series. The resulting product is then normalized to
have zero mean and unit variance based on the training data.

68
F.2 Construction of the Target Variables in Climate Dynamic Prediction
For each task a ∈ {air, csulf, slp, pres}, we use Xt to regress Zt,j tentatively for each j ∈ [60] on all
training data and evaluate R2 , which is defined as

− Yb (Xt ))2
P
(Xt ,Zt,j )∈D1 ∪D2 (Zt,j
1− P 2
.
(Xt ,Zt,j )∈D1 ∪D2 (Zt,j )

We add j to the set of target variables Y if R2 exceeds a predefined threshold. We set the threshold to 0.75
for a ir and csulf, and 0.9 for pres and slp. The selected target variables are shown in Table 3. We do
so since we only care about those target variables that have strong correlations with explain variables. We
use the same hyper-parameters when predicting multiple targets, while different tasks do not share the same
hyper-parameters.

F.3 The Procedure of Applying PCMCI+ or Granger Causality


When applying PCMCI+ or Granger causality in Section 4.1 and Section 4.2, we perform the analysis on
the entire training data and use a significance level of α = 0.01. We fix the set of selected covariates and
then conduct 100 random trials, with the L2 regularization parameter set to 0.1.

Data air csulf pres slp


[1, 2, 6, 7,
[1, 2, 6, 9, 13, 15, [1, 3, 7, 8,
10, 14, 16, 17,
19, 20, 21, 23, 24, 27, [1, 2, 8, 17, 13, 28, 32, 33,
Target Variables 20, 22, 28, 30,
31, 33, 37, 38, 40, 25, 27, 45, 53] 45, 46, 50, 53,
33, 36, 42, 45,
47, 48, 49, 54, 55, 58] 54, 55, 56]
49, 54,55,58]

Table 3: Selected target variables for the four tasks air temperature (air), clear sky upward solar flux (csulf), surface pressure
(pres) and sea level pressure (slp) over 100 replications.

F.4 Comparison of Different k


In this sectioon we show the performance of varying k ∈ {1, 2, 3} for our method invariance-guided regularization
(IGR) in the same settings as Section 4.1 and Section 4.2. The results presented in Table 4 and Table 5
indicate that the performance across different k is similar.

Data AMT SPG


k=1 0.135 ± 0.068 0.036 ± 0.034
k=2 0.131 ± 0.074 0.048 ± 0.039
k=3 0.129 ± 0.094 0.051 ± 0.040

Table 4: The average ± standard deviation of the worst-case out-of-sample R2 (4.2) for predicting the stocks AMT and SPG using
IGR with different k.

Data air csulf pres slp


k=1 3.7882 ± 0.3416 2.0431 ± 0.0673 1.5892 ± 0.1952 3.0392 ± 0.2569
k=2 3.7838 ± 0.3281 2.0523 ± 0.0883 1.6077 ± 0.1122 3.0466 ± 0.1955
k=3 3.7652 ± 0.4016 2.0637 ± 0.0415 1.6004 ± 0.1042 3.0379 ± 0.2157

Table 5: The average ± standard deviation of the mean squared error (4.3) of the four tasks air temperature (air), clear sky
upward solar flux (csulf), surface pressure (pres) and sea level pressure (slp) using IGR with different k.

69
F.5 Causal Relation Identified by Our Method in Climate Dynamic Data
To qualitatively evaluate our method for causal discovery, we present the paths identified by our approach
among six regions (No. 20, 23, 38, 40, 48, and 49) in the air temperature task (air ) in Fig. 3. In particular,
the causal path from the Arabian Sea (No. 38) to the eastern limb of ENSO (No. 40) via the Indian Ocean
(No. 49) is verified by Kumar et al. (1999) and Timmermann et al. (2018). Additionally, the paths between
East Asia (No. 48 and No. 23) and the high surface pressure sector of the Indian Monsoon region (No. 38)
align with the known relationship between the sea surface temperatures of the Indian Ocean and the Asian
Summer Monsoon (Li et al., 2001). These results demonstrate that our method is capable of effectively
identifying causal relationships.

 ƒ1
   

 ƒ1
 
 
  ƒ


ƒ6

    


3DWKFRHII

Figure 3: The paths identified by our approach among the six regions (No. 20, 23, 38, 40, 48, and 49) in the air temperature
task (air). The edge colors represent the path coefficients, while the labels indicate the time lags in days.

70

You might also like