Fundamental Computational Limits in Pursuing Invariant Causal Prediction and Invariance-Guided Regularization
Fundamental Computational Limits in Pursuing Invariant Causal Prediction and Invariance-Guided Regularization
Abstract
Pursuing invariant prediction from heterogeneous environments opens the door to learning
causality in a purely data-driven way and has several applications in causal discovery and robust
transfer learning. However, existing methods such as ICP (Peters et al., 2016) and EILLS
(Fan et al., 2024) that can attain sample-efficient estimation are based on exponential time
algorithms. In this paper, we show that such a problem is intrinsically hard in computation:
the decision problem, testing whether a non-trivial prediction-invariant solution exists across two
environments, is NP-hard even for the linear causal relationship. In the world where P̸=NP, our
results imply that the estimation error rate can be arbitrarily slow using any computationally
efficient algorithm. This suggests that pursuing causality is fundamentally harder than detecting
associations when no prior assumption is pre-offered.
Given there is almost no hope of computational improvement under the worst case, this paper
proposes a method capable of attaining both computationally and statistically efficient estimation
under additional conditions. Furthermore, our estimator is a distributionally robust estimator
with an ellipse-shaped uncertain set where more uncertainty is placed on spurious directions than
invariant directions, resulting in a smooth interpolation between the most predictive solution
and the causal solution by varying the invariance hyper-parameter. Non-asymptotic results and
empirical applications support the claim.
Keywords: Causality, Distributional Robustness, Invariant Prediction, Maximin Effects, NP-hardness,
Parsimonious Reduction.
1 Introduction
How do humans deduce the cause of a target variable from a set of candidate variables when only passive
observations are available? A natural high-level principle is to identify the variables that produce consistent
predictions at different times, locations, experimental conditions, or more generally, across various environments.
This heuristic is implemented in statistical learning by seeking invariant predictions from diverse environments
(Peters et al., 2016; Heinze-Deml et al., 2018; Fan et al., 2024; Gu et al., 2024). This approach goes beyond
just learning associations in the recognition hierarchy (Bareinboim et al., 2022) and enables the discovery of
certain data-driven causal relationships without prior causal assumptions. However, existing methods that
realize general invariance learning rely on explicit or implicit exhaustive searches, which are computationally
inefficient. This raises the question of whether learning invariant predictions is fundamentally hard. This
paper contributes to understanding the fundamental limits and introducing a novel relaxed estimator for
invariance learning. Theoretically, we prove this problem is intrinsically hard using a reduction argument
(Karp, 1972) with novel constructions. Our theoretical message further implies that learning data-driven
causality is fundamentally harder than detecting associations. On the methodological side, we propose a
relaxation in two aspects: our approach balances computational efficiency and statistical accuracy on one
hand while optimizing trade-offs between prediction power and robustness on the other.
∗ Supported by NSF Grants DMS-2210833 and DMS-2412029.
1
1.1 Pursuit of Linear Invariant Predictions
Suppose we are interested in pursuing the linear invariant relationship between the response variable Y ∈ R
and explanatory covariate X ∈ Rd using data from multiple sources/environments. Let E be the set of
(e) (e)
environments. For each environment e ∈ E, we observe n data {(Xi , Yi )}ni=1 that are i.i.d. drawn from
(e) (e) (e)
some distribution (X , Y ) ∼ µ satisfying
(e) (e)
Y (e) = (βS⋆ ⋆ )⊤ XS ⋆ + ε(e) with E[XS ⋆ ε(e) ] ≡ 0 (1.1)
where β ⋆ is the true parameter that is invariant across different environment and S ⋆ = supp(β ⋆ ) denotes
the support of β ⋆ , while the distribution of µ(e) may vary across environments. Here we assume different
environments have the same sample size n for presentation simplicity. The goal is to recover S ⋆ and β ⋆ using
(e) (e)
the observed data DE = {(Xi , Yi )}i∈[n],e∈E .
Methods inferring the invariant set S ⋆ from (1.1) can be applied to causal discovery under the structural
causal model (SCM) (Glymour et al., 2016) framework. This is because when observing environments
where interventions are applied within the covariates X, S ⋆ = {j : Xj is direct cause of Y } satisfies (1.1)
and is unique in some sense when the intervention is non-degenerate and enough (Peters et al., 2016; Gu
et al., 2024); see the discussion in Section 1.4. Though initially motivated by causal discovery under the SCM
framework that may be sensitive to model misspecification, pursuing invariant predictions from heterogeneous
environments itself is a much more generic principle in statistical learning, or a type of inductive bias in
causality (Bühlmann, 2020; Gu et al., 2024), that can also facilitate, for example, robust transfer learning
(Rojas-Carulla et al., 2018), prediction fairness among sub-populations (Hébert-Johnson et al., 2018), and
out-of-distribution generalization (Arjovsky et al., 2019).
Unlike the standard linear regression under which each variable Xj is either truly important (j ∈ S ⋆ ) or
exogenously spurious (Fan & Zhou, 2016) (j ∈ / S ⋆ but E[Xj ε] = E[Xj (Y − X ⊤ β ⋆ )] = 0), the set of candidate
variables in (1.1) can be decomposed into three groups:
Spurious Variables (S ⋆ )c
[z }|[ {
⋆ ⋆ E ⋆ E
{1, . . . , d} = S j∈
/ S : Cov (ε, Xj ) ̸= 0 j∈
/ S : Cov (ε, Xj ) = 0 , (1.2)
| {z } | {z }
Endogenously Spurious Variables G:= Exogeneously Spurious Variables
(e) (e)
where CovE (ε, Xj ) := |E|1
P
e∈E E[ε Xj ] is the pooled covariance between the noise and the covariate
Xj across different environments. The major difference compared with standard linear regression and the
main difficulty behind such an estimation problem is the presence of endogenously spurious variables G
(Fan & Liao, 2014). The exogenously spurious variable is one that lacks predictive power for the noise
ε = Y − XS⊤⋆ βS⋆ ⋆ in population and only increases the estimation error by n−1/2 if it is falsely included. It
usually does not cause the bias of estimation but inflates slightly the variance. In contrast, endogenously
spurious variables contribute to predicting the noise; thus, the false inclusion of any such variable results
in inconsistent estimation due to the biases they create. An illustrative example is to classify whether the
object in an image is a cow (Y = 0) or camel (Y = 1) using three extracted features X1 =body shape,
X2 =background color, and X3 = temperature or time that the photo is taken. Here S ⋆ = {1} is the invariant
and causal feature, while G = {2} helps predict the noise ε, since cows (resp. camels) usually appear on
green grass (resp. yellow sand) in the data collected. X3 is exogeneously spurious: including it does not
increase estimation bias but slight variance. From a statistical viewpoint, the core difficulty is to distinguish
whether a variable is truly important, or endogenously spurious among those statistically significant variables
that contribute to predicting Y . This is where multi-environment comes into play. There is a considerable
literature on estimating the parameter β ⋆ in (1.1) (Peters et al., 2016; Rothenhäusler et al., 2019, 2021;
Pfister et al., 2019; Arjovsky et al., 2019; Yin et al., 2021).
Fan et al. (2024) first realized sample-efficient estimation for the general model (1.1) and offered a
comprehensive non-asymptotic analysis in terms of both n and E, this idea is further extended to the
fully non-parametric setting in Gu et al. (2024). Specifically, it shows that given data from finitely many
2
environments |E| < ∞, one can identify S ⋆ with n = ∞ under the minimal identification condition:
′
∀S ⊆ [d] with S ∩ G ̸= ∅ =⇒ ∃e, e′ ∈ E, β (e,S) ̸= β (e ,S) (1.3)
where β (e,S) := argminsupp(β)⊆S E(X,Y )∼µ(e) [|Y − β ⊤ X|2 ]. This requires that S ⋆ is the maximum set that
preserves the invariance structure in that incorporating any endogenously spurious variables in G will result
in shifts in predictions across E. Turning to the empirical counterpart, the optimal rate for linear regression
can be attained therein using their proposed environment invariant linear least squares (EILLS) estimator.
This implies that as long as β ⋆ can be identified under finitely many environments, unveiling the data-
driven causality parameter β ⋆ in (1.1) is as statistically efficient as estimating the association counterpart
in standard linear regression.
Promising through the above progress, the invariance pursuit procedure has two drawbacks. The first is
about the computational burden. The estimation error is only guaranteed for the global minimizer of the
objective function in Fan et al. (2024) and Gu et al. (2024). An exponential-in-d algorithm is adopted to
find the global minimizer of the objective function that Fan et al. (2024) proposes. Though the Gumbel trick
introduced by Gu et al. (2024) allows variants of gradient descent algorithm to perform well in practice, the
nonconvexity nature is still kept and there are no theoretical guarantees on the optimization.
The second is that the invariant model is typically conservative in its predictive performance for a
new environment. Though it finds the “maximum” invariant set, the invariant prediction model will
eliminate the endogenously spurious variables that result in heterogeneous predictions in E. This may
result in conservativeness in prediction with the help of the endogenous variables, which is the best for the
adversarial environment but is not so for the prediction environment of interest. In the aforementioned
cow-camel classification task, suppose r1 = 95% cows (resp. camels) appear on grass (resp. sand) in the
first environment e = 1 and the spurious ratio is r2 = 70% in environment e = 2. In this case, an invariant
prediction model drops the background color X2 due to its variability across environments. In general, a
prediction model without X2 is intuitively the best when r = 0%, yet potentially reduces predictive power
compared to the ones including X2 when evaluated in an environment with r > 50%.
The above discussion gives rise naturally to the following two questions, which will be addressed in this
paper.
A. What does the formula below evaluate? (a) True (b) False
(True ∧ True) ∨ False ∧ (True ∨ ¬True) ∧ (¬False ∨ ¬(True ∧ True)) (1.4)
B. Can we choose v1 , v2 , v3 , v4 in {True, False} to make the result of the formula as True? (a)
Yes (b) No
(v1 ∧ v4 ) ∨ v2 ∧ (v3 ∨ ¬v4 ) ∧ (¬v2 ∨ ¬(v1 ∧ v3 )) (1.5)
3
The latter question is an instance of the circuit satisfiability (CircuitSAT) problem (Karp, 1972). The
answers to both questions are (a), and (1.4) offers the unique valid solution to (1.5) as (v1 , v2 , v3 , v4 ) =
(True, False, True, True).
From an intuitive perspective, we argue that the relationship between “finding the best linear predictor”
and “finding any non-trivial invariant (causal) prediction” shares some similarities with the relationship
between the two questions posed above. While both scenarios involve the same setting, that is “boolean
formula” for the second pair and “linear model” for the first pair, and may potentially yield the same
solution, their computation complexities and hierarchy in recognition tasks differ significantly. The former
ones only involve simple arithmetic calculations, are straightforward in thought, and can be solved quickly.
In contrast, the latter ones will suffer from inevitable brute force attempts, require complicated reasoning,
and necessitate a potentially larger time budget. The latter tasks involve reasoning using the information
extracted from the corresponding former perception tasks.
Formally, consider the testing problem ExistsLIS-2 using population-level quantities.
Problem 1.1 (ExistsLIS-2). Consider the case of |E| = 2. Given the positive definite covariance matrices
Σ(1) , Σ(2) ∈ Rd×d with Σ(e) = E[X (e) (X (e) )⊤ ] and the covariance vectors u(1) , u(2) ∈ Rd with u(e) =
E[X (e) Y (e) ], it asks whether it is possible to find a non-empty prediction-invariant set S ⊆ [d] such that
β (1,S) = β (2,S) ̸= 0. Here β (e,S) is defined in (1.3) and can be arithmetically calculated as β (e,S) =
(e) (e)
[(ΣS )−1 uS , 0S c ] provided Σ(e) is positive definite thus invertible.
Problem 1.1 simplifies the original linear invariance pursuit problem, i.e., estimating β ⋆ or S ⋆ in (1.1),
in several aspects: we consider only two heterogeneous environments to identify β ⋆ when G ̸= ∅, and it only
checks the existence of solution.
As the answer to Q1 in Section 1.1, this paper shows that the aforementioned simplified ExistsLIS-2
problem is NP-hard, which is essentially the same as the problem CircuitSat with an instance example
(1.5). Furthermore, the NP-hardness is not because of the existence of exponentially many possible invariant
solutions, it remains when β ⋆ is identifiable by (1.3). Many problems are classified as NP-hard, other
examples include 3Sat, MaxClique, Partition (Erickson, 2023). The Cook–Levin theorem (Karp, 1972)
states that if there exists a polynomial time algorithm to solve any NP-hard problem, then P=NP, meaning all
the N(ondeterministic-)P(olynomial-time) problems, which is verifiable in polynomial time, are P(olynoimal-
time) problems that are solvable in polynomial time. It is suspected, but is still a conjecture (Bovet et al.,
1994; Fortnow, 2021), that P̸=NP. This implies it is unlikely that there exists any polynomial-time algorithms
for NP-hard problems. This paper proves the NP-hardness of ExistsLIS-2 problem and an easier problem
with constraint (1.3) by constructing a parsimonious polynomial-time reduction from the 3Sat problem, a
simplification of CircuitSat, to our ExistsLIS-2 problem. See the formal definition of NP-hardness and
reduction in Section 2.
In many statistical problems, though attaining correct variable selection suffers from computational
barriers, it is possible to construct a computationally efficient and accurate estimator of the continuous
parameters of interest. For example, as a convex relaxation of L0 regularized least squares, L1 regularized
least squares can obtain n−1/2 (Bickel et al., 2009) prediction error rate in general and match the same
optimal n−1 rate under the additional yet mild restricted eigenvalue (RE) condition (Candes & Tao, 2007)1 .
On the other hand, compared with L0 (Zhang & Zhang, 2012) penalty, L1 penalty requires a much more
restrictive, usually impossible (Fan & Li, 2001; Zou, 2006), condition to attain variable selection consistency
(Zhao & Yu, 2006; Meinshausen & Bühlmann, 2006). It is natural to ask if obtaining a reasonable prediction
error using a computationally efficient algorithm is possible in finding invariant predictions. Our result also
says “No” if P̸=NP.
In summary, this paper proves that consistent variable selection and reasonable prediction error in
finding invariant predictions are NP-hard. In the world of P̸=NP, this establishes a dilemma between
computational and statistical tractability for the invariance pursuit problem, and such an impossibility
result has implications for several fields and questions.
1 The RE condition can be relaxed by the restricted strong convexity condition. In this case, if the covariate is zero-mean
Gaussian (Raskutti et al., 2010) or sub-Gaussian (Rudelson & Zhou, 2013), optimal estimation error can be obtained by L1
regularization when |supp(β ⋆ )| log p = o(n) provided the curvature is bounded from below, i.e., λmin (E[XX ⊤ ]) ≳ 1.
4
(a) It has long been hypothesized that there may exist some intrinsic computation barrier in finding
invariant solutions given that the problem has a combinatorial formulation and all the existing provable
sample-efficient methods use exhaustive search explicitly or implicitly. It is still open whether finding
an invariant solution is fundamentally hard or can be solved by a (still not discovered) computationally
efficient algorithm. We offer a definite pessimistic answer to this.
(b) Our established dilemma above shows that pursuing invariance is fundamentally harder than pursuing
sparsity. The latter can guarantee a decent prediction error using computationally efficient algorithms
under a mild assumption that does not hurt the generality of the problem, and the corresponding
estimation error will decrease when we keep increasing n. However, these no longer apply to the
former. Thus, the relaxation tricks used in the sparsity pursuit like L1 regularization may not be a
good fit, and potentially new relaxation techniques should be introduced to pursue invariance.
It regularizes the pooled least squares using pre-calculated weighted L1 penalty, where the adaptive, data-
driven weight wkE (j) on |βj | is the upper bound of the prediction variations across environments E when
incorporating variable xj ; see the details in Section 3. Here γ is the hyper-parameter that trades off predictive
power and robustness against spurious signals, and k is the hyper-parameter that controls the computation
budget through wkE (j). The key features of our proposed estimator are as follows.
(a) For the computation concern, our proposed estimator provably attains the causal identification, i.e.,
β k,γ = β ⋆ for large enough γ, by paying affordable computation cost (small k) under some unknown
low-dimensional structure among the variables. On the other hand, by increasing the computation
budget k to p, our proposal achieves the causal identification under the same assumptions as those in
EILLS (Fan et al., 2024).
(b) The estimator reaches the goal in Q2 by tuning γ. When causal identification is attained in (a), it
leads to a continuous solution path interpolating the pooled least squares solution with γ = 0 and
the causal solution β ⋆ with large enough γ. For any fixed γ, it has a certain distributional robustness
interpretation in that β k,γ can be represented as the maximin effects (Meinshausen & Bühlmann, 2015;
Guo, 2024) over some uncertainty set.
5
will stand in between ∅ and S ⋆ , i.e., ∅ ⊆ Sb∞ ⊆ S ⋆ , and easily be collapsed to ∅ in most of the cases when
the interventions are not enough. The idea of penalizing least squares using exact invariance regularizer
(Fan et al., 2024; Gu et al., 2024) will select variables Sb∞ satisfying S ⋆ ⊆ Sb∞ ⊆ S̄ as n = ∞ where S̄
is the Markov blanket of Y , but it will eliminate any of Y ’s child if it is intervened in a non-degenerate
manner. Though causal the solution is, it may lack some predictive power under the circumstances discussed
before Q2. The estimator proposed in this paper leverages the invariance principle as an inductive bias for
“soft” regularization instead of that for “hard” structural equation estimation and can alleviate the lack of
predictive power in this aspect.
There are also attempts to attain both computationally and statistically efficient estimation under (1.1).
For example, Rothenhäusler et al. (2019, 2021) consider the case where the mechanism among all covariate
and response variables (X, Y ) remain unchanged and linear, while the heterogeneity across environments
comes from additive interventions on X. Estimators similar to instrumental variable (IV) regression in
causal identification are proposed. This idea is further extended (Kania & Wit, 2022; Shen et al., 2023), but
can not go beyond circumventing the computation barrier by assumptions similar to IV regression. This is
conceptually the same as least squares that follow the prior untestable assumptions to pinpoint the unique
solution and may suffer from model misspecification. Li & Zhang (2024) studies a similar model with one
additional constraint – the covariance between XS ⋆ remains the same. A seemingly computation-efficient
variable selection method is proposed. However, the additional constraint seems to be superfluous in that it
cannot change the NP-hardness of the problem; see Appendix B.1. Therefore, there is still a gap in attaining
sample-efficient estimation by computation-efficient algorithms under mild assumptions that will not ruin
the prior-knowledge blind nature of invariance pursuit. This paper makes progress in this direction.
There is also a considerable literature on robustifying prediction using the idea of distributionally robust
optimization, which finds a predictor that minimizes the worst-case risk on a set of distributions referred
to as the uncertain set. The uncertain set is typically a (isotropic) sphere in postulated metric centered
on the training distribution. Examples of pre-determined metrics include KL divergence (Bagnell, 2005),
f -divergence (Duchi & Namkoong, 2021) and Wasserstein distance (Mohajerin Esfahani & Kuhn, 2018;
Blanchet et al., 2019). Such a postulated metric is uninformative which leads to a relatively conservative
solution. Our estimator is a distributionally robust estimator with an ellipsoid-shaped uncertainty set. It
assigns minimal uncertainty to invariant (causal) directions while allocating greater uncertainty to spurious
directions, which balances the robustness and power in a better way.
The NP-hardness and the conjecture P̸=NP are used to derive computation barriers in many statistical
problems, mainly about detecting sparse low-dimensional structures in high-dimensional data. For the
sparse linear model, Huo & Ni (2007) shows finding the global minima of L0 penalized least squares is
NP-hard, Chen et al. (2014) shows the NP-hardness holds for any Lq loss and Lp penalty with q ≥ 1 and
p ∈ [0, 1), and Chen et al. (2017) extends it to general convex loss and concave penalty. However, these are
computation barriers tailored to specific algorithms, not the fundamental limits of the problem itself. Zhang
et al. (2014) shows when P̸=NP, in the absence of the restricted eigenvalue condition, any polynomial-time
algorithm can not attain estimation error faster than n−1/2 , which is attained by L1 regularization but is
sub-optimal compared with optimal n−1 error. There is also a considerable literature on deriving statistical
sub-optimality of computationally efficient algorithms using the reduction from the planted clique problem
(Brennan & Bresler, 2019), such as sparse principle component (Berthet & Rigollet, 2013a,b; Wang et al.,
2016), sparse submatrix recovery (Ma & Wu, 2015). However, a reasonable error is still attainable using
computationally efficient alternatives. As discussed above, this is not the case for pursuing invariance as
shown by this paper.
Our Contributions. The main contributions are as follows:
• We establish the fundamental computational limits of finding prediction-invariant solutions in linear
models, which is the first in the literature. Our proof is based on constructing a novel parsimonious
reduction from the 3Sat problem to the ExistLIS-2 problem.
• A simple estimator is proposed to relax the computational budget and exact invariance pursuit using
two hyper-parameters. It allows for provably computational and statistical efficiency estimation of
6
the exact invariant (causal) parameters with mild additional assumptions and also offers flexibility in
trade-offing efficiency and invariance (robustness).
Organization. This paper is organized as follows. In Section 2, we introduce the concept of NP-hardness
and present our main computation barrier result accompanied by the proofs. In Section 3, we propose our
method that relaxes the computation budget and conservativeness, illustrate its distributional robustness
interpretation, and present the corresponding non-asymptotic result. The proofs for the results in Section 3
are deferred to the supplement material. Section 4 collects the real-world application.
Notations. We will use the following notations. Let X ∈ Rd , Y ∈ R be random variables and x, y be
their instances,
Pm respectively. We let [m] = {1, . . . , m}. For a vector z = (z1 , . . . , zm )⊤ ∈ Rm , we let
∥z∥q = ( j=1 |zj |q )1/q with q ∈ [1, ∞) be its ℓq norm, and let ∥z∥∞ = maxj∈[m] |zj |. For given index set
S = {j1 , . . . , j|S| } ⊆ [m] with j1 < · · · < j|S| , we denote [z]S = (zj1 , . . . , zj|S| )⊤ ∈ R|S| and abbreviate it
as zS if there is no ambiguity. We use A ∈ Rn×m to denote a n by m matrix, use AS,T = {ai,j }i∈S,j∈T to
denote a sub-matrix and abbreviate it as AS if √ S = T and n = m. For a d-dimensional vector z and d × d
positive semi-definite matrix A, we let ∥z∥A = z ⊤ Az, and let λmin (A) (resp. λmax (A)) be the minimum
(resp. maximum) eigenvalue of A.
We collect data from multiple environments E. For each environment e ∈ E, we observe n data
(e) (e)
{(Xi , Yi )}ni=1 which are drawn i.i.d. from µ(e) . We denote E[f (X (e) , Y (e) )] = f (x, y)µ(e) (dx, dy) and
R
b (X (e) , Y (e) )] = 1 n f (X (e) , Y (e) ), and define
E[f
P
n i=1 i i
1 X (e) 1 X (e)
Σ(e) = E[X (e) (X (e) )⊤ ], u(e) = E[X (e) Y (e) ], Σ = Σ , u= u . (1.6)
|E| |E|
e∈E e∈E
We assume there is no collinearity, i.e., Σ(e) ≻ 0 such that we can define the population-level best linear
predictor constrained on any set S in each environment e, β (e,S) := argminsupp(β)⊆S E[|Y (e) − β ⊤ X (e) |2 ], and
all the environment, β (S) := argminsupp(β)⊆S e∈E E[|Y (e) − β ⊤ X (e) |2 ]. Let the pooled least squares loss
P
over all the environments be
1 X
RE (β) = E[|Y (e) − β ⊤ X (e) |2 ]. (1.7)
2|E|
e∈E
7
We also consider a potentially easier variant of 3Sat to be used in the section. The problem is potentially
easier than 3Sat because it pursues the same target under additional non-trivial restrictions.
Problem 2.2 (3Sat-Unique). The 3Sat-Unique problem is the same as 3Sat under the promise that the
solution is unique if exists, i.e., X3Sat-Unique = {x ∈ X3Sat , |Sx | ≤ 1}.
We then introduce the idea of reduction and NP-hardness.
Definition 2 (Reduction). We say T : XP → XQ is a deterministic polynomial-time reduction from problem
P to problem Q if there exists some polynomial p such that for all x ∈ XP , (1) T (x) can be calculated on a
deterministic Turing machine with time complexity p(|x|); and (2) T (x) ∈ XQ,1 if and only if x ∈ XP,1 .
We say T : XP → XQ is a randomized polynomial-time reduction (Valiant & Vazirani, 1985) from problem
P to problem Q if there exists some polynomial p such that (1) T (x) can be calculated on a randomized (coin-
flipping) Turning machine with computational complexity p(|x|) for any x ∈ XP ; (2) For all x ∈ XP \ XP,1 ,
T (x) ∈
/ XQ,1 ; (3) For all x ∈ XP,1 , P[T (x) ∈ XQ,1 ] ≥ 1/p(|x|).
Definition 3 (NP-hardness). We say a problem P is NP-hard under deterministic (resp. randomized)
polynomial-time reduction if there exists deterministic (resp. randomized) polynomial-time reduction from
the circuit satisfiability problem (Karp, 1972) to problem P .
The NP-hardness of a problem is widely used to measure the existence of the underlying computational
barrier for the problem; examples in statistics include sparse PCA under particular regime (Berthet &
Rigollet, 2013a,b; Wang et al., 2016), sparse regression (Zhang et al., 2014) without restricted eigenvalue
condition. The underlying reason why an NP-hard problem P is “hard” can be illustrated via the Cook–
Levin theorem (Karp, 1972): the existence of any polynomial-time algorithm for the NP-hard problem under
deterministic polynomial-time reduction will assert P=NP, which implies any NP problem, defined as the
problem whose validness of solution can be verified within polynomial-time, can be solved within polynomial-
time. The NP-hardness under randomized polynomial-time reduction can be understood similarly: the
existence of any polynomial-time algorithm for such a problem implies any NP problem can be solved within
polynomial-time with high probability, that is, for any NP decision problem P , we can design a polynomial-
time randomized algorithm A e such that
∀x ∈ XP \ XP,1 , A(x)
e =0 and ∀x ∈ XP,1 , P[A(x)
e = 1] ≥ 1 − 0.01|x|−100 .
If the conjecture “P̸=NP” holds, then the NP-hardness of a problem naturally implies “there is no polynomial-
time algorithm for the problem”. We introduce the NP-hardness under randomized polynomial-time reduction
to characterize the computation barrier of the linear invariance pursuit under identification condition (1.3).
We have the following result for the above two problems.
Lemma 2.1. The problem 3Sat is NP-hard under deterministic polynomial-time reduction. The problem
3Sat-Unique is NP-hard under randomized polynomial-time reduction.
Proof of Lemma 2.1. The NP-hardness of 3Sat follows from Karp (1972), the proof for the NP-hardness of
3Sat-Unique can be found in Appendix A.4.
8
representing the covariance matrices of X (e) , i.e., Σ(e) = E[X (e) (X (e) )⊤ ], and u(1) , . . . , u(E) be d-dimensional
vectors representing the covariance between X (e) and Y (e) , i.e., u(e) = E[X (e) Y (e) ]. In this case, the
(e,S) (e,S) (e) (e)
population-level least squares solutions can be written as β (e,S) = [βS , 0S c ] with βS = (ΣS )−1 uS
(S) (S) (e) (e)
and β (S) = [βS , 0S c ] with βS = ( e∈E ΣS )−1 ( e∈E uS ).
P P
We define the problem ExistLIS as follows:
[Input] Σ(1) , . . . , Σ(E) and u(1) , . . . , u(E) satisfying the above constraints.
[Output] Returns 1(Yes) if there exists S ⊆ [d] such that β (e,S) ≡ β (S) ̸= 0; otherwise 0(No).
We simplify the original problem, that is, unveiling S ⋆ in (1.1), when n = ∞ from two aspects in
Problem 2.3. Firstly, we only use the first-order linear information rather than the full distribution information
such that the input of the problem is of O(d2 ) when |E| = O(1). The space of ExistLIS can be seen as
a “linear projection” of the space of the problems that recovering S ⋆ in (1.1) provided Σ(e) ≻ 0. Secondly,
we state it as a decision problem rather than a solution-solving problem: it suffices to answer whether a
non-trivial invariant set exists instead of pursuing one. For simplicity in this section, we use the terminology
“invariant set” instead of “linear prediction-invariant set”. We define the concept of the maximum invariant
set to present the same problem under the identification condition (1.3).
Definition 4 (Invariant Set and Maximum Invariant Set). Under the setting of Problem 2.3, we say a set
S̄ is a invariant set if β (e,S̄) ≡ β (S̄) . We say a set S̄ is a maximum invariant set if it is an invariant set and
satisfies
!
(S∪S̄) (S̄) (e,S) (e′ ,S)
∀S ⊆ [d], either β =β or sup ∥β −β ∥2 > 0 . (2.1)
e,e′ ∈[E]
Problem 2.4 (Existence of Linear Invariant Set under Identification). Problem ExistLIS-Ident is defined
as the same problem as ExistLIS with the additional constraint that there exists a maximum invariant set
S†.
Note that S † can be an empty set, under which the corresponding problem instance does not have non-
trivial invariant solutions. Observe that the boolean formula (a∨b) is equivalent to the statement (if ¬a then
b). As required by (2.1), an invariant set S̄ is a maximum invariant set if incorporating any variable that
enhances the prediction performance will lead to shifts in best linear predictions. Therefore, the existence of
the maximum invariant set defined in Definition 4 is just a restatement of the identification condition (1.3),
that is, S̄ is a maximum invariant set if and only if S ⋆ = S̄ satisfies (1.1) and (1.3) simultaneously.
Problem 2.4 is an easier version of the problem of recovering S ⋆ in (1.1) with the identification constraint
(1.3) in population n = ∞. The following example gives an instance of the problem ExistLIS-Ident. This
example also indicates that the maximum invariant set may not be unique, but all the maximum invariant
sets yield the same prediction performance.
Example 2.2 (An Instance of ExistLIS-Ident Problem). Consider an instance with d = 4, E = 2 and
input
√2 √2
1 0 15
0 1 0 5
0 2 2
0 1 √1 0 0 1 √1 0 1 1
(Σ(1) , Σ(2) ) = 2 15 , 2 5 , (u(1) , u(2) ) = 3 , √6 .
q
√15 √115 3
5 0 √
5
√1
5
7
5 0 2 5 5
0 0 0 1 0 0 0 1 0 0
It can be seen as a “linear projection” of the following data-generating process with e = {1, 2} and independent
standard normal random variables ε0 , . . . , ε4 :
(e) (e) (e)
X1 ← ε1 , X 2 ← ε2 , X4 ← ε4 ,
(e) (2)
Y (e) ← 2 · X1 + X2 + ε0 ,
√
(e) ( 3)e−2 Y (e) + ε3
X3 ← √ .
5
9
It is easy to see that the sets ∅, {1}, {2}, {4}, {1, 2}, {1, 4}, {2, 4}, {1, 2}, {1, 2, 4} are all invariant sets, while
the sets {1, 2} and {1, 2, 4} are maximum invariant sets.
From the perspective of a computational problem, the existence of a maximum invariant set offers non-
trivial constraints on the problem and one can construct a model where this condition fails to hold; see
Example 2.3 below. On the other hand, the non-existence of a maximum invariant set rarely happens under
the causal discovery setting. To be specific, under the setting of the structural causal model with intervention
on X, it is known from Theorem 3.1 in Gu et al. (2024) that a maximum invariant set always exists if the
intervention is non-degenerate, which occurs with probability 1 under suitable measure on the intervention.
Example 2.3 (An Instance of ExistLIS that is not ExistLIS-Ident). Consider the model of Example 4.1
in Fan et al. (2024) with s(1) = 1/2 and s(2) = 2, that is, the SCMs in environment e ∈ {1, 2} are
(e)
√
X1 ← 0.5ε1
(e) (e)
√
Y ← X1 + 0.5ε0
(e)
X2 ← 22e−3 Y (e) + ε2
with ε0 , . . . , ε2 are i.i.d. standard Gaussian random variables. It is easy to check that the sets ∅, {1}, {2} are
all invariant sets but none of them satisfies the second constraint (2.1), and the set {1, 2} is not an invariant
set. So there does not exist a maximum invariant set.
Given XExistLIS-Ident ⊊ XExistLIS , ExistLIS may be potentially harder than ExistLIS-Ident. We
will establish NP-hardness to both ExistLIS and ExistLIS-Ident to rule out the possibility that the
computational hardness is because of nonidentifiability, or in other words, computational difficulty can be
resolved when S ⋆ is identifiable in (1.1) by (1.3).
Theorem 2.1. When E = 2, the problem ExistsLIS is NP-hard under deterministic polynomial-time
reduction; the problem ExistsLIS-Ident is NP-hard under randomized polynomial-time reduction.
Theorem 2.1 states that there exist certain fundamental computational limits under the problem of
pursuing a linear invariant prediction: the difficulties are intrinsically inherited in the problem itself – there
does not exist a polynomial-time algorithm to test whether there exists a non-trivial invariant prediction in
general if P̸=NP.
Remark 1 (NP-hardness under More Restrictive Conditions). It is worth noticing that the underlying
computational barrier is attributed to the nature of the problem, i.e., pursuing invariance, instead of artificial
and technical difficulties. Such a barrier will remain for other cousin models and models under more
restrictive conditions. Examples include (1) finding a prediction with stronger invariance condition like
distributional invariance in Peters et al. (2016); (2) problems with row-wise sparse covariance matrices
where all the covariance matrices only have constant-level non-zero entries in each row; (3) problems with
well-separated heterogeneity in that the variations in prediction are large for all the non-invariant solutions.
See the rigorous statement and discussion in Appendix A.
We will show a much easier problem with fixed (Σ(1) , u(1) ) structure is NP-hard.
10
(resp. d) for presentation simplicity. For an integer m, We let 1m be a m-dimensional vector with all entries
being 1, and let Im be a m × m identity matrix.
Unlike the standard reduction argument whose goal is to find a polynomial time reduction T : X3Sat →
XExistLIS such that 1{|Sx | > 0} = 1{|ST (x) | > 0}, we will construct a parsimonious reduction satisfying
|Sx | = |ST (x) |. This finer construction transfers the promise of the unique solution in 3Sat-Unique to the
promise of the identification in ExistLIS-Ident.
Lemma 2.2. We can construct a parsimonious polynomial-time reduction from 3Sat to ExistLIS: for each
instance x of problem 3Sat with input size k, we can transform it to y = T (x) of problem ExistLIS within
polynomial-time with d = 7k + 1 such that |Sy | = |Sx |.
Proof of Lemma 2.2. We construct the reduction as follows. Let x be any 3Sat instance with k clauses.
Without loss of generality, we assume that each variable has appeared at least once in some clause. For
each clause, we use action ID in {0, . . . , 7} to represent the assignment for it. For example, for the clause
v1 ∨ ¬v2 ∨ ¬v5 and the action ID 6 with binary representation 110 means we let v1 = True, ¬v2 = True
and ¬v5 = False. One will not adopt action ID 0 in a valid solution because a 3Sat valid solution should
let each clause evaluate to True. For arbitrary i, i′ ∈ [k] and t, t′ ∈ [7], we say the action ID t in clause i
contradicts the action ID t′ in clause i′ if and only if t will assign a boolean variable to be True (resp. False)
while t′ will assign the same boolean variable to be False (resp. True). In the proof, we use i, i′ to represent
the index in [k], and use j, j ′ to represent the index in [d].
We construct the problem y as follows: we set d = 7k + 1, and use fixed first environment (Σ(1) , u(1) ) =
(Id , 1d ). For the second environment, we pick
i = i′ and t = t′
1{t contradicts itself}
i = i′ and t ≠ t′
1
A7(i−1)+t,7(i′ −1)+t′ = (2.2)
1 i ̸= i′ and t contradicts t′
0 otherwise
for any i, i′ ∈ [k] and t, t′ ∈ [7]. It is easy to verify that Σ(1) and Σ(2) are all positive definite matrices and it
is a deterministic polynomial-time reduction. Indeed, one has λmin (Σ(e) ) ≥ 1 for any e ∈ [2]. By definition,
S ∈ Sy if and only if β (1,S) = β (2,S) with |S| ≥ 1.
The intuitions behind the constructions are as follows: (a) the construction of (Σ(1) , u(1) ) is to enforce
the entries in the valid solutions, i.e., β (S) with S ∈ Sy , being either 0 or 1; (b) the positive non-integer 21
together with the last column of Σ(2) is to make sure d ∈ S for any S ∈ Sy , which further let |S| = k + 1 for
any S ∈ Sy ; (c) the construction of A is to connect any valid S ∈ Sy to a valid solution v ∈ Sx in a bijective
manner. The above intuitions can be formally stated as follows: the first claim (a) follows directly from our
construction of (Σ(1) , u(1) ), we defer technical verification of (b) and (c) to the end of the proof.
(a) (1,S)
S ∈ Sy ⇐⇒ S ̸= ∅ and β (2,S) = β (1,S) with βj = 1{j ∈ S}
(b)
⇐⇒ S = S̊ ∪ {d} with |S̊| = k and Aj,j ′ = 0 ∀j, j ′ ∈ S̊ ⊆ [7k]
(2.3)
(c)
⇐⇒ S = S̊ ∪ {d} where S̊ = {7(i − 1) + ai }ki=1 with ai ∈ [7] s.t. adopting
action ID ai in clause i ∈ [k] will lead to a valid solution v ∈ Sx .
Based on (2.3), for any v ∈ Sx , we can find a corresponding S ∈ Sy : Let v ∈ {True, False}n be the
assignments of the variables and ai be the corresponding action ID induced by v. Then it follows from (2.3)
that S = {d} ∪ {7(i − 1) + ai }ki=1 ∈ Sy . On the other hand, for any S ∈ Sy , we can also find a corresponding
11
v ∈ Sx by (2.3). Note the mapping between Sy and A = {(ai )ki=1 : (ai )ki=1 is induced by some solution v ∈
Sx } and the mapping between A and Sx are all bijective maps. So we can conclude that |Sx | = |A| = |Sy |.
Proof of (2.3) (b). The direction ⇐ is obvious. For the ⇒ direction, we first show that d ∈ S using the proof
by contradiction argument. Suppose |S| ≥ 1 but d ∈ / S, we pick j ∈ S, then
7k
h
(2) (2,S)
i X 1 (2)
ΣS β S = 5d + Aj,j ′ 1{j ′ ∈ S} =
̸ 5d + = uj
j 2
j ′ =1
(2,S) (1,S)
where the first equality follows from the assumption βj = βj = 1{j ∈ S} and d ∈/ S, and the inequality
follows from the fact that A ∈ {0, 1}7k×7k hence the L.H.S. is an integer. This indicates that β (1,S) ̸= β (2,S)
if |S| ≥ 1 and d ∈/ S. Given d ∈ S, we then obtain
7k
1 (2)
h
(2) (2,S)
i 1X 1
5d + k = ud = ΣS βS = 5d + 1{j ′ ∈ S} = 5d + (|S| − 1),
2 |S| 2 ′ 2
j =1
12
2.4 Hardness of Finding Approximate Solutions with Error Guarantees
The claim in Theorem 2.1 indicates a computational barrier exists in finding an exact invariant set. At first
glance, it does not rule out the possibility that there exists some polynomial-time algorithm that can find
an approximate solution whose prediction is relatively close to one of the non-trivial invariant ones. The
construction in Theorem 2.1 implicitly implies this, as demonstrated in Corollary 2.2. As a by-product,
Corollary 2.2 also rules out the possibility of finding a non-trivial invariant solution if one exists, as it allows
for estimation errors.
Problem 2.5. Consider the same setting as Problem 2.3 with E = 2 and suppose further Y (e) = (β (e,[d]) )⊤ X (e) ,
i.e., there is no intrinsic noise.
[Input] (Σ(1) , Σ(2) ) and (u(1) , u(2) ) as in Problem 2.3.
[Output] Return a d-dimensional vector β̄: β̄ should be an approximate solution to any of the non-trivial
invariant solutions if there exists a non-trivial invariant solution, that is
Moreover, in the construction in Lemma 2.2, the solutions are well-separated: whenever variable selection
is incorrect, the resulting predictions in the two environments are not very close, and the pooled prediction
also deviates from any invariant predictions; see the formal claims in (2.6) and (2.5), respectively. The
inequality (2.6) also rules out the possibility of finding a o(d−4 )-approximate invariant set in a computationally
efficient manner.
Lemma 2.3 (Relative Estimation Error Gap). In the constructed instance in Lemma 2.2, if we let Y (e) =
(β (e,[d]) )⊤ X (e) , then the following holds,
†
† ∥β (S) − β (S ) ∥2Σ
∀S, S ⊆ [d], P (e) |2 ]
∈ [1{S ̸= S † }(40d)−1 , 1] (2.5)
e∈[2] E[|Y
(S)
− β (e,S) ∥2Σ(e)
P
e∈[2] ∥β
∀S ⊆ [d], P (e) |2 ]
∈ {0} ∪ [(10d)−4 , 1] (2.6)
e∈[2] E[|Y
Remark 2 (Dilemma between Statistical and Computational Tractability). One can choose either the
relative distance to the closest non-trivial invariant solution δ1 , or the relative prediction variation defined
in the L.H.S. of (2.6) δ2 as the “estimation error” of interests. If P̸=NP, taking all the polynomial-time
algorithms into consideration, Corollary 2.2 claims that the worst-case estimation error δ1 is lower bounded
by (20d)−1 , and Lemma 2.3 shows that the worst-case estimation error δ2 is lower bounded by (10d)−4 . A
finer construction in Appendix A improves the error lower bounds in (2.4), (2.5) and (2.6) to be d−ϵ for
any fixed ϵ > 0. Given that our theorem is stated at a population level, and one can estimate all the β (e,S)
uniformly well provided n ≳ poly(d), we can claim that the statistical estimation error can be arbitrarily slow
with polynomial-time algorithms if P̸=NP.
Proof of Corollary 2.2. We use the same reduction as in Lemma 2.2. For 3Sat instance x, we let y = T (x)
be the constructed ExistLIS instance in Lemma 2.2. Let β̄ be the output required by Problem 2.5 in the
instance y, and Se = {j : β̄j ≥ 0.5}. Following the notations therein, we claim that
(a)
Se ∈ Sy ⇐⇒ |Sy | ≥ 1 ⇐⇒ x ∈ X3Sat,1 (2.7)
13
Therefore, if an algorithm A can take Problem 2.5 instance y as input and return the desired output β(y) b
within time O(p(|y|)) for some polynomial p, then the following algorithm can solve 3Sat within polynomial
time: for any instance x, it first transforms x into y = T (x), then use algorithm A to solve y and gets the
returned β̄, and finally output 1{Se ∈ Sy }.
It remains to verify (a): the ⇒ direction is obvious. For the ⇐ direction, suppose |Sy | ≥ 1, the estimation
error guarantee in Problem 2.5 indicates that
s r
† † ∥ β̄ − β (S † ) ∥2 (i) (20d)−1 10d2 1
(S ) (S ) Σ
∥βb − β ∥∞ ≤ ∥β̄ − β ∥2 ≤ < ≤
λmin (Σ) 2d 2
for some S † ∈ Sy . Here (i) follows from the the error guarantee (2.4), and the facts λmin (Σ) ≥ 0.5λmin (Σ(2) ) ≥
2d and e∈[2] E[|Y (e) |] ≤ 10d2 in (C.2) and (C.1), respectively. This further indicates Se = S † by the fact
P
(S † )
that βj = 1{j ∈ S † } for any j ∈ [d].
14
3.1 Warmup: Orthogonal Important Covariate
Let us first impose an additional restrictive assumption Condition 3.2 in the model (1.1) and see how the
computational barrier can be circumvented under this condition. In the following Section 3.2, we shall
consider a more general relaxation regime and establish a tradeoff between the additional assumption and
computational complexity.
(e)
Condition 3.2. For all e ∈ E, Σi,j = 0 for any i, j ∈ S ⋆ with i ̸= j.
Recall the definition of β (e,S) and β (S) . If Condition 3.2 holds, then under (1.1) and (1.3) S ⋆ can be
simplied as
n o
(e,{j}) ({j})
S ⋆ = j : ∀e ∈ E, βj ≡ βj
({j})
that involves only marginal regression coefficients, where βj stands for the pooled effect by simply using
the j-th variable as the predictor. This means under Condition 3.2, one can enumerate j ∈ [d] and screen
(e) (e′ )
out those Xj with varying marginal regression coefficients, i.e., Xj with rj ̸= rj for some e, e′ ∈ E, where
(e) (e) (e)
rj = E[Xj Y (e) ]/E[|Xj |2 ]. The survived variables will furnish S ⋆ . Turning to the empirical counterpart,
it is a multi-environment version of the sure-screening (Fan & Lv, 2008).
The above procedure is still of a discontinuity style. Recall RE (β) in (1.7), the main idea motivates
minimizing the following penalized least squares
J1 (β)
z }| {
d s 2
X 1 X (e) (e,{j}) ({j})
Q1,γ (β) = RE (β) + γ |βj | Σj,j βj − βj , (3.1)
j=1
|E|
e∈E
| {z }
w1 (j)
where the penalty term measures the discrepancy across different environments.
(e) (e,{j}) ({j}) 2 (e,{j}) ({j}) 2
Here we use Σj,j |βj − βj | rather than |βj − βj | since the former is x-scale invariant
and has a better explanation in prediction. To be specific, the term w1 (j) will be the same if we replace X
by aX for any a ∈ R \ {0}. More importantly, it can be explained as the variation of optimal prediction in
L2 norm across environments, namely,
Z o2
1 X n (e,j)
w1 (j) = f (x) − f (j) (x) µ(e) (dx) (3.2)
|E|
e∈E
(e,{j}) ({j})
where f (e,j) (x) = βj xj is the best linear prediction on Xj in environment e and f (j) (x) = βj xj is
the best linear prediction on Xj across all environments.
The proposed optimization program can be understood in two aspects. On the one hand, it maintains the
capability to solve the invariant pursuit problem, that is, recover β ⋆ from (1.1), when γ is large enough. To
see this, when γ ≍ 1, the introduced penalty γJ1 (β) will place a constant penalty on the spurious variables,
i.e., j ∈ G, and will not penalize any variables in S ⋆ . Therefore, one can expect that β ⋆ will be the unique
minimizer of Q1,γ (β) as γ is large enough so that the penalty term is larger than the prediction error of
using β ⋆ . On the other hand, it maximizes relaxed worst-case explained variance over small perturbations
around the pooled least squares, defined as β̄ := Σ−1 u, when γ is small. Recall the definition of pooled
quantity (Σ, u) in (1.6), the two-fold characterization of the population-level minimizer of (3.1) can be
formally delivered as follows.
Proposition 3.1. Let Pγ (Σ, u) = (X, Y ) ∼ µ : E[XX ⊤ ] = Σ, |E[XY ] − u| ≤ γ · (w1 (1), . . . , w1 (d))⊤ be
the uncertainty set of distributions. Under Condition 3.1, Q1,γ (β) has an unique minimizer β γ satisfying
15
Moreover, under (1.1) with S ⋆ further satisfying (1.3), if Condition 3.2 holds, then β γ = β ⋆ when γ ≥ γ ⋆ :=
1
P (e) (e)
maxj∈G | |E| e∈E E[Xj ε ]|/w1 (j), where w1 (j) is defined in (3.1).
Proposition 3.1 offers interpretations of the population-level minimizer β γ of Q1,γ (β) for varying γ from
two perspectives. On the one hand, β γ can be interpreted as the distributionally robust prediction model over
the uncertainty set Pγ (Σ, u): it minimizes the worst-case negative explained variance, or it is the maximin
effects (Meinshausen & Bühlmann, 2015; Guo, 2024) over the uncertainty set Pγ (Σ, u). The uncertainty
class contains all joint distributions of (X, y), where the covariates X have the second-order moment matrix
as Σ and the covariance between X and Y is perturbed around u. Similar to Theorem 1 in Meinshausen &
Bühlmann (2015) and Proposition 1 in Guo (2024), β γ has the following geometric explanation, that
This basically says that β γ is the projection of the null β = 0 on the convex closed set Θγ with respect to
the norm ∥ · ∥ = ∥Σ1/2 · ∥2 ; see the proof in Appendix D.2. The distributional robustness (3.3) and geometric
interpretation (3.4) are independent of the invariance structure (1.1) and further structural assumption
Condition 3.2. Instead, they are attributed to the choice of L1 regularization with inhomogeneous weights
(w1 (1), . . . , w1 (d)). This is a realization of the heuristic idea of adopting an anisotropic uncertainty ellipsoid
based on the observed environments. Specifically, more uncertainty is placed on the variables predicting
differently in the observed environments than those with invariant predictions.
On the other hand, consider the case where the data generating process satisfies the invariance structure
(1.1), the sufficient heterogeneity (1.3), together with an additional structure assumption Condition 3.2. Now
the above distributionally robust procedure will place zero uncertainty on the invariant, causal variables,
and will place linear-in-γ uncertainty on the spurious variables. The minimizer β γ will coincide with the
true, causal parameter β ⋆ when γ is large enough.
Let us illustrate the above ideas using the toy example below.
Example 3.1. Consider the following data-generating process with d = 3, E = {1, 2} and independent
standard normal random variables ε0 , . . . , ε3 , the cause-effect relationship and the intervention effects are
illustrated in Fig. 1 (a). The constant factors before εj with j ≥ 2 are added to ensure Xj has a unit
variance.
(1) (2)
X1 ← ε1 , X1 ← ε1 ,
(1) (2)
Y (1) ← X1 + ε0 , Y (2) ← X1 + ε0 ,
(1)
and (2)
√
X2 ← (2/3) · Y (1) + (1/3) · ε2 , X2 ← 0.5 · Y (2) + ( 2)−1 · ε2 ,
(1) (2)
p
X3 ← (2/3) · Y (1) + (1/3) · ε3 ; X3 ← 0.25 · Y (2) + 7/8 · ε3 .
In Example 3.1, X1 is the invariant (causal) variable, while X2 and X3 are all endogenous spurious (reverse
causal) variables as shown in Fig. 1 (a). They have identical spurious predictive powers in environment e = 1,
and variable X3 is confronted with stronger perturbations than X2 in environment e = 2. The invariance
structure is well identified with S ⋆ = {1} satisfying (1.1) and (1.3) simultaneously. The prediction variation
in (3.2) are (w1 (1), w1 (2), w1 (3)) = (0, 1/6, 1/4).
Fig. 1 (b) visualize the maximin effect (3.3) over the uncertainty set shaped by the prediction variation.
For given fixed γ, the uncertainty set in E[XY ] in (3.3) does not place uncertainty on the causal variable X1 ,
while it places a relatively small uncertainty γ/6 on the variables X2 which suffers from less perturbation, and
a relatively large uncertainty γ/4 on the variable X3 that predicts more differently in observed environments
E. This two-dimensional uncertainty plane in covariance space further yields the two-dimensional uncertainty
plane centered on the pooled least squares β̄ in the solution space after the affine transformation x → Σ−1 x
as shown in Fig. 1 (b). The uncertainty sets Θγ all lie in the same hyper-plane and their diameter scales
linearly with γ. The corresponding population-level minimizer β γ is the projection of the null β = 0 on
Θγ . This leads to a solution path that connects the most predictive solution β̄ and the causal solution
16
Θ3.6 βjγ
X1
1 β3
Y β2
Θ2
2/3 2/3
β1
X2 X3 β3
e=1 γ
Θ0.4 0.4 2 3.6
β0 = β̄ β2 γ
X1 ⋆
βFAIR,j
β = β 3.6 β 0.4
β2
1
β1
Y
β2
0.5 0.25
β1
X2 X3 β3
√
e=2 γ
(a) (b) (c)
Figure 1: (a) A structural causal model illustration of the multi-environment model in Example 3.1: the arrow from node u to
node v with number s means there is a linear causal effect s of u on v. (b) visualize the uncertainty set Θγ in three checkpoints
of γ ∈ {0.4, 2, 3.6} and regularization path of the proposed estimator (3.3) in the three-dimensional parameter space β ∈ R3 .
For each γ, the uncertainty set Θγ is a two-dimensional plane filled by colors changing from red to blue as γ increases. The
upper panel of (c) depicts how the population level solution β γ ∈ R3 changes according to γ in each coordinate j ∈ [3]: the
causal variable is represented by green solid line, and the two spurious (reverse causal) variable are represented by yellow
dashed (β2 ) and dotted (β3 ) lines, respectively. The lower panel of (c) plots the counterpart for the FAIR-Linear estimator in
Gu et al. (2024).
β ⋆ continuously. When γ is smaller than the critical threshold, such a prediction β γ still leverages part
of the spurious variables for prediction and will have better prediction over β ⋆ and β̄ when it is deployed
in an environment
√ p where the reverse causal effects are still positive but slightly shrinkage, for example,
X3 ← Y /3 2 + 8/9ε3 . Such a solution β γ stands in between β ⋆ and β̄: it is more robust than β̄ and
less conservative than β ⋆ . As a comparison, the FAIR-Linear (Gu et al., 2024) estimator that solves the
hard-constrained structural estimation problem is less flexible in this regard, as shown in the lower panel of
Fig. 1 (c), it adopts certain hard threshold and choose either to include or eliminate the spurious variables.
17
with some computational budget hyper-parameter k ∈ N.
As k grows or equivalently as more computational budget is paid, the space of instances that can be
solved enlarges and will finally coincide with that of EILLS or FAIR when k ≥ |S ⋆ |. On the other hand,
if the computational budget we can pay is relatively limited, one can still probably solve some problem
instances with low-dimensional structures as elaborated in the following Theorem 3.3.
Condition 3.3 (Restricted Invariance). For any j ∈ S ⋆ , there exists some S ⊆ [d] with |S| ≤ k and j ∈ S
such that β (e,S) ≡ β (S) for any e ∈ E.
Note that when Condition 3.3 holds, for all j ∈ S ⋆ , the weight wk (j) in the penalty term is equal to
0. On the other hand, for a large enough γ, all endogenous variables will be excluded due to a positive
wk (j). Hence, the object (3.5) will screen out all endogenously spurious variables and meanwhile minimize
the prediction errors using the remaining variables. Condition 3.3 naturally holds when k ≥ |S ⋆ |. When
k < |S ⋆ |, Condition 3.3 requires a stronger identification condition than the invariance assumption (1.1)
such that all the invariant variables Xj with j ∈ S ⋆ can be identified using a smaller set Sj with |Sj | ≤
k < |S ⋆ |. This is a generic condition and can hold under different circumstances. For example, there are
(e)
some shared group-orthogonal structures in the set S ⋆ such as ΣS ∗ admits a block diagonal structure with
the maximum block size ≤ k, which includes the diagonal case in Condition 3.2 as a specific instance, or
the insufficiency of interventions on the ancestors of S ⋆ , for example, all the ancestors of Y are free of
intervention. Proposition B.2 in the appendix further offers conditions under which Condition 3.3 holds.
The following two theorems generalize Proposition 3.1 for growing k.
Theorem 3.2. Let Pγ,k (Σ, u) = (X, Y ) ∼ µ : E[XX ⊤ ] = Σ, |E[Xj Y ] − uj | ≤ γ · wk (j) ∀j ∈ [d] be the
uncertain set of distributions. Under Condition 3.1, Qk,γ (β) has a unique minimizer β k,γ satisfying
Theorem 3.3. Under the setting of Theorem 3.2, assume the invariance structure (1.1) holds with S ⋆
satisfying (1.3). Suppose further that Condition 3.3 holds, then β k,γ = β ⋆ when γ ≥ γk⋆ with γk⋆ :=
1
P (e) (e)
maxj∈G | |E| e∈E E[Xj ε ]|/wk (j).
√
⋆ (e)
γk ≤ min λmin (Σ ) · γ ∗
e∈E
where γ ∗ is the critical threshold, or the signal-to-noise ratio in heterogeneity in Fan et al. (2024). It was
defined on a square scale, so a square root is taken here; see the formal definition of γ ∗ in (D.2) in the
appendix. This indicates that one does not need to adopt a potentially larger hyper-parameter to achieve
causal identification compared with EILLS in Fan et al. (2024), recalling the scaling Condition 3.1.
Similar to Proposition 3.1, the first distributional robustness interpretation (3.6) in Theorem 3.2 is
due to adopting inhomogeneous L1 penalization on the variables based on a finer prediction variation
(wk (1), · · · , wk (d)) observed in the environments E than the marginal counterpart (w1 (1), · · · , w1 (d)). The
second theorem Theorem 3.3 states that when additional structural assumption (3.3) holds, the causal
parameter β ⋆ under (1.1) with (1.3) can be identified by our estimator when γ is large enough.
18
3.3 Empirical-level Estimator and Non-asymptotic Analysis
Turning to the empirical counterpart, for given k and γ, we consider minimizing the following empirical-level
penalized least squares
Q
b k,γ (β)
z }| {
2 d
1 X
(e) (e)
X p
βbk,γ = argmin Yi − β ⊤ Xi +γ· |βj | wbk (j),
β 2n|E| j=1
(3.7)
e∈E,i∈[n]
1 X b(e,S) b(S) 2
with w
bk (j) = inf βS − βS .
S⊆[d],|S|≤k,j∈S |E| b (e)
Σ S
e∈E
The weighted L1 -penalty aims at attenuating the endogenously spurious variables. This will be applied to
the low-dimensional regime d = o(n). Under the high-dimensional regime d ≳ n, we further add another L1
penalization with hyper-parameter λ, which aims at reducing exogenously spurious variables:
βbk,γ,λ = argmin Q
b k,γ (β) + λ∥β∥1 . (3.8)
β∈Rd
For the theoretical analysis, we impose some standard assumptions used in linear regression.
Condition 3.4 (Regularity). The following conditions hold:
(a) (Data Generating Process) We collect data from |E| ∈ N+ environments. For each environment e ∈
(e) (e) (e) (e) i.i.d.
E, we observe (X1 , Y1 ), . . . , (Xn , Yn ) ∼ µ(e) . The data from different environments are also
independent.
(b) (Non-collinearity and Normalization) Assume Σ(e) ≻ 0 for any e ∈ E. Recall the definition in (1.6),
we have Σj,j = 1 for any j ∈ [d].
(c) (Sub-Gaussian Covariate and Noise) There exists some constants σx ∈ [1, ∞) and σy ∈ R+ such that
2
h n
(e) (e)
oi σx
∀e ∈ E E exp v ⊤ (ΣS )−1/2 XS ≤ exp · ∥v∥22 ∀S ⊆ [d], v ∈ R|S| ,
2
!
h n
(e)
oi λ2 σy2
E exp λY ≤ exp ∀λ ∈ R.
2
(d) (Relative Bounded Covariance) There exists a constant b ∈ [1, ∞) such that
−1/2 (e) −1/2
∀e ∈ E and S ⊆ [p] λmax (ΣS ΣS ΣS ) ≤ b.
To simplify the presentation, let c1 be such that c1 ≥ max{b, σx , σy } and |E| ≤ nc1 .
These assumptions are standard in the analysis of linear regression. It is easy to see the sub-Gaussian
covariate conditions hold with σx = 1 when X (e) ∼ N (0, Σ(e) ). The sub-Gaussian condition can be relaxed by
the finite fourth-moment conditions with robust inputs; see Fan et al. (2021). Our error bound is independent
of supe∈E λmax (Σ(e) ) given fixed b. The maximum eigenvalue λmax (Σ(e) ) may grow with d in the presence of
highly correlated covariates such as factor models (Fan et al., 2022; Fan & Gu, 2024). It is also easy to see
that b ≤ |E| by observing that
n o−1
−1/2 (e) −1/2 (e) (e)
λmax (ΣS ΣS ΣS ) ≤ λmin (ΣS )−1/2 ΣS (ΣS )−1/2 ≤ |E|. (3.9)
The following theorem establishes the L2 error bound with respect to β k,γ identified in Theorem 3.2 in
the low-dimensional regime.
19
Theorem 3.4. Assume Condition 3.4 holds. There exists a constant C e = O(poly(c1 )) such that if n ≥
−t
e max{d, k log d, t} and t ≥ log n, then with probability at least 1 − e ,
C
r ( s )
k,γ k,γ d γ p 1 + t/d
∥βb − β ∥2 ≤ C e· · t + log(n) + k log d + ,
n κ κ · |E|
with the proper choice of γ ≍ γk⋆ ≍ 1. When 0 < γ < γk⋆ , the estimator βbk,γ serves as an invariance
information guided distributionally robust estimator, whose variance of the empirical estimator lies in
between the two.
Turning to the high-dimensional regime, we have the following result. The main message is that the
proposed estimator in (3.8) can handle the high-dimensional covariates in a similar spirit to Lasso (Tibshirani,
1997; Bickel et al., 2009) for the sparse linear model with the help of another L1 penalty.
Theorem 3.5. Assume Condition 3.4 holds. Denote S k,γ = supp(β k,γ ). There exists a constant C e =
−1 k,γ −10
O(poly(c1 )) such that if n ≥ C(k + κ |S |) log d, then with probability at least 1 − (nd) ,
e
p r s !
12 |S k,γ | k log d + log n log d + log n
∥βbk,γ,λ − β k,γ ∥2 ≤ λ if λ ≥ C
e γ· + .
κ n n · |E|
20
Algorithm 1 Linear Regression with Invariance-Guided Regularization (IGR)
(e) (e)
1: Input: Training environments {D(e) }e∈E with D(e) = {(Xi , Yi )}ni=1 ; validation environment D(valid) .
2: Input: computational budget k.
3: Input: candidate sets of hyper-parameters Γ and Λ.
4: For each pair of hyper-parameters (γ, λ) ∈ Γ × Λ, calculate βbk,γ,λ using (3.8) on training environments.
5: Choose hyper-parameters as
1 X
γ b ∈ argmin
b, λ (Xi⊤ βbk,γ,λ − Yi )2 (4.1)
γ∈Γ,λ∈Λ |D(valid) |
(Xi ,Yi )∈D (valid)
Output: β k,bγ ,λ .
b
6:
denoted as {D3+i }4i=1 . This partitioning is motivated by the results of Varambally et al. (2023), which
indicate that the market behavior between the two training time spans differs significantly. We set both X
and Y to be zero-mean in each environment to remove the effect of the trend.
We fix the computational budget k = 2 and compare our method with Causal Dantzig (Rothenhäusler
et al., 2019), Anchor Regression (Rothenhäusler et al., 2021) and DRIG (Shen et al., 2023) with the aid of L1
penalty (if applicable), along with PCMCI+ (Runge, 2020) with the aid of L2 penalty. The hyper-parameters
for all models are determined via the validation set D(valid) = D3 using the criterion similar to (4.1). Here
the hyper-parameters in the two prediction tasks are determined independently. We finally evaluate each
method using the worst-case out-of-sample R2 across the four test environments defined as
− Yb (X))2
P
2 2 (X,Y )∈De (Y
min Roos,e with Roos,e =1− P (4.2)
e∈{4,5,6,7} (X,Y )∈De Y2
where Yb (X) is the model’s prediction. Here we use the R2 rather than the mean squared error in (4.1) to
present the result to illustrate the challenge of this task, given most of the previous methods have negative
out-of-sample R2 , indicating that their fitted models are even worse than simply using the null prediction
model.
This process is repeated 100 times. For each trial, we use a random sample of 90 data in each training
environment D1 , D2 to fit the model. The average ± standard deviation of the worst-case out-of-sample R2
is reported in Table 1.
Table 1: The average ± standard deviation of the worst-case out-of-sample R2 (4.2) for predicting the stocks AMT and SPG using
different estimators.
We can see that our method outperforms competing methods in terms of robustness, as it provides more
consistent estimations across different environments. In particular, our method achieves a positive worst-case
out-of-sample R2 when predicting SPG, while the other methods result in negative R2 values. To qualitatively
illustrate why most of the other competing methods yield negative R2 values, we apply LASSO with an L1
penalty parameter of 0.125 on the training data in the AMT task to select covariates. Using the selected
covariates, we refit the target on the training environments D1 , D2 , as well as one of the test environments
D6 . As shown in Fig. 2, the resulting estimations differ drastically, highlighting strong heterogeneity across
environments. This observation partially explains why other methods may produce negative R2 values.
21
(0, 84) (1, 60) (0, 21) (0, 22) (0, 59) (0, 69) (0, 36) (0, 74) (0, 96) (0, 76) (0, 60)
0.4
Environments
{1, 2}
0.0
{6}
−0.4
Variables
Figure 2: The estimated coefficients of the selected variables are shown for D1 ∪ D2 and D6 . Warm colors represent positive
coefficients, while cool colors indicate negative coefficients. Variables are denoted as (τ, j), where τ ∈ {0, 1} represents the time
lag, and j indicates the stock index.
Here Y is the variables that can be predicted by Xt significantly better than simply using the null prediction
model; see the formal procedure on determining Y in Appendix F.2.
For competing estimators, we consider PCMCI+ (Runge, 2020) and Granger causality (Granger, 1969)
with the aid of L2 penalty, along with the following three causality-oriented linear models: Causal Dantzig
(Rothenhäusler et al., 2019), Anchor Regression (Rothenhäusler et al., 2021) and DRIG (Shen et al., 2023)
with the aid of L1 penalty (if applicable). We use mean squared error (MSE) as both the validation metric
and test metric, which is defined as
X
MSEe = ∥Yt − Yb (Xt )∥22 e ∈ [4] (4.3)
(Xt ,Yt )∈De
where Yb (Xt ) is the model’s prediction. The hyper-parameters for each model are tuned using the validation
environment D3 as described in Algorithm 1.
This process is repeated 100 times. For each trail, we use a random sample of 300 data in each training
environment D1 , D2 to fit the model. The average ± standard deviation of the average mean squared error
on the test environment D4 of each method for each task is reported in Table 2. The quantitative results
show that our method outperforms all competing methods across all tasks, indicating that IGR can provide
more robust predictions. We also qualitatively visualize the causal relation detected by our method; see
Appendix F.5.
22
Data air csulf pres slp
IGR(Ours) 3.7838 ± 0.3281 2.0523 ± 0.0883 1.6077 ± 0.1122 3.0466 ± 0.1955
Causal Dantzig 4.3742 ± 0.2099 2.6197 ± 0.0455 2.0429 ± 0.1502 3.5819 ± 0.3161
LASSO 3.9171 ± 0.3194 2.1327 ± 0.0567 1.6726 ± 0.0897 3.1261 ± 0.1887
Anchor 3.9007 ± 0.2394 2.1142 ± 0.0615 1.6638 ± 0.0981 3.1235 ± 0.1622
DRIG 3.9579 ± 0.2594 2.1844 ± 0.1233 1.7235 ± 0.1176 3.2890 ± 0.1618
Granger 4.3174 ± 0.3842 2.3182 ± 0.0736 1.8470 ± 0.1222 3.5308 ± 0.1484
PCMCI+ 4.3533 ± 0.3062 2.4024 ± 0.0422 1.9499 ± 0.1213 3.6627 ± 0.2711
Table 2: The average ± standard deviation of the mean squared error (4.3) of the four tasks air temperature (air), clear sky
upward solar flux (csulf), surface pressure (pres) and sea level pressure (slp) using different estimators.
References
Aldrich, J. (1989). Autonomy. Oxford Economic Papers, 41(1), 15–34.
Arjovsky, M., Bottou, L., Gulrajani, I., & Lopez-Paz, D. (2019). Invariant risk minimization. arXiv preprint
arXiv:1907.02893.
Bagnell, J. A. (2005). Robust supervised learning. In AAAI (pp. 714–719).
Bareinboim, E., Correa, J. D., Ibeling, D., & Icard, T. (2022). On pearl’s hierarchy and the foundations of
causal inference. In Probabilistic and causal inference: the works of judea pearl (pp. 507–556).
Berthet, Q. & Rigollet, P. (2013a). Complexity theoretic lower bounds for sparse principal component
detection. In Conference on learning theory (pp. 1046–1066).: PMLR.
Berthet, Q. & Rigollet, P. (2013b). Optimal detection of sparse principal components in high dimension.
The Annals of Statistics, 41(4), 1780–1815.
Bickel, P. J., Ritov, Y., & Tsybakov, A. B. (2009). Simultaneous analysis of lasso and dantzig selector. The
Annals of statistics, 37(4), 1705–1732.
Blanchet, J., Kang, Y., Murthy, K., & Zhang, F. (2019). Data-driven optimal transport cost selection
for distributionally robust optimization. In 2019 winter simulation conference (WSC) (pp. 3740–3751).:
IEEE.
Bovet, D. P., Crescenzi, P., & Bovet, D. (1994). Introduction to the Theory of Complexity, volume 7. Prentice
Hall London.
Brennan, M. & Bresler, G. (2019). Optimal average-case reductions to sparse pca: From weak assumptions
to strong hardness. In Conference on Learning Theory (pp. 469–470).: PMLR.
Bühlmann, P. (2020). Invariance, causality and robustness. Statistical Science, 35(3), 404–426.
Candes, E. & Tao, T. (2007). The Dantzig selector: Statistical estimation when p is much larger than n.
The Annals of Statistics, 35(6), 2313 – 2351.
Chen, X., Ge, D., Wang, Z., & Ye, Y. (2014). Complexity of l2 -lp unconstrained minimization. Mathematical
Programming, 143(1), 371–383.
Chen, Y., Ge, D., Wang, M., Wang, Z., Ye, Y., & Yin, H. (2017). Strong np-hardness for sparse optimization
with concave penalty functions. In International Conference on Machine Learning (pp. 740–747).: PMLR.
Conze, J., Gani, J., & Fernique, X. (1975). Regularité des trajectoires des fonctions aléatoires gaussiennes.
Springer.
Dawid, A. P. & Didelez, V. (2010). Identifying the consequences of dynamic treatment strategies: A decision-
theoretic overview. Statistics Surveys, 4(none), 184 – 231.
23
Didelez, V., Dawid, P., & Geneletti, S. (2012). Direct and indirect effects of sequential treatments. arXiv
preprint arXiv:1206.6840.
Duchi, J. C. & Namkoong, H. (2021). Learning models with uniform performance via distributionally robust
optimization. The Annals of Statistics, 49(3), 1378–1406.
Fan, J. & Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties.
Journal of the American statistical Association, 96(456), 1348–1360.
Fan, J. & Liao, Y. (2014). Endogeneity in high dimensions. Annals of statistics, 42(3), 872.
Fan, J., Lou, Z., & Yu, M. (2022). Are latent factor regression and sparse regression adequate? arXiv
preprint arXiv:2203.01219.
Fan, J. & Lv, J. (2008). Sure independence screening for ultrahigh dimensional feature space. Journal of
the Royal Statistical Society: Series B (Statistical Methodology), 70(5), 849–911.
Fan, J., Wang, K., Zhong, Y., & Zhu, Z. (2021). Robust high dimensional factor models with applications
to statistical machine learning. Statistical Science, 36(2), 303–327.
Fan, J. & Zhou, W.-X. (2016). Guarding against spurious discoveries in high dimensions. Journal of Machine
Learning Research, 17(203), 1–34.
Fortnow, L. (2021). Fifty years of p vs. np and the possibility of the impossible. Communications of the
ACM, 65(1), 76–85.
Glymour, M., Pearl, J., & Jewell, N. P. (2016). Causal inference in statistics: A primer. John Wiley & Sons.
Granger, C. W. (1969). Investigating causal relations by econometric models and cross-spectral methods.
Econometrica: journal of the Econometric Society, (pp. 424–438).
Gu, Y., Fang, C., Bühlmann, P., & Fan, J. (2024). Causality pursuit from heterogeneous environments via
neural adversarial invariance learning. arXiv preprint arXiv:2405.04715.
Guo, Z. (2024). Statistical inference for maximin effects: Identifying stable associations across multiple
studies. Journal of the American Statistical Association, 119(547), 1968–1984.
Haavelmo, T. (1944). The probability approach in econometrics. Econometrica: Journal of the Econometric
Society, (pp. iii–115).
Hébert-Johnson, U., Kim, M., Reingold, O., & Rothblum, G. (2018). Multicalibration: Calibration for the
(computationally-identifiable) masses. In International Conference on Machine Learning (pp. 1939–1948).:
PMLR.
Heinze-Deml, C., Peters, J., & Meinshausen, N. (2018). Invariant causal prediction for nonlinear models.
Journal of Causal Inference, 6(2).
Huo, X. & Ni, X. (2007). When do stepwise algorithms meet subset selection criteria? The Annals of
Statistics, (pp. 870–887).
24
Kaiser, H. F. (1958). The varimax criterion for analytic rotation in factor analysis. Psychometrika, 23(3),
187–200.
Kalnay, E., Kanamitsu, M., Kistler, R., Collins, W., Deaven, D., Gandin, L., Iredell, M., Saha, S., White,
G., Woollen, J., Zhu, Y., Leetmaa, A., Reynolds, B., Chelliah, M., Ebisuzaki, W., Higgins, W., Janowiak,
J., Mo, K. C., Ropelewski, C., Wang, J., Jenne, R., & Joseph, D. (1996). The NCEP/NCAR 40-Year
Reanalysis Project. Bulletin of the American Meteorological Society, 77(3), 437–472.
Kania, L. & Wit, E. (2022). Causal regularization: On the trade-off between in-sample risk and out-of-sample
risk guarantees. arXiv preprint arXiv:2205.01593.
Karp, R. M. (1972). Reducibility among combinatorial problems. In Complexity of Computer Computations:
Proceedings of a symposium on the Complexity of Computer Computations (pp. 85–103). New York:
Springer.
Kumar, K. K., Rajagopalan, B., & Cane, M. A. (1999). On the weakening relationship between the indian
monsoon and enso. Science, 284(5423), 2156–2159.
Li, S. & Zhang, L. (2024). Fairm: Learning invariant representations for algorithmic fairness and domain
generalization with minimax optimality. arXiv preprint arXiv:2404.01608.
Li, T., Zhang, Y., Chang, C.-P., & Wang, B. (2001). On the relationship between indian ocean sea surface
temperature and asian summer monsoon. Geophysical Research Letters, 28(14), 2843–2846.
Ma, Z. & Wu, Y. (2015). Computational barriers in minimax submatrix detection. The Annals of Statistics,
(pp. 1089–1116).
Meinshausen, N. & Bühlmann, P. (2006). High-dimensional graphs and variable selection with the lasso.
The Annals of Statistics, (pp. 1436–1462).
Meinshausen, N. & Bühlmann, P. (2015). Maximin effects in inhomongeous large-scale data. The Annals of
Statistics, 43(4), 1801–1830.
Meinshausen, N., Hauser, A., Mooij, J. M., Peters, J., Versteeg, P., & Bühlmann, P. (2016). Methods for
causal inference from gene perturbation experiments and validation. Proceedings of the National Academy
of Sciences, 113(27), 7361–7368.
Mendelson, S., Pajor, A., & Tomczak-Jaegermann, N. (2007). Reconstruction and subgaussian operators in
asymptotic geometric analysis. Geometric and Functional Analysis, 17(4), 1248–1282.
Mohajerin Esfahani, P. & Kuhn, D. (2018). Data-driven distributionally robust optimization using the
wasserstein metric: Performance guarantees and tractable reformulations. Mathematical Programming,
171(1), 115–166.
Peters, J., Bühlmann, P., & Meinshausen, N. (2016). Causal inference by using invariant prediction:
identification and confidence intervals. Journal of the Royal Statistical Society. Series B (Statistical
Methodology), (pp. 947–1012).
Pfister, N., Bühlmann, P., & Peters, J. (2019). Invariant causal prediction for sequential data. Journal of
the American Statistical Association, 114(527), 1264–1276.
Raskutti, G., Wainwright, M. J., & Yu, B. (2010). Restricted eigenvalue properties for correlated gaussian
designs. The Journal of Machine Learning Research, 11, 2241–2259.
Rojas-Carulla, M., Schölkopf, B., Turner, R., & Peters, J. (2018). Invariant models for causal transfer
learning. The Journal of Machine Learning Research, 19(1), 1309–1342.
25
Rothenhäusler, D., Bühlmann, P., & Meinshausen, N. (2019). Causal dantzig: fast inference in linear
structural equation models with hidden variables under additive interventions. The Annals of Statistics,
47(3), 1688–1722.
Rothenhäusler, D., Meinshausen, N., Bühlmann, P., & Peters, J. (2021). Anchor regression: Heterogeneous
data meet causality. Journal of the Royal Statistical Society. Series B, Statistical Methodology, 83(2),
215–246.
Rudelson, M. & Zhou, S. (2013). Reconstruction from anisotropic random measurements. IEEE Transactions
on Information Theory, 6(59), 3434–3447.
Runge, J. (2020). Discovering contemporaneous and lagged causal relations in autocorrelated nonlinear time
series datasets. In Conference on Uncertainty in Artificial Intelligence (pp. 1388–1397).: Pmlr.
Runge, J., Petoukhov, V., Donges, J. F., Hlinka, J., Jajcay, N., Vejmelka, M., Hartman, D., Marwan, N.,
Paluš, M., & Kurths, J. (2015). Identifying causal gateways and mediators in complex spatio-temporal
systems. Nature communications, 6(1), 8502.
Schölkopf, B., Janzing, D., Peters, J., Sgouritsa, E., Zhang, K., & Mooij, J. (2012). On causal and anticausal
learning. arXiv preprint arXiv:1206.6471.
Shen, X., Bühlmann, P., & Taeb, A. (2023). Causality-oriented robustness: exploiting general additive
interventions. arXiv preprint arXiv:2307.10299.
Talagrand, M. (2005). The generic chaining: upper and lower bounds of stochastic processes. Springer
Science & Business Media.
Tibshirani, R. (1997). The lasso method for variable selection in the cox model. Statistics in medicine, 16(4),
385–395.
Timmermann, A., An, S.-I., Kug, J.-S., Jin, F.-F., Cai, W., Capotondi, A., Cobb, K. M., Lengaigne,
M., McPhaden, M. J., Stuecker, M. F., et al. (2018). El niño–southern oscillation complexity. Nature,
559(7715), 535–545.
Valiant, L. G. & Vazirani, V. V. (1985). Np is as easy as detecting unique solutions. In Proceedings of the
seventeenth annual ACM symposium on Theory of computing (pp. 458–463).
Varambally, S., Ma, Y.-A., & Yu, R. (2023). Discovering mixtures of structural causal models from time
series data. arXiv preprint arXiv:2310.06312.
Vejmelka, M., Pokorná, L., Hlinka, J., Hartman, D., Jajcay, N., & Paluš, M. (2015). Non-random correlation
structures and dimensionality reduction in multivariate climate data. Climate Dynamics, 44, 2663–2682.
Vershynin, R. (2018). High-dimensional probability: An introduction with applications in data science,
volume 47. Cambridge university press.
Wang, T., Berthet, Q., & Samworth, R. (2016). Statistical and computational trade-offs in estimation of
sparse principal components. Annals of Statistics, 44(5), 1896–1930.
Yin, M., Wang, Y., & Blei, D. M. (2021). Optimization-based causal estimation from heterogenous
environments. arXiv preprint arXiv:2109.11990.
Zhang, C.-H. & Zhang, T. (2012). A general theory of concave regularization for high-dimensional sparse
estimation problems. Statistical Science, 27(4), 576–593.
Zhang, Y., Wainwright, M. J., & Jordan, M. I. (2014). Lower bounds on the performance of polynomial-time
algorithms for sparse linear regression. In Conference on Learning Theory (pp. 921–948).: PMLR.
26
Zhao, P. & Yu, B. (2006). On model selection consistency of lasso. The Journal of Machine Learning
Research, 7, 2541–2563.
Zou, H. (2006). The adaptive lasso and its oracle properties. Journal of the American statistical association,
101(476), 1418–1429.
27
Supplemental Materials
The supplemental materials are organized as follows:
Appendix A provides additional discussions about the computation barrier omitted in the main text.
Appendix B contains the discussions omitted in the main paper.
Appendix C contains the proofs for the computation barrier results.
Appendix D contains the proofs for the population-level results.
Appendix E contains the proofs for the finite sample results.
Q1 The setup of (1.1) are searching for a set with weaker invariance conditions adopted in causal discovery
literature. Can stronger invariance conditions like the full distributional invariance in Peters et al.
(2016) help?
Q2 The NP-hardness of 0/1 Knapsack Problem relies on the exponential total budget, and there is
poly(number of items, total budget) algorithm. Is the NP-hardness in ExistLIS due to the existence of
many varying solutions with heterogeneity signal e−Ω(d) such that a computationally efficient algorithm
is possible if all the non-invariant solutions have large heterogeneity signals?
Q3 The covariance matrices in the construction Lemma 2.2 is dense. Is computationally efficient estimation
attainable when the covariance matrices are all sparse?
The brief answers to the above questions are all “No”, and the rigorous statements can be found in the
following subsections.
where Σ(1) , . . . , Σ(E) ∈ Rd×d are positive definite matrices, u(1) , . . . , u(E) are d-dimensional vectors, and v (e)
is a scalar satisfying v (e) > (u(e) )⊤ (Σ(e) )−1 u(e) . We say a set S is a non-trivial distribution-invariant set if
(e,S) ⊤ (e) (e,S)
β (e,S) ≡ β (S) ̸= 0 and v (e) − (βS ) ΣS β S ≡ vε . (A.1)
28
The following lemma shows that (A.1) is equivalent to the full distribution invariance condition (Assumption
1 in Peters et al. (2016)) under the setting in Problem A.1.
Lemma A.1. Under the setting of Problem A.1, S satisfies (A.1) if and only if
(e) (e)
∃β̄ ∈ Rd , βS ̸= 0 s.t. Y (e) = XS β̄S + ε(e) with ε(e) ∼ Fε ⊥⊥ XS ∀e ∈ {1, . . . , E} (A.2)
It is then easy to see that the problem ExistDIS-Unique corresponds to the case where the non-trivial
invariant set is unique if it exists and ICP (Peters et al., 2016) can uniquely identify S ⋆ . We have the
following result. The proof idea is that we construct the problem such that the additional invariant noise-
level constraint trivially holds for all the prediction-invariant solutions.
Theorem A.1. When E = 2, the problem ExistsDIS is NP-hard under deterministic polynomial-time
reduction, the problem ExistsDIS-Unique is NP-hard under randomized polynomial-time reduction.
Problem A.3. Consider the problem Exist-ϵ-Sep-LIS with E = 2 and suppose further Y (e) = (β (e,[d]) )⊤ X (e) ,
i.e., there is no intrinsic noise. The input is the same, and it is required to output βb ∈ Rd such that
inf S:β (e,S) ≡β (S) ̸=0 ∥βb − β (S) ∥2Σ ≤ d−ϵ /4 if {S : β (e,S) ≡ β (S) ̸= 0} =
̸ ∅, its output can be an arbitrary
d-dimensional vector otherwise.
Corollary A.3. If Problem A.3 can be solved by a worst-case polynomial-time algorithm, then 3Sat can
also be solved by a worst-case polynomial-time algorithm.
The key idea is to divide the set [d] into two blocks: a block with size dϵ/3 , whose construction is similar
to Lemma 2.2, and a remaining auxiliary block, where there is no invariant solution in this block and the
predictive variance is carefully controlled. It is interesting to see if a similar result holds for ϵ = 0. We leave
it for future studies.
29
Theorem A.4. Consider the problem ExistLIS with the additional constraint that for any e ∈ [E], each
row of matrix Σ(e) has no more than C non-zero elements for some universal constant C > 0. The above
problem is NP-hard under deterministic polynomial-time reduction when E = 2.
The proof idea is as follows. We first reduce the general 3Sat problem x with k clauses to another 3Sat
problem x′ with O(k 2 ) clauses. In x′ , each variable at most appears on 15 times. This will further lead
to a row-wise sparse A in Lemma 2.2. A finer construction will also adopted to distribute the constraints
imposed by the last dense row of Σ(2) into O(k 2 ) sparse rows.
The last constraint tm = 0 can be written as the clause (¬tm ∨ ¬tm ∨ ¬tm ).
The rest of the proof is the same as that in Theorem 1.1: we can construct a randomized polynomial
reduction from 3Sat to 3Sat-Unique.
B Omitted Discussions
B.1 Discussion on Li & Zhang (2024)
Problem B.1. Under the same setting as Problem 2.3 with E = 2, it takes (Σ(1) , Σ(2) ) and (u(1) , u(2) ) as
(1) (2)
input and is required to determine whether there exists S ⊆ [d] with |S| ≥ d/7 such that ΣS = ΣS and
(1) (2)
uS = uS .
Here we test the existence of any “large”, namely |S| ≥ d/7, covariance-invariant set rather than any
covariance-invariant set, this is because S is covariance-invariant in Problem B.1 will imply {j} is covariance-
invariant for any j ∈ S. Testing the existence of a univariate invariant set is trivial and has O(d·E) algorithm.
The proof is similar to Theorem 2.1 by letting u(1) = u(2) .
Theorem B.1. Problem B.1 is NP-hard.
30
Definition 5 (Structural Causal Model). A structural causal model M = (S, ν) on p variables Z1 , . . . , Zp
can be described using p assignment functions {f1 , . . . , fp } = S:
Zj ← fj (Zpa(j) , Uj ) j = 1, . . . , p,
Qp ⊆ {1, . . . , p} is the set of parents, or the direct causes, of the variable Zj , and the joint distribution
where pa(j)
ν(du) = j=1 νj (duj ) over p independent exogenous variables (U1 , . . . , Up ). For a given model M , there is
an associated directed graph G(M ) = (V, E) that describes the causal relationships among variables, where
V = [p] is the set of nodes, E is the edge set such that (i, j) ∈ E if and only if i ∈ pa(j). G(M ) is acyclic if
there is no sequence (v1 , . . . , vk ) with k ≥ 2 such that v1 = vk and (vi , vi+1 ) ∈ E for any i ∈ [k − 1].
As in Peters et al. (2016), we consider the following data-generating process in |E| environments. For each
(e) (e) (e) (e)
e ∈ E, the process governing p = d + 1 random variables Z (e) = (Z1 , . . . , Zd+1 ) = (X1 , . . . , Xd , Y (e) ) is
(e) (e)
derived from an SCM M (S , ν). We let e0 ∈ E be the observational environment for reference and the
rest are interventional environments. We let G be the directed graph representing the causal relationships
in e0 , and simply let G be shared across E without loss of generality. We assume G is acyclic. In each
environment e ∈ E, the assignments are as follows:
(e) (e) (e)
Xj ← fj (Zpa(j) , Uj ), j = 1, . . . , d
(e)
(B.1)
Y (e) ← fd+1 (Xpa(d+1) , Ud+1 ).
Here the distribution of exogenous variables (U1 , . . . , Ud+1 ), the cause-effect relationship {pa(j)}d+1
j=1 represented
by G, and the structural assignment fd+1 are invariant across e ∈ E, while the structural assignments for X
may vary among e ∈ E. The heterogeneity, which is emphasized by superscript (e) is due to the arbitrary
interventions on the variables X. We use Zpa(j) to emphasize that Y can be the direct cause of some variables
in the covariate vector.
(e) (e )
We denote I ⊆ [d], defined as I := {j : fj ̸= fj 0 for some e ∈ E}, be the set of variables intervened,
We summarize the above data-generating process as a condition.
Condition B.1. Suppose {M (e) }e∈E are defined by (B.1), G is acyclic, and fd+1 is a linear function.
Proposition B.2. Under the model (1.1) with regularity condition Condition 3.4, suppose one of the
following conditions holds.
⊤ (e) (e)
(a) There exists a partition of S ⋆ = ∪L ⋆ ⋆
l=1 Sl such that E[XS ⋆ (XSr⋆ ) ] = 0 for any l ̸= r and |Sl | ≤ k.
l
(b) Assume Condition B.1 holds such that we can define the ancestor set recursively as at(j) = pa(j) ∪
We have I ∩ at(d + 1) = ∅, S ⋆ = pa(d + 1), and k ≥ 1.
S
k∈pa(j) at(k).
Then Condition 3.3 holds.
Proof of Proposition B.2. We first prove (a). To be specific, we show that
(e,Sl⋆ )
∀e ∈ E, l ∈ [L], βS ⋆ = βS⋆l⋆ .
l
It follows from Condition 3.4 and the definition of least squares that
i−1
(e,S ⋆ )
h
(e) (e)
β S ⋆ l = ΣS ⋆ E[XS ⋆ Y (e) ]
l l l
h i−1
(e) (e) (e) (e)
X
= ΣS ⋆ E XS ⋆ ε(e) + (XS ⋆ )⊤ βS⋆l⋆ + (XSr⋆ )⊤ βS⋆r⋆
l l l
r̸=l
(i)
h i−1 h i
(e) (e) (e) (e) (e)
X
= ΣS ⋆ E XS ⋆ ε(e) + ΣS ⋆ βS⋆l⋆ + E[XS ⋆ (XSr⋆ )⊤ ]βS⋆r⋆
l l l l
r̸=l
= βS⋆l⋆
31
where (i) follows from the exogeneity of XS ⋆ in (1.1) and (a).
Now we prove (b). Given the condition in (b), we have for any j ∈ S ⋆ = pa(d + 1), and e, e′ ∈ E
(e) (e′ ) ′
(e) E[Xj Y (e) ] E[Xj Y (e ) ] (e′ )
βj = (e)
= (e′ )
= βj .
E[|Xj |2 ] E[|Xj |2 ]
Σ u
here (a) follows from the fact that the pooled full covariance matrix is positive
u⊤ 12 e∈E E[|Y (e) |2 ]
P
semi-definite.
1
A 1k
Now we turn to the lower bound. We denote A = 1 ⊤e 2 . It is easy to see that ∥A∥e F ≤ d,
2 1k 0
combining this with the fact that ∥A∥ e F , the maximum and minimum eigenvalue of Σ(2) can be
e 2 ≤ ∥A∥
controlled by
e 2 ≤ λmin (Σ(2) ) ≤ λmax (Σ(2) ) ≤ 5d + ∥A∥
4d ≤ 32k − ∥A∥ e 2 ≤ 6d (C.1)
When there is no intrinsic noise, the variance of Y (e) can be exactly calculated as
†
†
⊤ †
∥β (S) − β (S ) ∥2Σ = β (S) − β (S ) ΣΣ−1 Σ β (S) − β (S )
2 2 (C.3)
1 † 1 †
≥ Σ β (S) − β (S ) ≥ Σ β (S) − β (S ) .
λmax (Σ) 2 7d 2
†
We denote S1 = S \S † and S2 = S † \S. We will establish the lower bound on ∥∆∥22 for ∆ = Σ(β (S) −β (S ) ) ∈
Rd when S ̸= S † . Given S ̸= S † , one has either S1 ̸= ∅ or S2 ̸= ∅. Without loss of generality, we assume
that S2 ̸= ∅.
32
First, one has
−1
(S) −1 5d + 1 1e
∥βS ∥2 = (ΣS ) uS 2 = I|S| + AS uS
2 2
2
−1
1 e 2
= I|S| + AS uS
5d + 1 5d + 1
2
(a)
−1
1 e 2 2
≤ I|S| + AS − I|S| ∥uS ∥2 + ∥uS ∥2
5d + 1 5d + 1 5d + 1
2 5d + 1 + 0.5k √
(b)
d
≤ 1+2 d
5d + 1 5d + 1 2
√ √
≤ (1 + 2/5) × (1 + 0.5/5) d ≤ 1.5 d.
Here (a) follows from the triangle inequality, (b) follows from the fact that ∥(I + M )−1 − I∥2 ≤ 2∥M ∥ if
∥M ∥ ≤ 0.5. Pick j ∈ S2 , it follows from the above upper bound, the fact j ∈ / S and Cauchy Schwarz
inequality that
(S)
∆j = Σ⊤
j,S βS − uj
This further yields that ∥∆∥22 ≥ ∥∆j ∥2 ≥ d2 . Combining it with (C.3) and (C.2) completes the proof of the
lower bound.
(e,S)
Here
(e) (a) follows from the definition of β , (b) follows from the fact that the following covariance matrix
(e)
Σ u
is positive semi-definite.
(u(e) )⊤ E[|Y (e) |2 ]
Turning to the lower bound,
X
∥β (S) − β (e,S) ∥2Σ(e) ≥ ∥β (S) − β (1,S) ∥22
e∈[2]
2
(1) (2) (1)
= Σ−1
S (0.5uS + 0.5uS ) − uS
2
−2 (1) (2) (1) (2) (1)
≥ [λmax (Σ)] 0.5uS + 0.5uS − 0.5uS − 0.5ΣS uS
2
−2 (2) (1) (2)
≥ [4λmax (Σ)] ∥2ΣS uS − uS ∥22
33
(2) (1) (2) (2) (1) (2)
Observe that all the entries in the vector 2ΣS uS − uS are integer. Then unless ΣS uS = uS , in other
(2) (1) (2)
words, S is a invariant set by Definition 4, we have ∥2ΣS uS − uS ∥22 ≥ 1. Therefore, we have
X
∥β (S) − β (e,S) ∥2Σ(e) ≥ ∥β (S) − β (1,S) ∥22 ≥ [4λmax (Σ)]−2 ≥ 784d−2
e∈[2]
if S is not a invariant set. Combining it with the upper bound (C.2) completes the proof.
where (a) follows from the fact that (X, Y ) are multivariate Gaussian under which independence is equivalent
to uncorrelatedness and the fact that εb(e) is also Gaussian, (b) follows from the fact that
(e,S) ⊤ (e) (e,S) ⊤ (e) (e,S)
ε(e) ) = E[|Y (e) |2 ] − 2(βS
var(b ) E[XS Y (e) ] + (βS ) ΣS β S
(e,S) ⊤ (e) (e,S)
= v (e) − (βS ) ΣS β S .
Proof of Theorem A.1. The proof is similar to that of Theorem 2.1. For each instance x, we use the same
reduction construction of (Σ, u) in problem y constructed in Lemma 2.2 and let
and
(2,S) (2) (2,S) (2)
v (2) − βS ΣS β S = v (2) − 1⊤
|S| ΣS 1|S|
(a) (2) 1
= v − 5d + 2(|S| − 1) + 1⊤
k (5dId + A )1
S̊ k
2
(b)
= v (2) − 5d(1 + k) − k = 100d5 .
Here (a) follows from the fact that d ∈ S provided S ∈ Sy , (b) follows from the fact that Ai,j = 0 for any
i, j ∈ S̊ and |S| = k + 1 provided S ∈ Sy . This further yields that Sy ⊆ Sye. Combined with the fact that
Sye ⊆ Sy , one further has Sye = Sy . The rest of the proof follows similarly.
34
C.3 Proof of Theorem A.2
We adopt a similar reduction idea as that in Lemma 2.2. Without loss of generality, we assume k ≥ 104 and
ϵ < 0.5.
We first introduce one additional notation. For any integer ℓ > 0, we define the positive definite ℓ × ℓ
matrix Hℓ as follows:
(
2 j = j′
(Hℓ )j,j ′ = (C.4)
1 otherwise
−1
for any j, j ′ ∈ [ℓ]. Namely, Hℓ = Iℓ + 1ℓ 1⊤
ℓ for any ℓ ≥ 1. One can thereby obtain Hℓ
1
= Iℓ − ℓ+1 1ℓ 1⊤
ℓ .
Step 1. Construct the Reduction. For any 3Sat instance x with input size k, we construct an
ExistLIS instance y with size d = ⌈k 3/(ϵ) ⌉ as follows:
(1) 32k · I7k+1 0 (1) −1 32k · 17k+1
Σ = and u = (k ) ,
0 Hd−7k−1 1d−7k−1
and
1
(32k + 12 ) · 17k
32k · I7k + A 2 · 17k 0
Σ(2) = 1 ⊤
2 · 17k 32k 0 and u(2) = (k −1 ) 32k + 12 k .
0 0 Hd−7k−1 −3 · 1d−7k−1
One can observe that both Σ(1) and Σ(2) are respectively composed by an upper-left (7k + 1) × (7k + 1)
matrix and a lower-right (d − 7k − 1) ×
(d − 7k −1) matrix Hd−7k−1 . Recall that in the Proof of (2.5) we
1
A 1
e = 1 ⊤ 2 k , and we have ∥A∥ e 2 ≤ ∥A∥e F ≤ 7k + 1. Then similar to
introduce the notation of matrix A
2 1k 0
(2)
(C.1), the maximum and minimum eigenvalue of Σ[7k+1] can be controlled by
Combining with the fact that Hℓ is positive definite for any ℓ ≥ 1, we can conclude that both Σ(1) and Σ(2)
are positive definite, and the above reduction can be calculated within polynomial time.
Now it suffices to show that (1) The above construction is a parsimonious reduction; and (2) The instance
y lies in the problem Exist-ϵ-Sep-LIS. In order to complete the remaining proof, it is helpful to observe
that there are three modifications in this construction compared to the construction in Lemma 2.2.
(a) We introduce an auxiliary (d − 7k − 1)-dimension part [d] \ [7k + 1]. We will show that this part is
precluded by any invariant set.
(1) (2)
(b) We change the diagonal coordinates in Σ[7k+1] from 1 to 32k, and those in Σ[7k+1] from 5d to 32k to
(1) (2)
make E[|Y (1) |2 ] ≍ E[|Y (2) |2 ]. We also change the coordinates of u[7k+1] and u[7k+1] accordingly.
(c) We add a k −1 multiplicative factor in u(1) and u(2) to let E[|Y (1) |2 ], E[|Y (2) |2 ] ≍ 1. This will also result
in all the β (e,S) and β (S) being multiplied by the same k −1 factor.
Step 2. Verification of Parsimonious Reduction. We first claim that the auxiliary (d − 7k − 1)-
dimension part [d] \ [7k + 1] is precluded by any invariant set, namely
∀S † ∈ Sy =⇒ S † ∩ {7k + 2, . . . , d} = ∅. (C.6)
35
(1) (2) (1) (2)
However, in our construction Σj,S † = Σj,S † while uj ̸= uj . This leads to a contradiction. Therefore, an
invariant set should not contain any element in {7k + 2, . . . , d}.
By (C.6), we have the following statements similar to (2.3) in the proof of Lemma 2.2.
(a) † † (1,S † )
S † ∈ Sy ⇐⇒ S † ̸= ∅ and β (2,S )
= β (1,S )
with βj = (k −1 )1{j ∈ S † }
(b)
⇐⇒ S † = S̊ ∪ {7k + 1} with |S̊| = k and Aj,j ′ = 0 ∀j, j ′ ∈ S̊ ⊆ [7k]
(C.7)
(c)
⇐⇒ S̊ = {7(i − 1) + ai }ki=1 with ai ∈ [7] s.t. adopt action ID ai
in clause i ∈ [k] will lead to a valid solution v ∈ Sx .
We emphasis that the proof of C.7(a) and (c) are essentially identical to those of (2.3). For completeness,
we prove (b).
Proof of (C.7) (b). The proof is almost identical to the proof of (2.3)(b) since the major difference is the
k −1 multiplicative factor. The direction ⇐ is obvious. For the ⇒ direction, we first show that 7k + 1 ∈ S †
using the proof by contradiction argument. Suppose |S † | ≥ 1 but 7k + 1 ∈
/ S † , we pick j ∈ S † , then
7k
h
(2) (2,S † )
i X 1 (2)
k Σ S † βS † = 32k + Aj,j ′ 1{j ′ ∈ S † } =
̸ 32k + = k · uj
j ′
2
j =1
(2,S † ) (1,S † )
where the first equality follows from the assumption βj = βj = k −1 1{j ∈ S † } and 7k + 1 ∈
/ S † , and
the inequality follows from the fact that A ∈ {0, 1}7k×7k hence the L.H.S. is an integer. This indicates that
† †
β (1,S ) ̸= β (2,S ) if |S † | ≥ 1 and 7k + 1 ∈
/ S † . Given 7k + 1 ∈ S † , we then obtain
7k
1 (2)
h
(2) (2,S † )
i 1X 1
32k + k = k · u7k+1 = k ΣS † βS † = 32k + 1{j ′ ∈ S † } = 32k + (|S † | − 1),
2 †
|S | 2 ′ 2
j =1
36
−1
Here (a) follows from the fact that Σ(1) is a block diagonal matrix. It follows from the identity 1⊤
ℓ Hℓ 1ℓ =
ℓ/(1 + ℓ) that
1
1 < (k −1 )2 (32k)2 (7k + 1)
32k
≤ E[|Y (1) |2 ]
d − 7k − 1
≤ (k −1 )2 32k · (7k + 1) +
d − 7k − 1 + 1
< 256.
Similarly, for E[|Y (2) |2 ], following from the fact that Σ(2) is block diagonal, we obtain
E[|Y (2) |2 ] = (u(2) )⊤ (Σ(2) )−1 u(2)
(2) (2) (2) (2) (2) (2)
= (u[7k+1] )⊤ (Σ[7k+1] )−1 u[7k+1] + (u[d]\[7k+1] )⊤ (Σ[d]\[7k+1] )−1 u[d]\[7k+1]
(2) (2) (2) −1
= (u[7k+1] )⊤ (Σ[7k+1] )−1 u[7k+1] + 9(k −1 )2 1⊤
d−7k−1 Hd−7k−1 1d−7k−1 .
(2) (2)
Recall that λmax (Σ[7k+1] ) ≤ 40k and λmin (Σ[7k+1] ) ≥ 24k, we have
(2) (2) (2)
1 < (k −1 )2 (40k)−1 (32k)2 (7k + 1) ≤ (u[7k+1] )⊤ (Σ[7k+1] )−1 u[7k+1]
≤ (k −1 )2 (24k)−1 (32k + k)2 (7k + 1)
< 999.
Therefore,
1 < E[|Y (2) |2 ] < 999 + 9(k −1 )2 < 1000.
Hence we can conclude that 1 ≤ E[|Y (1) |2 ], E[|Y (2) |2 ] ≤ 1000.
1
P (e,S)
Step 2. Calculating the Prediction Variation. Now we lower bound the heterogeneity gap |E| e∈E ∥βS −
(S) −ϵ
βS ∥2 (e) ≥ d /1280 when S is not an invariant set as Definition 4. Denote S1 = S ∩ [7k + 1] and
Σ S
S2 = S \ [7k + 1]. We divide it into two cases when β (1,S) ̸= β (2,S) :
Case 1. S2 ̸= ∅: Observe Σ(1) and Σ(2) are block diagonal matrices, we have
(1,S) (1,S2 ) −1 (1)
βS2 = βS2 = H|S u ,
2 | S2
Substituting the above terms, we can lower bound the heterogeneity gap as
1 (1,S) (S) (2,S) (S)
∥βS − βS ∥2Σ(1) + ∥βS − βS ∥2Σ(2)
2 S S
1 (1,S1 ) (S1 ) 2 (2,S1 ) (S1 ) 2
= ∥βS1 − βS1 ∥Σ(1) + ∥βS1 − βS1 ∥Σ(2)
2 S1 S1
1 (1,S2 ) (S2 ) 2 (2,S2 ) (S2 ) 2
+ ∥βS2 − βS2 ∥Σ(1) + ∥βS2 − βS2 ∥Σ(2)
2 S2 S2
1 (1,S2 ) (S2 ) 2 (2,S2 ) (S2 ) 2
≥ ∥βS2 − βS2 ∥Σ(1) + ∥βS2 − βS2 ∥Σ(2)
2 S2 S2
37
(1,S) (2,S) (e,S)
Case 2. S2 = ∅: In this case, we must have βS1 ̸= βS1 because β(S1 )c = 0 for any e ∈ {1, 2}. At the
same time,
1 (1,S) (S) (2,S) (S)
∥βS − βS ∥2Σ(1) + ∥βS − βS ∥2Σ(2)
2 S S
(a) 1 (1,S1 ) (S1 ) 2 (2,S1 ) (S1 ) 2
= ∥βS1 − βS1 ∥Σ(1) + ∥βS1 − βS1 ∥Σ(2)
2 S1 S1
(1) (2)
λmin Σ[7k+1] ∧ λmin Σ[7k+1]
(1,S ) (S ) (2,S ) (S )
≥ ∥βS1 1 − βS11 ∥22 + ∥βS1 1 − βS11 ∥22
2
(b) 24k
(1,S ) (S ) (2,S ) (S )
≥ ∥βS1 1 − βS11 ∥22 + ∥βS1 1 − βS11 ∥22
2
(c) 24k
1 (1,S1 ) (2,S1 ) 2 (1,S ) (2,S )
≥ ∥β − βS1 ∥2 = 6k∥βS1 1 − βS1 1 ∥22 .
2 2 S1
Here (a) follows from the fact that Σ(1) and Σ(2) are block diagonal and S2 = ∅; (b) follows from the
(1) (2)
fact that λmin (Σ[7k+1] ), λmin (Σ[7k+1] ) ≥ 24k; and (c) follows from the fact that ∥a − c∥22 + ∥b − c∥22 ≥
minx ∥a − x∥22 + ∥b − x∥22 ≥ ∥a − (a + b)/2∥22 + ∥a − (a + b)/2∥22 ≥ 0.5∥a − b∥22 for any vector a, b, c.
(2) (2,S) (2) (2)
Recall that ΣS1 βS1 = uS1 and λmax (ΣS1 ) ≤ 40k, then
⊤
(1,S ) (2,S ) (2) (1,S) (2) (2) (2) (1,S) (2)
6k∥βS1 1 − βS1 1 ∥2 = 6k ΣS1 βS1 − uS1 (ΣS1 )−2 ΣS1 βS1 − uS1
6k 2
(2) (1,S) (2)
≥ (2)
ΣS1 βS1 − uS1
λmax (ΣS1 )2 2
6k 2
(2) (1,S) (2)
≥ 2
ΣS1 βS1 − uS1
(40k) 2
6k 2
(2) (1,S) (2)
= (2ΣS1 )(kβS1 ) − (2k) · uS1 .
(80k 2 )2 2
(1,S)
Combining βS1 = (k −1 )1|S1 | and the definition of Σ(2) and u(2) , we obtain that each coordinate of the
(2) (1,S) (2)
vector (2ΣS1 )(kβS1 ) − (2k) · uS1 is an integer. At the same time, we also have
(2) (1,S) (2) (2) (1,S) (2,S)
(2ΣS1 )(kβS1 ) − (2k) · uS1 = 2kΣS1 βS1 − βS1 ̸= 0
(2) (2) (1,S) (2)
because ΣS1 has full rank, which further yields ∥(2ΣS1 )(kβS1 ) − (2k) · uS1 ∥22 ≥ 1. So we can conclude that
1 (1,S) (S) (2,S) (S)
6k 1 −3
∥βS − βS ∥2Σ(1) + ∥βS − βS ∥2Σ(2) ≥ 2 2
≥ k
2 S S (80k ) 1280
under Case 2. Combing the above two cases together, we can conclude that
1 (1,S)
∥β − β (S) ∥2Σ(1) + ∥β (2,S) − β (S) ∥2Σ(2) ≥ k −3 /1280 ≥ d−ϵ /1280.
2 S S
†
Step 3. Calculating the Gap between β (S) and β (S ) . Let S † be arbitrary invariant set according to
Definition 4 and S be any set that does not equal to S † . We keep adopting the notation S1 = S ∩[7k+1], S2 =
S \ [7k + 1], and divide it into two cases.
(S) −1 (1)
Case 1. S2 ̸= ∅: In this case, from the calculations above we have βS2 = −H|S u . On the other hand,
2 | S2
(S † )
βS2 = 0 for any invariant set S † according to (C.6). Combing the two facts together yields
† (S ) (1) (1)
−1
∥β (S) − β (S ) ∥2Σ ≥ ∥βS22 ∥2H|S | = (uS2 )⊤ H|S u
2 | S2
2
|S2 | 1
≥ (k −1 )2 ≥ k −2 ≥ d−ϵ /2.
|S2 | + 1 2
38
Case 2. S2 = ∅: In this case, since S2 = ∅, one must have S ⊂ [7k + 1]. On the other hand, in (C.6) we show
that any invariant set S † should also be a subset of [7k + 1]. In this case, we claim that a stronger statement
′
holds, that for any pair of distinct subsets S, S ′ in [7k + 1], one has ∥β (S) − β (S ) ∥2Σ ≥ d−ϵ /1280.
(2) (2)
Recall that in (C.5) we obtain 24k ≤ λmin (Σ[7k+1] ) ≤ λmax (Σ[7k+1] ) ≤ 40k. This implies 28k ≤
λmin (Σ[7k+1] ) ≤ λmax (Σ[7k+1] ) ≤ 36k. It follows from the assumption S2 = ∅, our construction of Σ
′ (S) (S ′ )
∥β (S) − β (S ) ∥2Σ = ∥β[7k+1] − β[7k+1] ∥2Σ[7k+1]
⊤
(S ′ ) (S ′ )
(S) (S)
= β[7k+1] − β[7k+1] Σ[7k+1] Σ−1 [7k+1] Σ [7k+1] β [7k+1] − β [7k+1]
1
(S) †
(S )
2 (C.9)
≥ Σ[7k+1] β[7k+1] − β[7k+1]
λmax (Σ[7k+1] ) 2
1 ′
2
(S) (S )
≥ Σ[7k+1] β[7k+1] − β[7k+1] .
36k 2
Here (a) follows from the triangle inequality, (b) follows from the fact that ∥(I + M )−1 − I∥2 ≤ 2∥M ∥ if
′
∥M ∥ ≤ 0.5. Hence ∥β (S) ∥2 ≤ 3k −1/2 for any S ⊂ [7k + 1]. Similarly ∥β (S ) ∥2 ≤ 3k −1/2 .
Since S ̸= S ′ , there exists some j ∈ [7k + 1] such that j ∈ (S \ S ′ ) ∨ (S ′ \ S). Without loss of generality,
we assume j ∈ S ′ \ S. Then it follows from the above upper bound, the fact j ∈ S ′ \ S and Cauchy Schwarz
inequality that
(S)
∆j = Σ⊤
j,S βS − uj
39
the instance y, and Se = {j ∈ [7k + 1] : βbj ≥ k −1 /2}. Following the notations therein, we claim that
(a)
Se ∈ Sy ⇐⇒ |Sy | ≥ 1 ⇐⇒ x ∈ X3Sat,1 (C.10)
Therefore, if an algorithm A can take Problem A.3 instance y as input and return the desired output β(y) b
within time O(p(|y|)) for some polynomial p, then the following algorithm can solve 3Sat within polynomial
time: for any instance x, it first transforms x into y = Tϵ (x), then use algorithm A to solve y and gets the
b and finally output 1{Se ∈ Sy }.
returned β,
It remains to verify (a): the ⇒ direction is obvious. For the ⇐ direction, suppose |Sy | ≥ 1, the estimation
error guarantee in Problem A.3 indicates that
s
†
(S ) † ∥βb − β (S † ) ∥2Σ (a) √ 1
∥βb[7k+1] − β[7k+1] ∥∞ ≤ ∥βb − β (S ) ∥2 ≤ < 0.25d−ϵ ≤ k −1
λmin (Σ) 2
for some S † ∈ Sy . Here (a) follows from the the error guarantee in Problem A.3, and the fact λmin (Σ) ≥ 1
derived in the proof of Theorem A.2. This further indicates Se = S † by the fact that S † ⊂ [7k + 1] and
(S † )
βj = (k −1 )1{j ∈ S † } for any j ∈ [7k + 1] derived in the proof of Theorem A.2.
with an additionally introduced boolean variable w◦ that is forced to be True by the first four clauses in
(C.12). Hence the constraints (C.11) can be translated into 6n(k − 1) clauses, with additionally introduced
n(k−1)
n(k − 1) variables {wℓ◦ }ℓ=1 . Finally, in instance x′ there are k ′ = k + 6n(k − 1) < 18k 2 clauses in total.
Each boolean variable in {wm,i }m∈[n],i∈[k] appears no more than 3 + 2 × 6 = 15 times, and each additionally
n(k−1)
introduced boolean variable in {wℓ◦ }ℓ=1 appears no more than 6 times.
Now we prove that the mapping T we construct is a parsimonious polynomial-time reduction, namely,
for any valid solution v ∈ Sx , setting wm,i = vm for m ∈ [n], i ∈ [k] and wℓ◦ = True for ℓ ∈ [n(k − 1)] leads
to a valid solution w ∈ Sx′ , and such mapping from Sx to Sx′ is a bijection.
The verification of injection is obvious. Now we prove it is a surjection. For any valid solution w of
instance x′ , the constraints (C.11) require wm,1 = · · · = wm,k for m ∈ [n]. Hence setting vm = wm,1 for
m ∈ [n] leads to a valid solution v ∈ Sx whose image is w. This completes to proof for the bijection.
40
Step 2. Construction of ExistLIS-Ident Problem. Next, we construct the 7k ′ × 7k ′ matrix A that
corresponds to the 3Sat instance x′ , as shown in (2.2). Namely,
i = i′ and t = t′
1{t contradicts itself}
i = i′ and t ̸= t′
1
A7(i−1)+t,7(i′ −1)+t′ =
1
i ̸= i′ and tcontradicts t′
0 otherwise
for any i, i′ ∈ [k ′ ]. Matrix B can be seen as the adjacency matrix of a connected graph over k ′ vertices. We
′ ′
define matrix K ∈ Rk ×7k as follows
(
1 7(i − 1) < j ≤ 7i
Ki,j = (C.14)
0 otherwise
for any i ∈ [k ′ ], j ∈ [7k ′ ]. We construct its corresponding ExistLIS instance y with |y| = 8k ′ as follows:
and
1 ⊤
(1000 + 12 ) · 17k′
(2) 1000I7k′ + A 2K (2)
Σ = and u = .
1
2K 1000Ik′ + 81 B (1000 + 12 )1k′ + 18 B1k′
One can easily verify both Σ(1) and Σ(2) are positive definite from the fact that Σ(2) is diagonally dominant,
and Hℓ is positive definite for any ℓ ≥ 1. Note that A7(i−1)+s,7(j−1)+t ̸= 0 immediately implies the i-th clause
and the i′ -th clause have shared variable. Since each variable appears no more than 15 times, one clause
shares common variables with up to 3 × 15 other clauses. Then we can conclude that each row of matrix
A has no more than 7 × (3 × 15 + 1) = 322 non-zero elements. Combining with the fact that there are no
more than 2 non-zero elements in each row of B and no more than 7 non-zero elements in each row/column
of K, we can conclude that for any e ∈ E, each row of matrix Σ(e) has no more than 322 + 7 + 2 + 1 < 400
non-zero elements.
Similar to (2.3) in the proof of Lemma 2.2, we claim the following and defer the proof to the end of this
step.
(a) † † (1,S † )
S † ∈ Sy ⇐⇒ ̸ S † ⊂ [8k ′ ] and β (2,S
∅= )
= β (1,S )
with βj = 1{j ∈ S † }
(b)
⇐⇒ S † = S̊ ∪ {7k ′ + 1, . . . , 8k ′ } with |S̊ ∩ {7i − 6, . . . , 7i}| = 1, ∀1 ≤ i ≤ k ′
and Aj,j ′ = 0 ∀j, j ′ ∈ S̊ ⊆ [7k ′ ] (C.15)
(c) ′
⇐⇒ S̊ = {7(i − 1) + ai }ki=1 with ai ∈ [7] s.t. adopt action ID ai
in clause i ∈ [k ′ ] will lead to a valid solution v ∈ Sx′ .
Combining (C.15) and Step 1, we have |Sx | = |Sx′ | = |Sy |. Since d = 8k ′ = poly(k) and such construction
can be done in polynomial time, this mapping admits a deterministic polynomial-time reduction from 3Sat
to the problem we construct. Therefore, we can conclude that the problem we construct is NP-hard.
Proof of (C.15)(a) is essentially identical to the proof of (2.3)(a) in Lemma 2.2. Now we prove (C.15)(b)
and (c).
41
Proof of (C.15)(b) The proof idea is similar to (2.3)(b). The direction ⇐ is obvious. For the ⇒ direction,
we first assert that
{7k ′ + 1, . . . , 8k ′ } ∩ S † ̸= ∅ (C.16)
We use the proof by contradiction argument. If {7k ′ + 1, . . . , 8k ′ } ∩ S † = ∅, there must exist an index
(2,S † ) (1,S † )
j ∈ [7k ′ ] ∩ S † since S † is nonempty. Combined with the fact βj = βj = 1{j ∈ S † }, the equation
†
(2) (2,S ) (2)
Σj,S † βS † = uj tells
′
7k
X (2,S † ) 1
1000 + Aj,j ′ βj ′ = 1000 +
2
j ′ =1
The L.H.S. is an integer while the R.H.S. is not an integer. This leads to a contradiction. This proves (C.16).
(2) (2,S † ) (2)
Now we consider the element i + 7k ′ ∈ {7k ′ + 1, . . . , 8k ′ } ∩ S † . Then the equation ΣS † βS † = uS † tells
1 X (2,S † ) 1 X (2,S † ) 1 1 X
βj + βi′ +7k′ + 1000 = 1000 + + 1.
2 8 2 8
7i−6<j≤7i i′ :Bi,i′ =1 i′ :Bi,i′ =1
This indicates that all the neighbors of i (with respect to the adjacency matrix B) should be simultaneously
contained in S † . Since B represents the adjacency matrix of a connected graph, we can then inductively
(2) (2,S † ) (2)
prove that {7k ′ + 1, . . . , 8k ′ } ⊂ S † . Given this, the equation ΣS βS = uS † now becomes
1 1
K1S̊ = 1S̊ =⇒ |S̊ ∩ {7i − 6, . . . , 7i}| = 1, for ∀1 ≤ i ≤ k ′
2 2
and
(i)
AS̊ 1k′ = 0 =⇒ Aj ′ ,j = 0, for ∀j ′ , j ∈ S̊
′ ′
where (i) follows from the fact that A ∈ {0, 1}7k ×7k .
Proof of (C.15)(c) The direction ⇒ follows from the proof of (2.3)(c). For the direction ⇐, it follows
′
from the proof of (2.3)(c) and the fact that S̊ = {7(i − 1) + ai }ki=1 with ai ∈ [7] naturally implies |S̊ ∩ {7i −
′
6, . . . , 7i}| = 1 for i ∈ [k ].
42
D Proofs for the Population-level Results
D.1 Proof of Proposition 3.1
Applying Theorem 3.2 with k = 1 completes the proof of (3.3). To establish the causal identification result,
it suffices to verify Condition 3.3 with k = 1.
To see this, under (1.1) and (1.3), if Condition 3.2 further holds, we have, for each j ∈ S ⋆ and e ∈ E,
(e) (e) (e) ⋆
E[Xj Y (e) ] + ε(e) )]
P
(e,{j})
E[Xj ( i∈S ⋆ Xi βi ) (a)
β = (e) (e)
= (e) (e)
= βj⋆
E[Xj Xj ] E[Xj Xj ]
provided Condition 3.2 and (1.1), respectively. This completes the proof.
It is easy to check that the convex hull of Θγ is itself, applying Theorem 1 of Meinshausen & Bühlmann
(2015) completes the proof.
Eµ |Y − β ⊤ X|2 − |Y |2 .
Qk,γ (β) = sup
µ∈Pk,γ (Σ,u)
Eµ |Y − β ⊤ X|2 − |Y |2 = β ⊤ Eµ XX ⊤ β − 2β ⊤ Eµ [XY ]
= β ⊤ Σβ − 2β ⊤ Eµ [XY ].
On the other hand, it follows from the definition of Σ, u and Qk,γ (β) that
1 ⊤
Qk,γ (β) = β Σβ − β ⊤ u + γv ⊤ |β| with v = (wk (1), . . . , wk (d)).
2
Now it suffices to show that for any β ∈ Rd ,
1 ⊤ 1 ⊤
β Σβ − β ⊤ u + γv ⊤ |β| = sup β Σβ − γβ ⊤ u
e. (D.1)
2 u u−u|≤v 2
e:|e
43
To see this, it is easy to verify that, for any given x, a ∈ R and b ∈ R+ , one has
this verifies (D.1) and thus completes the proofs of the claim (3.6).
where G is defined in (1.2) and β̄ = Σ−1 u. Here (a) follows from the fact that the first quadratic term is
non-negative, and the identity
1 Xn o 1 X
Σ(β̄ − β ⋆ ) = E[X (e) Y (e) ] − E[X (e) (X (e) )⊤ β ⋆ ] = E[X (e) ε(e) ].
|E| |E|
e∈E e∈E
Therefore, we have
1 (e)
E[ε(e) Xj ]
P
⋆ |E| e∈E
Qk,γ (β) − Qk,γ (β ) ≥ 0 if γ ≥ max := γk⋆ ,
j∈G wk (j)
44
We finally establish the upper bound on γk⋆ . It follows from the definition of wk that
2
1 (e)
E[ε(e) Xj ]
P
|E| e∈E
(γk⋆ )2 = max
j∈G {wk (j)}2
2
1 (e)
E[ε(e) Xj ]
P
|E| e∈E
= max (e,S) (S)
1
− βS ∥2 (e)
j∈G
P
inf S:j∈S |E| e∈E ∥βS
ΣS
2
1 (e)
E[ε(e) XS ]
P
|E| e∈E
2
≤ max max (e,S) (S)
1
− βS ∥2 (e)
j∈G S:j∈S
P
|E| e∈E ∥βS ΣS
2
1 (e) (e)
P
|E| e∈E E[ε XS ]
2
= max (e,S) (S) 2
S:S∩G̸=∅ 1
P
|E| e∈E ∥βS − βS ∥ (e)
Σ S
(e)
Let κmin = mine∈E λmin (Σ ), we have
1 X (e,S) (S) 1 X (e,S) (S)
∥βS − βS ∥2Σ(e) ≥ κmin ∥βS − βS ∥22
|E| S |E|
e∈E e∈E
1 X (e,S) 1 X (e,S)
≥ κmin inf ∥β − β∥22 ≥ κmin ∥β − β̄ (S) ∥22
β:βS c =0 |E| |E|
e∈E e∈E
(S) 1 (e,S) ⋆ 2
P
with β̄ = |E| e∈E β . Plugging it back into the upper bounded on (γk ) , we conclude that
2
1 (e)
E[ε(e) XS ]
P
|E| e∈E
(γk⋆ )2 ≤ (κmin )−1 max 1
P (e,S) − β (S) ∥2
2
= γ ∗ κ2min ,
S:S∩G̸=∅
|E| e∈E ∥β 2
where
2
1 (e)
E[ε(e) XS ]
P
|E| e∈E
γ ∗ = (κmin )−3 max 1
P (e,S) − β (S) ∥2
2
(D.2)
S:S∩G̸=∅
|E| e∈E ∥β 2
We define
1 X b h (e) (e)
i 1 ⊤b 1 ⊤
R(β)
b = E |Yi − β ⊤ Xi |2 = β Σβ − β ⊤ u
b+ ub u
b,
2|E| 2 2
e∈E
1 X h (e) (e)
i 1 ⊤ 1
R(β) = E |Yi − β ⊤ Xi |2 = β Σβ − β ⊤ u + u⊤ u.
2|E| 2 2
e∈E
45
We let
1 X (e,S) 2 1 X b(e,S) b(S) 2
(S)
v(S) = βS − βS and vb(S) = βS − βS .
|E| (e)
ΣS |E| b (e)
Σ S
e∈E e∈E
One can expect |v(S) − vb(S)| ≍ (|S|/n)1/2 by CLT. However, applying such a crude bound will result in a
slower rate. Instead, the next proposition targets to establish a shaper instance-dependent error bound for
the difference. We define
(s log(4d/s)) + log(|E|) + t (s log(4d/s)) + t
ρ(s, t) = and ζ(s, t) = (E.2)
n n · |E|
with s ∈ [d] and t > 0.
We also define some concepts that will be used throughout the proof.
Definition 6 (Sub-Gaussian Random Variable). A random variable X is a sub-Gaussian random variable
with parameter σ ∈ R+ if
λ 2
∀λ ∈ R, E[exp (λ(X − E[X]))] ≤ exp σ .
2
46
Proposition E.2. Under Condition 3.1, for any k ∈ [d] and γ ≥ 0, Qk,γ (β) is uniquely minimized by some
β k,γ . Moreover, for any β ∈ Rd .
1 1/2
Qk,γ (β) − Qk,γ (β k,γ ) ≥ ∥Σ (β − β k,γ )∥22
2
The next lemma shows the explained variance of β k,γ is smaller than the explained variance of population-
level least squares, the latter is smaller than σy2 .
Lemma E.3. Let β k,γ be the unique minimizer of Qk,γ (β), and β̄ be the unique minimizer of R(β). Then
we have
Qk,γ (β) − Qk,γ (β k,γ ) = Qk,γ (β) − Qb k,γ (β) + Qb k,γ (β) − Q b k,γ (β k,γ )
For T2 (β), under the event A3 (d, t) and A4 (d, t) defined in (E.18), the following holds with a universal
47
constant C1
∀β, T2 (β) = R(β) − R(β k,γ ) − R(β)
b b k,γ )
− R(β
1
= (β − β k,γ )⊤ Σ(β − β k,γ )+(β − β k,γ )⊤ (Σβ k,γ − u)
2
1
− (β − β k,γ )⊤ Σ(β
b − β k,γ ) − (β − β k,γ )⊤ (Σβb k,γ − u
b)
2
1 n o
= {Σ1/2 (β − β k,γ )}⊤ I − Σ−1/2 ΣΣ b −1/2 {Σ1/2 (β − β k,γ )}
2 n o
− {Σ1/2 (β − β k,γ )}⊤ I − Σ−1/2 ΣΣ b −1/2 (Σ1/2 β k,γ )
Here we substitute the upper bounds in (E.18) and use condition that n ≥ 3(d + t) such that b · ζ(d, t) ≤
(t + d · log 4)/n ≤ 1. We also use ∥Σ1/2 β k,γ ∥2 ≤ σy derived in Lemma E.3. Substituting the low-dimension
structure, if n · |E| ≥ 64C12 bσx4 (d + t), we can obtain
1 1/2 d+t
∀β, T2 (β) ≤ ∥Σ (β − β k,γ )∥22 + C2 bσx4 σy2 ·
4 n · |E|
by first applying Proposition E.1 and then applying Lemma E.4 provided Cσx4 ρ(k, t) ≤ 1 following from
e log(d) + t) and |E| < nc1 .
n ≥ C(k
48
Now we plug in β = βbk,γ , under which T1 (βbk,γ ) ≤ 0, denote ♣ = bσx4 σy2 , we have
1 X b h (e)
(e)
i h
(e)
i
∀j ∈ [d], E Y − (X (e) )⊤ β k,γ Xj − E Y (e) − (X (e) )⊤ β k,γ Xj
|E|
e∈E
p
≤ Cbσx2 σy ζ(1, t) + ζ(1, t)
49
Now we are ready to prove Theorem 3.5.
Proof of Theorem 3.5. Denote βb = βbk,γ,λ , β⋆ = (β⋆,1 , . . . , β⋆,d )⊤ = β k,γ , S⋆ = supp(β⋆ ) and ∆
b = βb − β⋆ .
First, one can observe that
(a) (b)
b S ∥1 − ∥ ∆
λ(∥∆ b S c ∥1 ) ≥ λ(∥β⋆ ∥1 − ∥β∥
b 1) ≥ Q b −Q
b k,γ (β) b k,γ (β⋆ ).
⋆ ⋆
Here (a) follows from the triangle inequality; and (b) follows from the fact that βb minimizes Q
b k,γ (β) + λ∥β∥1 .
At the same time, it follows from the definition of Qk,γ (β) that
b
Q b −Q
b k,γ (β) b k,γ (β⋆ )
d
X d
X
= R(β)
b + bk (j) · |βbj | − R(β
w b ⋆) − bk (j) · |β⋆,j |
w
j=1 j=1
d
1 b⊤ b b b⊤ 1 ⊤b ⊤
X
= β Σβ − β u b − β⋆ Σβ⋆ + β⋆ u
b+γ bk (j) |βbj | − |β⋆,j |
w
2 2 j=1
d
1 b b βb − β⋆ + β⋆ ) − 1 β⋆⊤ Σβ
X
= (β − β⋆ + β⋆ )⊤ Σ( b ⋆ − (βb − β⋆ )⊤ u
b+γ bk (j) |βbj | − |β⋆,j |
w
2 2 j=1
d
1 b⊤b b b⊤b b ⊤u
X
= ∆ Σ∆ + ∆ Σβ⋆ − ∆ b+γ bk (j) |βbj | − |β⋆,j |
w (E.7)
2 j=1
d
1 b⊤b b b⊤ b ⊤ (u − Σβ⋆ ) − ∆
b ⊤ (u − Σβ⋆ ) + γ
X
= ∆ Σ∆ − ∆ (b u − Σβ
b ⋆) + ∆ bk (j) |βbj | − |β⋆,j |
w
2 j=1
(a) 1 ⊤
n o
= ∆ b Σ
b∆b −∆b ⊤ (b
u − Σβ
b ⋆ ) − (u − Σβ⋆ )
2
Xd X d
+ γ − ξj wk (j) βbj − β⋆,j + bk (j) |βbj | − |β⋆,j |
w
j=1 j=1
1 b⊤b b
= ∆ Σ∆ + T1 + γT2 (β).
b
2
Here (a) follows from the KKT condition that
and
(
{sign(β⋆ )} j ∈ S⋆
ξj ∈ .
[−1, 1] j∈/ S⋆
n o
For T1 , note that the j-th coordinate of (bu − Σβ
b ⋆ ) − (u − Σβ⋆ ) is
1 X b h (e) (e)
i h
(e)
i
E (Y − (X (e) )⊤ β k,γ )Xj − E (Y (e) − (X (e) )⊤ β k,γ )Xj .
|E|
e∈E
50
with probability at least 1 − (nd)−20 .
For T2 , one has
X X
T2 = |βbj | − |β⋆,j | wbk (j) − βbj − β⋆,j wk (j) sign(β⋆,j )
j∈S⋆ j∈S⋆
X X
+ |βbj | · w
bk (j) − βbj · wk (j)ξj
j ∈S
/ ⋆ j ∈S
/ ⋆
X X
≥ |βbj | − |β⋆,j | wbk (j) − βbj − β⋆,j wk (j) sign(β⋆,j )
j∈S⋆ j∈S⋆
X X
+ |βbj | · w
bk (j) − |βbj | · wk (j)|ξj |
j ∈S
/ ⋆ j ∈S
/ ⋆
Xh i
= |βbj | · (w
bk (j) − wk (j)) + |βbj | · wk (j)(1 − |ξj |)
j∈S⋆c
X h i
+ |βbj | − |β⋆,j | · w
bk (j) + |β⋆,j | − (βbj ) sign(β⋆,j ) wk (j)
j∈S⋆
(a) Xh i X h i
≥ |βbj | · (w
bk (j) − wk (j)) + |βbj | − |β⋆,j | · w
bk (j) + |β⋆,j | − |βbj | wk (j)
j∈S⋆c j∈S⋆
Xh i Xh i
= (|βbj | − |β⋆,j |) · (w
bk (j) − wk (j)) + (|βbj | − |β⋆,j |) · (w
bk (j) − wk (j))
j∈S⋆c j∈S⋆
d
X
≥− |βbj − β⋆,j | · |w
bk (j) − wk (j)|.
j=1
Here (a) follows from the fact that 1 − |ξj | ≥ 0 and (βbj ) sign(β⋆,j ) ≥ −|βbj |. It follows from the upper bound
bk − wk ∥∞ derived in (E.4) that, provided Cσx4 ρ(k, t) ≤ 1, the following holds with probability at least
of ∥w
1 − e−t
p
T2 ≥ −∥βb − β⋆ ∥1 ∥w bk − wk ∥∞ ≥ −∥∆∥ b 1 ♣ · ρ(k, t).
Set t = C log(n · d) and recall that we assume |E| ≤ nc1 . Then the following holds with probability at least
1 − (n · d)−20
r
b 1 · k log d + (c1 + 1) log n .
p
T2 ≥ −C ♣∥∆∥ (E.10)
n
Combining (E.7), (E.9) and (E.10), we obtain
r s !
b S ∥1 − ∥ ∆
b S c ∥1 ) ≥ −∥∆∥
p k log d + (c1 + 1) log n log d + log n
λ(∥∆ ⋆
b 1 C ♣·γ + C♠ .
⋆
n n · |E|
| {z }
λ⋆
b S c ∥1 (λ − λ⋆ ) ≤ (λ + λ⋆ ) · ∥∆
This immediately implies ∥∆ b S ∥1 , then the following holds
⋆ ⋆
∥∆
b S c ∥1 ≤ 3∥∆
⋆
b S ∥1
⋆ (E.11)
provided λ ≥ 2λ⋆ . Given (E.11), we can apply the restricted strong convexity derived from Lemma E.6 with
α = 3 and combine (E.7), which yields
1 b⊤b b b 1 ≥ κ ∥∆∥
b S ∥1 − ∥∆
λ(∥∆ ⋆
b S c ∥1 ) ≥ ∆ Σ∆ − λ⋆ ∥∆∥ b 22 − λ⋆ ∥∆∥
b 1
⋆
2 4
51
with probability over 1 − 3 exp(−cn/σx4 ) ≥ 1 − (n · d)−20 . This further implies
(a) (b) p
b 22 ≤ 4λ⋆ ∥∆∥
κ∥∆∥ b S ∥1 ≤ 12λ∥∆
b 1 + 4λ∥∆ b S ∥1 ≤ 12λ |S⋆ |∥∆∥
b 2.
⋆ ⋆
12λ p
∥∆∥
b 2≤ |S⋆ |.
κ
is the recentered average of mean-zero independent random variables, each of which is the product of two
sub-Gaussian variables. By Condition 3.4, the product of sub-Gaussian parameters of (Y (e) − (X (e) )⊤ β k,γ )
(e)
and Xj is no more than
√ √
r (a)
(e) 1/2 k,γ (e)
C σy + σx max ∥(Σ ) β ∥2 σx max |Σjj | ≤ C(σy + bσx ∥Σ1/2 β k,γ ∥2 )σx b
e∈E e∈E,j∈[d]
(b) √ √
≤ C(σy + bσx σy ) bσx ≤ 2Cbσx2 σy .
Here the (a) follows from Condition 3.4; and (b) follows from Lemma E.3. Consequently, it follows from the
concentration inequality Lemma E.2 that
1 X b h (e) (e)
i h
(e)
i
∀j ∈ [d], E (Y − (X (e) )⊤ β k,γ )Xj − E (Y (e) − (X (e) )⊤ β k,γ )Xj
|E|
e∈E
p
≤ 2Cbσx2 σy ζ(1, t) + ζ(1, t)
52
Moreover,
1 X h (e) (e,S) i
E XS R = 0. (E.16)
|E|
e∈E
Proposition E.1 is a deterministic result after defining the following high-probability events.
Lemma E.8. Suppose Condition 3.4 hold. Then there exists some universal constants C1 , C2 > 0 such that
the following two events
(
A1 (s, t) = ∀e ∈ E, S ⊆ [d] with |S| ≤ s,
)
(e)
√ p
(ΣS )−1/2 (E[U
b (e,S) ] − E[U (e,S)
]) ≤ C1 bσx2 σy ρ(s, t) + ρ(s, t)
2
( (E.17)
A2 (s, t) = ∀e ∈ E, S ⊆ [d] with |S| ≤ s,
)
p
(e) b (e) )(Σ(e) )−1/2
(ΣS )−1/2 (Σ S S −I ≤ C2 σx2 ρ(s, t) + ρ(s, t)
2
53
b (e) ≻ 0 for any e ∈ E. We can claim that qbS (a) can be minimized by
provided Σ S
( ) ( )
(S) (S) −1 1 X b (e,S) (e) (S) −1 1 X b (e,S)
a=β b = βS + (Σ)b E[R XS ] = βS + (Σ) b E[U ] ,
|E| |E|
b
e∈E e∈E
For T
b 1 , we do the following decomposition,
o⊤
b1 = 1
X n (e)
(e)
T (ΣS )−1/2 E[U
b (e,S) ] − E[U (e,S) ] + E[U (e,S) ] b (e) (Σ(e) )−1/2 )−1
((ΣS )−1/2 Σ S S
|E|
e∈E
n o
(e)
× (ΣS )−1/2 E[Ub (e,S) ] − E[U (e,S) ] + E[U (e,S) ]
where the last inequalities follows from the fact that 1/(1 − x) ≤ 1 + 2x and 1/(1 + x) ≥ 1 − 2x when
x ∈ [0, 0.5]. We thus have
(e) (e)
(∆2 + I)−1 − I ≤ 2∥∆2 ∥2 and (∆2 + I)−1 ≤ 1.5 (E.19)
2 2
54
Therefore, it follows from the triangle inequality and Cauchy-Schwarz inequality that
o⊤
b 1 − v(S) = 1
X n (e) (e)
T ∆1 + (ΣS )−1/2 E[U (e,S) ] (∆2 + I)−1
|E|
e∈E
n o
(e) (e)
× ∆1 + (ΣS )−1/2 E[U (e,S) ] − v(S)
1 X (e) (e)
≤ 2∥∆1 ∥2 (∆2 + I)−1 2 (ΣS )−1/2 E[U (e,S) ]
|E| 2
e∈E
1 X (e)
+ ∥∆1 ∥22 (∆2 + I)−1 2
|E|
e∈E
1 X 2
(e)
+ (∆2 + I)−1 − I 2 (ΣS )−1/2 E[U (e,S) ]
|E| 2
e∈E
s s
(a) 1 X 2 1 X (e) 2 1 X (e) 2
(e)
≤ 3 (ΣS )−1/2 E[U (e,S) ] ∥∆1 ∥2 + 1.5 ∥∆1 ∥2
|E| 2 |E| |E|
e∈E e∈E e∈E
1 X 2
(e) 2 (e)
+ sup 2∥∆2 ∥2 (ΣS )−1/2 E[U (e,S) ]
e∈E |E| 2
e∈E
(b) q p
≤ 3 v(S) · C12 σx4 σy2 bρ(k, t) + 1.5C12 σx4 σy2 bρ(k, t) + 2C2 σx2 ρ(k, t) · v(S)
(c) p q
≤ 4 v(S) · (C12 + C22 )σx4 σy2 b · ρ(k, t) + 1.5C12 σx4 σy2 bρ(k, t)
where (a) follows from the inequalities (E.19) and Cauchy-Schwarz inequality, (b) follows from (E.17), and
(c) follows from the fact
(e) (e) (e)
∥(ΣS )−1/2 E[U (e,S) ]∥2 ≤ ∥(ΣS )−1/2 E[X (e) Y (e) ]∥2 + ∥(ΣS )1/2 β (S) ∥2
(d)
q
(e) (e) (e)
≤ (E[XS Y (e) ])⊤ (ΣS )−1 (E[XS Y (e) ])
(e) −1/2 1/2
+ ∥(ΣS )1/2 ΣS ∥2 ∥ΣS β (S) ∥2
√ √
≤ σy + bσy ≤ 2 bσy ,
which further implies that
s
1 (e)
p q p
v(S) = ∥(ΣS )−1/2 E[U (e,S) ]∥22 · v(S) ≤ 4bσy2 · v(S).
|E|
(e)
Here (d) follows from the fact that the covariance matrix of [XS , Y (e) ] are positive semi-definite thus the
Schur complement satisfies
(e) (e) (e)
σy2 − (E[XS Y (e) ])⊤ (ΣS )−1 (E[XS Y (e) ]) ≥ 0,
1
P (e)
and a similar argument to the covariance matrix of the mixture distribution [XS , Y ] ∼ |E| e∈E µ(xS ,y) .
b 2 , observe that 1 P (e) (e,S)
For T |E| e∈E E[XS R ] = 0, then following (E.18), (E.19) and the fact that bσx4 ζ(k, t) ≤
σx4 ρ(k, t) ≤ 1 since b ≤ |E| by Condition 3.4,
2
b 2 | ≤ (ΣS )−1/2 1 b (e,S) ] − E[U (e,S) ]) · (Σ−1/2 Σ b S Σ−1/2 )−1
X
|T (E[U S S
|E| 2
e∈E 2
p 2
≤ 1.5(C22 σx2 σy )2 bζ(s, t) + bζ(s, t) ≤ C3 σx4 σy2 ρ(k, t).
55
Putting all the pieces together, we can conclude that
T b 2 − v(S) ≤ T
b1 + T b 1 − v(S) + T b2
p q
≤ C4 bσx4 σy2 ρ(k, t) + v(S) bσx4 σy2 ρ(k, t) .
56
E.6 Proof of Lemma E.7
(e,S) (e) (e)
It follows from the identity βS = (ΣS )−1 E[Y (e) XS ] that
1 X (e,S) (S) (e) (e,S) (S)
v(S) = (βS − βS )⊤ ΣS (βS − βS )
|E|
e∈E
1 X (e) (e) (e) (e) (S) (S) (e) (S)
= E[Y (e) XS ]⊤ (ΣS )−1 E[Y (e) XS ] − E[Y (e) XS ]⊤ βS + βS ΣS βS
|E|
e∈E
1 X 2
(e) (e) (e) (S)
= (ΣS )−1/2 E[Y (e) XS ] − ΣS βS .
|E| 2
e∈E
(S) (e)
At the same time, for any a ∈ R|S| , plugging Y (e) = (βS )⊤ XS + R(e,S) gives
1 X 2
(e) (e) (e)
qS (a) = (ΣS )−1/2 E[Y (e) XS ] − ΣS a
|E| 2
e∈E
1 2
(e) (e) (e) (S)
X
= (ΣS )−1/2 E[R(e,S) XS ] − ΣS (a − βS )
|E| 2
e∈E
( )
(S) ⊤ (S) (S) ⊤ 1 X (e,S) (e)
=(a − βS ) Σ(a − βS ) − 2(a − βS ) E[R XS ]
|E|
e∈E
1 X h i 2
(e)
+ (Σ(e) )−1/2 E XS R(e,S) .
|E| 2
e∈E
It follows from the definition of R(e,S) and the definition of β (S) that
1 X (e) 1 X (e) (S) (e)
E[XS R(e,S) ] = E[XS (Y (e) − (βS )⊤ XS )]
|E| |E|
e∈E e∈E
1 X (e) (S)
= E[XS Y (e) ] − ΣS βS = 0.
|E|
e∈E
(S)
This verifies (E.16). Therefore a⋆ = βS attains the global minima of qS (a), this verifies (E.14) and (E.15).
57
(e)
At the same time, for fixed e and S, denote ξ = (ΣS )−1/2 (E[U
b (e,S) ] − E[U (e,S) ]). It follows from the
variational representation of the ℓ2 norm that
(S,ℓ) ⊤ (S,π(w)) ⊤ (S,ℓ) ⊤ 1
∥ξ∥2 = sup wS⊤ ξ ≤ sup (wS ) ξ + sup (wS − wS ) ξ ≤ sup (wS ) ξ + ∥ξ∥2 ,
w∈BS ℓ∈[NS ] w∈BS ℓ∈[NS ] 4
where the last inequality follows from the Cauchy-Schwarz inequality and our construction of covering in
(S,ℓ)
(E.20). This implies ∥ξ∥2 ≤ 2 supℓ∈[NS ] (wS )⊤ ξ, thus
(e)
sup sup (ΣS )−1/2 (E[U
b (e,S) ] − E[U (e,S) ])
e∈E |S|≤s 2
n
(S,ℓ) ⊤ (e) 1 X (e) (e,S) (e)
(E.22)
≤2 sup (wS ) (ΣS )−1/2 Xi,S Ri − E[XS R(e,S) ] .
e∈E,|S|≤s,ℓ∈[NS ] n i=1
| {z }
Z1 (e,S,ℓ)
Note for fixed e, S and ℓ, Z1 (e, S, ℓ) is the recentered average of independent random variables, each of which
(S,ℓ) (e) (e)
is the product of two sub-Gaussian variables. By Condition 3.4, (wS )⊤ (ΣS )−1/2 XS has sub-Gaussian
(e,S) (e) (S) ⊤ (e)
parameter at most σx , and the sub-Gaussian parameter of R := Y − (β ) X is no more than
!
(a) 1 X
(e) 1/2 (S) (e) 1/2 −1/2 −1/2 (e) (e)
σy + σx (ΣS ) β ≤ σy + σx (ΣS ) ΣS ΣS E[XS Y ]
2 2 |E|
e∈E 2
s
(b) 1
(e)
≤ σy + σx (ΣS )1/2 ΣS
−1/2
X
E[(Y (e) )2 ] (E.23)
2 |E|
e∈E
(c) √
≤ σy + σx bσy .
Here (a) follows from the property of the operator norm and the definition of β (S) ; (b) follows from the
(S,ℓ) (e) −1/2 (e) (e,S)
Cauchy-Schwarz inequality; and (c) follows from Condition 3.4. Therefore, (wS )⊤ (ΣS )√ XS R
is the product of two sub-Gaussian variables with parameter no more than σx and σy + σx bσy . Then it
follows from the tail bound for sub-exponential random variable that
r
′ 1/2 2 u u
∀e ∈ E, |S| ≤ s, ℓ ∈ [NS ], P |Z1 (e, S, ℓ)| ≥ C b σx σy + ≤ 2e−u , ∀u > 0.
n n
Letting u = t + log(2N |E|) ≤ 6 (t + s log(4d/s) + log(|E|)), we obtain
" #
p
P sup |Z1 (e, S, ℓ)| ≥ 6C ′ b1/2 σx2 σy ρ(s, t) + ρ(s, t)
e∈E,|S|≤s,ℓ∈[NS ]
(S,ℓ) ⊤ (S,ℓ) 1 1
≤ sup (wS ) QS (wS ) + ∥QS ∥2 + ∥QS ∥2 ,
l∈[NS ] 2 16
58
(S,ℓ) ⊤ (S,ℓ)
which implies ∥QS ∥2 ≤ 3 supℓ∈[NS ] (wS ) QS (wS ), thus
(E.24)
h i
(S,ℓ) (e) b (e) )(Σ(e) )−1/2 − I (w(S,ℓ) ) .
≤3 sup (wS )⊤ (ΣS )−1/2 (Σ S S S
e∈E,|S|≤s,ℓ∈[NS ] | {z }
Z2 (e,S,ℓ)
Note for fixed e, S and ℓ, Z2 (e, S, ℓ) is the recentered average of independent random variables, each of wich is
(S,ℓ) (2) (e)
the square of a sub-Gaussian variable (wS )⊤ (ΣS )−1/2 XS with parameter at most σx , by Condition 3.4.
Then it follows from the tail bound for exponential random variable that
r
u u
∀e ∈ E, |S| ≤ s, ℓ ∈ [NS ], P |Z2 (e, S, ℓ)| ≥ C ′ σx2 + ≤ 2e−u , ∀u > 0.
n n
Combining with the argument (E.24) concludes the proof of the claim with C2 = 18C ′ .
1 X b (e,S)
sup (ΣS )−1/2 (E[U ] − E[U (e,S) ])
|S|≤s |E|
e∈E 2
n
(S,ℓ) 1 X X (e) (e,S) (e)
(E.25)
≤2 sup (wS )⊤ (ΣS )−1/2 Xi,S Ri − E[XS R(e,S) ] .
|S|≤s,ℓ∈[NS ] n·E
e∈E i=1
| {z }
Z3 (S,ℓ)
(S,ℓ) (e)
Note by Condition 3.4, for fixed e, S and ℓ, (wS )⊤ (ΣS )−1/2 XS is a sub-Gaussian variable with parameter
1/2
(S,ℓ) (e) (S,ℓ)
σe,S,ℓ = (wS )⊤ (ΣS )−1/2 (ΣS )(ΣS )−1/2 wS σx , which satisfies
n
!1/2 !1/2
XX X
(σe,S,ℓ )2 = n (σe,S,ℓ )2
e∈E i=1 e∈E
!1/2
X (S,ℓ) (e) (S,ℓ)
= n (wS )⊤ (ΣS )−1/2 (ΣS )(ΣS )−1/2 wS σx
e∈E
! !1/2
(S,ℓ) (e) (S,ℓ)
X
= n· (wS )⊤ (ΣS )−1/2 ΣS (ΣS )−1/2 wS σx
e∈E
= (n · |E|)1/2 σx .
59
Also, from Condition 3.4, we have
1/2
(S,ℓ) (e) (S,ℓ)
∀e ∈ E, |S| ≤ s, ℓ ∈ [NS ], σe,S,ℓ = (wS )⊤ (ΣS )−1/2 (ΣS )(ΣS )−1/2 wS σx
q
(S,ℓ) (S,ℓ) (E.26)
≤ b · (wS )⊤ wS · σx
√
= b · σx .
√
While for fixed e and S, R(e,S) is a sub-Gaussian variable with parameter σy (1 + σx b), as obtained in
(E.23). Thus Z3 (e, S, ℓ) is the recentered average of independent random variables,
√ each of which is the
product of two sub-Gaussian variables with parameters σe,S,ℓ and σy (1 + σx b). Then it follows from the
tail bound for exponential random variable that
√ √
r
u u
|S| ≤ s, ℓ ∈ [NS ], P |Z3 (S, ℓ)| ≥ C ′ bσx2 σy b + ≤ 2e−u , ∀u > 0.
n · |E| n · |E|
Combining with the argument (E.25) concludes the proof of the claim with C1 = 12C ′ .
High probability error bound in A4 (k, t). Recall that in Lemma E.8 we obtain that for any symmetric
(S,ℓ) (S,ℓ)
matrix Q ∈ Rd×d , ∥QS ∥2 ≤ 3 supℓ∈[NS ] (wS )⊤ QS wS , by the variational representation of the operator
norm. This immediately yields,
(E.27)
h i
(S,ℓ) ⊤ b S )(ΣS )−1/2 − I (w(S,ℓ) ) .
≤3 sup (wS ) (ΣS )−1/2 (Σ S
|S|≤s,ℓ∈[NS ] | {z }
Z4 (S,ℓ)
is the recentered average of independent random variables, each of which is the square of sub-Gaussian
(S,ℓ) (e) (S,ℓ)
1/2 √
variable with parameter σe,S,ℓ = (wS )⊤ (ΣS )−1/2 (ΣS )(ΣS )−1/2 wS σx . We have σe,S,ℓ ≤ b · σx
60
as obtained in (E.26), and
n
!1/2
XX
4
(σe,S,ℓ )
e∈E i=1
!1/2
X
= n (σe,S,ℓ )4
e∈E
!1/2
√ X
2
≤ n · (max σe,S,ℓ ) σe,S,ℓ
e,S,ℓ
e∈E
!1/2
√ √ X
2
≤ n· b · σx σe,S,ℓ
e∈E
!1/2
√ √ X (S,ℓ) (e) (S,ℓ)
= n· b · σx (wS )⊤ (ΣS )−1/2 (ΣS )(ΣS )−1/2 wS σx
e∈E
! !1/2
√ √ (S,ℓ)
X (e) (S,ℓ)
= n· b · σx (wS )⊤ (ΣS )−1/2 ΣS (ΣS ) −1/2
wS σx
e∈E
√ √ p
= n· b· |E| · σx2 .
Then it follows from the tail bound for the sub-exponential random variable that
r
′ 2 u u
∀|S| ≤ s, ℓ ∈ [NS ], P |Z4 (S, ℓ)| ≥ C σx b + b ≤ 2e−u , ∀u > 0.
n · |E| n · |E|
Letting u = t + log(2N ) ≤ 6 (t + s log(ed/s)), we obtain
" #
p
′ 2
P sup |Z4 (S, ℓ)| ≥ 6C σx bζ(s, t) + bζ(s, t) ≤ N × 2e− log(2N )−t ≤ e−t .
|S|≤s,ℓ∈[NS ]
Combining with the argument (E.27) concludes the proof of the claim with C1 = 18C ′ .
respectively. Given any fixed α > 0, let s = c(1 + α)−2 σx−4 κ · n|E|/(b · log d) where c is a universal constant.
We also define the set
[
Θ = Θs,α := {θ ∈ Rd : ∥θS c ∥1 ≤ α∥θS ∥1 }
S⊆[d],|S|≤s
and abbreviate it as Θ given our analysis focused on any fixed (α, s(α)). Note that Θ is a cone, in the sense
that for any θ ∈ Θ and t > 0 we also have t · θ ∈ Θ, and note that the result we want to prove is quadratic
in θ on both sides. Therefore it suffices to consider {v ∈ Θ : ∥v∥Σ = 1}, and we define the following set
B := Bs,α : = Θ ∩ {θ ∈ Rd : ∥θ∥Σ = 1}
[
= {θ ∈ Rd : ∥θS c ∥1 ≤ α∥θS ∥1 , ∥θ∥Σ = 1}
S⊆[d],|S|≤s
61
and abbreviate it as B. We also define the following metric on Rd
d(v, v ′ ) = σx ∥v − v ′ ∥Σ
and simply let d(v, T ) = inf a∈T d(v, a) for some set T . It suffices to show that there exists some universal
constant C such that
1
P inf Zv + 1 ≥ ≥ 1 − 3 exp(−e n/(Cσx )4 ),
v∈B 2
where
n · |E|
n
e := .
b
It is obvious that ne ≥ n follows from b ≤ |E| derived in (3.9). Our proof is divided into three steps.
In the first step, we establish concentration inequalities for any fixed v and v ′ . To be specific, we show
that for some universal constant C > 0, the following holds: for any t > 0,
" r !#
′ ′ t t
P |Zv − Zv′ | > Cd(v, −v )d(v, v ) + ≤ 2e−t , ∀v, v ′ ∈ Rd ; (E.28)
n
e n e
" r !#
t t
P |Zv | > Cσx 2
+ ≤ 2e−t , ∀v ∈ B; (E.29)
n
e n e
" r !#
t
P Wv > Cd(v, 0) +1 ≤ 2e−t , ∀v ∈ Rd . (E.30)
n
e
In the second step, we establish an upper bound on the Talagrand’s γ2 functional (Vershynin, 2018) of
Θ, which is defined as
∞
X
γ2 (Θ, d) := inf sup 2k/2 d(v, Bk ). (E.31)
{Bk }∞ 2k v∈Θ
k=0 :|B0 |=1,|Bk |≤2 k=0
Step 1. Establish Concentration Inequalities for Fixed v. In this step we prove the concentration
inequalities (E.28),(E.29) and (E.30). For (E.28), it follows from the definition of Z that
1 X
(e) (e)
(v ⊤ Xi )2 − ((v ′ )⊤ Xi )2 − v ⊤ Σv − v ′⊤ Σv ′
Zv − Zv ′ =
n · |E|
i∈[n],e∈E
1 X
(e)
(e)
(v + v ′ )⊤ Xi · (v − v ′ )⊤ Xi − v ⊤ Σv − v ′⊤ Σv ′ .
=
n · |E|
i∈[n],e∈E
62
It is the recentered average of independent random variables, each of which is the product of two sub-Gaussian
variables with parameter σe,v+v′ and σe,v−v′ satisfying
(a)
σe,v+v′ σe,v−v′ ≤ (σx ∥v + v ′ ∥Σ(e) ) · (σx ∥v − v ′ ∥Σ(e) )
(b) √
≤ b · d(v, −v ′ ) · σx ∥v − v ′ ∥Σ(e)
(c)
≤ b · d(v, −v ′ )d(v, v ′ );
X (d) X
(σe,v+v′ σe,v−v′ )2 ≤ b · d(v, −v ′ )2 · σx2 · ∥v − v ′ ∥2Σ(e)
i∈[n],e∈E i∈[n],e∈E
(e)
= n|E|bd(v, −v ′ )2 σx2 ∥v − v ′ ∥2Σ
= n|E|bd(v, −v ′ )2 d(v, v ′ )2 .
for some universal constant C > 0. This completes the proof of (E.28).
(E.29) is a corollary of (E.28), following from assigning v ′ = 0 and noticing that d(v, 0) = σx for all
v ∈ B.
For (E.30), observe that, Wv2 = Zv + v ⊤ Σv. Combining (E.28) we can conclude that for all v ∈ Rd and
t > 0,
" !#
√
r
t
P Wv > Cd(v, 0) +1
n
e
" r !#
2 2 t t
= P Wv > Cd(v, 0) 2 + +1
n
e n e
" r ! #
(a) t t
2 2 ⊤
≤ P Wv > Cd(v, 0) + + v Σv
n
e n e
" r !#
2 t t
= P Zv > Cd(v, 0) +
n
e n e
≤ 2e−t .
Here in (a) we use the fact that C > 1 and that d(v, 0)2 = σx2 v ⊤ Σv ≥ v ⊤ Σv since σx ≥ 1.
Step 2. Bounding the γ2 -functional. In this step we prove (E.32). We define another set
Since (B, d) is isometric to (B Σ , σx ∥ · ∥2 ) and (B Σ , σx ∥ · ∥2 ) is isometric to (σx B Σ , ∥ · ∥2 ). From the fact that
γ2 functional is invariant under isometries, we have
63
Also, the γ2 functional respects scaling in the sense that
γ2 (σx B Σ , ∥ · ∥2 ) = σx γ2 (B Σ , ∥ · ∥2 ). (E.35)
Additionally, it follows from Talagrand’s majorizing measure theorem (Talagrand, 2005) that there exists
some universal constant C > 0 such that
γ2 (B Σ , ∥ · ∥2 ) ≤ C · Eg∼N (0,Id ) sup g ⊤ x . (E.36)
x∈BΣ
Here (a) follows from the fact that Σ−1/2 x ∈ B and for any v ∈ B, we have
√ √
∥v∥1 = ∥vS ∥1 + ∥vS c ∥1 ≤ (1 + α)∥vS ∥1 ≤ (1 + α) s∥vS ∥2 ≤ (1 + α) s∥v∥2
for some subset |S| ≤ s by the definition of Θ; (b) follows √ from ∥Σ−1/2 x∥2 ≤ κ−1/2 ∥x∥2 = κ−1/2 ; and (c)
1/2
follows from Eg∼N (0,Id ) [∥Σ g∥∞ ] ≤ Eg∼N (0,Id ) [∥g∥∞ ] ≤ 50 log d by Sudakov-Fernique’s inequality (Conze
et al., 1975) and Condition 3.4(b). Combining with (E.34), (E.36) and (E.35), we complete the proof of
(E.32).
Step 3. Bounding the maximum of |Zv |: In this step, we prove (E.33) following Mendelson et al. (2007).
It follows from the definition of γ2 -functional that there exists a sequence of subsets {Bk : k ≥ 0} of B with
k
|B0 | = 1 and |Bk | ≤ 22 such that for every v ∈ B,
∞
X
2k/2 d(v, πk (v)) ≤ 1.01γ2 (B, d),
k=0
n ≥ 2k0 > n
Let the integer k0 satisfy 2e e. It follows from triangle inequality and the definition of Wv and Zv
that
|Zv | ≤ |Zv − Zπk0 (v) | + |Zπk0 (v) | = |Wv2 − Wπ2k (v) | + |Zπk0 (v) |. (E.38)
0
64
From Minkowski’s inequality, we can observe that Wv is sub-additive with respect to v, that is, for any
v1 , v2 ∈ Rd ,
1/2
1 2
(e) (e)
X
Wv1 +v2 = v1⊤ Xi + v2⊤ Xi
n · |E|
i∈[n],e∈E
1/2 1/2
(E.39)
1 2 1 2
(e) (e)
X X
≤ v1⊤ Xi + v2⊤ Xi
n · |E| n · |E|
i∈[n],e∈E i∈[n],e∈E
= W v1 + W v2 .
(Wπk0 (v) − Wv−πk0 (v) )2 − Wπ2k (v) ≤ Wv2 − Wπ2k (v) ≤ (Wπk0 (v) + Wv−πk0 (v) )2 − Wπ2k (v) .
0 0 0
Therefore, combining (E.38) and (E.40), and letting the positive integer k1 < k0 be determined later, we can
upper bound supv∈B |Zv | as follows
2
sup |Zv | ≤ sup Wv−π k (v) + 2 sup Wv−πk0 (v) sup Wv0 + sup |Zv0 |
0
v∈B v∈B v∈B v0 ∈Bk0 v0 ∈Bk0
2
p
= sup Wv−π k (v) + 2 sup Wv−πk0 (v) sup Zv0 + 1 + sup |Zv0 |
0
v∈B v∈B v0 ∈Bk0 v0 ∈Bk0
2
≤ sup Wv−π k0 (v)
+ 2 sup Wv−πk0 (v) sup (|Zv0 | + 1) + sup |Zv0 |
v∈B v∈B v0 ∈Bk0 v0 ∈Bk0
(E.41)
2
= sup Wv−π k (v) + 2 sup Wv−πk0 (v) + 2 sup Wv−πk0 (v) + 1 sup |Zv0 |
0
v∈B v∈B v∈B v0 ∈Bk0
2
≤ sup Wv−π k0 (v)
+ 2 sup Wv−πk0 (v)
v∈B v∈B
!
+ 2 sup Wv−πk0 (v) + 1 sup |Zv0 − Zπk1 (v0 ) | + sup |Zv1 | .
v∈B v0 ∈Bk0 v1 ∈Bk1
Then it remains to upper bound supv∈B Wv−πk0 (v) , supv0 ∈Bk |Zv0 − Zπk1 (v0 ) | and supv1 ∈Bk |Zv1 |.
0 1
First, we upper bound supv∈B Wv−πk0 (v) . It follows from the sub-additivity of Wv that
∞
X
Wv−πk0 (v) ≤ Wπk+1 (v)−πk (v) , ∀v ∈ B. (E.42)
k=k0
k k k+1 k+2
where the constant C is the same as that in (E.30). Since |Bk | ≤ 22 , there are at most 22 × 22 ≤ 22
distinct pairs of (πk+1 (v), πk (v)). Thus, we can take a union bound over all such pairs, combine with (E.30)
65
and use the fact that 2k > n
e provided k ≥ k0 to obtain
P U1 (k)
X q
≤ n · d (πk+1 (v), πk (v))
P Wπk+1 (v)−πk (v) > 8C 2k /e
(πk+1 (v),πk (v))
q (E.43)
X
≤ P Wπk+1 (v)−πk (v) > C 16 · 2k /e
n + 1 · d (πk+1 (v) − πk (v), 0)
(πk+1 (v),πk (v))
k+2
≤ 22 · 2 exp(−16 · 2k ) ≤ exp(−8 · 2k ).
T
Under the event k≥k0 U1 (k), it follows from (E.37) and (E.42) that
∞
X ∞
X
sup Wv−πk0 (v) ≤ n−1/2
sup Wπk+1 (v)−πk (v) ≤ 8Ce 2k/2 d(πk+1 (v), πk (v))
v∈B
k=k0
v∈B
k=k0 (E.44)
n−1/2 γ2 (B, d).
≤ 16Ce
For supv0 ∈Bk |Zv0 − Zπk1 (v0 ) |, we first define the following event for each 0 ≤ k ≤ k0 − 1,
0
q
U2 (k) = sup Zπk+1 (v) − Zπk (v) ≤ C · 40σx d(πk+1 (v), πk (v)) 2k /e
n
v∈B
k k k+1 k+2
where the constant C is the same as that in (E.28). Since |Bk | ≤ 22 , there are at most 22 × 22 ≤ 22
distinct pairs of (πk+1 (v), πk (v)). Thus, we can take a union bound over all such pairs, combine with (E.28)
and use the fact that 2k ≤ n e provided k ≤ k0 − 1 to obtain
P U2 (k)
X q
≤ P Zπk+1 (v) − Zπk (v) > C · 2σx d(πk+1 (v), πk (v)) · 20 2 /e
k n
(πk+1 (v),πk (v))
X
≤ P Zπk+1 (v) − Zπk (v) > Cd (πk+1 (v), −πk (v)) d(πk+1 (v), πk (v)) (E.45)
(πk+1 (v),πk (v))
q
× n + (16 · 2k )/e
(16 · 2k )/e n
k+2
≤ 22 · 2 exp(−16 · 2k ) ≤ exp(−8 · 2k ).
Tk0 −1
From triangle inequality, under the event k=k1 U2 (k), we have
0 −1
kX
sup |Zv0 − Zπk1 (v0 ) | ≤ sup |Zπk+1 (v) − Zπk (v) |
v0 ∈Bk0 v∈B
k=k1
0 −1
kX (E.46)
e−1/2
≤ 40Cσx n 2k/2 d(πk+1 (v), πk (v))
k=k1
−1/2
≤ 40Cσx n
e 2γ2 (B, d).
For supv1 ∈Bk |Zv1 |, we define the following event for each 0 ≤ k ≤ k0 − 1,
1
q
U3 (k) = sup Zπk (v) ≤ Cσx2 · 32 2 /e
k n (E.47)
v∈B
66
where the constant C is the same as that in (E.29). We take a union bound over all elements in Bk , combine
with (E.29) and use the fact that 2k ≤ n
e to obtain
X q
2
P(U3 (k)) ≤ P Zπk (v) > Cσx · 32 2 /ek n
πk (v)
q
≤
X
P Zπk (v) > Cσx2 n + (16 · 2k )/e
(16 · 2k )/e n (E.48)
πk (v)
k
≤ 22 · 2 exp(−16 · 2k ) ≤ exp −8 · 2k .
2
sup |Zv | ≤ sup Wv−πk (v) + 2 sup Wv−πk0 (v)
0
v∈B v∈B v∈B
!
+ 2 sup Wv−πk0 (v) + 1 sup |Zv0 − Zπk1 (v0 ) | + sup |Zv1 |
v∈B v0 ∈Bk0 v1 ∈Bk1
2
1 1 1 1 1 1
≤ +2· + 2· +1 + < .
64 64 64 64 64 2
Therefore, combine this with (E.43), (E.45) and (E.48), we can conclude that the event
1
sup |Zv | <
v∈B 2
occurs with probability at least
∞
X h 0 −1
i kX h i h i
1− P U1 (k) − P U2 (k) − P U3 (k1 )
k=k0 k=k1
∞
X ∞
X
exp(−8 · 2k ) − exp(−8 · 2k ) − exp −8 · 2k1
≥1−
k=k0 k=k0
≥ 1 − 3 exp −4 · 2k1
n/(C ′ σx4 ) .
≥ 1 − 3 exp −e
67
E.10 Proof of Lemma E.3
Σ u
The R.H.S. of the inequality follows from the fact that the augmented covariance matrix ⊤ is the
u σy2
positive semi-definite matrix and thus For the L.H.S., we apply the proof-by-contradiction argument. To be
specific, we will show that if ∥Σ1/2 β k,γ ∥2 > ∥Σ1/2 β̄∥2 , then β k,γ will not be the unique minimizer of Qk,γ (β),
which is contrary to the claim in Theorem 3.2. To see this, let
βe = argmin ∥Σ1/2 (β − β̄)∥2 with β̄ = Σ−1 u. (E.49)
β=t·β k,γ ,t∈R
Observe that βe is the projection on the subspace {t · β k,γ : t ∈ R} with respect to ∥Σ1/2 · ∥2 norm, this
implies that
(βe − β̄)⊤ Σv = 0 ∀v ∈ {t · β k,γ : t ∈ R}. (E.50)
Then we can obtain that
βe⊤ Σβe (a) β̄ ⊤ Σβe (b) ∥Σ1/2 β̄∥2 ∥Σ1/2 β∥
e 2
∥Σ1/2 β∥
e 2= = ≤
∥Σ1/2 β∥
e 2 ∥Σ1/2 β∥
e 2 ∥Σ1/2 β∥e 2
(c)
= ∥Σ1/2 β̄∥2 < ∥Σ1/2 β k,γ ∥2 ,
t · β k,γ with |e
where (a) follows from the minimization program in (E.49), (b) follows from βe = e t| < 1. This
k,γ
is contrary to the fact that β uniquely minimize Qk,γ (β). Then we can conclude that ∥Σ1/2 β k,γ ∥2 ≤
∥Σ1/2 β̄∥2 ≤ σy .
68
F.2 Construction of the Target Variables in Climate Dynamic Prediction
For each task a ∈ {air, csulf, slp, pres}, we use Xt to regress Zt,j tentatively for each j ∈ [60] on all
training data and evaluate R2 , which is defined as
− Yb (Xt ))2
P
(Xt ,Zt,j )∈D1 ∪D2 (Zt,j
1− P 2
.
(Xt ,Zt,j )∈D1 ∪D2 (Zt,j )
We add j to the set of target variables Y if R2 exceeds a predefined threshold. We set the threshold to 0.75
for a ir and csulf, and 0.9 for pres and slp. The selected target variables are shown in Table 3. We do
so since we only care about those target variables that have strong correlations with explain variables. We
use the same hyper-parameters when predicting multiple targets, while different tasks do not share the same
hyper-parameters.
Table 3: Selected target variables for the four tasks air temperature (air), clear sky upward solar flux (csulf), surface pressure
(pres) and sea level pressure (slp) over 100 replications.
Table 4: The average ± standard deviation of the worst-case out-of-sample R2 (4.2) for predicting the stocks AMT and SPG using
IGR with different k.
Table 5: The average ± standard deviation of the mean squared error (4.3) of the four tasks air temperature (air), clear sky
upward solar flux (csulf), surface pressure (pres) and sea level pressure (slp) using IGR with different k.
69
F.5 Causal Relation Identified by Our Method in Climate Dynamic Data
To qualitatively evaluate our method for causal discovery, we present the paths identified by our approach
among six regions (No. 20, 23, 38, 40, 48, and 49) in the air temperature task (air ) in Fig. 3. In particular,
the causal path from the Arabian Sea (No. 38) to the eastern limb of ENSO (No. 40) via the Indian Ocean
(No. 49) is verified by Kumar et al. (1999) and Timmermann et al. (2018). Additionally, the paths between
East Asia (No. 48 and No. 23) and the high surface pressure sector of the Indian Monsoon region (No. 38)
align with the known relationship between the sea surface temperatures of the Indian Ocean and the Asian
Summer Monsoon (Li et al., 2001). These results demonstrate that our method is capable of effectively
identifying causal relationships.
1
1
6
Figure 3: The paths identified by our approach among the six regions (No. 20, 23, 38, 40, 48, and 49) in the air temperature
task (air). The edge colors represent the path coefficients, while the labels indicate the time lags in days.
70