0% found this document useful (0 votes)
77 views10 pages

Causal Discovery With General Non-Linear Relationships Using Non-Linear ICA

This document proposes a method for causal discovery from observational data that allows for general non-linear relationships between variables. It introduces non-linear independent component analysis (ICA) using time-contrastive learning, which can disentangle latent sources from non-linear mixtures. The method exploits non-stationarities in observational data collected across different experimental conditions. It is argued that if latent sources can be recovered, then a series of independence tests or a measure based on likelihood ratios can be used to infer causal direction between two variables, even without assumptions of linearity or additive noise.

Uploaded by

Emoty
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
77 views10 pages

Causal Discovery With General Non-Linear Relationships Using Non-Linear ICA

This document proposes a method for causal discovery from observational data that allows for general non-linear relationships between variables. It introduces non-linear independent component analysis (ICA) using time-contrastive learning, which can disentangle latent sources from non-linear mixtures. The method exploits non-stationarities in observational data collected across different experimental conditions. It is argued that if latent sources can be recovered, then a series of independence tests or a measure based on likelihood ratios can be used to infer causal direction between two variables, even without assumptions of linearity or additive noise.

Uploaded by

Emoty
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Causal Discovery with General Non-Linear Relationships

Using Non-Linear ICA

Ricardo Pio Monti1 , Kun Zhang2 , Aapo Hyvärinen1,3


1
Gatsby Computational Neuroscience Unit, University College London, UK
2
Department of Philosophy, Carnegie Mellon University, USA
3
Department of Computer Science and HIIT, University of Helsinki, Finland

Abstract unfeasible or unethical in many scenarios (Spirtes and


Zhang, 2016). Even when it is possible to run random-
We consider the problem of inferring causal re- ized control trials, the number of experiments required
lationships between two or more passively ob- may raise practical challenges (Eberhardt et al., 2005).
served variables. While the problem of such Furthermore, big data sets publicly available on the in-
causal discovery has been extensively studied, ternet often try to be generic and thus cannot be strongly
especially in the bivariate setting, the major- based on specific interventions; a prominent example is
ity of current methods assume a linear causal the Human Connectome Project which collects resting
relationship, and the few methods which con- state fMRI data from over 500 subjects (Van Essen et al.,
sider non-linear relations usually make the as- 2012). As such, it is important to develop causal discov-
sumption of additive noise. Here, we propose ery methods through which to uncover causal structure
a framework through which we can perform from (potentially large-scale) passively observed data.
causal discovery in the presence of general Such data collected without explicit manipulation of cer-
non-linear relationships. The proposed method tain variables is often termed observational data.
is based on recent progress in non-linear in- The intrinsic appeal of causal discovery methods is that
dependent component analysis (ICA) and ex- they allow us to uncover the underlying causal structure
ploits the non-stationarity of observations in of complex systems, providing an explicit description of
order to recover the underlying sources. We the underlying generative mechanisms. Within the con-
show rigorously that in the case of bivariate text of machine learning, causal knowledge has also been
causal discovery, such non-linear ICA can be shown to play an important role in many domains such as
used to infer causal direction via a series of in- semi-supervised and transfer learning (Schölkopf et al.,
dependence tests. We further propose an al- 2012; Zhang et al., 2013), covariate shift and algorith-
ternative measure for inferring causal direc- mic fairness (Kusner et al., 2017). A wide range of
tion based on asymptotic approximations to the methods have been proposed to discover casual knowl-
likelihood ratio, as well as an extension to mul- edge (Shimizu et al., 2006; Hoyer et al., 2009; Zhang
tivariate causal discovery. We demonstrate the and Hyvärinen, 2009; Peters et al., 2016; Zhang et al.,
capabilities of the proposed method via a se- 2017). However, many of the current methods rely on
ries of simulation studies and conclude with an restrictive assumptions regarding the nature of the causal
application to neuroimaging data. relationships. For example, Shimizu et al. (2006) assume
linear causal models with non-Gaussian disturbances and
demonstrate that independent component analysis (ICA)
1 INTRODUCTION may be employed to uncover causal structure. Hoyer
et al. (2009) provide an extension to non-linear causal
Causal models play a fundamental role in modern sci- models but under the assumption of additive noise.
entific endeavor (Pearl, 2009). While randomized con-
trol studies are the gold standard, such an approach is In this paper we propose a general method for bivari-
ate causal discovery in the presence of general non-
KZ acknowledges the support by the U.S. Air Force under
Contract No. FA8650-17-C-7715 and by NSF EAGER Grant
linearities. The proposed method is able to uncover non-
No. IIS-1829681. The U.S. Air Force and the NSF are not linear causal relationships without requiring assumptions
responsible for the views reported in this article. such as linear causal structure or additive noise. Our ap-
proach exploits a correspondence between a non-linear The aforementioned approaches enforce strict con-
ICA model and non-linear causal models, and is specif- straints on the functional class of the SEM. Otherwise,
ically tailored for observational data which are collected without suitable constraints on the functional class, for
across a series of distinct experimental conditions or any two variables one can always express one of them as
regimes. Given such data, we seek to exploit the non- a function of the other and independent noise (Hyväri-
stationarity introduced via distinct experimental condi- nen and Pajunen, 1999). We are motivated to develop
tions in order to perform causal discovery. We demon- novel causal discovery methods which benefit from new
strate that if latent sources can be recovered via non- identifiability results established from a different angle,
linear ICA, then a series of independence tests may be in the context of general non-linear (and non-additive)
employed to uncover causal structure. As an alternative relationships. A key component of our method exploits
to independence testing, we further propose a novel mea- some recent advances in non-linear ICA, which we re-
sure of non-linear causal direction based on an asymp- view next.
totic approximation to the likelihood ratio.
2.2 NON-LINEAR ICA VIA TCL
2 PRELIMINARIES
We briefly outline the recently proposed Time Con-
trastive Learning (TCL) algorithm, through which it is
In this section we introduce the class of causal models
possible to demix (or disentangle) latent sources from
to be studied. We also present an overview of non-linear
observed non-linear mixtures; this algorithm provides
ICA methods based on contrastive learning, upon which
hints as to the identifiability of causal direction between
we base the proposed method.
two variables in general non-linear cases under certain
assumptions and is exploited in our causal discovery
2.1 MODEL DEFINITION
method. For further details we refer readers to Hyvärinen
Suppose we observe d-dimensional random variables and Morioka (2016) but we also provide a brief review
X = (X1 , . . . , Xd ) with joint distribution P(X). The in Supplementary Material A. We assume we observe d-
objective of causal discovery is to use the observed data, dimensional data, X, which are generated according to a
which give the empirical version of P(X), to infer the smooth and invertible non-linear mixture of independent
associated causal graph which describes the data gener- latent variables S = (S1 , . . . , Sd ). In particular, we have
ating procedure (Spirtes et al., 2000; Pearl, 2009). X = f(S). (2)
A structural equation model (SEM) is here defined (gen- The goal of non-linear ICA is then to recover S from X.
eralizing the traditional definition) as a collection of d
structural equations: TCL, as introduced by Hyvärinen and Morioka (2016),
is a method for non-linear ICA which is premised on the
Xj = fj (PAj , Nj ), j = 1, . . . , d (1) assumption that both latent sources and observed data
are non-stationary time series. Formally, they assume
together with a joint distribution, P(N), over disturbance
that while components Sj are mutually independent, the
(noise) variables, Nj , which are assumed to be mutually
distribution of each component is piece-wise stationary,
independent. We write PAj to denote the parents of the
implying they can be divided into non-overlapping time
variable Xj . The causal graph, G, associated with a SEM
segments such that their distribution varies across seg-
in equation (1) is a graph consisting of one node corre-
ments, indexed by e ∈ E. In the basic case, the log-
sponding to each variable Xj ; throughout this work we
density of the jth latent source in segment e is assumed
assume G is a directed acyclic graph (DAG).
to follow an exponential family distribution such that:
While functions fj in equation (1) can be any (possibly
log pe (Sj ) = qj,0 (Sj ) + λj (e)qj (Sj ) − log Z(e), (3)
non-linear) functions, to date the causal discovery com-
munity has focused on specific special cases in order to where qj,0 is a stationary baseline and qj is a non-linear
obtain identifiability results as well as provide practical scalar function defining an exponential family for the jth
algorithms. Pertinent examples include: a) the linear source. (Exponential families with more than one suffi-
non-Gaussian acyclic model (LiNGAM; Shimizu et al., cient statistic are also allowed.) The final term in equa-
2006), which assumes each fj is a linear function and the tion (3) corresponds to a normalization constant. It is
Nj are non-Gaussian, b) the additive noise model (ANM; important to note that parameters λj (e) are functions of
Hoyer et al., 2009), which assumes the noise is additive, the segment index, e, implying that the distribution of
and c) the post-nonlinear causal model, which also cap- sources will vary across segments. It follows from equa-
tures possible non-linear distortion in the observed vari- tion (2) that observations X may also be divided into non-
ables (Zhang and Hyvärinen, 2009). overlapping segments indexed by e ∈ E. We write X(i)
to denote the ith observation and Ci ∈ E to denote its sources (e.g., they may be correlated). This problem will
corresponding segment. be particularly pertinent when data is only collected over
a reduced number of segments. As such, alternative lin-
TCL proceeds by defining a multinomial classification
ear ICA algorithms are required to effectively employ
task, where we consider each original data point X(i)
TCL in such a setting, as addressed in Section 3.2.
as a data point to be classified, and the segment indices
Ci give the labels. Given the observations, X, together
with the associated segment labels, C, TCL can then 3 NON-LINEAR CAUSAL DISCOVERY
be proven to recover f−1 as well as independent com- VIA NON-LINEAR ICA
ponents, S, by learning to classify the observations into
their corresponding segments. In particular, TCL trains In this section we outline the proposed method for causal
a deep neural network using multinomial logistic regres- discovery over bivariate data, which we term Non-linear
sion to perform this classification task. The network ar- SEM Estimation using Non-Stationarity (NonSENS).
chitecture employed consists of a feature extractor corre- We begin by providing an intuition for the proposed
sponding to the last hidden layer, denoted by h(X(i); θ) method in Section 3.1, which is based on the connection
and parameterised by θ, together with a final linear layer. between non-linear ICA and non-linear SEMs. In Sec-
The central Theorem on TCL is given in our notation as tion 3.2 we propose a novel linear ICA algorithm which
complements TCL for the purpose of causal discovery,
Theorem 1 (Hyvärinen and Morioka (2016)) Assume particularly in the presence of observational data with
the following conditions hold: few segments. Our method is formally detailed in Sec-
tion 3.3, which also contains a proof of identifiability.
1. We observe data generated by independent sources Finally in Section 3.4 we present an alternative measure
according to equation (3) and mixed via invertible, of causal direction based on asymptotic approximations
smooth function f as stated in equation (2). to the likelihood ratio of non-linear causal models.
2. We train a neural network consisting of a feature
extractor h(X(i); θ) and a final linear layer (i.e., 3.1 RELATING SEM TO ICA
softmax classifier) to classify each observation to
its corresponding segment label, Ci . We require the We assume we observe bivariate data X(i) ∈ R2 and
dimension of h(X(i); θ) be the same as X(i). write X1 (i) and X2 (i) to denote the first and second
entries of X(i) respectively. We will omit the i index
3. The matrix L with elements Le,j = λj (e) − λj (1) whenever it is clear from context. Following the nota-
for segments e = 1, . . . , E and j = 1, . . . , d has tion of Peters et al. (2016), we further assume data is
full rank. available over a set of distinct environmental conditions
E = {1, . . . , E}. As such, each X(i) is allocated to an
Then in the limit of infinite data, the outputs of the feature experimental condition denoted by Ci ∈ E. Let ne be the
extractor are equal to q(S), up to an invertible linear number of observations
transformation. Pwithin each experimental condi-
tion such that ntot = e∈E ne .
Theorem 1 states that we may perform non-linear ICA The objective of the proposed method is to uncover the
by training a neural network to classify the segments as- causal direction between X1 and X2 . Suppose that
sociated with each observation, followed by linear ICA X1 → X2 , such that the associated SEM is of the form:
on the hidden representations, h(X; θ). This theorem
provides identifiability of this particular non-linear ICA X1 (i) = f1 (N1 (i)), (4)
model, meaning that it is possible to recover the sources. X2 (i) = f2 (X1 (i), N2 (i)), (5)
This is not the case with many simpler attempts at non-
where N1 , N2 are latent disturbances whose distributions
linear ICA models (Hyvärinen and Pajunen, 1999), such
are also assumed to vary across experimental conditions.
as the case with a single segment in the model above.
The DAG associated with equations (4) and (5) is shown
While Theorem 1 provides identifiability for a particular in Figure 1. Fundamentally, the proposed NonSENS al-
non-linear ICA model, it requires a final linear unmixing gorithm exploits the correspondence between the non-
of sources (i.e., via linear ICA). However, when sources linear ICA model described in Section 2.2 and non-linear
follow the piece-wise stationary distribution detailed in SEMs. This correspondence is formally stated as fol-
equation (3), traditional linear ICA methods may not be lows: observations generated according to the (possi-
appropriate as sources will only be independent condi- bly non-linear) SEM detailed in equations (4) and (5)
tional on the segment. For example, it is possible that ex- will follow a non-linear ICA model where each distur-
ponential family parameters, λj (e), are dependent across bance variable, Nj , corresponds to a latent source, Sπ(j) .
ordinary linear ICA to unmix h(X; θ) is premised on the
N1 N2
f2 S1 : X1 = f1 (N1 ) assumption that latent sources are independent. This is
f1 not necessarily guaranteed when sources follow the ICA
S2 : X2 = f2 (X1 , N2 )
model presented in equation (3) with a fixed number of
X1 X2
segments. For example, it is possible that parameters
λj (e) are correlated across segments. We note that this
Figure 1: Visualization of DAG, G, associated with the is not a problem when the number of segments increases
SEM in equations (4) and (5). asymptotically and parameters λj (e) are assumed to be
randomly generated, as stated in Corollary 1 of Hyväri-
nen and Morioka (2016).
Moreover, structural equations f1 and f2 jointly define
a bivariate non-linear mapping from sources to observa- In order to address this issue, we propose an alternative
tions as in non-linear ICA. However, the mixing function linear ICA algorithm to be employed in the final stage of
f in non-linear ICA is not exactly the same as f1 and f2 TCL, through which to accurately recover latent sources
(see Supplementary Material B). We note that due to the in the presence of a small number of segments.
permutation indeterminacy present in ICA, each distur- The proposed linear ICA algorithm explicitly models la-
bance variable, Nj , will only be identifiable up to some tent sources as following the piece-wise stationary distri-
permutation π of the set {1, 2}. bution specified in equation (3). We write Z(i) ∈ Rd to
denote the ith observation, generated as a linear mixtures
The proposed method consists of a two-step procedure. of sources: Z(i) = AS(i), where A ∈ Rd×d is a square
First, it seeks to recover latent disturbances via non- mixing matrix. Estimation of parameters proceeds via
linear ICA. Given the estimated latent disturbances, the score matching (Hyvärinen, 2005), which yields an ob-
following property highlights how we may employ statis- jective function of the following form:
tical independencies between observations and estimated d
XX 1 X 00 T
sources in order to infer the causal structure: J = λj (e) qj (wj Z(i))
e∈E j=1
ne Ci =e

Property 1 Assume the true causal structure follows 1 X X


d
T 1 X 0 T 0 T
+ λk (e)λj (e)wk wj q (w Z(i))qj (wj Z(i)),
equations (4) and (5), as depicted in Figure 1. Then, 2 e∈E j,k=1 ne C =e k k
i
assuming each observed variable is statistically depen-
dent on its latent disturbance (thus avoiding degenerate where W ∈ Rd×d denotes the unmixing matrix and qj0
cases), it follows that X1 ⊥ ⊥ N2 while X1 ⊥ 6 ⊥ N1 and and qj00 denote the first and second derivatives of the non-
X2 ⊥6 ⊥ N1 as well as X2 ⊥ 6 ⊥ N2 . 1 linear scalar functions introduced in equation (3). Details
and results are provided in Supplementary C, where the
Property 1 highlights the relationship between observa- proposed method is shown to outperform both FastICA
tions X and latent sources, N, and provides some insight and Infomax ICA, as well as the joint diagonalization
into how a non-linear ICA method, together with inde- method of Pham and Cardoso (2001), which is explicitly
pendence testing, could be employed to perform bivari- tailored for non-stationary sources.
ate causal discovery. This is formalized in Section 3.3.
3.3 CAUSAL DISCOVERY USING
3.2 A LINEAR ICA ALGORITHM FOR INDEPENDENCE TESTS
PIECE-WISE STATIONARY SOURCES
Now we give the outline of NonSENS. NonSENS per-
Before proceeding, we have to improve the non-linear forms causal discovery by combining Property 1 with a
ICA theory of Hyvärinen and Morioka (2016). Assump- non-linear ICA algorithm. Notably, we employ TCL, de-
tions 1–3 of Theorem 1 for TCL guarantee that the fea- scribed in Section 2.2, with the important addition that
ture extractor, h(X; θ), will recover a linear mixture of the final linear unmixing of the hidden representations,
latent independent sources (up to element-wise transfor- h(X; θ), is performed using the objective given in Sec-
mation by q). As a result, applying a linear unmixing tion 3.2. The proposed method is summarized as follows:
method to the final representations, h(X; θ), will allow
us to recover latent disturbances. However, the use of
1. (a) Using TCL, train a deep neural network with
1
We note that the property that effect is dependent on its feature extractor h(X(i); θ) to accurately clas-
direct causes typically holds, although one may construct spe- sify each observation X(i) according to its seg-
cific examples (with discrete variables or continuous variables ment label Ci .
with complex causal relations) in which effect is independent
from its direct causes. In particular, if faithfulness is as- (b) Perform linear unmixing of h(X; θ) using the
sumed (Spirtes et al., 2000), the above property clearly holds. algorithm presented in Section 3.2.
2. Perform the four tests listed in Property 1, and con- non-linear SEMs. Briefly, Hyvärinen and Smith (2013)
clude a cause-effect relationship in the case where consider the likelihood ratio between two candidate mod-
there is evidence to reject the null hypothesis in els of causal influence: X1 → X2 or X2 → X1 . The
three of the tests and only one of the tests fails to log-likelihood ratio is then defined as the difference in
reject the null. The variable for which the null hy- log-likelihoods under each model:
pothesis was not rejected is considered the cause.
R = L1→2 − L2→1 (6)
Each test is run at a pre-specified significance level, α, where we write L1→2 to denote the log-likelihood un-
and Bonferroni corrected in order to control the family- der the assumption that X1 is the causal variable and
wise error rate. Throughout this work we employ HSIC L2→1 for the alternative model. Under the assumption
as a test for statistical independence (Gretton et al., that X1 → X2 , it follows that the underlying SEM is
2005). Pseudo-code is provided in Supplementary G. of the form described in equations (4) and (5). The log-
Theorem 2 formally states the assumptions and identi- likelihood for a single data point may thus be written as
fiability properties of the proposed method.
L1→2 = log PX1 (X1 ) + log PX2 |X1 (X2 |X1 ).
Theorem 2 Assume the following conditions hold:
Furthermore, in the context of linear causal models we
1. We observe bivariate data X which has been gen- have that equations (4) and (5) define a bijection between
erated from a non-linear SEM with smooth non- N2 and X2 whose Jacobian has unit determinant, such
linearities and no hidden confounders. that the log-likelihood can be expressed as:

2. Data is available over at least three distinct exper- L1→2 = log PX1 (X1 ) + log PN2 (N2 ).
imental conditions and latent disturbances, Nj , are
In the asymptotic limit we can take the expectation of
generated according to equation (3).
log-likelihood, and the log-likelihood converges to:
3. We employ TCL, with a sufficiently deep neural net-
work as the feature extractor, followed by linear E[L1→2 ] = −H(X1 ) − H(N2 ) (7)
ICA (as described in Section 3.2) on hidden repre- where H(·) denotes the differential entropy. Hyvärinen
sentations to recover the latent sources. and Smith (2013) note that the benefit of equation (7)
4. We employ an independence test which can capture is that only univariate approximations of the differential
any type of departure from independence, for ex- entropy are required. In this section we seek to derive
ample HSIC, with Bonferroni correction and signif- equivalent measures for causal direction without the as-
icance level α. sumption of linear causal effects. Recall that after train-
ing via TCL, we obtain an estimate of g = f−1 which is
Then in the limit of infinite data the proposed method will parameterized by a deep neural network.
identify the cause variable with probability 1 − α. In order to compute the log-likelihood, L1→2 , we con-
sider the following change of variables:
See Supplementary D for a proof. Theorem 2 extends      
previous identifiability results relying on constraints on X1 X1 X1
= g̃ =
functional classes (e.g., ANM in Hoyer et al. (2009)) to N2 X2 g2 (X1 , X2 )
the domain of arbitrary non-linear models, under further
assumptions on nonstationarity of the given data. where we note that g2 : R2 → R refers to the second
component of g. Further, we note that the the mapping
3.4 LIKELIHOOD RATIO-BASED MEASURES g̃ only applies the identity to the first element, thereby
OF CAUSAL DIRECTION leaving X1 unchanged. Given such a change of variables,
we may evaluate the log-likelihood as follows:
While independence tests are widely used in causal dis-
L1→2 = log pX1 (X1 ) + log pN2 (N2 ) + log |det Jg̃|,
covery, they may not be statistically optimal for decid-
ing causal direction. In this section, we further propose where Jg̃ denotes the Jacobian of g̃, as we have X1 ⊥

a novel measure of causal direction which is based on N2 by construction under the assumption that X1 → X2 .
the likelihood ratio under non-linear causal models, and
Due to the particular choice of g̃, we are able to easily
which thus is likely to be more efficient.
evaluate the Jacobian, which can be expressed as:
The proposed measure can be seen as the extension of ! 
∂g̃1 ∂g̃1 
linear measures of causal direction, such as those pro- ∂X1 ∂X2 1 0
Jg̃ = ∂g̃2 ∂g̃2 = ∂g2 ∂g2 .
posed by Hyvärinen and Smith (2013), to the domain of ∂X1 ∂X2 ∂X1 ∂X2
As a result, the determinant can be directly evaluated as plementary E. This result serves to connect the pro-
∂g2
∂X2 . Furthermore, since g2 is parameterized by a deep posed likelihood ratio to independence testing methods
network, we can directly evaluate its derivative with re- for causal discovery which use mutual information.
spect to X2 . This allows us to directly evaluate the log-
likelihood of X1 being the causal variable as: 3.5 EXTENSION TO MULTIVARIATE DATA

∂g2
L1→2 = log pX1 (X1 ) + log pN2 (N2 ) + log . It is not straightforward to extend NonSENS to mul-
∂X2 tivariate cases. Due to the permutation invariance of
Finally, we consider the asymptotic limit and obtain the sources, we would require d2 independence tests, where
non-linear generalization of equation (7) as: d is the number of variables, leading to a significant drop
  in power after Bonferroni correction. Likewise, the like-
∂g2 lihood ratio test inherently considers only two variables.
E[L1→2 ] = − H(X1 ) − H(N2 ) + E log .
∂X2 Instead, we propose to extend to proposed method to the
In practice we use the sample mean instead of the expec- domain of multivariate causal discovery by employing it
tation. in conjunction with a traditional constraint based method
such as the PC algorithm, as in Zhang and Hyvärinen
One remaining issue to address is the permutation invari-
ance of estimated sources (note this this permutation is (2009). Formally, the PC algorithm is first employed to
not about the causal order of the observed variables). We estimate the skeleton and orient as many edges as pos-
must consider both permutations π of the set {1, 2}. In sible. Any remaining undirected edges are then directed
order to resolve this issue, we note that if the true permu- using either proposed bivariate method.
tation is π = (1, 2), then assuming X1 → X2 , we have
∂g1 ∂g2
∂X2 = 0 while ∂X2 6= 0. This is because g1 unmixes 3.6 RELATIONSHIP TO PREVIOUS METHODS
observations to return the latent disturbance for causal
variable, X1 , and is therefore not a function of X2 . The NonSENS is closely related to linear ICA-based methods
converse is true if the permutation is π = (2, 1). Sim-
ilar reasoning can be employed for the reverse model: as described in Shimizu et al. (2006). However, there
X2 → X1 . As such, we propose to select the permuta- are important differences: LiNGAM focuses exclusively
tion as follows: on linear causal models whilst NonSENS is specifically
  
∂gπ(2)
 
∂gπ(1) designed to recover arbitrary non-linear causal structure.

π = argmax E log
+ E log
. Moreover, the proposed method is mainly designed for
π ∂X2 ∂X1
bivariate causal discovery whereas the original LiNGAM
For a chosen permutation, π ∗ , we may therefore compute method can easily perform multivariate causal discovery
the likelihood ratio in equation (6) as: by permuting the estimated ICA unmixing matrix. In this
 
∂gπ∗ (2) sense NonSENS is more closely aligned to the Pairwise
R = −H(X1 ) − H(Nπ∗ (2) ) + E log LiNGAM method (Hyvärinen and Smith, 2013).
∂X2
 
∂gπ∗ (1) Hoyer et al. (2009) and Peters et al. (2014) propose a
+H(X2 ) + H(Nπ∗ (1) ) − E log .
∂X1 non-linear causal discovery method named regression
and subsequent independence test (RESIT) which is able
If R is positive, we conclude that X1 is the causal vari- to recover the causal structure under the assumption of
able, whereas if R is negative X2 is reported as the causal an additive noise model. RESIT essentially shares the
variable. When computing the differential entropy, we same underlying idea as NonSENS, with the difference
employ the approximations described in Kraskov et al. being that it estimates latent disturbances via non-linear
(2004). We note that such approximations require vari- regression, as opposed to via non-linear ICA. Related is
ables to be standardized; in the case of latent variables the Regression Error Causal Inference (RECI) algorithm
this can be achieved by defining a further change of vari- (Blöbaum et al., 2018), which proposes measures of
ables corresponding to a standardization. causal direction based on the magnitude of (non-linear)
Finally, we note that the likelihood ratio presented above regression errors. Importantly, both of those methods re-
can be connected to the independence measures em- strict the non-linear relations to have additive noise.
ployed in Section 3.3 when mutual information is used a Recently several methods have been proposed which
measure of statistical dependence. In particular, we have seek to exploit non-stationarity in order to perform causal
R = −I(X1 , Nπ(2) ) + I(X2 , Nπ(1) ), (8) discovery. Following Schölkopf et al. (2012), Peters
et al. (2016) propose to leverage the invariance of causal
where I(·, ·) denotes the mutual information between models under covariate shift in order to recover the true
two variables. We provide a full derivation in Sup- causal structure. Their method, termed Invariant Causal
Prediction (ICP), is tailored to the setting where data is methods in the Supplementary material F.
collected across a variety of experimental regimes, simi-
We generate synthetic data from the non-linear ICA
lar to ours. However, their main results, including iden-
model detailed in Section 2.2. Non-stationary dis-
tifiability are in the linear or additive noise settings.
turbances, N, were randomly generated by simulating
Zhang et al. (2017) proposed a method, termed Laplace random variables with distinct variances in each
CD-NOD, for causal discovery from heterogeneous, segment. For the non-linear mixing function we employ
multiple-domain data or non-stationary data, which al- a deep neural network (“mixing-DNN”) with randomly
lows for general non-linearities. Their method thus generated weights such that:
solves a problem similar to ours, although with a very
different approach. Their method accounts for non- X(1) = A(1) N, (9)
 
stationarity, which manifests itself via changes in the X(l) = A(l) f X(l−1) , (10)
causal modules, via the introduction of an surrogate
variable representing the domain or time index into the where we write X(l) to denote the activations at the lth
causal DAG. Conditional independence testing is em- layer and f corresponds to the leaky-ReLU activation
ployed to recover the skeleton over the augmented DAG, function which is applied element-wise. We restrict ma-
and their method does not produce an estimate of the trices A(l) to be lower-triangular in order to introduce
SEM to represent the causal mechanism. acyclic causal relations. In the special case of multivari-
ate causal discovery, we follow Peters et al. (2014) and
2
4 EXPERIMENTAL RESULTS include edges with a probability of d−1 , implying that
the expected number of edges is d. We present exper-
In order to demonstrate the capabilities of the proposed iments for d = 6 dimensions. Note that equation (9)
method we consider a series of experiments on synthetic follows the LiNGAM. For depths l ≥ 2, equation (10)
data as well as real neuroimaging data. generates data with non-linear causal structure.
Throughout experiments we vary the following factors:
4.1 SIMULATIONS ON ARTIFICIAL DATA the number of distinct experimental conditions (i.e., dis-
tinct segments), the number of observations per segment,
In the implementation of the proposed method we em-
ne , as well as the depth, l, of the mixing-DNN. In the
ployed deep neural networks of varying depths as feature
context of bivariate causal discovery we measure how
extractors. All networks were trained on cross-entropy
frequently each method is able to correctly identify the
loss using stochastic gradient descent. In the final linear
cause variable. For multivariate causal discovery we con-
unmixing required by TCL, we employ the linear ICA
sider the F1 score, which serves to quantify the agree-
model described in Section 3.2. For independence test-
ment between estimated and true DAGs.
ing, we employ HSIC with a Gaussian kernel. All tests
are run at the α = 5% level and Bonferroni corrected. Figure 2 shows the results for bivariate causal discovery
as the number of distinct experimental conditions, |E|,
We benchmark the performance of the NonSENS al-
increases and the number of observations within each
gorithm against several state-of-the-art methods. As a
condition was fixed at ne = 512. Each horizontal panel
measure of performance against linear methods we com-
shows the results as the depth of the mixing-DNN in-
pare against LiNGAM. In particular, we compare perfor-
creased from l = 1 to l = 5. The top panels show
mance to DirectLiNGAM (Shimizu et al., 2011). In or-
the proportion of times the correct cause variable was
der to highlight the need for non-linear ICA methods, we
identified across 100 independent simulations. In partic-
also consider the performance of the proposed method
ular, the first top panel corresponds to linear causal de-
where linear ICA is employed to estimate latent distur-
pendencies. As such, all methods are able to accurately
bances; we refer to this baseline as Linear-ICA Non-
recover the true cause variable. However, as the depth of
SENS. We further compare against the RESIT method of
the mixing-DNN increases, the causal dependencies be-
Peters et al. (2014). Here we employ Gaussian process
come increasingly non-linear and the performance of all
regression to estimate non-linear effects and HSIC as a
methods deteriorates. While we attribute this drop in per-
measure of statistical dependence. Finally, we also com-
formance to the increasingly non-linear nature of causal
pare against the CD-NOD method of Zhang et al. (2017)
structure, we note that the NonSENS algorithm is able to
as well as the RECI method presented in Blöbaum et al.
out-perform all alternative methods.
(2018). For the latter, we employ Gaussian process re-
gression and note that this method assumes the presence The bottom panels of Figure 2 shows the results when
of a causal effect, and is therefore only included in some no directed acyclic causal structure is present. Here data
experiments. We provide a description of each of the was generated such that A(l) was not lower-triangular. In
Figure 2: Experimental results indicating performance as we increase the number of experimental conditions, |E|,
whilst keeping the number of observation per condition fixed at ne = 512. Each horizontal panel plots results for
varying depths of the mixing-DNN, ranging from l = 1, . . . 5. The top panels show the proportion of times the
correct cause variable is identified when a causal effect exists. The bottom panels considers data where no acyclic
causal structure exists (A(l) are not lower-triangular) and reports the proportion of times no causal effect is correctly
reported. The dashed, horizontal red line indicates the theoretical (1 − α)% true negative rate. For clarity we omit the
standard errors, but we note that they were small in magnitude (approximately 2 − 5%).

3rRSRrtLRn cRrrect cDusDO GLrectLRn GetecteG


1 ODyer 2 ODyer 3 ODyer 4 ODyer 5 ODyer
1.0

0.8
1Rn6E16
3rRSRrtLRn cRrrect

3rRSRrtLRn cRrrect

3rRSRrtLRn cRrrect

3rRSRrtLRn cRrrect

3rRSRrtLRn cRrrect

0.6
LLn-ICA 1Rn6E16
DLrectLL1GA0
0.4 5E6IT
IC3
0.2 CD-12D

0.0

10 20 30 40 50 10 20 30 40 50 10 20 30 40 50 10 20 30 40 50 10 20 30 40 50
# ExSerLPentDO cRnGLtLRns # ExSerLPentDO cRnGLtLRns # ExSerLPentDO cRnGLtLRns # ExSerLPentDO cRnGLtLRns # ExSerLPentDO cRnGLtLRns

3rRSRrtLRn nuOO cRrrectOy DcceSteG (1 - TySe 1 errRr)


1 ODyer 2 ODyer 3 ODyer 4 ODyer 5 ODyer
1.00

0.98
1Rn6(16
3rRSRrtLRn cRrrect

3rRSRrtLRn cRrrect

3rRSRrtLRn cRrrect

3rRSRrtLRn cRrrect

3rRSRrtLRn cRrrect

LLn-ICA 1Rn6(16
0.96
DLrectLL1GA0
5(6IT
0.94
IC3
CD-12D
0.92

0.90
10 20 30 40 50 10 20 30 40 50 10 20 30 40 50 10 20 30 40 50 10 20 30 40 50
# (xSerLPentDO cRnGLtLRns # (xSerLPentDO cRnGLtLRns # (xSerLPentDO cRnGLtLRns # (xSerLPentDO cRnGLtLRns # (xSerLPentDO cRnGLtLRns

Figure 3: Experimental results visualizing performance under the assumption that a causal effect exists. This reduces
the bivariate causal discovery problem to recovering the causal ordering over X1 and X2 . The top panel considers
an increasing number of experimental conditions whilst the bottom panel shows results when we vary the number
of observations within a fixed number of experimental conditions, |E| = 10. Each horizontal plane plots results for
varying depths of the mixing-DNN, ranging from l = 1, . . . , 5.
F1 score
1.0 Layer
1 PRc DG
2

Est. DAG
0.9
3
4
0.8 5
ERc
CA1
0.7

0.6 PHc Sub


0.5

0.4

0.3

NonSENS RESIT LiNGAM


Algorithm
PC Alg. CD-NOD Figure 5: Estimated causal DAG on fMRI Hippocampal
data by the proposed method. Blue edges are feasible
Figure 4: F1 score for multivariate causal discovery over given anatomical connectivity; red edges are not.
6-dimensional data. For each algorithm, we plot the F1
scores as we vary the depth of the mixing-DNN from l =
1, . . . , 5. Higher F1 scores indicate better performance. presented in Figure 4, where we plot the F1 score be-
tween the true and inferred DAGs as the depth of the
mixing-DNN increases. The proposed method is com-
particular, we set the off-diagonal entries of A(l) to be petitive across all depths. In particular, the proposed
identical and non-zero, resulting in cyclic causal struc- method outperforms the PC algorithm, indicating that its
ture. In the context of such data, we would expect all use to resolve undirected edges is beneficial.
methods to report that the causal structure is inconclusive
95% of the time, as all tests are Bonferroni corrected at 4.2 HIPPOCAMPAL FMRI DATA
the α = 5% level. The bottom panel of Figure 2 shows As a real-data application, the proposed method was ap-
the proportion of times the causal structure is correctly plied to resting state fMRI data collected from six dis-
reported as inconclusive. The results indicate that all tinct brain regions as part of the MyConnectome project
methods are overly conservative in their testing, and be- (Poldrack et al., 2015). Data was collected from a single
come increasingly conservative as the depth, l, increases. subject over 84 successive days. Further details are pro-
We also consider the performance of all algorithms in vided in Supplementary Material I. We treated each day
the context of a fixed number of experimental conditions, as a distinct experimental condition and employed the
|E| = 10, and an increasing number of observations per multivariate extension of the proposed method. For each
condition, ne , in Supplementary H. unresolved edge, we employed NonSENS as described
Furthermore, we also consider the scenario where a in Section 3.3 with a 5-layer network. The results are
causal effect is assumed to exist. In such a scenario, we shown in Figure 5. While there is no ground truth avail-
consider both the likelihood ratio approach described in able, we highlight in blue all estimated edges which are
Section 3.4, termed NonSENS LR, and a heuristic ap- feasible due to anatomical connectivity between the re-
proach of comparing the p-values of independence tests, gions and in red estimated edges which are not feasible
termed NonSENS p-val. In the case of algorithms such (Bird and Burgess, 2008). We note that the proposed
as RESIT we compare p-values in order to determine di- method recovers feasible directed connectivity structures
rection. The results for these experiments are shown in for the entorhinal cortex (ERc), which is known to play
Figure 3. The top panels show results as the number of an prominent role within the hippocampus.
experimental conditions, |E|, increases. As before, we fix
the number of observations per condition to ne = 512. 5 CONCLUSION
The bottom panels show results for a fixed number of
We present a method to perform causal discovery in the
experimental conditions |E| = 10, as we increase the
context of general non-linear SEMs in the presence of
number of observations per condition. We note that the
non-stationarities or different conditions. This is in con-
proposed measure of causal direction is shown to out-
trast to alternative methods which often require restric-
perform alternative algorithms. Performance in Figure
tions on the functional form of the SEMs. The proposed
3 appears significantly higher than that shown in Figure
method exploits the correspondence between non-linear
2 due to that the fact that a causal effect is known to ex-
ICA and non-linear SEMs, as originally considered in
ist; this reduces the bivariate causal discovery problem to
the linear setting by Shimizu et al. (2006). Notably, we
recovering the causal ordering over X1 and X2 . The CD-
established the identifiability of causal direction from
NOD algorithm cannot easily be extended to assume the
a completely different angle, by making use of non-
existence of a causal effect and is therefore not included
stationarity instead of constraining functional classes.
in these experiments.
Developing computationally more efficient methods for
Finally, the results for multivariate causal discovery are the multivariate case is one line of our future work.
References Judea Pearl. Causality. Cambridge University Press,
2009.
Anthony Bell and Terrence Sejnowski. An information-
maximization approach to blind separation and blind Jonas Peters, J Mooij, Dominik Janzing, and Bernhard
deconvolution. Neural Comput., 7(6):1129–1159, Schölkopf. Causal discovery with continuous additive
1995. noise models. J. Mach. Learn. Res., 15:2009–2053,
2014.
Chris M. Bird and Neil Burgess. The hippocampus and
Jonas Peters, Peter Bühlmann, and Nicolai Meinshausen.
memory: Insights from spatial processing. Nat. Rev.
Causal inference by using invariant prediction: identi-
Neurosci., 9(3):182–194, 2008.
fication and confidence intervals. J. R. Stat. Soc. Ser.
Patrick Blöbaum, Dominik Janzing, Takashi Washio, B, pages 947–1012, 2016.
Shohei Shimizu, and Bernhard Schölkopf. Cause-
Dinh Tuan Pham and Jean-Francois Cardoso. Blind Sep-
Effect Inference by Comparing Regression Errors.
aration of Instantaneous Mixtures of Non Stationary
AISTATS, 2018.
Sources. IEEE Trans. Signal Process., 49(9):1837–
Frederick Eberhardt, Clark Glymour, and Richard 1848, 2001.
Scheines. On the number of experiments sufficient Russell A Poldrack et al. Long-term neural and physio-
and in the worst case necessary to identify all causal logical phenotyping of a single human. Nat. Commun.,
relations among n variables. Proc. Twenty-First Conf. 6, 2015. ISSN 20411723.
Uncertain. Artif. Intell., pages 178–184, 2005.
Bernhard Schölkopf, Dominik Janzing, Jonas Peters,
Arthur Gretton, Olivier Bousquet, Alex Smola, and Eleni Sgouritsa, Kun Zhang, and Joris Mooij. On
Bernhard Schölkopf. Measuring Statistical Depen- Causal and Anticausal Learning. In Int. Conf. Mach.
dence with Hilbert-Schmidt Norms. Int. Conf. Algo- Learn., pages 1255–1262, 2012.
rithmic Learn. Theory, pages 63–77, 2005.
Shohei Shimizu, Patrik O Hoyer, Aapo Hyvärinen, and
Patrik O Hoyer, Dominik Janzing, Joris M. Mooij, Jonas Antti Kerminen. A Linear Non-Gaussian Acyclic
Peters, and Bernhard Schölkopf. Nonlinear causal dis- Model for Causal Discovery. J. Mach. Learn. Res.,
covery with additive noise models. Neural Inf. Pro- 7:2003–2030, 2006.
cess. Syst., pages 689–696, 2009. Shohei Shimizu et al. DirectLiNGAM: A Direct Method
Aapo Hyvärinen. Fast and robust fixed-point algorithm for Learning a Linear Non-Gaussian Structural Equa-
for independent component analysis. IEEE Trans. tion Model. J. Mach. Learn. Res., 12:1225–1248,
Neural Networks Learn. Syst., 10(3):626–634, 1999. 2011.
Aapo Hyvärinen. Estimation of non-normalized statisti- Peter Spirtes and Kun Zhang. Causal discovery and infer-
cal models by score matching. J. Mach. Learn. Res., ence: concepts and recent methodological advances.
6:695–708, 2005. Appl. Informatics, 2016.
Aapo Hyvärinen. Some extensions of score matching. Peter Spirtes, Clark Glymour, Richard Scheines, David
Comput. Stat. Data Anal., 51(5):2499–2512, 2007. Heckerman, Christopher Meek, and Thomas Richard-
son. Causation, Prediction and Search. MIT Press,
Aapo Hyvärinen and Hiroshi Morioka. Unsupervised
2000.
Feature Extraction by Time-Contrastive Learning and
Nonlinear ICA. Neural Inf. Process. Syst., 2016. David Van Essen et al. The Human Connectome Project:
A data acquisition perspective. NeuroImage, 62(4):
Aapo Hyvärinen and Petteri Pajunen. Nonlinear inde- 2222–2231, 2012.
pendent component analysis: Existence and unique-
ness results. Neural Networks, 12(3):429–439, 1999. Kun Zhang and Aapo Hyvärinen. On the identifiability
of the post-nonlinear causal model. Proc. Twenty-Fifth
Aapo Hyvärinen and Stephen M Smith. Pairwise Like- Conf. Uncertain. Artif. Intell., pages 647–655, 2009.
lihood Ratios for Estimation of Non-Gaussian Struc-
Kun Zhang, Bernhard Schölkopf, Krikamol Muandet,
tural Equation Models. J. Mach. Learn. Res., 14:111–
and Zhikun Wang. Domain adaptation under target
152, 2013.
and conditional shift. Proc. 30th Int. Conf. Mach.
Alexander Kraskov, Harald Stögbauer, and Peter Grass- Learn., 28:819–827, 2013. ISSN 1938-7228.
berger. Estimating mutual information. Phys. Rev. E, Kun Zhang, Biwei Huangy, Jiji Zhang, Clark Glymour,
69(6):16, 2004. and Bernhard Schölkopf. Causal discovery from Non-
Matt J. Kusner, Joshua R. Loftus, Chris Russell, and Ri- stationary/heterogeneous data: Skeleton estimation
cardo Silva. Counterfactual Fairness. Neural Inf. Pro- and orientation determination. In Int. Jt. Conf. Artif.
cess. Syst., 2017. Intell., pages 1347–1353, 2017.

You might also like