Causal Discovery With General Non-Linear Relationships Using Non-Linear ICA
Causal Discovery With General Non-Linear Relationships Using Non-Linear ICA
2. Data is available over at least three distinct exper- L1→2 = log PX1 (X1 ) + log PN2 (N2 ).
imental conditions and latent disturbances, Nj , are
In the asymptotic limit we can take the expectation of
generated according to equation (3).
log-likelihood, and the log-likelihood converges to:
3. We employ TCL, with a sufficiently deep neural net-
work as the feature extractor, followed by linear E[L1→2 ] = −H(X1 ) − H(N2 ) (7)
ICA (as described in Section 3.2) on hidden repre- where H(·) denotes the differential entropy. Hyvärinen
sentations to recover the latent sources. and Smith (2013) note that the benefit of equation (7)
4. We employ an independence test which can capture is that only univariate approximations of the differential
any type of departure from independence, for ex- entropy are required. In this section we seek to derive
ample HSIC, with Bonferroni correction and signif- equivalent measures for causal direction without the as-
icance level α. sumption of linear causal effects. Recall that after train-
ing via TCL, we obtain an estimate of g = f−1 which is
Then in the limit of infinite data the proposed method will parameterized by a deep neural network.
identify the cause variable with probability 1 − α. In order to compute the log-likelihood, L1→2 , we con-
sider the following change of variables:
See Supplementary D for a proof. Theorem 2 extends
previous identifiability results relying on constraints on X1 X1 X1
= g̃ =
functional classes (e.g., ANM in Hoyer et al. (2009)) to N2 X2 g2 (X1 , X2 )
the domain of arbitrary non-linear models, under further
assumptions on nonstationarity of the given data. where we note that g2 : R2 → R refers to the second
component of g. Further, we note that the the mapping
3.4 LIKELIHOOD RATIO-BASED MEASURES g̃ only applies the identity to the first element, thereby
OF CAUSAL DIRECTION leaving X1 unchanged. Given such a change of variables,
we may evaluate the log-likelihood as follows:
While independence tests are widely used in causal dis-
L1→2 = log pX1 (X1 ) + log pN2 (N2 ) + log |det Jg̃|,
covery, they may not be statistically optimal for decid-
ing causal direction. In this section, we further propose where Jg̃ denotes the Jacobian of g̃, as we have X1 ⊥
⊥
a novel measure of causal direction which is based on N2 by construction under the assumption that X1 → X2 .
the likelihood ratio under non-linear causal models, and
Due to the particular choice of g̃, we are able to easily
which thus is likely to be more efficient.
evaluate the Jacobian, which can be expressed as:
The proposed measure can be seen as the extension of !
∂g̃1 ∂g̃1
linear measures of causal direction, such as those pro- ∂X1 ∂X2 1 0
Jg̃ = ∂g̃2 ∂g̃2 = ∂g2 ∂g2 .
posed by Hyvärinen and Smith (2013), to the domain of ∂X1 ∂X2 ∂X1 ∂X2
As a result, the determinant can be directly evaluated as plementary E. This result serves to connect the pro-
∂g2
∂X2 . Furthermore, since g2 is parameterized by a deep posed likelihood ratio to independence testing methods
network, we can directly evaluate its derivative with re- for causal discovery which use mutual information.
spect to X2 . This allows us to directly evaluate the log-
likelihood of X1 being the causal variable as: 3.5 EXTENSION TO MULTIVARIATE DATA
∂g2
L1→2 = log pX1 (X1 ) + log pN2 (N2 ) + log . It is not straightforward to extend NonSENS to mul-
∂X2 tivariate cases. Due to the permutation invariance of
Finally, we consider the asymptotic limit and obtain the sources, we would require d2 independence tests, where
non-linear generalization of equation (7) as: d is the number of variables, leading to a significant drop
in power after Bonferroni correction. Likewise, the like-
∂g2 lihood ratio test inherently considers only two variables.
E[L1→2 ] = − H(X1 ) − H(N2 ) + E log .
∂X2 Instead, we propose to extend to proposed method to the
In practice we use the sample mean instead of the expec- domain of multivariate causal discovery by employing it
tation. in conjunction with a traditional constraint based method
such as the PC algorithm, as in Zhang and Hyvärinen
One remaining issue to address is the permutation invari-
ance of estimated sources (note this this permutation is (2009). Formally, the PC algorithm is first employed to
not about the causal order of the observed variables). We estimate the skeleton and orient as many edges as pos-
must consider both permutations π of the set {1, 2}. In sible. Any remaining undirected edges are then directed
order to resolve this issue, we note that if the true permu- using either proposed bivariate method.
tation is π = (1, 2), then assuming X1 → X2 , we have
∂g1 ∂g2
∂X2 = 0 while ∂X2 6= 0. This is because g1 unmixes 3.6 RELATIONSHIP TO PREVIOUS METHODS
observations to return the latent disturbance for causal
variable, X1 , and is therefore not a function of X2 . The NonSENS is closely related to linear ICA-based methods
converse is true if the permutation is π = (2, 1). Sim-
ilar reasoning can be employed for the reverse model: as described in Shimizu et al. (2006). However, there
X2 → X1 . As such, we propose to select the permuta- are important differences: LiNGAM focuses exclusively
tion as follows: on linear causal models whilst NonSENS is specifically
∂gπ(2)
∂gπ(1) designed to recover arbitrary non-linear causal structure.
∗
π = argmax E log
+ E log
. Moreover, the proposed method is mainly designed for
π ∂X2 ∂X1
bivariate causal discovery whereas the original LiNGAM
For a chosen permutation, π ∗ , we may therefore compute method can easily perform multivariate causal discovery
the likelihood ratio in equation (6) as: by permuting the estimated ICA unmixing matrix. In this
∂gπ∗ (2) sense NonSENS is more closely aligned to the Pairwise
R = −H(X1 ) − H(Nπ∗ (2) ) + E log LiNGAM method (Hyvärinen and Smith, 2013).
∂X2
∂gπ∗ (1) Hoyer et al. (2009) and Peters et al. (2014) propose a
+H(X2 ) + H(Nπ∗ (1) ) − E log .
∂X1 non-linear causal discovery method named regression
and subsequent independence test (RESIT) which is able
If R is positive, we conclude that X1 is the causal vari- to recover the causal structure under the assumption of
able, whereas if R is negative X2 is reported as the causal an additive noise model. RESIT essentially shares the
variable. When computing the differential entropy, we same underlying idea as NonSENS, with the difference
employ the approximations described in Kraskov et al. being that it estimates latent disturbances via non-linear
(2004). We note that such approximations require vari- regression, as opposed to via non-linear ICA. Related is
ables to be standardized; in the case of latent variables the Regression Error Causal Inference (RECI) algorithm
this can be achieved by defining a further change of vari- (Blöbaum et al., 2018), which proposes measures of
ables corresponding to a standardization. causal direction based on the magnitude of (non-linear)
Finally, we note that the likelihood ratio presented above regression errors. Importantly, both of those methods re-
can be connected to the independence measures em- strict the non-linear relations to have additive noise.
ployed in Section 3.3 when mutual information is used a Recently several methods have been proposed which
measure of statistical dependence. In particular, we have seek to exploit non-stationarity in order to perform causal
R = −I(X1 , Nπ(2) ) + I(X2 , Nπ(1) ), (8) discovery. Following Schölkopf et al. (2012), Peters
et al. (2016) propose to leverage the invariance of causal
where I(·, ·) denotes the mutual information between models under covariate shift in order to recover the true
two variables. We provide a full derivation in Sup- causal structure. Their method, termed Invariant Causal
Prediction (ICP), is tailored to the setting where data is methods in the Supplementary material F.
collected across a variety of experimental regimes, simi-
We generate synthetic data from the non-linear ICA
lar to ours. However, their main results, including iden-
model detailed in Section 2.2. Non-stationary dis-
tifiability are in the linear or additive noise settings.
turbances, N, were randomly generated by simulating
Zhang et al. (2017) proposed a method, termed Laplace random variables with distinct variances in each
CD-NOD, for causal discovery from heterogeneous, segment. For the non-linear mixing function we employ
multiple-domain data or non-stationary data, which al- a deep neural network (“mixing-DNN”) with randomly
lows for general non-linearities. Their method thus generated weights such that:
solves a problem similar to ours, although with a very
different approach. Their method accounts for non- X(1) = A(1) N, (9)
stationarity, which manifests itself via changes in the X(l) = A(l) f X(l−1) , (10)
causal modules, via the introduction of an surrogate
variable representing the domain or time index into the where we write X(l) to denote the activations at the lth
causal DAG. Conditional independence testing is em- layer and f corresponds to the leaky-ReLU activation
ployed to recover the skeleton over the augmented DAG, function which is applied element-wise. We restrict ma-
and their method does not produce an estimate of the trices A(l) to be lower-triangular in order to introduce
SEM to represent the causal mechanism. acyclic causal relations. In the special case of multivari-
ate causal discovery, we follow Peters et al. (2014) and
2
4 EXPERIMENTAL RESULTS include edges with a probability of d−1 , implying that
the expected number of edges is d. We present exper-
In order to demonstrate the capabilities of the proposed iments for d = 6 dimensions. Note that equation (9)
method we consider a series of experiments on synthetic follows the LiNGAM. For depths l ≥ 2, equation (10)
data as well as real neuroimaging data. generates data with non-linear causal structure.
Throughout experiments we vary the following factors:
4.1 SIMULATIONS ON ARTIFICIAL DATA the number of distinct experimental conditions (i.e., dis-
tinct segments), the number of observations per segment,
In the implementation of the proposed method we em-
ne , as well as the depth, l, of the mixing-DNN. In the
ployed deep neural networks of varying depths as feature
context of bivariate causal discovery we measure how
extractors. All networks were trained on cross-entropy
frequently each method is able to correctly identify the
loss using stochastic gradient descent. In the final linear
cause variable. For multivariate causal discovery we con-
unmixing required by TCL, we employ the linear ICA
sider the F1 score, which serves to quantify the agree-
model described in Section 3.2. For independence test-
ment between estimated and true DAGs.
ing, we employ HSIC with a Gaussian kernel. All tests
are run at the α = 5% level and Bonferroni corrected. Figure 2 shows the results for bivariate causal discovery
as the number of distinct experimental conditions, |E|,
We benchmark the performance of the NonSENS al-
increases and the number of observations within each
gorithm against several state-of-the-art methods. As a
condition was fixed at ne = 512. Each horizontal panel
measure of performance against linear methods we com-
shows the results as the depth of the mixing-DNN in-
pare against LiNGAM. In particular, we compare perfor-
creased from l = 1 to l = 5. The top panels show
mance to DirectLiNGAM (Shimizu et al., 2011). In or-
the proportion of times the correct cause variable was
der to highlight the need for non-linear ICA methods, we
identified across 100 independent simulations. In partic-
also consider the performance of the proposed method
ular, the first top panel corresponds to linear causal de-
where linear ICA is employed to estimate latent distur-
pendencies. As such, all methods are able to accurately
bances; we refer to this baseline as Linear-ICA Non-
recover the true cause variable. However, as the depth of
SENS. We further compare against the RESIT method of
the mixing-DNN increases, the causal dependencies be-
Peters et al. (2014). Here we employ Gaussian process
come increasingly non-linear and the performance of all
regression to estimate non-linear effects and HSIC as a
methods deteriorates. While we attribute this drop in per-
measure of statistical dependence. Finally, we also com-
formance to the increasingly non-linear nature of causal
pare against the CD-NOD method of Zhang et al. (2017)
structure, we note that the NonSENS algorithm is able to
as well as the RECI method presented in Blöbaum et al.
out-perform all alternative methods.
(2018). For the latter, we employ Gaussian process re-
gression and note that this method assumes the presence The bottom panels of Figure 2 shows the results when
of a causal effect, and is therefore only included in some no directed acyclic causal structure is present. Here data
experiments. We provide a description of each of the was generated such that A(l) was not lower-triangular. In
Figure 2: Experimental results indicating performance as we increase the number of experimental conditions, |E|,
whilst keeping the number of observation per condition fixed at ne = 512. Each horizontal panel plots results for
varying depths of the mixing-DNN, ranging from l = 1, . . . 5. The top panels show the proportion of times the
correct cause variable is identified when a causal effect exists. The bottom panels considers data where no acyclic
causal structure exists (A(l) are not lower-triangular) and reports the proportion of times no causal effect is correctly
reported. The dashed, horizontal red line indicates the theoretical (1 − α)% true negative rate. For clarity we omit the
standard errors, but we note that they were small in magnitude (approximately 2 − 5%).
0.8
1Rn6E16
3rRSRrtLRn cRrrect
3rRSRrtLRn cRrrect
3rRSRrtLRn cRrrect
3rRSRrtLRn cRrrect
3rRSRrtLRn cRrrect
0.6
LLn-ICA 1Rn6E16
DLrectLL1GA0
0.4 5E6IT
IC3
0.2 CD-12D
0.0
10 20 30 40 50 10 20 30 40 50 10 20 30 40 50 10 20 30 40 50 10 20 30 40 50
# ExSerLPentDO cRnGLtLRns # ExSerLPentDO cRnGLtLRns # ExSerLPentDO cRnGLtLRns # ExSerLPentDO cRnGLtLRns # ExSerLPentDO cRnGLtLRns
0.98
1Rn6(16
3rRSRrtLRn cRrrect
3rRSRrtLRn cRrrect
3rRSRrtLRn cRrrect
3rRSRrtLRn cRrrect
3rRSRrtLRn cRrrect
LLn-ICA 1Rn6(16
0.96
DLrectLL1GA0
5(6IT
0.94
IC3
CD-12D
0.92
0.90
10 20 30 40 50 10 20 30 40 50 10 20 30 40 50 10 20 30 40 50 10 20 30 40 50
# (xSerLPentDO cRnGLtLRns # (xSerLPentDO cRnGLtLRns # (xSerLPentDO cRnGLtLRns # (xSerLPentDO cRnGLtLRns # (xSerLPentDO cRnGLtLRns
Figure 3: Experimental results visualizing performance under the assumption that a causal effect exists. This reduces
the bivariate causal discovery problem to recovering the causal ordering over X1 and X2 . The top panel considers
an increasing number of experimental conditions whilst the bottom panel shows results when we vary the number
of observations within a fixed number of experimental conditions, |E| = 10. Each horizontal plane plots results for
varying depths of the mixing-DNN, ranging from l = 1, . . . , 5.
F1 score
1.0 Layer
1 PRc DG
2
Est. DAG
0.9
3
4
0.8 5
ERc
CA1
0.7
0.4
0.3