0% found this document useful (0 votes)
17 views

Deep Causal Learning

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views

Deep Causal Learning

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 35

Deep Causal Learning: Representation, Discovery and Inference

ZIZHEN DENG, XIAOLONG ZHENG*, HU TIAN, and DANIEL DAJUN ZENG, School of
Artificial Intelligence, University of Chinese Academy of Sciences; The State Key Laboratory of Management and
Control for Complex Systems, Institute of Automation, Chinses Academy of Sciences, Beijing, China

Causal learning has attracted much attention in recent years because causality reveals the essential relationship between things and
indicates how the world progresses. However, there are many problems and bottlenecks in traditional causal learning methods, such as
high-dimensional unstructured variables, combinatorial optimization problems, unknown intervention, unobserved confounders,
selection bias and estimation bias. Deep causal learning, that is, causal learning based on deep neural networks, brings new insights for
addressing these problems. While many deep learning-based causal discovery and causal inference methods have been proposed, there
is a lack of reviews exploring the internal mechanism of deep learning to improve causal learning. In this article, we comprehensively
review how deep learning can contribute to causal learning by addressing conventional challenges from three aspects: representation,
discovery, and inference. We point out that deep causal learning is important for the theoretical extension and application expansion of
causal science and is also an indispensable part of general artificial intelligence. We conclude the article with a summary of open issues
and potential directions for future work.

CCS CONCEPTS • Computing methodologies → Machine learning; Neural network; Causal reasoning and diagnostics;
Mathematics of computing → Causal networks

Additional Keywords and Phrases: Deep learning, Causal variables, Causal discovery, Causal inference

1 INTRODUCTION
The study of causality has always been a very important part of scientific research. Causality has been studied in many
fields, such as biology [1-2], medicine [3-7], economics [8-12], epidemiology [13-15], and sociology [16-20]. For the
construction of general artificial intelligence systems, causality is also indispensable [21-23]. For a long time, causal
discovery and causal inference have been the main research directions of causal science. In causal discovery (CD) [24-
25], causal relationships are found in observational data, usually in the form of a causal graph. The traditional causal
discovery methods typically include constraint-based methods [26], score-based methods [27], asymmetric (function)-
based methods [28-29] and other classes of methods [30-32]. In causal inference (CI) [33-34], the causal effect is
estimated, which can be further divided into causal identification and causal estimation [35]. In causal identification,
whether the causal effect can be estimated based on the existing information is determined, and in causal estimation,
specific causal effect values are obtained. There are two mainstream frameworks for causal inference: the structural

*Authors’ address: Z. Deng, X. Zheng (corresponding author), H. Tian and D. Zeng, The State Key Laboratory of Management and Control for Complex
System, Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China; emails: {dengzizhen2021, xiaolong.zheng, tianhu2018,
dajun.zeng}@ia.ac.cn.
causal model (SCM) [33] and the potential outcome model (POM) [34]. Many methods have been developed by previous
researchers based on these two frameworks, such as front-door adjustment, back-door adjustment [35], matching [36-37],
propensity scores [38-40], and double robust regression [41-42].
Although there have been many methods in the field of causal learning, there are still many unsolved problems. In the
past, causality was usually studied on low-dimensional structured data, so there was no need to extract features from the
data. However, with the expansion of application scenarios, many high-dimensional unstructured data need to be
processed, such as images, text and video [43-47]. To discuss causal learning more conveniently and clearly, we refer to
all variables involved in causal tasks as causal variables. (e.g., variables in causal discovery and variables in causal
inference). Even if causal variables are structured, their covariate distributions could be unbalanced, becoming a source
of selection bias. For causal discovery, most methods require strong assumptions (such as linear non-Gaussian, causal
Markov conditions, and faithfulness assumptions) [26, 48-49], which are often impossible to verify. In addition, most of
the traditional causal discovery methods are based on combinatorial optimization [25], which is comparatively
intractable when the number of nodes is large. For causal inference, due to the lack of counterfactual data, the gold
standard for estimating causal effects is random controlled trials (RCT) [50]. However, in reality, we are often unable to
do this due to high costs or ethical constraints. Therefore, it is common to use observational data to make causal
inferences. The key problem with causal inference from observational data is selection bias [33], which consists of
confounding bias (from confounders) and data selection bias (from colliders). Due to selection bias, we may observe
false causality or see correlation as causation. Traditional causal inference methods often have large estimation bias due
to limited fitting ability [51-52]. There are also some common persistent issues in the causal field, such as unknown
interventions [53], unobserved confounders [54], and missing data [55].
We review the use of deep learning methods to address the above problems in the causal learning field from three
points of view: representation, discovery and inference. The three core strengths of deep learning for casual learning are
strong representational capabilities, fitting capabilities, and the ability to approximate data generation mechanisms.
First, deep representation learning for casual variables [22, 56] uses deep learning methods to learn the low-dimensional
balanced structured representation of high-dimensional, unstructured and unbalanced data so that variables can be better
used for causal discovery and causal inference [57]. Second, the universal approximation theorem indicates that neural
networks can learn extremely complex functions to estimate heterogeneous treatment effects with low bias estimators
[58-59]. Because of the general fitting ability and flexibility of neural networks, it has become the main continuous
optimization method to solve the long-standing combinatorial optimization problem in causal discovery and can
theoretically cope with large-scale data. Finally, deep learning methods can generate counterfactuals implicitly through
adversarial learning (usually implemented with a generative adversarial network (GAN) [60]) or explicitly model the
data generating process through disentanglement mechanisms to generate proxy/latent variables (usually implemented
with a variational autoencoder (VAE) [61]). Neural network-based methods for modeling data mechanisms require very
little prior knowledge and do not make many assumptions about the relationship between variables, so that deep
learning-based causal inference and causal discovery methods allow the presence of unobserved confounders and can
also make use of intervention data. Figure 1 shows the main difference between traditional causal learning and deep
causal learning. We can clearly see the improvements that deep learning brings to causal learning.
In the past few years, many deep learning-based causal discovery and causal inference methods have been proposed.
There have been many reviews related to causal learning, but few of them summarized how to take advantage of deep
learning to improve causal learning methods. Guo et al. [23] reviewed causal discovery and causal inference methods

2
(a) (b)
Representation
Discovery

Observation Intervention Unobserved


Discovery data data Confounder
Data
Causal Relation

Neural
Network
Structured data
Causal Effect
Unstructured data

Inference Inference
Inference Inference
Causal Learning Deep Causal Learning
Example: Example:

:Age :Age
Discovery Discovery
:Exercise :Exercise
Text: I’m so tired :Blood Pressure
:Blood Pressure
Image: :Mood
Inference Inference
Representation
(c) ( = =− . + ( = =− + (d)

Figure 1: The difference between causal learning and deep causal learning. The comparison between (a) and (b) shows the theoretical
advantages of deep causal learning. In the framework of deep causal learning, unstructured data can be processed with the
representational power of neural networks. With the modeling capabilities of neural networks, in causal discovery, observational data
and (known or unknown) intervention data can be comprehensively used in the presence of unobserved confounders to obtain a causal
graph that is closer to the facts. With the fitting ability of neural networks, the estimation bias of causal effects can be reduced in causal
inference. The 4 orange arrows represent the neural network's empowerment of representation, discovery, and inference. (c) and (d)
demonstrate the advantages of deep causal learning in more detail by exploring examples of the effect of exercise on blood pressure. We
assume that the ground truth of exercise on blood pressure is 𝐸(𝑋3|𝑑𝑜(𝑋2 = 𝑥)) = −1.1𝑥 + 84.

with observational data, but very little content was related to deep learning. Yao et al. [34] focused on causal inference
based on the potential outcome framework. It also mentioned some causal inference methods based on neural networks,
but the description was not systematic. Nogueira et al. [62] summarized causal discovery and causal inference datasets,
software, evaluation metrics and running examples without focusing on the theoretical level. Glymour et al. [24] mainly
reviewed traditional casual discovery methods, and Vowels et al. [25] focused on continuous optimization-based
discovery methods. There have been reviews that combined causality with machine learning [22, 63-64], but these
survey papers mainly explored how causal knowledge can be used to solve problems in the machine learning community.
The work of Koch et al. [65] is more similar to our starting point, and focused on the improvements that deep learning
brings to causal learning. However, they only considered the combination of deep learning and causal inference under
the framework of the potential outcome model and did not address other aspects of the field of causal learning, such as
the representation of causal variables, causal discovery, and causal inference under the framework of the structural causal
model. In this article, we argue that deep representation learning for causal variables, deep causal discovery, and deep
causal inference together constitute the field of deep causal learning, as these three parts cover the general process of

3
Survey Structure

1. Introduction 2. Preliminaries 3. Deep Representation Learning for Causal Variables

2.1 Structural causal model 3.1 Regularization adjustment for covariate balance

2.2 Potential outcome model 3.2 Implicit generation of counterfactual data

2.3 NNs for causal learning 3.3 Explicit generation for disentangled causal mechanisms

4. Deep Causal Discovery 5. Deep Causal Inference 6. Conclusion and Future Directions

4.1 Observation data-driven CD 5.1 CI through covariate balance

4.2 Intervention data-driven CD 5.2 Adversarial training for CI

4.3 Unobserved confounders data-driven CD 5.3 Proxy variable-based CI

5.4 Deep structural causal model

Figure 2: An overview of the structure of the survey.

exploring causal relationships: representation, discovery, and inference. We present a more comprehensive and detailed
review of the changes that deep learning brings to causal learning. The overall framework of this survey is shown in
Figure 2.
The rest of this article is organized as follows. Section 2 provides basic concepts related to causality, including
structural causal models, potential outcome models, causal discovery and some main types of neural networks used in
deep causal learning. Section 3 reviews three kinds of deep representation learning methods for causal variables based on
regularization adjustment, implicit generation, and explicit generation. Section 4, we introduce deep causal discovery
methods on observed data, intervention data, and the presence of unobserved confounders data. Section 5 reviews the
deep causal inference methods based on covariate balance, adversarial training and proxy variables. In addition, the deep
structural causal model is introduced. Section 6 provides a conclusion and discusses the future directions of deep causal
learning.

2 PRELIMINARIES
In this section, we briefly introduce causal discovery, causal inference and some main types of neural networks used in
deep causal learning. There are two mainstream frameworks for causal inference: the structural causal model (SCM) and
the potential outcome model (POM). Our survey focuses on the combination of causality and deep learning in both
frameworks. The composition, assumptions, and main methods of these two frameworks are introduced to provide the
necessary background knowledge for subsequent integration with deep learning methods. Table 1 presents the basic
notations used in this article.

4
Table 1: Basic notations and their corresponding descriptions

Notation Description Notation Description


𝑋𝑖 Covariates of the sample 𝑖. 𝑉 The set of endogenous variables.
𝑇𝑖 Treatment of the sample 𝑖. 𝑈 The set of exogenous variables.
𝑌𝑖𝐹 Factual outcome of the sample 𝑖. 𝐹 The set of mapping functions.
𝑌𝑖𝐶𝐹 Counterfactual outcome of the sample 𝑖. 𝑃𝑎𝑣 The parent nodes of 𝑣.
𝑒(𝑋𝑖 ) Propensity score of 𝑋𝑖 . 𝑃(𝑈) Distribution of exogenous variables.
𝛷(𝑋𝑖 ) Covariates representation of 𝑋𝑖 . 𝐼 Instrumental variable.
𝜃 Neural network parameters. 𝑍 Hidden/latent factors.
ℎ(∙) Neural network mapping. 𝑑𝑖𝑠(∙) Distance measure.
𝐼𝑃𝑀𝐺 (∙) Integral probability metrics. 𝐸𝑝 The expectation based on data distribution 𝑝.
‖∙‖𝑝 𝑝-norm. 𝑁(𝜇, 𝜎 2 ) Gaussian distribution with mean 𝜇 and variance 𝜎 2 .
𝐿𝑜𝑠𝑠(∙) Loss function. 𝛼, 𝛽, 𝛾 Hyperparameters.

2.1 Structure Causal Model


Definition 1 (Structure Causal Model). A structural causal model is a 4-tuple < 𝑉, 𝑈, 𝐹, 𝑃(𝑈) >, where 𝑉 is the set of
endogenous variables, 𝑈 is the set of exogenous variables, 𝐹 is the set of mapping functions accomplish the mapping
from the parent node 𝑃𝑎𝑣 to 𝑣, and 𝑃(𝑈) is the distribution of exogenous variables.

In essence, SCM is a subjective abstraction of the objective world, and the involved endogenous and exogenous
variables are heavily dependent on the researchers' prior knowledge. That is, the definitions of these variables themselves
are not necessarily accurate, or the most essential variables cannot be observed due to various limitations. For example,
when studying the effect of a person's family status on their academic performance, we might use the family's annual
income as a proxy variable, although this variable may not be entirely appropriate or even correct.
Definition 2 (Causal Graph). Usually, each SCM model has a corresponding causal graph, which is typically a
directed acyclic graph (DAG) [33]. In fact, the casual graph can be seen as an integral part of SCM, in addition to
counterfactual logic. As shown in Figure 3, there are three basic structures in a causal graph: Chain (a), Fork (b) and
Collider (c). These three basic structures constitute a variety of causal graphs. In Figure 3 (e), 𝑋 represent covariates, 𝑌
represent outcome, 𝑇 represent treatment, 𝐶 represent confounders, 𝐼 represent instrumental variable, 𝑈𝑡 represent the
exogenous variable of 𝑇 and 𝑈𝑦 represent the exogenous variable of 𝑌.

To calculate various causal effects in the SCM framework, we must understand three forms of data corresponding to
the Pearl’s causal hierarchy (PCH) proposed by Judea Pearl [35, 66]: observation data, intervention data and
counterfactual data. Observational data represent passive collection without any intervention. Causal effects cannot be
calculated by relying solely on observational data without making any assumptions. Intervention refers to changing the
value or distribution of a few variables; data are collected, and the average treatment effect (ATE) can be calculated.
Counterfactual data are unavailable in the real world. However, under various assumptions, individual treatment effects
(ITE) can be calculated with the help of counterfactual theory. The capabilities of the three models were discussed in
more detail in previous work [33].
Calculating causal effects is the core of causal inference. The average treatment effect (ATE) under the SCM
framework is a common indicator to measure causal effects. It is defined as follows:

𝐴𝑇𝐸 = 𝐸[𝑌|𝑑𝑜(𝑡 = 1)] − 𝐸[𝑌|𝑑𝑜(𝑡 = 0)]. (1)

5
T X Y T C Y I T C

(a) Chain (b) Fork (c) Collider

Ut Uy
Z U
X

X Y I T Y X Z Y

C
(d) Back-door criterion (e) Structural Causal Model (f) Front-door criterion

Figure 3: Three basic DAGs, a simple structure causal model, and two adjustment criteria.

As shown in Equation (1), the key to calculating the causal effect is to calculate the probability under the intervention. In
the SCM framework, there are many ways to calculate the probability under the intervention in different situations (for
example, whether there are unobserved confounders). Here, we briefly introduce several commonly used methods.
Back-door adjustment. In the causal graph corresponding to the SCM, there is a pair of ordered variables (𝑋, 𝑌). If
the variable set 𝑍 satisfies the condition that there is no descendant node of 𝑋 in 𝑍, and 𝑍 blocks every path between 𝑋
and 𝑌 pointing to the 𝑋 path, then 𝑍 is said to satisfy the backdoor criterion [35] about (𝑋, 𝑌), as shown in Figure 3 (d). If
the variable set 𝑍 satisfies the backdoor criterion of (𝑋, 𝑌), then the causal effect of 𝑋 on 𝑌 can be calculated using the
following formula:

𝑃(𝑌 = 𝑦|𝑑𝑜(𝑋 = 𝑥)) = ∑ 𝑃(𝑌 = 𝑦|𝑋 = 𝑥, 𝑍 = 𝑧)𝑃(𝑍 = 𝑧) . (2)


𝑧

Front-door adjustment. As shown in Figure 3 (f), variable set 𝑍 is said to satisfy the front door criterion [35] for an
ordered variable pair (𝑋, 𝑌), if the following conditions are satisfied: 1) 𝑧 cuts off all directed paths from 𝑋 to 𝑌; 2) 𝑋 to
𝑍 has no backdoor paths; and 3) all 𝑍 backdoor paths to 𝑌 are blocked by 𝑋. If 𝑍 satisfies the front door criterion for the
variable pair (𝑋, 𝑌), and 𝑃(𝑥, 𝑧) > 0, then the causal effect of 𝑋 on 𝑌 is identifiable and is calculated by:

𝑃(𝑌 = 𝑦|𝑑𝑜(𝑋 = 𝑥)) = ∑ 𝑃(𝑧|𝑥) ∑ 𝑃(𝑦|𝑥 ′ , 𝑧)𝑃(𝑥 ′ ) . (3)


𝑧 𝑥′

Causal Discovery. Using observational data to discover causal relationships between variables is a fundamental and
important problem. Causal discovery has a wide range of applications in many fields, such as social science, medicine,
biology and atmospheric sciences. Causal relations among variables are usually described using a DAG, with the nodes
representing variables and the edges indicating probabilistic relations among them. When performing causal discovery,
causal Markov conditions and faithfulness assumptions are often needed. Traditional causal discovery algorithms are
roughly divided into score-based, constraint-based, and functional-based methods.
Constraint-based methods infer the causal graph by analyzing conditional independence in the data. Classical
algorithms include PC and FCI [26]. The score-based methods search through the space of all possible directed acyclic

6
graphs (DAGs) representing the causal structure based on some form of scoring function for network structures. A
typical algorithm is GES [27]. The function-based method requires assumptions about the data generation mechanism
between variables and then the causal direction is judged through the asymmetry of the residuals. Essential algorithms
include LiNGAM [28] and ANM [29].

2.2 Potential Outcome Model


Definition 3 (Potential Outcome Model). In the context of the potential outcome model, 𝑇 represents the observed
treatment, 𝑌 is the outcome, 𝑖 denotes a specific individual, and 𝑋 denotes all other covariates. We use 𝑌𝑖 (0) to represent
the potential outcome under no treatment and 𝑌𝑖 (1) to represent the potential outcome under treatment. In the real world,
we can only observe one of these outcomes, denoted as 𝑌𝑖𝐹 . Therefore, the observed data we have collected can be
𝑁
expressed as {𝑋𝑖 , 𝑇𝑖 , 𝑌𝑖𝐹 }𝑖=1 , where 𝑁 is the number of samples. According to the representation of POM, we define the
causal effect as follows:

Individual Treatment Effect (ITE):

ITE𝑖 = 𝑌𝑖 (𝑇 = 1) − 𝑌𝑖 (𝑇 = 0). (4)

Average Treatment Effect (ATE):

𝐴𝑇𝐸 = 𝐸[𝑌(𝑇 = 1) − 𝑌(𝑇 = 0)]. (5)

Conditional Average Treatment Effect (CATE):

𝐶𝐴𝑇𝐸 = 𝐸[𝑌(𝑇 = 1|𝑋 = 𝑥)] − 𝐸[𝑌(𝑇 = 0|𝑋 = 𝑥)]. (6)

Obtaining accurate causal effects requires both facts and counterfactuals, but we can only obtain factual data, so we aim
to approximate counterfactuals through various methods for estimating causal effects. When making causal estimates, we
usually need to rely on the following basic assumptions [15, 67]:
Stable unit treatment value assumption. The effect of treatment on a unit is independent of the treatment
assignment of other units.
Unconfoundedness. The distribution of treatment is independent of potential outcome when given the observed
variables, which means there are no unobserved confounders 𝑇 ⊥ (𝑌(0), 𝑌(1))|𝑋.
Positivity. Each unit has a nonzero probability of receiving either treatment status when given the observed variables.

0 < 𝑃(𝑇 = 1|𝑋 = 𝑥) < 1. (7)

Next, we introduce several methods for estimating causal effects that are frequently used in the POM framework.
Match. Matching involves selecting individuals from the treatment group and the control group and continuously
selecting the two most similar individuals to form a pair. Outcomes from matched samples can be approximated as
counterfactual for each other, so causal effects can be calculated by comparing results from paired samples. When
matching, we can use nearest neighbor matching or set a certain distance threshold.
Propensity score matching. When matching is based on distance, good results cannot be obtained because of the
high-dimension problem or insufficient samples. Therefore, the propensity score matching (PSM) [50] method was
proposed in which the probability 𝑒(𝑋) of a unit to be treated is estimated by using predictive models [39]. As a
similarity proxy, the calculated propensity score is used to implement the matching method. Propensity score is defined
as follows:

7
𝑒(𝑋) = 𝑃(𝑇 = 1|𝑋). (8)

Inverse of propensity weighting. To further address the problem of insufficient samples caused by matching, we
introduce another class of techniques called reweighting. Here, we present the reweighting technique based on the
propensity score, known as inverse of propensity weighting (IPW) [68]. The propensity score of each sample is used to
reweight it, which is equivalent to eliminating the estimation bias caused by the imbalance of covariates. ATE estimated
by the IPW method is defined as:
𝑛 𝑛
1 𝑇𝑖 𝑌𝑖 1 (1 − 𝑇𝑖 )𝑌𝑖
𝐴𝑇𝐸𝐼𝑃𝑊 = ∑ − ∑ , (9)
𝑛 𝑒̂ (𝑋𝑖 ) 𝑛 1 − 𝑒̂ (𝑋𝑖 )
𝑖=1 𝑖=1

where 𝑒̂ (𝑋𝑖 ) is the estimated propensity score of the individual 𝑖.

2.3 Deep Neural Network for Causal Learning


In this subsection, we briefly introduce some types of neural networks used in deep causal learning. We introduce the
ideas behind each network and its role in causal learning. Different network structures represent different perspectives on
the development of deep learning. The feedforward neural network (FNN) is the simplest neural network. In deep
representation learning for causal variables, it is usually used to extract representations, convert covariates from original
space to latent space, and cooperate with regular terms to achieve covariate balance. Generative adversarial networks
(GANs) [60] represent the idea of adversarial training, in which the data distribution is not explicitly modeled, but the
generator-discriminator structure is used to continuously approximate the true distribution by sampling noise. In causal
learning, GAN structures are often used to generate counterfactuals to address the core problem of the lack of
counterfactual data in causal inference, thus making it possible to estimate causal effects directly. Variational
autoencoders (VAEs) [61] represent the idea of disentanglement, in which the encoder-decoder architecture is utilized to
disentangle the data generation mechanism into independent mechanisms by explicitly modeling the distribution of the
data. In causal learning, VAEs are often used to solve representation learning to disentangle causal mechanisms and
estimation problems in the presence of unobserved confounders. Recurrent neural networks (RNNs) [69-70] are the
primary neural architectures used to model sequential information and are used for causal discovery and causal inference
problems in time series scenarios. Graph neural networks (GNNs) [71-72] can make good use of the information flow of
surrounding neighbor nodes to encode structural information into network parameters and exhibit excellent performance
in modeling graph-structure data. Therefore, we can use GNNs to model relation information when individual
relationships are network structures or to approximate structural causal models directly. The estimation of causal
parameters and causal effects can be completed in the GNNs’ optimization process.
Using different perspectives and information, we can find the most suitable structure under various conditions to
solve the problem of causal learning. In addition, deep learning methods can make full use of the advantages of big data
and computing power, which makes deep causal learning more advantageous in today's era of big data.

3 DEEP REPRESENTATION LEARNING FOR CAUSAL VARIABLES


Before performing causal tasks such as causal discovery and causal inference, variables should have a good
representation, especially in the current era of big data. In real scenarios, most causal variables are unstructured and high-
dimensional, such as images and text. For them, the existing causal discovery and causal inference methods can be used
only by using representation learning to represent these variables as low-dimensional forms. Converting unstructured

8
data into structured data is in the content of representation learning and will not be discussed in detail in this article. We
mainly investigate how deep learning can be used to further optimize the representation of causal variables, thereby
reducing the obstacles that may be faced in causal discovery and causal inference. This requires us to think about what
constitutes a good representation in the field of causal learning. Starting from the two fundamental problems of causal
inference (selection bias and lack of counterfactual data), we believe that the following are necessary for a good causal
variable representation: the covariates are balanced, and the counterfactual data or the causal generation mechanism of
the variable can be obtained.
In this section, we introduce three deep representation learning frameworks for casual variables: regularization
adjustment for covariate balance, implicit generation of counterfactual data, and explicit generation for disentangling
casual mechanisms. This section focuses on the components of these frameworks and the role of each in constructing
representations. More details related to causal effect estimation are elaborated in section 5.

3.1 Regularization Adjustment for Covariate Balance


For causal inference, it would be very beneficial for estimating causal effects if the representation of the learned
variables across treatment groups had balanced covariates. This is equivalent to eliminating the influence of confounders
[73]. Regularization adjustment-based representation learning starts from this perspective, is adjusted through various
forms of regularization and attempts to balance covariates as much as possible, while obtaining structured variable
representations. Most of the traditional methods are balanced on the covariate space [74], such as the IPW [68]
mentioned in subsection 2.2. This kind of method is limited by the representation ability of the original variable,
especially when the sample size is insufficient, and a good representation effect cannot be achieved. This subsection
mainly introduces representation learning in latent space using deep neural networks based on regularization adjustment.
The architecture of regularization adjustment-based representation learning is usually as shown in Figure 4. It consists
of five parts: the representation network ①, outcome network ②, balance regularization term ③, prediction loss function
④ and total loss function ⑤. The total loss function is usually the sum of prediction error and the balanced regularization
term. We introduce the role of each part, the selection of prediction loss functions and regularization terms. The
representation network ① is used to obtain the representation of the original covariates. The original covariates may be
structured or unstructured, such as graph-structure data, text, and images. For unstructured data, the main representation
learning structure in [57] can be used; for structured data, standard feed-forward neural networks are often used.
Outcome network ② is used to predict the potential outcome. In the BNN [74], the outcome network only has one head to
deal with both treatment and control groups. To avoid the influence of treatment variable 𝑡 lost during network training,
CFR [75] splits the outcome network into two heads with the same network architectures to predict counterfactual
outcomes. The balanced regularization term ③ is the core of the regularization adjustment-based representation learning
framework. It represents the difference in covariate distribution between the treatment and control groups. Therefore,
many studies have started from the perspective of covariate shift or domain adaptation to design this regular term. The
integral probability metric (IPM) [76-77] is a popular method used to measure the two distribution distances between the
treatment group covariate distribution and the control group covariate distribution. Among various IPMs, the Wasserstein
distance [78] and the maximum mean discrepancy distance (MMD) [79] are often used as the regularization term. The
prediction loss function ④ is used to record the error between the predicted outcome 𝑦̂𝑖𝐹 and the actual outcome 𝑦𝑖𝐹 . The
total loss function ⑤ simultaneously optimizes prediction error and distribution error (balance regularization) to make a
trade-off between balance and accuracy. The total loss function is:
𝐿𝑡𝑜𝑡𝑎𝑙 = 𝐿𝑝𝑟𝑒𝑑𝑖𝑐𝑡 + 𝛼𝐿𝑏𝑎𝑙𝑎𝑐𝑛𝑒 . (10)

9
Feedback(e.g. Drop-out)
2 Outcome Network

1 Representation Network

...
5

3 Balanced regularization term

Figure 4: The framework of representation learning based on regularization adjustment for covariate balance.

In the regularization adjustment-based causal representation framework, in addition to regularization terms, the
propensity scores can also be used to enable the representation network to learn balanced features. The first practical
method was DargonNet [80], in which the standard feedforward neural network is used to predict the probability of each
sample receiving treatment and the representation is obtained from the hidden layer. Such representations do not contain
treatment-independent variables and can be used to predict outcomes. The balance of covariates helps to eliminate
selection bias, but paying too much attention to the balance will affect the counterfactual prediction performance. While
maintaining global balance, SITE [81] found that maintaining local similarity can help improve prediction accuracy. This
method achieves global balance and local similarity and maintains certain predictive abilities while reducing selection
bias. The specific implementation details will be described in section 5.

3.2 Implicit Generation of Counterfactual Data


Implicit generation models do not directly model the probability density and likelihood functions [82]. Instead, they
interact with training data to make the distribution of the data generated by the model close to the true distribution. The
generative adversarial network (GAN) [60] is a type of implicit generation model and has been widely used in various
fields in recent years. With GAN, the generator is used to sample from the noise directly, the discriminator is used to
learn the distance between the generated data distribution and real data distribution, and the generator is trained to
minimize the distance.
Implicit generation-based representation learning is different from previous regularization adjustment-based
representation learning; it minimizes imbalance metrics. Implicit generation-based representation learning eliminates
selection bias and counterfactual data are generated directly because the discriminator is unable to distinguish the
received sample from the treatment group (𝑇 = 1) or control group (𝑇 = 0). When there are many covariates and their
relationship are particularly complex, the method based on weighting or balance is often ineffective. This is because the
feature representation obtained by the neural network is not reasonable due to the complex relationship among covariates
and the limited number of samples. In contrast, the framework based on implicit generation can handle this problem very
well.

10
u ui
~ G xi G ~
yiCF
x G y
t D Loss 1 − ti ITE
y yiF

Figure 5: Representation learning framework based on implicit generation.

The basic architecture of implicit generation-based representation learning is shown in Figure 5. It consists of
potential outcome generator 𝐺 and potential outcome discriminator 𝐷. The potential outcome generator 𝐺 generates
potential outcomes 𝑦̃ based on covariates 𝑥, treatments 𝑡, and exogenous noise 𝑢; the generated potential outcomes 𝑦̃ and
factual outcomes 𝑦 are then fed into the discriminator 𝐷, which tries to distinguish which is the factual outcome and
which is the generated outcome. Implicit generation models can learn stable representations and improve the accuracy of
subsequent causal effect estimation. The adversarial learning architecture [83] can be used to control for confounders in
the latent space. This means that the generator 𝐺 tries to maximize the error of the discriminator 𝐷 through training and
finally cannot distinguish the training samples from the control group or the treatment group, thus achieving the purpose
of eliminating the effect of confounders. Therefore, 𝐺 can be used to generate approximate counterfactual data, so it can
be seen as the implicit representation of the counterfactual data. In addition to the most basic adversarial training
structure shown in Figure 5, some studies have extended the method to multivariate treatment variables, continuous data,
and time series, greatly expanding the adaptability of representation learning based on implicit generation.

3.3 Explicit Generation for Disentangled Causal Mechanisms


In representational learning using explicit generation models, each variable is considered a disentangled causal
mechanism to obtain a more stable representation. The generation mechanism of data is explicitly modeled, generating
counterfactuals parametrically. Defining the variables associated with causal problems is essentially creating a division
of the world at a certain granularity [56]. The disentangled causal mechanism plays an increasingly important role in the
field of causality. In essence, it assumes that a data generation mechanism can be decomposed into multiple independent
causal mechanisms that are unrelated to each other and do not affect each other [84]. The term “independent” here has
two meanings. First, one mechanism does not contain any information about the other. Second, changes to one mechanic
do not affect the other. For instance, if a robot is asked to grab an object, it only needs to know the object’s shape and
position, not its color. The causal effects of intervention can be easily estimated by such disentanglement representation.
It can be seen that if independent mechanisms can be used to represent variables, subsequent causal inference results will
be more reliable.
The basic framework of explicit generation-based representation learning is shown in Figure 6. 𝑋 is the observed
variable, 𝐺 is the real mechanics generate factors and 𝑍 is the learned latent representation. This framework can be
implemented using VAE. From the perspective of causality, IRS [85] proposes a new metric to measure the robustness of
representations to evaluate disentangled representation learning algorithms. In IRS, disentanglement is seen as a property
of the causal process responsible for data generation, not just an encoded heuristic feature. Combining these discrete
causal processes with encoding allows us to investigate interventions on feature representations and estimate causal
effects from observational data. In IM [84], an algorithm was developed to automatically identify a set of independent

11
G1 G2 G3 ... GK
Independent Mechanism

X Encoder

Z1 Z2 Z3 Z4 ... ZN

Decoder

Figure 6: Representation learning framework based on explicit generation.

mechanisms from the mixture of shifted data without labels. The design architecture is modular and easily extended to a
variety of situations. Note that the real number of mechanisms is unknown a priori. The IM hypothesis can be used to
identify causal models (i.e., causal discovery), exploiting multi-expert competition to discover independent causal
mechanisms. Each expert is represented by a smaller neural network, and the experts compete for the sample data during
training. Each time, only the winning expert network can update the weight parameters through backpropagation, and the
other expert networks remain unchanged. Both GAN and VAE can be used to implement this method. In addition, in
CFL [56] the use of low-level data to discover high-level causal relations is proposed. In this way, variables that are
better suited for causal expression can be learned from low levels, which is not necessarily appropriate for all data
settings. One of the benefits of such a method is avoiding preconceived biases.

4 DEEP CAUSAL DISCOVERY


This section introduces causal discovery methods based on deep neural networks. Traditional causal discovery methods
are combinatorial optimization problems, and the technique commonly used in training NNs is continuous optimization
[86-87]. Recently, there have been a large number of approaches utilizing NNs for causal discovery. The original causal
discovery field has been expanded and supplemented from different perspectives. In the traditional causal discovery field,
there are many unsolved problems that affect the accuracy and scalability of causal discovery results. Here, we describe
how deep learning can improve causal discovery and address some of the most common challenges.

4.1 Observation Data-driven Causal Discovery


This subsection introduces methods for causal discovery on observational data using deep neural networks. These
methods usually convert the constraints of acyclicity into constraints on the trace of the adjacency matrix, as shown in
Figure 7 (a). NOTEARS [88] was the first to transform causal discovery from a combinatorial optimization problem to a
continuous optimization problem, setting the stage for the subsequent introduction of neural networks. Specifically, the
previous causal discovery methods usually require the obtained causal graph to be a DAG. However, with the increase in
the number of nodes, the complexity of the combinatorial optimization problem increases very rapidly, and the speed of
solving the problem is greatly reduced, which seriously limits the solvability size of the problem. The transformation of
the form is as follows. In Equation (11), 𝐴 represents the adjacency matrix of the causal graph, 𝑑 represents the number

12
Encoder

Decoder

(a) Observed data (b) Known and unknown intervention (c) Unobserved confounder
Figure 7: Causal discovery using observed data, intervention data, and data with unobserved confounders.

of nodes, and ℎ(∙) is a smooth function over real matrices. This derivation is concise and powerful but also ingenious.
Most of the subsequent gradient-based methods are extensions of this. Note that in the formula, the value of ℎ(𝑊) is
usually small but not 0, so the setting of the threshold is required in most cases.

𝐺(𝐴) ∈ 𝐷𝐴𝐺𝑠 ⇔ ℎ(𝐴) = 0 ⇔ 𝑡𝑟(𝑒 𝐴°𝐴 ) − 𝑑 = 0. (11)

NOTEARS has a good performance under the assumption of the SEM, and many subsequent works have extended it
to the nonlinear field. DAG-GNN [89] introduces neural networks into the process of causal discovery, extending
scenarios to nonlinearities. Using the encoder-decoder architecture, the adjacency matrix is obtained during the network
training process to complete the structure selection. In Equations (12)-(14), 𝐴 is the adjacency matrix, 𝑋 is a sample of a
joint distribution and 𝑍 is the noise matrix. 𝑓𝑖 is the function that implements the transformation between 𝑋 and 𝑍,
usually implemented using the neural network. Following the idea of encoder-decoder, GAE [90] utilizes the idea of
graph self-encoder to better utilize graph structure information for causal graph structure discovery. Another method that
uses neural networks to adapt to nonlinear scenarios is GraN-DAG [91]. It can handle the parameter families of various
conditional probability distributions. The idea behind it is similar to that of DAG-GNN, but it achieves better results in
experiments.

𝑍 = 𝑓4 ((𝐼 − 𝐴𝑇 )𝑓3 (𝑋)), (12)

𝑋 = 𝑓2 ((𝐼 − 𝐴𝑇 )−1 𝑓1 (𝑍)), (13)

𝑓2−1 (𝑋) = 𝐴𝑇 𝑓2−1 (𝑋) + 𝑓1 (𝑍). (14)

A new indicator based on the reconstruction error of the autoencoder is proposed in AEQ [92]. Different indicator
values are used to distinguish the causal directions, and the identification effect is better under univariate conditions.
CASTLE [93] uses causal discovery as an auxiliary task to improve generalization when training a supervised model. In
CASTLE, the adjacency matrix of the DAG is learned in the process of continuous optimization and embeds it into the
input layer of the FNN. Essentially, the learned causal graph is used as an autoencoder-based regularization to improve
the model’s generalization by reconstructing only causal features. In CAN [94], high-quality and diverse samples are
generated from conditional and interventional distributions. Here, the relationship between variables is represented by a
matrix, which is continuously optimized during the training process. In practical implementations, the selection of
interventions is achieved through mask vectors. Finding the complete causal relationship from the data often requires
assumptions about the data-generating mechanism, and the data used by CAN often do not satisfy these assumptions, so
there is no guarantee that a true causal graph will be found.
Causal discovery in time series is the key to many fields in science [95]. However, most of the available causal
discovery methods for nontime series are not directly applicable to time series data. There are several issues to consider

13
for time series data, such as sampling frequency and unobserved confounders. At the same time, causal discovery on
time series data must make the necessary assumptions [96-97]; for example, the cause must occur before the effect, and
there is no instantaneous causal effect. For time series, there are two types of causal graphs: full-time graphs and
summary graphs. Full-time graphs depict the causal relationship between variables at each moment, and summary graphs
depict the causal relationship during this time. Therefore, the causal graph obtained from the time series is likely to have
cycles, which does not satisfy the DAG condition.
The most common method for discovering causal relationships in time series is the Granger causal (GC) [98].
Although the GC can achieve common-sense results under linear assumptions, the results of the GC in nonlinear
scenarios are often unsatisfactory. There are many variations of GC-based methods that address the nonlinear problem
[99-102]. NGC [102] separates the functional representation of each variable to achieve an effective distinction between
cause and effect to a certain extent. NGC provides a neural network for each variable 𝑖 to calculate the influence of other
variables on it. If a column of the obtained weight matrix is 0, it means that the corresponding variable has no Granger
causality to the variable 𝑖 . The core of the NGC is to design a structured sparsity-induced penalty to achieve
regularization so that the Granger causality and the corresponding time delay 𝑡 can be selected at the same time.
𝑇 𝑃
2
𝑚𝑖𝑛𝑊 ∑ (𝑥𝑖𝑡 − 𝑔𝑖 (𝑥(𝑡−1):(𝑡−𝐾) )) + 𝜆 ∑ 𝛺(𝑊:𝑗1 ) , (15)
𝑡=𝐾 𝑗=1

where Ω(⋅) is the penalty, W is the weights matrix and 𝑔𝑖 is the function of the relationships among variables.
Successive images also naturally imply possible causal knowledge. VCC [103] learns causality from sequential
images combined with context. It attempts to identify the causal relationship between the events extracted from the
pictures by using cross-attention to calculate the causal score of one image against another. A high-quality causal image
annotation dataset known as Vis-Causal is proposed.
Converge Cross Mapping (CCM) [104] is used to find causal relationships between time series variables in dynamical
systems. It is especially suitable for nonlinear weakly coupled dynamical systems. NSM [105] first uses AIR [106] to
learn a low-dimensional time series representation of variables from video data and then reconstructs the time series
using the time series delay to obtain its nearest neighbors at each time point. Then, the encoder is used to obtain the
vector representation of the nearest neighbor sequence, calculate the correlation coefficient matrix, and judge the
existence and direction of the causal relationship through the value of the correlation coefficient.

4.2 Intervention Data-driven Causal Discovery


Most of the previous causal discovery algorithms have been applied to observational data; there may be many issues
applying the previous methods directly to intervention data. Intervention data reveal more information about data-
generating mechanisms and allow us to obtain a more accurate causal graph. Intervention data are further divided into
known intervention data and unknown intervention data, as shown in Figure 7 (b).

4.2.1Known Intervention
Unlike many methods dedicated to causal discovery, many deep learning-based methods use causal discovery as a
subsidiary function to discover causal knowledge. MTCD [107] defines a meta-learning objective to measure the speed
of adaptation. The correct causal direction can adapt to the intervention faster when sampling; that is, the speed of
adaptation is used as a score for causal discovery. CBCD [108] utilizes VAE to extract binary conceptual causal
variables from unstructured data that can explain the classifier. This is a partial causal discovery method because it

14
achieves the discovery of all causal variables that explain the results, and the relationships between these causal variables
are not explored. In the CausalVAE [109], the causal layer is used to convert independent exogenous variables into
endogenous variables with causal significance. In this method, the causal relationship is assumed to be linear.

𝑣 = 𝐴𝑇 𝑣 + 𝑢 = (𝐼 − 𝐴𝑇 )−1 𝑢, 𝑢~𝑁(0, 𝐼), (16)

𝑣𝑖 = 𝑓𝑖 (𝐴𝑖 ∘ u; 𝜃𝑖 ) + 𝑢𝑖 , (17)

where 𝐴 is the adjacency matrix to be learned, 𝑢 is the exogenous factor that follows the Gaussian distribution, 𝑣 is the
structured representation of the variable, 𝑓𝑖 represents the functional relationship between variables, and 𝜃𝑖 is the
parameter of 𝑓𝑖 . The observed variables 𝑥 are passed through the encoder as input to generate independent exogenous
variables, and then the causal layer module is used to convert them into endogenous variables 𝑢 with causal meaning.
Then, the mask mechanism is used to select the intervention variable from 𝑢 , and the observed variable 𝑥 is
reconstructed via the decoder. In this way, CausalVAE can discover the causal relationships with adjacency matrix 𝐴.

4.2.2Unknown Intervention
Interventions may also be unknown; it is not known which variables were intervened. A new class of approaches is
needed to deal with this situation. The SDI method [53] enables the simultaneous discovery of causal diagrams and
structural equations in the presence of unknown interventions. This method is score-based, including iterative and
continuous optimization. Considering that the structural parameters and functional parameters are not independent and
affect each other, the structural representation of the DAG and the functional representation of a set of independent
causal mechanisms are jointly trained until convergence. The first step is parameterization. There are two types of
parameters in this method: structural parameters (i.e., an adjacency matrix of 𝑀 ∗ 𝑀) and function parameters (i.e., 𝑀
function parameters). Then, a multilayer perceptron is used to fit the observed data to update the parameters. The score of
the graph is further obtained with the intervention data, which includes the penalty term for cyclic graphs.
(𝑘) (𝑘)
∑𝑘 (𝜎(𝛾𝑖𝑗 ) − 𝑐𝑖𝑗 ) 𝐿𝐶,𝑖 (𝑋)
𝑔𝑖𝑗 = , ∀𝑖, 𝑗 ∈ {0, … , 𝑀 − 1}, (18)
∑𝑘 𝐿(𝑘)
𝐶,𝑖
(𝑋)

(𝑘)
where γ are the structural parameters and 𝐿𝐶,𝑖 denotes the log-likelihood of variable 𝑋𝑖 .
However, SDI cannot handle dynamic systems well because it can only learn one causal graph at a time. To better
address the problem of causal discovery in dynamic systems, CRN [110] trains a learner to sample from different causal
mechanisms each time. In each training, several interventions and the neural network are used to synthesize the
distribution of multiple intervention data to learn the causal graph. This method can achieve domain adaptation to a
certain extent and can also accumulate prior knowledge for subsequent structural learning. Specifically, at the beginning
of each episode, a new causal graph is selected as the learning target, an intervention is randomly selected at each time,
the outcome after each intervention is predicted, and the neural network is used to train the adjacency relationship matrix.
Each episode uses 𝑘 different interventions to achieve structural learning.

4.3 Unobserved Confounders Data-driven Causal Discovery


In practice, we do not necessarily have access to all variables because not all variables contribute to causal discovery.
However, when confounding variables are not observed, causal discovery will face various challenges. One of the most
serious problems is that the unobserved confounders can lead to an observed association between two variables that do

15
not have a causal relationship between them, as shown in Figure 7 (c). In traditional methods, there are some methods to
deal with unobserved confounders, such as FCI. However, most of them are based on combinatorial optimization [111],
which is not efficient enough. Here, we introduce some causal discovery methods based on deep neural networks, which
can more efficiently and accurately find causal relationships in the presence of unobserved confounders.
In CGNN [112], the prior form of the function is not set and MMD is used as a metric to calculate the score of each
graph to evaluate how well each causal graph fits the observed data. The generation mechanism can be used to generate
distributions that are arbitrarily close to the observed data. The most important contribution of this method is the formal
definition of a functional causal model (FCM) with latent variables; an exogenous variable 𝑈𝑖𝑗 that affects variables 𝑖 and
𝑗 is set to solve with the presence of unobserved confounders and proved that it is still possible to learn with
backpropagation. The maximum mean discrepancy (MMD) is defined as shown in Equation (19), and 𝑘(∙) is the
Gaussian kernel:
𝑛 𝑛 𝑛
1 1 2
̂ 𝑘 (𝐷, 𝐷
𝑀𝑀𝐷 ̂) = ∑ 𝑘(𝑥𝑖 , 𝑥𝑗 ) + 2 ∑ 𝑘(𝑥̂𝑖 , 𝑥̂𝑗 ) − 2 ∑ 𝑘(𝑥𝑖 , 𝑥̂𝑗 ) . (19)
𝑛2 𝑛 𝑛
𝑖,𝑗=1 𝑖,𝑗=1 𝑖,𝑗=1

SAM [113] solves the computational limitation of CGNN. SAM is different from structural equations. It incorporates
all variables other than itself into the equation, so it is called "structural agnostic". In SAM, noise matrices instead of the
noise variable of variable pairs. The introduction of differentiable parameters in the correlation matrix allows the SAM
structure to efficiently and automatically exploit the relation between variables, thus providing SAM with a way to
discover correlations from unobserved variables.
𝑛ℎ

𝑋𝑗 = 𝑓̂𝑗 (𝑋, 𝑈𝑗 ) = ∑ 𝑚𝑗,𝑘 ∅𝑗,𝑘 (𝑋, 𝑈𝑗 )𝑧𝑗,𝑘 + 𝑚𝑗,0 , (20)


𝑘=1

where 𝑓̂𝑗 is the nonlinear function, ∅𝑗,𝑘 is the feature, 𝑧𝑗,𝑘 is the Boolean vector and 𝐸𝑗 is the noise variable.
Similar to other neural network-based methods, SAM and CGNN suffer from instability; the randomness of the
learning process or neural network initialization can influence their final performance and predictions even with the same
data and parameters. Such instability can be mitigated by computing multiple runs independently on the same setup and
then averaging the results. ACD [114] is used to deal with the presence of unobserved confounders in time series data.
The core idea of this method is that different causal graphs may belong to the same dynamical system and therefore
potentially have much common information. Therefore, the goal of this method is to obtain a model that can realize
causal discovery for different causal graphs belonging to the same dynamic system. During sampling, multiple causal
graphs and their corresponding data distributions belonging to the same dynamic system are selected, and each causal
graph is used as a training sample. By extending the amortized encoder, it is possible to predict an additional variable,
combined with structural knowledge, which can be used to represent unobserved confounders. Moreover, ACD can learn
the causal mechanism of this dynamic system from the different causal graphs, which can reduce the influence of
unobserved confounders to a certain extent.
In addition to the above mentioned methods for dealing with time series, V-CDN [115] can discover causal relations
from video without ground truth by extracting key points from videos, learning causal relationships between key points,
and predicting future dynamics using dynamical system interactions. The V-CDN has three modules: visual perception,
structure inference and dynamics prediction. The visual perception module is used to extract key points from the image.

16
Table 2: The main deep causal discovery methods

Subcategory Network Paper Relationship Core Idea


Nonlinear, discrete and vector-valued
VAE, GNN DAG-GNN [89] Nonlinear relationship
variables.
GraN-DAG [91] Nonlinear relationship Nonlinear relationships.
FNN Nonlinear relationship,
RL-BIC [116] RL for DAG structure learning.
Nontime Gaussian noise
series data Generate conditional and intervention
GAN CAN [94] Nonlinear relationship
distribution without causal graph.
GAE [90] Nonlinear relationship Graph autoencoder.
Observed data
AE AEQ [92] Nonlinear relationship A measure of causal asymmetry.
CASTLE [93] Nonlinear relationship Treat causal graphs as regularization.
Extend granger causality to nonlinear
RNN NGC [102] Nonlinear relationship
domain.
Time series Learning contextual causality from
Attention VCC [103] Nonlinear relationship
data the visual signal.
Use neural network to implement
FNN NSM [105] Nonlinear relationship
CCM.
Linear relationship,
FNN MTCD [107] Meta-learning.
Gaussian noise
Known Linear relationship, From the perspective of causal
CausalVAE [109]
Intervention Gaussian noise disentanglement.
VAE
Nonlinear relationship, Discover concepts with more
Intervention CBCD [108]
Gaussian noise significant causality.
data
Proposed model can generalize well
FNN SDI [53] Nonlinear relationship
to unknown interventions.
Unknown
Learn causal graph using continuous
Intervention
RNN CRN [110] Nonlinear relationship representations via unsupervised
losses.
Learn causal graph from continuous
FNN SAM [113] Nonlinear relationship observational data along a
Nontime
multivariate nonparametric setting.
series data
Leverages the power of generative
VAE CGNN [112] Nonlinear relationship
model by minimizing the MMD.
Unobserved
Learn causal relations from samples
Confounder
ACD [114] Nonlinear relationship with different underlying causal
Time series graphs but shared dynamics.
GNN
data Causal discovery from video without
V-CDN [115] Nonlinear relationship supervision on the ground-truth graph
structure.
Table 2. ‘Relationship’ indicates the method’s assumption about the relationship between variables.

Structure inference uses the extracted key points and graph neural networks to learn the causal graph. With the learned
causal graph, the dynamics module is used to learn the future condition of these key points.
In this section, we review causal methods using neural networks. According to the type of data, it is divided into
causal discovery driven by observational data, causal discovery driven by intervention data, and causal discovery with
unobserved confounders data. Table 2 summarizes the main deep causal discovery methods, and some of these methods
may apply to more than one data situation.

17
5 DEEP CAUSAL INFERENCE
Most traditional causal inference methods directly act on the original low-dimensional feature space, and the effect may
not be very good. With the popularity of deep learning, many studies have begun to use the powerful fitting ability of
neural networks to explore the relationship between treatment and effect.
The core problems of causal inference are the missing counterfactual data and selection bias, as shown in Figure 8. Of
the two problems, the former is more fundamental because once counterfactual data are obtained, the estimation of
causal effects will be very simple and natural. Therefore, from the perspective of solving the fundamental problems, the
methods of causal inference can be divided into selection bias-oriented and counterfactual data-oriented methods.
Existing deep learning-based causal inference methods can be roughly divided into four categories: covariate balance-
based methods adjust covariates to balance the distribution of covariates in different treatment groups, thereby
eliminating selection bias; adversarial training-based methods utilize adversarial training to make the discriminator
unable to distinguish between the real data and the data generated by the generator, thereby realizing the generation of
implicit counterfactual data; proxy variable-based methods model the data generation mechanism as the joint action
between multiple latent variables to achieve explicit counterfactual generation; deep structural causal model methods
usually combine the SCM and the neural network structure, using the structural information of the SCM and the fitting
ability of the neural network to model the data generation mechanism and realize counterfactual generation. The
relationship between the two core problems and three main methods is shown in Figure 8.

5.1 Causal Inference Through Covariate Balancing


In this subsection, we focus on causal inference methods based on the covariate balance perspective. The core of
traditional methods for causal inference is to balance the covariates to estimate causal effects. Neural networks have a
strong fitting ability so that through much training, the quantitative relationship between treatment 𝑇 and effect 𝑌 can be
found. This is also due to the strong fitting ability of the neural network; there must be regular terms to balance the
𝐶𝐹
covariates 𝑋 when used to estimate causal effects. Using neural networks, the counterfactual result (𝑦𝑡=1 ) of the control
𝐶𝐹 𝐹
group (𝑡 = 0) with a certain value of the covariate (𝑋 = 𝑥) can be predicted, thus obtaining causal effects 𝑦𝑡=1 − 𝑦𝑡=0 .
Such methods typically use neural networks to obtain the representations of covariates or the propensity scores and then
train estimators by minimizing the differences in representations between treatment and control groups. In the loss
function, these balance ideas are usually achieved through regularization terms. The advantage of the neural network is
that it can flexibly use various forms of regularization to balance the distribution of covariates to eliminate the influence
of confounders. At the same time, it has a strong estimation ability and high accuracy.
The first method to use neural networks to achieve counterfactual predictions was BNN [74], which considers the
problem of counterfactual inference from a domain adaptation perspective. The neural network is used to learn the
representation of covariables, and then match is implemented in the representation space, i.e., selecting the nearest
observation effect as its counterfactual effect for training. For example, for a sample 𝑥𝑖 in the treatment group (𝑡 = 1),
the observation effect of sample 𝑥𝑗 nearest to the covariate representation in the control group (𝑡 = 0) is chosen as its
counterfactual effect 𝑦𝑖𝐶𝐹 = 𝑦𝑗(𝑖)
𝐹
. During training, each sample has two estimate outcomes ℎ(𝛷(𝑥𝑖 ), 𝑡𝑖 ) and ℎ(𝛷(𝑥𝑖 ), 1 −
𝑡𝑖 ). In Equation (21), the first term and third term represent the difference between the observed outcome and the
predicted outcome. The discrepancy distance 𝑑𝑖𝑠𝐻 is used to measure the difference between the two distributions. BNN
was also the earliest causal representation learning method to balance latent space. The important contribution of the
method is making a trade-off between balance and accuracy. The optimization objective is as follows:

18
P(X|T=0) P(X|T=1) World 1 World 2
T=0 T=1

Two fundamental problems Cause


of Causal Inference

Selection Bias Missing Counterfactual Data

Data Generation Mechanism


Representation level Adjustment
Implicitly Generated Explicitly Generated

NNs GANs

Inference methods

Covariate balance Proxy variable


Adversarial training

Figure 8: The methods to solve the fundamental problems of causal inference using deep learning. The most fundamental problem in
causal inference is the missing counterfactual data. Due to the lack of counterfactual data, only observational data can be used to
estimate causal effects, leading to selection bias. In essence, if the data generation mechanism can be modeled, then the "counterfactual
data" can be approximated, and the problem of causal inference can be solved. In “Selection Bias”, gray and white nodes represent
individuals with different covariates. In “Missing Counterfactual Data”, white and gray nodes represent outcomes under two treatments,
i.e., fact and counterfactual, and only one of them can be observed.
𝑛 𝑛
1 𝛾 𝐹
𝐵𝐻,𝛼,𝛾 (𝛷, ℎ) = ∑|ℎ(𝛷(𝑥𝑖 ), 𝑡𝑖 ) − 𝑦𝑖𝐹 | + 𝛼𝑑𝑖𝑠𝐻 (𝑃̂𝛷𝐹 , 𝑃̂𝛷𝐶𝐹 ) + ∑|ℎ(𝛷(𝑥𝑖 ), 1 − 𝑡𝑖 ) − 𝑦𝑗(𝑖) |, (21)
𝑛 𝑛
𝑖=1 𝑖=1

where 𝛷 is the learned representation and 𝑑𝑖𝑠𝐻 (∙,∙) is the distance measure. By minimizing the loss function equation
(21), the BNN can simultaneously accomplish counterfactual inference and covariate balance. Unbalanced distributions
of covariates are helpful for prediction but affect the estimation of causal effects. A balanced distribution can reduce the
prediction variance when the treatment variable is shifted. In the BNN, the network structure only has one head. This
means that the treatment assignment information 𝑡𝑖 needs to be concatenated to the representation of covariate 𝛷(𝑥𝑖 ). In
most cases, 𝛷(𝑥𝑖 ) is high-dimensional, so the information of 𝑡𝑖 might be lost during training.
To address the problem of the BNN, a new architecture was proposed in the CFR [75], which has two separate heads
representing the control group and the treatment group, which share a representation network. This architecture avoids
the loss of treatment variable 𝑡 during network training. In the actual training process, according to the value of 𝑡𝑖 , each
sample is used to update the parameters of the corresponding header. Using the integral probability metric (IPM) [76-77]
to measure the distance of control and treated distributions 𝑝(𝑥|𝑡 = 0) and 𝑝(𝑥|𝑡 = 1) was also proposed. In BRNN [58],
the MSE is decomposed into bias and variance, and the estimation ability of the multi-head model is compared with that
of the single-head model. A new regularization term, PRG, was introduced to assess differences between the treatment
and control groups. It was proven that the estimation bias of PSM increases with the increase in the covariate dimension.
Two methods for estimating ITE were compared inductive inference and transductive inference. Because the inductive
inference shares the same representation layers, it may have less noise and bias. This inspired us to try more statistical
values as regular terms to constrain the network to obtain a better estimation effect.

19
The previous methods are only applicable to 0-1 cases, and PM [117] extends the approach to multiple treatment
settings by using the balanced propensity score to make the match and estimating the counterfactual outcomes using the
nearest neighbor. Here, we use the minibatch level, rather than the dataset level, which can reduce variance. RCFR [118]
alleviates the bias of BNN [74] when sample sizes are large. It is reweighted according to imbalance terms and variance.
IPM metrics and regularization terms are still used here:
𝑛
1 𝜆ℎ 𝑤
‖𝑤‖2
𝐿𝜋 (ℎ, 𝛷, 𝑤; 𝛽) = ∑ 𝑤𝑖 𝑙ℎ (𝛷(𝑥𝑖 ), 𝑡𝑖 ) + 𝑅(ℎ) + 𝛼𝐼𝑃𝑀𝐺 (𝑝
̂,
𝜋 𝛷, 𝑝̂ 𝜇,𝛷 ) + 𝜆𝑤 . (22)
𝑛 √𝑛 𝑛
𝑖=1

In the previous weighting methods, the calculation of the propensity score often requires regression approximation in
practice. However, there are many problems with approximating the propensity score as the weight. Here, BWCRF [73]
proposed a function called balancing weights to make a trade-off between balance and predictability, as shown in
Equation (23). It is not a direct balance between groups, instead a weighted balance of representations. By performing a
deeper analysis of the bound, it is proven theoretically that the balanced feature distribution is beneficial to the model.
𝑓(𝑥)
𝑤(𝑥, 𝑡) = . (23)
𝑡 ∙ 𝑒(𝑥) + (1 − 𝑡) ∙ (1 − 𝑒(𝑥))

Selection bias can also be addressed by transforming the counterfactual problem into a domain adaptation problem. In
CARW [119], the context-aware weighting scheme that leverages the importance sampling technique based on [75] to
better solve the selection bias is integrated. SITE [81] is a balanced representation learning method that preserves local
similarity. The balance of covariates helps to eliminate selection bias, but paying too much attention to the balance will
affect the counterfactual prediction performance. This method achieves global balance and local similarity and maintains
certain predictive abilities while reducing selection bias. This method uses mini-batch and selects the triplet pairs. There
are two core components: the PDDM and the MPDM. The PDDM is used to preserve the local similarity information.
The MPDM is used to achieve balanced distributions in the latent space. Then, the prediction network is used to obtain
the prediction loss of potential outcomes. The loss function is:

𝐿 = 𝐿𝐹𝐿 + 𝛽𝐿𝑃𝐷𝐷𝑀 + 𝛾𝐿𝑀𝑃𝐷𝑀 + 𝜆‖𝑊‖2 . (24)

Using the neural network to directly fit the relationship of covariate 𝑋 to outcome 𝑌 can cause many problems; for
example, the neural network may use all the variables for 𝑌 prediction. In fact, these variables are not needed and should
not all be used for estimating causal effects. Covariates can be divided into instrumental variables (only affecting
treatment), adjustment variables (only affecting outcome), irrelevant variables (having no effect on either treatment or
outcome), and confounders (the cause of both treatment and outcome). When learning representations, SCRNet [120]
only balances the representations of confounders. It is then concatenated with the representation of the adjustment
variable. This approach reduces computational overhead and increases efficiency in practical applications. However, the
division of variables is usually subjective, especially when the true causal graph cannot be obtained. DragonNet [80] is a
method for using neural networks to find those covariates that are associated with treatment and only use these variables
to predict the outcome. First, a deep neural network (DNN) is trained to predict 𝑇, and then the last prediction layer is
removed to obtain the representation 𝛷. Next, similar to the TARNet [75], two separate DNNs are used to predict the
outcome at 𝑡 = 0 and 𝑡 = 1. Essentially, 𝛷 stands for the representation related only to 𝑇, i.e., represents the propensity
score. The objective function is:

20
Table 3: Covariate balance-based causal inference methods

Subcategory Object Treatments Paper Year Core Idea


First brings together ideas from
BNN [74] 2016 domain adaptation and representation
learning.
Different treatments use different
CFR [75] 2017
networks.
Propose a context aware weighting
2
ITE CARW [119] 2019 scheme based on importance
sampling.
Distance score as Propose a new balanced regular term
BRNN [58] 2020
regular term PRG.
More detailed partitioning of
SCRNet [120] 2020
covariates.
Any PM [117] 2018 Deal with any number of treatments.
Reweighting with complementary
RCFR [118] 2018
robustness properties.
ATE/CA
2 Integration of balancing weights
TE
BWCFR [73] 2020 alleviates the trade-off between
feature balance and predictive power.
DCN [121] 2017 Use the propensity score to drop-out.
2 Use neural network to predict
PropenNet [122] 2018
ITE propensity.
Propensity score as
Minimizing reconstruction error
regular term Any Deep-treat [123] 2018
using autoencoders.
ATE/CA Use propensity score for adjustment
Any DragonNet [80] 2019
TE and proposed targeted regularization.
Preserve local similarity and balance
SITE [81] 2018
by focusing on several hard samples.
Local similarity ITE 2
Adaptively extracts fine-grained
ACE [124] 2019
similarity information.
Table 3. ‘Object’ indicates the types of causal effect. ‘Treatments’ indicates the number of treatments.

𝜃̂ = 𝑎𝑟𝑔𝑚𝑖𝑛𝜃 𝑅̂(𝜃; 𝑋), 𝑤ℎ𝑒𝑟𝑒 (25)

1
𝑅̂(𝜃; 𝑋) = ∑[(𝑄𝑛𝑛 (𝑡𝑖 , 𝑥𝑖 ; 𝜃) − 𝑦𝑖 )2 + 𝛼𝐶𝑟𝑜𝑠𝑠𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑔𝑛𝑛 (𝑥𝑖 ; 𝜃), 𝑡𝑖 )] , (26)
𝑛

where 𝑄𝑛𝑛 (⋅) is the DNN to model the outcome and 𝑔𝑛𝑛 (⋅) is the DNN to model the propensity score model. 𝛼 is the
hyperparameter to weight the loss term. Simultaneous training to predict propensity scores and outcomes ensures that the
features used are treatment-relevant. Compared to TARNet [75], DragonNet has part of the predicted propensity score.
DragonNet makes a clear distinction between the prediction of outcomes and estimation of causal effects because
accurate predictions do not mean that causal effects can be accurately estimated. In DragonNet, 𝑔𝑛𝑛 (⋅) is used to find
the confounding factors in the covariates so that the resulting representation contains only the part related to 𝑋 (according
to the adequacy of the propensity score). DragonNet also uses target regularization from semi-parameter estimation
theory. The purpose of target regularization is to ensure asymptotic consistency if the semiparametric equation is
satisfied and the convergence is fast.

21
In DCN [121], the architecture of predicting outcome is similar to DragonNet, but it has a separate neural network to
model the propensity score. For each head of the predicted outcome DNN, there is a certain probability of dropout [125-
126] at each training, and the probability depends on the propensity score. This occurs through dropout, implicitly
reflecting the balance effect of the propensity score in the neural network. Deep-treat [123] divides the counterfactual
prediction problem into two steps: the first step uses an autoencoder to learn representations that trade off bias and
information loss (by controlling the hyperparameter λ); then, the FNN is used on the transformed data to achieve
treatment allocation. There is a module for learning propensity scores in an autoencoder (AE) to minimize the
reconstructed feature representation error for different populations. This method makes the representation as accurate as
possible and makes the representation distribution of different populations as close as possible to achieve covariate
balance. At the same time, deep treatment is suitable for the case of multivariate treatments. Table 3 summarizes the
main balanced-based deep casual inference methods.

5.2 Adversarial Training for Causal Inference


Causal inference methods based on adversarial training mechanisms utilize generative-adversarial networks to learn
counterfactual distributions. Generators are typically used to generate counterfactual results, and then a discriminator is
used to determine whether the input data come from the factual distribution or the generate distribution. GANITE [127]
uses the adversarial idea to generate the counterfactual outcomes for a given set of covariates. It can deal with multiple
treatment situations. GANITE has two blocks: the counterfactual block was used to impute the “complete” dataset, and
the ITE block was used to estimate the causal effect. First, generator 𝐺 used the covariates 𝑥, treatment 𝑡, factual
outcome 𝑦 𝐶𝐹 , and the noisy 𝑧𝐺 to generate the potential outcome vector. The loss of 𝐺 is the difference between estimate
value 𝑦̃ 𝐹 of the factual outcome and the factual outcome itself 𝑦 𝐹 . Then, the counterfactual discriminator 𝐷𝐺 is used to
determine the probability that each dimension of the potential outcome vector was true. Through continuous adversarial
learning, the generator is eventually able to generate the potential outcomes that the discriminator cannot distinguish.
Then, the generator 𝐺 can be used to generate the counterfactual outcomes, obtaining the complete dataset 𝐷 ̃=
(𝑥, 𝑦 𝐹 , 𝑦̃ 𝐶𝐹 ). The ITE generator 𝐼 only uses the covariates and noise to generate the potential outcome 𝑦̂. The ITE
discriminator 𝐷𝐼 is used to determine the probability of the outcome from the “real dataset” 𝐷 ̃ . Through the generative
adversarial training framework, finally, the ITE generator 𝐼 will have the capacity to estimate the truly potential outcome;
then, we can use the potential to estimate causal effects. In the framework of GANITE, the potential results can be
predicted, and confidence intervals can also be generated. Due to the advantages of the GANITE framework, the
information used for prediction is not lost. At the same time, although there are multiple treatment variables to choose
from, only one of them can be selected to match the actual situation. A drawback of GANITE is that it cannot deal with
continuous-valued interventions. The SCIGAN [128] addresses this problem very well and provides a theoretical
verification of causal estimation under the GAN framework. However, the disadvantage is that it requires thousands of
training samples.
There are many methods that also use the idea of adversarial training. MatchGAN [129] maps the original samples
into the GAN latent space and selects the samples of different categories with the closest distance to enable removal of
bias while preserving critical features. This method can extract matching samples from a dataset of real face images. It
selects each pair of samples (2 images) so they differ in terms of one selected attribute (such as gender) and are as similar
as possible in terms of all other attributes. This method can be used to study the effects of an element in an image, such
as studying the fairness of gender or ethnicity in image recognition. CF-Net [130] describes an end-to-end approach to
obtaining features that remain invariant to confounders while considering the intrinsic correlation between confounders

22
Table 4: Adversarial training-based causal inference methods

Object Paper Year Core Idea


Train causal implicit generative model based on causal
CausalGAN [131] 2018
graph.
Continuous-valued data and provide theoretical results to
SCIGAN [128] 2020
Potential support use of the GAN framework.
Outcome Using adversarial training to remove the influence of
CF-Net [130] 2020
confounders.
Producing a dataset of matched samples culled from a
MatchGAN [129] 2021
larger dataset of real face images.
GANITE [127] 2018 Use GAN to generate potential outcome.
CTAM [83] 2019 Deal with covariates contains text data.
ITE Time series data and constructs a representation which
CRN [132] 2020 removes the association between patient history and
treatment assignments.
Simultaneous estimation of multiple causal effects: ITE,
CTAM [83] 2019
ATE, ATT.
ATE/ATT/CA χ-GAN [133] 2020 Use importance sampling and the χ2 -divergence.
TT Propose a new method based on adversarial training of a
DeepMatch [134] 2020 weighting and a discriminator network for rich covariates
and complex relationships.
Table 4. ‘Object’ is the type of result obtained from the method.

and predicted outcomes. In the architecture of CF-Net, the CP module is designed to eliminate the influence of
confounders when extracting features and has achieved good results in the medical field. Another adversarial learning
architecture, CTAM [83], is used to control the confounders in the latent space. The discriminator 𝐷 is trained to
minimize the assignment of correct treatment. This means that the learner tries to maximize the error of the discriminator
through training and finally cannot distinguish the training samples from the control group or the treatment group, thus
eliminating the effect of confounders.
There are also approaches that combine adversarial and balanced approaches, i.e., using adversarial training to
produce covariate-balanced representations [135]. DeepMatch [134] uses adversarial training to balance covariates in
such situations. DeepMatch uses the discriminative discrepancy metric in the context of NNs and requires a few further
developments of alternating gradient approaches similar to GAN. DeepMatch uses the idea of adversarial learning to
learn stable representations and improves the accuracy of subsequent causal effect estimation. Another feature balancing
approach that uses adversarial training is χ-GAN [133]. It introduces the importance sampling theory to minimize the
variance of causal effect estimation. This method can better handle samples whose propensity scores are close to 0 or 1
and the prediction results are more stable. There is also a class of methods that consider time-related confounding factors,
which can help to understand how results change over time. The previously mentioned static-based methods are often not
directly applicable to time series scenarios. CRN [132] is based on the recurrent neural network and uses adversarial
training to obtain feature representations that are not affected by time, thereby removing time-related confounding factors.
This approach is ideal for precision medicine and can be used to answer key questions about when to treat patients, when
to stop, and how to determine the dosage. Table 4 summarizes the main adversarial-based deep casual inference methods.

23
5.3 Proxy Variable-based Causal Inference
For causal inference, there is generally an “unconfounderness” assumption, meaning that there are no unobserved
confounders. Most existing methods of estimating causal effects are based on this assumption, such as the traditional
methods (matching, propensity score-based methods) or the balance-based methods mentioned earlier. The
aforementioned methods are guaranteed to recover the true causal effect when variables are observed. However, in many
cases, the assumptions are not satisfied. Once there are unobserved confounders, the previous method will have large
bias and even lead to incorrect conclusions. This subsection discusses how to use neural networks for causal inference in
the presence of unobserved confounders.
Although it may not be possible to observe all confounders, there is generally a way to measure the proxy variables of
the confounders. Proxy variable-based methods utilize proxy variables to separate different types of variables using
disentangle representations. Most proxy variables required by the confounders are covered by collecting a large number
of observed variables. This is easy to achieve in today's era of big data. Exactly how these proxy variables are used
depends on their relationship to the unobserved confounders, treatments, and outcomes. There are also many causal
identification problems based on proxy variables to study the presence of unobserved confounders.
Deep latent variable techniques can use noisy proxy variables to infer unobserved confounders [136]. CEVAE [137]
uses latent variable generative models to discover unobserved confounders from the perspective of maximum likelihood.
One of the typical scenarios is shown in Figure 9 (a). This approach requires fairly weak assumptions about the data
generation process and the structure of unobserved confounders. It uses the VAE [138-139] architecture and contains
both an inference network and a model network. The inference network is equivalent to an encoder, and the model
network is equivalent to a decoder. The nonlinearity of the neural network is used to detect the nonlinearity of causal
effects; that is, the neural network is used to fit the causal relationship between the variables. The core of this approach is
the use of variational inference to obtain the probability distribution needed to estimate causal effects, 𝑝(𝑋, 𝑍, 𝑡, 𝑦). We
use 𝑍 to denote the hidden variable, and 𝑋 is the proxy variable that has no effect on outcome 𝑦 and treatment 𝑡. Actually,
𝑍 can be seen as the latent variable in VAE. The inference network was used to obtain 𝑞(𝑡|𝑥) and 𝑞(𝑧|𝑡, 𝑦, 𝑥). 𝑞(𝑡|𝑥)
can be seen as the propensity score. The model network was used to obtain 𝑝(𝑡|𝑧) , 𝑝(𝑥|𝑧) and 𝑝(𝑦|𝑡, 𝑧). Then, the
framework is used to model the generation mechanism when unobserved confounders exist:
𝑁

𝐿 = ∑ 𝐸𝑞(𝑧𝑖 |𝑥𝑖 , 𝑡𝑖 , 𝑦𝑖 ) [𝑙𝑜𝑔𝑝(𝑥𝑖 , 𝑡𝑖 |𝑧𝑖 ) + 𝑙𝑜𝑔𝑝(𝑦𝑖 |𝑡𝑖 , 𝑧𝑖 ) + 𝑙𝑜𝑔𝑝(𝑧𝑖 ) − 𝑙𝑜𝑔𝑞(𝑧𝑖 |𝑥𝑖 , 𝑡𝑖 , 𝑦𝑖 )] , (27)
𝑖=1

𝐹𝐶𝐸𝑉𝐴𝐸 = 𝐿 + ∑(𝑙𝑜𝑔𝑞(𝑡𝑖 = 𝑡𝑖∗ |𝑥𝑖∗ ) + 𝑙𝑜𝑔𝑞(𝑦𝑖 = 𝑦𝑖∗ |𝑥𝑖∗ , 𝑡𝑖∗ )) , (28)


𝑖=1

In practice, to better estimate the distribution of parameters, two terms are added to the lower variable boundary as
𝐹𝐶𝐸𝑉𝐴𝐸 . 𝑦𝑖∗ , 𝑥𝑖∗ , 𝑡𝑖∗ are all the input observed values. The advantage of CEVAE is that it can cope well with the presence
of unobserved confounders, but the disadvantage is the lack of theoretical guarantees. Furthermore, the latent variable
can be divided into risk latent variable 𝑧𝑦 , instrumental latent variable 𝑧𝑡 , confounding latent variable 𝑧𝑐 and noisy latent
variable 𝑧𝑜 . This more granular division can help obtain an accurate distribution when making variational inferences to
facilitate subsequent estimation of causal effects. A more complete division is shown in the lower right corner “Proxy
variable” of Figure 8.
For the estimation of causal effects, when there are too many control variables, the ability to estimate is weak; when
unnecessary variables are included, it results in suboptimal nonparametric estimation. In the high-dimensional case,

24
X
Z X

I T Y

T Y
C

(a) Proxy Variable (b) Instrumental Variable

Figure 9: Typical scenarios for proxy variable-based causal inference methods. Gray nodes represent unknown variables or variables
that cannot be observed.

many variables are not confounders and need to be excluded from adjustment. This leads to a dilemma: including too
many unnecessary variables will increase the bias and variance of the estimate, reducing the accuracy of the estimate;
limiting the included variables cause confounders to be missed and introduce selection bias.
To solve this problem, TEDVAE [140] divides covariates into three categories: confounders, instrumental factors and
risk factors. Confounders affect both cause and effect, instrumental factors only affect the cause, and risk factors only
affect the effect. TEDVAE uses variational inference to infer latent variables from observed data and decouples them
into these three types of variables. The rest of the architecture is similar to CEVAE and can also be used for continuous
treatment variables. TVAE [136] improved TEDVAE by combining the method of target learning and maximum
likelihood estimation training. Since the causal graph used may vary when making causal inferences, the errors brought
by different methods are different; TVAE tends to have a smaller error even though the causal graph is wrong. The
purpose of introducing targeted regularization is to make the outcome y and the treatment assignment t as independent as
possible. TVAE can be seen as the combination of DragonNet and TEDVAE.
Unlike most methods, CausalVAE [137] does not require a priori causal diagram and only needs a small amount of
information related to the true causal concept 𝑢 as a supervision signal, converting independent exogenous variables into
causal endogenous variables and realize causal discovery at the same time. There are two uses of u; the first is to use
𝑝(𝑧|𝑢) to regularize the posterior of 𝑧 and the second is to use 𝑢 to learn the causal structure 𝐴. In addition to learning
causal representations, interventions can also be used to generate counterfactual data that are unobserved.
Instrumental variable (IV) [141-142] methods look for proxy latent variables for causal inference instead of finding
hidden confounders [15]. The IV framework has a long history, especially in economics [143]. The typical scenario using
the instrumental variable method is shown in Figure 9 (b). A more powerful instrumental variable approach
incorporating deep learning is introduced. DeepIV [144] uses instrumental variables and defines a counterfactual error
function to implement neural network-based causal inference in the presence of unobserved confounders. The method
can verify the accuracy of the out-of-distribution sample, which is very beneficial and affects the need for
hyperparameters for neural network tuning. DeepIV is implemented in two steps. The first step is to learn the treatment
distribution using a neural network: 𝐹̂ = 𝐹ф (𝑡|𝑥, 𝑧), where 𝑥 is the covariate, 𝑡 is the treatment variable, and 𝑧 is the
instrumental variable. The second step is to use the outcome network to predict the counterfactual outcomes. The
objective function is:
2
𝐿(𝐷; 𝜃) = |𝐷|−1 ∑ (𝑦𝑖 − ∫ ℎ𝜃 (𝑡, 𝑥𝑖 )𝑑𝐹̂ф (𝑡|𝑥𝑖 , 𝑧𝑖 )) , (29)
𝑖

25
2016 2017 2018 2019 2020 2021

Figure 10: Timeline of the main deep causal inference methods. In this figure, the blue circle represents the covariate balance-based
method, the green circle represents the adversarial training-based method, the yellow circle represents the proxy variable-based method,
and the gray triangle represents the technique used.

where ℎ is the prediction function, 𝐹̂ф is the treatment distribution obtained from the first step, and 𝐷 is the dataset.
Figure 10 shows the timeline of the main deep causal inference methods according to covariate balance-based,
adversarial training-based and proxy variable-based methods. We can see that covariate balance-based methods are the
core of deep causal inference methods; adversarial training-based and proxy variable-based methods have received
increasing attention in recent years.

5.4 Deep Structural Causal Model


In addition to the three categories of methods mentioned above, the deep fusion of neural networks and structural causal
models resulted in a new model called the deep structural causal model. Due to the similarity in structure, the
combination of SCM and graph neural networks naturally attracts attention. Graph neural networks are well suited to
learning the characterization of graph structures. The structure of the causal graph itself is naturally suitable for
representation with the GNN. Therefore, some studies have explored how GNNs and SCMs can be deeply integrated to
produce a model that combines the respective advantages of GNNs and SCMs. They usually include a general
approximation of neural networks and can model intervention distributions and counterfactual distributions. The GNN-
SCM [145] uses the graph neural network [146-147] for causal inference. Graph neural networks work well with
structured data, so they can be used as nonparametric function approximators for structural information. SCM acts as a
realistic model for data generation. The GNN-SCM derives the theoretical connection between the GNN and SCM from
the first principles. A more fine-grained neural causal model is defined [148], intervention under the GNN structure is
formalized, and a new deep structural causal model class is established using autoencoders [149-150]. Although existing

26
theories of causal recognition prove that intervention is not necessary to identify causal effects, intervention is still at the
core of causal reasoning at the current stage of research. The intervention similar to SCM is defined in the computing
layer of the GNN. In the graph, the intervention changes the connection to the neighbor node, removing the edge from
the parent node in the causal graph. The restrictions on data acquisition information in the Pearl’s causal hierarchy (PCH)
(which level of information can be obtained by the data) still apply to neural networks. Neural networks have universal
approximations, so a set of neural networks can be used to train in the data generated by SCM to obtain an estimate of
SCM.
Furthermore, some graph-based variational autoencoders are beginning to merge with the SCM. VACA [151] does
not require any assumptions about the parameters; it simulates the necessity properties of SCM, providing a framework
for achieving intervention operations. Graph neural networks are used for causal inference. The conditions that the
Variance Graph Autoencoder (VGAE) [152] must satisfy are described as a density estimator for an a priori graph
structure so that it can simulate the behavior of causal intervention. In VACA, probabilistic models are introduced that
represent uncertainty to estimate causality and provides good approximations distributions of interventional and
counterfactual.
To build a deep structural causal model, the ability to model a distribution with three levels is required: association,
intervention, and counterfactual. At the same time, deep-structured causal models are no longer limited to structured data,
and since deep neural networks have been combined with structures, they can often directly process unstructured, high-
dimensional data. By fully combining the deep mechanism with SCM, DSCM [153] can use exogenous variables for
counterfactual inference such as SCM via variational inference. Three types of deep mechanisms combined with SCM
have been discussed: explicit likelihood, amortized explicit likelihood, and amortized implicit likelihood. These three
mechanisms may need variational inference and normalizing flows to model. DSCM can also finish the three steps of
counterfactual inference depicted by Pearl [35], which are abduction, action, and prediction. First, DSCM uses the
available evidence to estimate the exogenous variables. Second, it intervenes in one of the variables and the other
mechanisms remain unchanged. Finally, it uses the exogenous variable and causal mechanisms to obtain the new
outcomes. A deep causal graph (DCG) [154] also uses neural networks to model causal relationships. This model adapts
to data sampled from observational or intervention distributions to answer questions about the intervention or
counterfactual data. Specifically, DCG uses neural networks to simulate structural equations, whose elements are deep
causal units (DCUs), which can perform three operations: sampling, calculating likelihood, and calculating noise
posteriorly.

6 CONCLUSION AND FUTURE DIRECTIONS


Deep learning models show great advantages in various fields, and an increasing number of researchers are trying to
combine deep learning with causal learning. Hopefully, the powerful representation and learning ability of neural
networks can be used to help existing causal learning from all aspects.
This article reviews three aspects of the improvements that deep learning brings to causal learning: deep
representation learning for causal variables, deep causal discovery, and deep causal inference. Deep representation
learning for causal variables uses a deep neural network to learn the representation of causal variables, especially in the
case of unstructured and unbalanced data. To learn representation with causal semantics for better causal discovery and
causal inference, it is necessary to control confounders, that is, balance the covariates’ distribution. For causal discovery,
we reviewed deep causal discovery methods based on neural networks according to the data situations. Then, we
reviewed causal inference methods with deep learning from three perspectives: covariate balance-based, adversarial

27
training-based, and proxy variable-based methods. Finally, we introduced a special class of structural causal models
called deep structural causal models. These models deeply integrate neural networks with causal models and can make
full use of the structural information of SCM and the fitting ability of neural networks.
Although deep learning has brought many changes to causal learning, there are still many problems that must be
addressed. Here, we raised these questions and included a brief discussion, hoping to provide researchers with some
future directions.
Scarcity of causal knowledge and too many strong assumptions. Although many of us are devoted to the field of
causal learning, as researchers, we must constantly reflect on whether what we think of as causality is truly causality and
whether there is a more suitable form of studying causality than the causal graph or potential outcomes. Because of this
lack of causal knowledge, we rely heavily on untestable assumptions when studying causality. In the causal discovery
field, the correctness of the causal Markov condition assumption and faithfulness assumption still needs to be fully
verified [155]. SUTVA, unconfounderness, and positivity assumptions are required when making causal inference. The
conditions under which these assumptions apply and fail need to be fully studied (e.g., due to social connections, one
person's treatment outcomes may affect another person through social activities.). Most existing methods and
applications of causal inference are based on the assumption of directed acyclic graphs [156], but in reality, there may be
causal feedback between variables, leading to the emergence of cyclic graphs [95]. In addition, different methods are
based on different assumptions about the distribution of noise [26, 157]. How to reasonably relax the assumptions while
ensuring the accuracy is a very challenging problem. Although these assumptions are convenient, they also bring many
risks, and care must be taken when using them.
Complex unstructured treatments and effects. Existing deep representation learning for causal variables does not
completely solve the problem of complex data. In many scenarios, data are heterogeneous [158], and treatment variables
may be very complex, such as time-series multivariate continuous variables (as opposed to simple binary discrete
variables) which pose challenges to many existing methods (e.g., the sampling frequency may affect results [95]). An
important problem in causal representation learning is how to make the representations as stable and unique as possible
and how to match the representations with human cognitive understanding. In addition, after causal representation
vectors are obtained, it may not make sense to test the independence between vectors; therefore, designing suitable
causal discovery and causal inference methods for different scenarios or designing general powerful methods is a way to
make deep causal learning more applicable.
Lack of casual datasets and suitable metrics. Although there are some commonly used datasets in the fields of
causal discovery and causal inference [25, 62], these small-scale datasets severely limit the performance of neural
networks for deep causal learning methods. Therefore, releasing large-scale causal datasets would be a significant boost
to the entire field of deep causal learning. At the same time, the metrics used to estimate causal effects on different
datasets remain inconsistent. For example, 𝜖𝑃𝐸𝐻𝐸 used in IHDP is only suitable for binary treatment variables, while the
metric 𝑅𝑝𝑜𝑙 used in the Jobs dataset is highly targeted and does not have universality. Rich and diverse causal "loss
functions" similar to the current stage of deep learning [159-161] would be very helpful for the rapid improvement of the
performance of deep causal learning algorithms.
Limited scalability and excessive computational consumption. Most current causal discovery methods can only
achieve good results on small-scale datasets, although methods based on continuous optimization can theoretically cope
with hundreds or thousands of high-dimensional variables [25]. Therefore, verifying the efficiency of existing methods
on large-scale data and developing theoretically more efficient methods are urgently needed in the field of causal
discovery. In addition to using purely observational data to discover causal relationships, integrating prior knowledge

28
into the causal discovery process is also a very meaningful direction for future study [162]. It is not always necessary to
pursue the discovery of a complete causal graph [163]; sometimes it is enough to know part of the causal structure to
solve the problem.
Deeper integration with deep learning. In this article, we mentioned the implicit generation of counterfactual data
and the explicit generation of disentangled causal mechanisms. We can also explore leveraging explicit generative
methods to generate counterfactuals, using implicit generative models to approximate causal mechanisms. Other deep
models, such as normalizing flow models [153], autoregressive flows models [164], and energy-based models [165],
combined with causal discovery and casual inference methods, can be considered. Based on our classification system,
deep causal inference methods are classified into covariate balancing-based, adversarial training-based, and proxy
variable-based methods. Some studies combined covariate balancing with adversarial training and obtained good results
[134-135, 166] (e.g., achieving covariate balance through adversarial training). There are also some methods that use
Transformer for causal inference [167-169]. This inspires us to multi-dimensionally integrate deep learning ideas with
causal methods from different perspectives. Finally, a class of methods called neural causal models (NCMs) [148] has
been extensively studied in recent years, and many effective algorithms have been developed for causal discovery and
inference. NCM is a subset of SCM but generally has the same expressive power as SCM. Neural causal models usually
combine the characteristics of the SCM data structure and the advantages of the general fitting ability of neural networks.
Therefore, they are considered a research direction that may bring a breakthrough to causal learning.
Causality for deep learning. In this article, we mainly discussed the changes that deep learning brought to causal
learning. At the same time, causal learning is profoundly changing the field of deep learning. Many studies have focused
on how causality can help deep learning address long-standing issues such as interpretability [44-45, 47, 170-171],
generalization [172-174], robustness [175-177], and fairness [178-180].

ACKNOWLEDGMENTS
We thank Xingwei Zhang and Songran Bai for their precious suggestions and comments on this work. We also thank
Haitao Huang and Gang Zhou for invaluable discussion. This work is supported by the Ministry of Science and
Technology of China under Grant No. 2020AAA0108401,and the Natural Science Foundation of China under Grant
Nos. 72225011 and 71621002.

REFERENCES
[1] Md Vasimuddin and Srinivas Aluru. 2017. Parallel Exact Dynamic Bayesian Network Structure Learning with Application to Gene Networks. In
2017 IEEE 24th International Conference on High Performance Computing (HiPC), Dec 18-21, 2017. Jaipur, India. 42-51.
[2] Sofia Triantafillou, Vincenzo Lagani, Christina Heinze-Deml, Angelika Schmidt and Ioannis Tsamardinos. 2017. Predicting Causal Relationships
from Biological Data: Applying Automated Casual Discovery on Mass Cytometry Data of Human Immune Cells. Scientific Reports 7, 1 (2017), 1-12.
[3] Steffen L Lauritzen and David J Spiegelhalter. 1988. Local computations with probabilities on graphical structures and their application to expert
systems. Journal of the Royal Statistical Society: Series B (Methodological) 50, 2 (1988), 157-194.
[4] Guido W Imbens and Donald B Rubin. 2015. Causal inference in statistics, social, and biomedical sciences. Cambridge University Press.
[5] Stefan Wager and Susan Athey. 2018. Estimation and inference of heterogeneous treatment effects using random forests. Journal of the American
Statistical Association 113, 523 (2018), 1228-1242.
[6] Subramani Mani and Gregory F Cooper. 2000. Causal discovery from medical textual data. In Proceedings of the AMIA Symposium, 2000. 542.
[7] Cross-Disorder Group of the Psychiatric Genomics Consortium. 2013. Identification of risk loci with shared effects on five major psychiatric
disorders: a genome-wide analysis. The Lancet 381, 9875 (2013), 1371-1379.
[8] Clive WJ Granger. 1969. Investigating causal relations by econometric models and cross-spectral methods. Econometrica: journal of the Econometric
Society 37 (1969), 424-438.
[9] Kevin D Hoover. 2006. Causality in economics and econometrics. Springer.
[10] Alberto Abadie and Guido W Imbens. 2016. Matching on the estimated propensity score. Econometrica 84, 2 (2016), 781-807.

29
[11] Guido W Imbens. 2004. Nonparametric estimation of average treatment effects under exogeneity: A review. Review of Economics and statistics 86,
1 (2004), 4-29.
[12] Serge Darolles, Yanqin Fan, Jean-Pierre Florens and Eric Renault. 2011. Nonparametric instrumental regression. Econometrica 79, 5 (2011), 1541-
1565.
[13] Miguel Ángel Hernán, Babette Brumback and James M Robins. 2000. Marginal structural models to estimate the causal effect of zidovudine on the
survival of HIV-positive men. Epidemiology 11, 5 (2000), 561-570.
[14] James M Robins, Miguel Angel Hernan and Babette Brumback. 2000. Marginal structural models and causal inference in epidemiology.
Epidemiology 11, 5 (2000), 550-560.
[15] Miguel A Hernán and James M Robins. 2010. Causal inference: What If. CRC Press.
[16] Miguel A Hernán. 2018. The C-word: scientific euphemisms do not improve causal inference from observational data. American journal of public
health 108, 5 (2018), 616-619.
[17] Michael P Grosz, Julia M Rohrer and Felix Thoemmes. 2020. The taboo against explicit causal inference in nonexperimental psychology.
Perspectives on Psychological Science 15, 5 (2020), 1243-1255.
[18] MJ Vowels. 2020. Limited functional form, misspecification, and unreliable interpretations in psychology and social science. arXiv:2009.10025.
Retrieved from https://arxiv.org/abs/2009.10025
[19] Michael E Sobel. 1998. Causal inference in statistical models of the process of socioeconomic achievement: A case study. Sociological Methods &
Research 27, 2 (1998), 318-348.
[20] Cosma Rohilla Shalizi and Andrew C Thomas. 2011. Homophily and contagion are generically confounded in observational social network studies.
Sociological methods & research 40, 2 (2011), 211-239.
[21] Bernhard Schölkopf. 2022. Causality for machine learning. Probabilistic and Causal Inference: The Works of Judea Pearl (2022), 765-804.
[22] Bernhard Scholkopf, Francesco Locatello, Stefan Bauer, Nan Rosemary Ke, Nal Kalchbrenner, Anirudh Goyal and Yoshua Bengio. 2021. Toward
Causal Representation Learning. Proceedings of the IEEE 109, 5 (2021), 612-634.
[23] Ruocheng Guo, Lu Cheng, Jundong Li, P. Richard Hahn and Huan Liu. 2021. A Survey of Learning Causality with Data. ACM Computing Surveys
53, 4 (2021), 1-37.
[24] Clark Glymour, Kun Zhang and Peter Spirtes. 2019. Review of Causal Discovery Methods Based on Graphical Models. Front Genet 10 (2019), 524.
[25] Matthew J Vowels, Necati Cihan Camgoz and Richard Bowden. 2021. D'ya like dags? A survey on structure learning and causal discovery.
arXiv:2103.02582. Retrieved from https://arxiv.org/abs/2103.02582
[26] Peter Spirtes, Clark N Glymour, Richard Scheines and David Heckerman. 2000. Causation, prediction, and search. MIT press.
[27] David Maxwell Chickering. 2002. Optimal structure identification with greedy search. Journal of machine learning research 3, Nov (2002), 507-554.
[28] Shohei Shimizu, Patrik O Hoyer, Aapo Hyvärinen, Antti Kerminen and Michael Jordan. 2006. A linear non-Gaussian acyclic model for causal
discovery. Journal of Machine Learning Research 72, 7 (2006), 2003-2030.
[29] Patrik Hoyer, Dominik Janzing, Joris M Mooij, Jonas Peters and Bernhard Schölkopf. 2008. Nonlinear causal discovery with additive noise models.
In Advances in neural information processing systems, Dec 8-10, 2008. Vancouver, B.C., Canada. 689-696.
[30] Diego Colombo, Marloes H Maathuis, Markus Kalisch and Thomas S Richardson. 2012. Learning high-dimensional directed acyclic graphs with
latent and selection variables. The Annals of Statistics 40, 1 (2012), 294-321.
[31] Dominik Janzing, Joris Mooij, Kun Zhang, Jan Lemeire, Jakob Zscheischler, Povilas Daniušis, Bastian Steudel and Bernhard Schölkopf. 2012.
Information-geometric approach to inferring causal directions. Artificial Intelligence 182 (2012), 1-31.
[32] Antti Hyttinen, Patrik O Hoyer, Frederick Eberhardt and Matti Jarvisalo. 2013. Discovering cyclic causal models with latent variables: A general
SAT-based procedure. arXiv:1309.6836. Retrieved from https://arxiv.org/abs/1309.6836
[33] Judea Pearl. 2009. Causal inference in statistics: An overview. Statistics Surveys 3 (2009), 96-146.
[34] Liuyi Yao, Zhixuan Chu, Sheng Li, Yaliang Li, Jing Gao and Aidong Zhang. 2021. A survey on causal inference. ACM Transactions on Knowledge
Discovery from Data (TKDD) 15, 5 (2021), 1-46.
[35] Judea Pearl. 2009. Causality. Cambridge university press.
[36] Alberto Abadie, David Drukker, Jane Leber Herr and Guido W Imbens. 2004. Implementing matching estimators for average treatment effects in
Stata. The stata journal 4, 3 (2004), 290-311.
[37] Elizabeth A Stuart. 2010. Matching methods for causal inference: A review and a look forward. Statistical science: a review journal of the Institute of
Mathematical Statistics 25, 1 (2010), 1-21.
[38] Donald B Rubin. 1974. Estimating causal effects of treatments in randomized and nonrandomized studies. Journal of educational Psychology 66, 5
(1974), 688.
[39] Paul R Rosenbaum and Donald B Rubin. 1983. The central role of the propensity score in observational studies for causal effects. Biometrika 70, 1
(1983), 41-55.
[40] Donald B Rubin. 2005. Causal inference using potential outcomes: Design, modeling, decisions. Journal of the American Statistical Association 100,
469 (2005), 322-331.
[41] James M Robins, Andrea Rotnitzky and Lue Ping Zhao. 1994. Estimation of regression coefficients when some regressors are not always observed.
Journal of the American statistical Association 89, 427 (1994), 846-866.

30
[42] Michele Jonsson Funk, Daniel Westreich, Chris Wiesen, Til Stürmer, M Alan Brookhart and Marie Davidian. 2011. Doubly robust estimation of
causal effects. American journal of epidemiology 173, 7 (2011), 761-767.
[43] Youngseo Son, Nipun Bayas and H Andrew Schwartz. 2018. Causal explanation analysis on social media. arXiv:1809.01202. Retrieved from
https://arxiv.org/abs/1809.01202
[44] Jesse Vig, Sebastian Gehrmann, Yonatan Belinkov, Sharon Qian, Daniel Nevo, Yaron Singer and Stuart Shieber. 2020. Investigating gender bias in
language models using causal mediation analysis. In Advances in Neural Information Processing Systems, Dec 6-12, 2020. 12388-12401.
[45] Matthew Finlayson, Aaron Mueller, Sebastian Gehrmann, Stuart Shieber, Tal Linzen and Yonatan Belinkov. 2021. Causal analysis of syntactic
agreement mechanisms in neural language models. arXiv:2106.06087. Retrieved from https://arxiv.org/abs/2106.06087
[46] Dong Zhang, Hanwang Zhang, Jinhui Tang, Xian-Sheng Hua and Qianru Sun. 2020. Causal intervention for weakly-supervised semantic
segmentation. Advances in Neural Information Processing Systems 33, (2020), 655-666.
[47] Pranoy Panda, Sai Srinivas Kancheti and Vineeth N Balasubramanian. 2021. Instance-wise Causal Feature Selection for Model Interpretation. In
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021. 1756-1759.
[48] Markus Kalisch and Peter Bühlman. 2007. Estimating high-dimensional directed acyclic graphs with the PC-algorithm. Journal of Machine Learning
Research 8 (2007), 613-636.
[49] Joseph Ramsey, Jiji Zhang and Peter L Spirtes. 2012. Adjacency-faithfulness and conservative causal inference. arXiv:1206.6843. Retrieved from
https://arxiv.org/abs/1206.6843
[50] Peter C Austin. 2011. An introduction to propensity score methods for reducing the effects of confounding in observational studies. Multivariate
behavioral research 46, 3 (2011), 399-424.
[51] Rohit Bhattacharya, Razieh Nabi and Ilya Shpitser. 2020. Semiparametric inference for causal effects in graphical models with hidden variables.
arXiv:2003.12659. Retrieved from https://arxiv.org/abs/2003.12659
[52] Niki Kiriakidou and Christos Diou. 2022. An improved neural network model for treatment effect estimation. arXiv:2205.11106. Retrieved from
https://arxiv.org/abs/2205.11106
[53] Nan Rosemary Ke, Olexa Bilaniuk, Anirudh Goyal, Stefan Bauer, Hugo Larochelle, Bernhard Schölkopf, Michael C Mozer, Chris Pal and Yoshua
Bengio. 2019. Learning neural causal models from unknown interventions. arXiv:1910.01075. Retrieved from https://arxiv.org/abs/1910.01075
[54] Christina Heinze-Deml, Marloes H Maathuis and Nicolai Meinshausen. 2018. Causal structure learning. Annual Review of Statistics and Its
Application 5 (2018), 371-391.
[55] Clark Glymour, Kun Zhang and Peter Spirtes. 2019. Review of causal discovery methods based on graphical models. Frontiers in genetics 10, (2019),
524.
[56] Krzysztof Chalupka, Frederick Eberhardt and Pietro Perona. 2016. Causal feature learning: an overview. Behaviormetrika 44, 1 (2016), 137-164.
[57] Yoshua Bengio, Aaron Courville and Pascal Vincent. 2013. Representation learning: A review and new perspectives. IEEE transactions on pattern
analysis and machine intelligence 35, 8 (2013), 1798-1828.
[58] Mehrdad Farajtabar, Andrew Lee, Yuanjian Feng, Vishal Gupta, Peter Dolan, Harish Chandran and Martin Szummer. 2020. Balance regularized
neural network models for causal effect estimation. arXiv:2011.11199. Retrieved from https://arxiv.org/abs/2011.11199
[59] Jean Kaddour, Yuchen Zhu, Qi Liu, Matt J Kusner and Ricardo Silva. 2021. Causal effect inference for structured treatments. In Advances in Neural
Information Processing Systems, Dec 6-14, 2021. 24841-24854.
[60] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville and Yoshua Bengio. 2014.
Generative adversarial nets. In Advances in neural information processing systems, Dec 8-13, 2014. Montréal Canada. 2672-2680.
[61] Diederik P Kingma and Max Welling. 2013. Auto-encoding variational bayes. arXiv:1312.6114. Retrieved from https://arxiv.org/abs/1312.6114
[62] Ana Rita Nogueira, Andrea Pugnana, Salvatore Ruggieri, Dino Pedreschi and João Gama. 2022. Methods and tools for causal discovery and causal
inference. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 12, 2 (2022), e1449.
[63] Bernhard Schölkopf. 2019. Causality for machine learning. arXiv preprint arXiv:1911.10500 (2019),
[64] Jean Kaddour, Aengus Lynch, Qi Liu, Matt J Kusner and Ricardo Silva. 2022. Causal Machine Learning: A Survey and Open Problems.
arXiv:2206.15475. Retrieved from https://arxiv.org/abs/2206.15475
[65] Bernard Koch, Tim Sainburg, Pablo Geraldo, Song Jiang, Yizhou Sun and Jacob Gates Foster. 2021. Deep learning of potential outcomes.
arXiv:2110.04442. Retrieved from https://arxiv.org/abs/2110.04442
[66] Ilya Shpitser and Judea Pearl. 2008. Complete Identification Methods for the Causal Hierarchy. Journal of Machine Learning Research 9, 9 (2008),
1941-1979.
[67] Jonas Peters, Dominik Janzing and Bernhard Schölkopf. 2017. Elements of Causal Inference - Foundations and Learning Algorithms. The MIT Press.
[68] Keisuke Hirano, Guido W Imbens and Geert Ridder. 2003. Efficient estimation of average treatment effects using the estimated propensity score.
Econometrica 71, 4 (2003), 1161-1189.
[69] Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation 9, 8 (1997), 1735-1780.
[70] Tomas Mikolov, Martin Karafiát, Lukas Burget, Jan Cernocký and Sanjeev Khudanpur. 2010. Recurrent neural network based language model. In
Interspeech, Sep 26-30, 2010. Makuhari, Chiba, Japan. 1045-1048.
[71] Marco Gori, Gabriele Monfardini and Franco Scarselli. 2005. A new model for learning in graph domains. In Proceedings of 2005 IEEE international
joint conference on neural networks, Jul 31-Aug 4, 2005. Montreal, QC, Canada. 729-734.
[72] Thomas N Kipf and Max Welling. 2016. Semi-supervised classification with graph convolutional networks. arXiv:1609.02907. Retrieved from

31
https://arxiv.org/abs/1609.02907
[73] Serge Assaad, Shuxi Zeng, Chenyang Tao, Shounak Datta, Nikhil Mehta, Ricardo Henao, Fan Li and Lawrence Carin. 2021. Counterfactual
representation learning with balancing weights. In International Conference on Artificial Intelligence and Statistics, Apr 13-15, 2021. 1972-1980.
[74] Fredrik Johansson, Uri Shalit and David Sontag. 2016. Learning representations for counterfactual inference. In International conference on machine
learning, Jun 19-24, 2016. New York City, NY, USA. 3020-3029.
[75] Uri Shalit, Fredrik D Johansson and David Sontag. 2017. Estimating individual treatment effect: generalization bounds and algorithms. In
International Conference on Machine Learning, Aug 6-11, 2017. Sydney, Australia. 3076-3085.
[76] Cédric Villani. 2009. Optimal transport: old and new. Springer.
[77] Arthur Gretton, Karsten M Borgwardt, Malte J Rasch, Bernhard Schölkopf and Alexander Smola. 2012. A kernel two-sample test. The Journal of
Machine Learning Research 13, 1 (2012), 723-773.
[78] Nicolas Fournier and Arnaud Guillin. 2015. On the rate of convergence in Wasserstein distance of the empirical measure. Probability Theory and
Related Fields 162, 3 (2015), 707-738.
[79] Karsten M Borgwardt, Arthur Gretton, Malte J Rasch, Hans-Peter Kriegel, Bernhard Schölkopf and Alex J Smola. 2006. Integrating structured
biological data by kernel maximum mean discrepancy. Bioinformatics 22, 14 (2006), e49-e57.
[80] Claudia Shi, David Blei and Victor Veitch. 2019. Adapting neural networks for the estimation of treatment effects. arXiv:1906.0212. Retrieved from
https://arxiv.org/abs/1906.0212
[81] Liuyi Yao, Sheng Li, Yaliang Li, Mengdi Huai, Jing Gao and Aidong Zhang. 2018. Representation learning for treatment effect estimation from
observational data. In Advances in Neural Information Processing Systems, Dec 2-8, 2018. Montréal Canada. 2638-2648.
[82] Shakir Mohamed and Balaji Lakshminarayanan. 2016. Learning in implicit generative models. arXiv:1610.03483. Retrieved from
https://arxiv.org/abs/1610.03483
[83] Liuyi Yao, Sheng Li, Yaliang Li, Hongfei Xue, Jing Gao and Aidong Zhang. 2019. On the estimation of treatment effect with text covariates. In the
28th International Joint Conference on Artificial Intelligence, Aug 10-16, 2019. Macao, China. 4106-4113.
[84] Giambattista Parascandolo, Niki Kilbertus, Mateo Rojas-Carulla and Bernhard Schölkopf. 2018. Learning independent causal mechanisms. In
International Conference on Machine Learning, Jul 10-15, 2018. Stockholmsmässan, Stockholm Sweden. 4036-4044.
[85] Raphael Suter, Djordje Miladinovic, Bernhard Schölkopf and Stefan Bauer. 2019. Robustly disentangled causal mechanisms: Validating deep
representations for interventional robustness. In International Conference on Machine Learning, Jun 9-15, 2019. Long Beach, California. 6056-6065.
[86] Antonin Chambolle and Thomas Pock. 2016. An introduction to continuous optimization for imaging. Acta Numerica 25 (2016), 161-319.
[87] Niclas Andréasson, Anton Evgrafov and Michael Patriksson. 2020. An introduction to continuous optimization: foundations and fundamental
algorithms. Courier Dover Publications.
[88] Xun Zheng, Bryon Aragam, Pradeep K Ravikumar and Eric P Xing. 2018. Dags with no tears: Continuous optimization for structure learning. In
Advances in Neural Information Processing Systems, Dec 2-8, 2018. Montréal Canada. 9492-9503.
[89] Yue Yu, Jie Chen, Tian Gao and Mo Yu. 2019. DAG-GNN: DAG structure learning with graph neural networks. In International Conference on
Machine Learning, June 9-15, 2019. Long Beach, California. 7154-7163.
[90] Ignavier Ng, Shengyu Zhu, Zhitang Chen and Zhuangyan Fang. 2019. A graph autoencoder approach to causal structure learning. arXiv:1911.07420.
Retrieved from https://arxiv.org/abs/1911.07420
[91] Sébastien Lachapelle, Philippe Brouillard, Tristan Deleu and Simon Lacoste-Julien. 2019. Gradient-based neural dag learning. arXiv:1906.02226.
Retrieved from https://arxiv.org/abs/1906.02226
[92] Tomer Galanti, Ofir Nabati and Lior Wolf. 2020. A critical view of the structural causal model. arXiv:2002.10007. Retrieved from
https://arxiv.org/abs/2002.10007
[93] Trent Kyono, Yao Zhang and Mihaela van der Schaar. 2020. Castle: Regularization via auxiliary causal graph discovery. In Advances in Neural
Information Processing Systems, Dec 6-12, 2020. 1501-1512.
[94] Raha Moraffah, Bahman Moraffah, Mansooreh Karami, Adrienne Raglin and Huan Liu. 2020. Causal adversarial network for learning conditional
and interventional distributions. arXiv:2008.11376. Retrieved from https://arxiv.org/abs/2008.11376
[95] Jakob Runge, Sebastian Bathiany, Erik Bollt, Gustau Camps-Valls, Dim Coumou, Ethan Deyle, Clark Glymour, Marlene Kretschmer, Miguel D
Mahecha and Jordi Muñoz-Marí.2019. Inferring causation from time series in Earth system sciences. Nature communications 10, 1 (2019), 1-13.
[96] David Danks and Sergey Plis. 2013. Learning causal structure from undersampled time series. In Twenty-seventh Conference on Neural Information
Processing Systems, Dec 5-10, 2013. Harrahs and Harveys, Lake Tahoe 1-10.
[97] Antti Hyttinen, Sergey Plis, Matti Järvisalo, Frederick Eberhardt and David Danks. 2016. Causal discovery from subsampled time series data by
constraint optimization. In Conference on Probabilistic Graphical Models, Sep 6-9, 2016. Lugano, Switzerland. 216-227.
[98] C. W. J. Granger. 1980. Testing for causality: A personal viewpoint. Journal of Economic Dynamics and Control 2, (1980), 329-352.
[99] Lionel Barnett and Anil K Seth. 2014. The MVGC multivariate Granger causality toolbox: a new approach to Granger-causal inference. Journal of
neuroscience methods 223 (2014), 50-68.
[100] Lionel Barnett and Anil K Seth. 2014. The MVGC multivariate Granger causality toolbox: a new approach to Granger-causal inference. Journal of
neuroscience methods 223, (2014), 50-68.
[101] Belkacem Chikhaoui, Mauricio Chiazzaro and Shengrui Wang. 2015. A new granger causal model for influence evolution in dynamic social
networks: The case of dblp. In Proceedings of the AAAI Conference on Artificial Intelligence, January 25-30, 2015. Austin Texas, USA. 51-57.

32
[102] Alex Tank, Ian Covert, Nicholas Foti, Ali Shojaie and Emily Fox. 2021. Neural granger causality. IEEE Transactions on Pattern Analysis and
Machine Intelligence 44, 8 (2021), 4267-4279.
[103] Hongming Zhang, Yintong Huo, Xinran Zhao, Yangqiu Song and Dan Roth. 2021. Learning Contextual Causality between Daily Events from Time-
consecutive Images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, December 6-12, 2021. 1752-1755.
[104] George Sugihara, Robert May, Hao Ye, Chih-hao Hsieh, Ethan Deyle, Michael Fogarty and Stephan Munch. 2012. Detecting causality in complex
ecosystems. Science 338, 6106 (2012), 496-500.
[105] Matthew J Vowels, Necati Cihan Camgoz and Richard Bowden. 2021. Shadow-mapping for unsupervised neural causal discovery. In Proceedings of
the IEEE/CVF Conference on Computer Vision and Pattern Recognition, June 19-25, 2021. 1740-1743.
[106] SM Eslami, Nicolas Heess, Theophane Weber, Yuval Tassa, David Szepesvari and Geoffrey E Hinton. 2016. Attend, infer, repeat: Fast scene
understanding with generative models. arXiv:1603.08575. Retrieved from https://arxiv.org/abs/1603.08575
[107] Yoshua Bengio, Tristan Deleu, Nasim Rahaman, Rosemary Ke, Sébastien Lachapelle, Olexa Bilaniuk, Anirudh Goyal and Christopher Pal. 2019. A
meta-transfer objective for learning to disentangle causal mechanisms. arXiv:1901.10912. Retrieved from https://arxiv.org/abs/1901.10912
[108] Thien Q Tran, Kazuto Fukuchi, Youhei Akimoto and Jun Sakuma. 2021. Unsupervised Causal Binary Concepts Discovery with VAE for Black-box
Model Explanation. arXiv:2109.04518. Retrieved from https://arxiv.org/abs/2109.04518
[109] Mengyue Yang, Furui Liu, Zhitang Chen, Xinwei Shen, Jianye Hao and Jun Wang. 2021. CausalVAE: Disentangled representation learning via
neural structural causal models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, December 6-12, 2021.
9593-9602.
[110] Nan Rosemary Ke, Jane Wang, Jovana Mitrovic, Martin Szummer and Danilo J Rezende. 2020. Amortized learning of neural causal representations.
arXiv:2008.09301. Retrieved from https://arxiv.org/abs/2008.09301
[111] Bernhard H Korte, Jens Vygen, B Korte and J Vygen. 2011. Combinatorial optimization. Springer.
[112] Olivier Goudet, Diviyan Kalainathan, Philippe Caillou, Isabelle Guyon, David Lopez-Paz and Michele Sebag. 2018. Learning functional causal
models with generative neural networks. Explainable and interpretable models in computer vision and machine learning (2018), 39-80.
[113] Diviyan Kalainathan. 2019. Generative Neural Networks to infer Causal Mechanisms: algorithms and applications. UniversitéParis Saclay
[114] Sindy Löwe, David Madras, Richard Zemel and Max Welling. 2020. Amortized causal discovery: Learning to infer causal graphs from time-series
data. arXiv:2006.10833. Retrieved from https://arxiv.org/abs/2006.10833
[115] Yunzhu Li, Antonio Torralba, Anima Anandkumar, Dieter Fox and Animesh Garg. 2020. Causal discovery in physical systems from videos. In
Advances in Neural Information Processing Systems, Dec 6-12, 2020. 9180-9192.
[116] Shengyu Zhu, Ignavier Ng and Zhitang Chen. 2019. Causal discovery with reinforcement learning. arXiv:1906.04477. Retrieved from
https://arxiv.org/abs/1906.04477
[117] Patrick Schwab, Lorenz Linhardt and Walter Karlen. 2018. Perfect match: A simple method for learning representations for counterfactual inference
with neural networks. arXiv:1810.00656. Retrieved from https://arxiv.org/abs/1810.00656
[118] Fredrik D Johansson, Nathan Kallus, Uri Shalit and David Sontag. 2018. Learning weighted representations for generalization across designs.
arXiv:1802.08598. Retrieved from https://arxiv.org/abs/1802.08598
[119] Negar Hassanpour and Russell Greiner. 2019. CounterFactual Regression with Importance Sampling Weights. In the 28th International Joint
Conference on Artificial Intelligence, Aug 10-16, 2019. Macao, China. 5880-5887.
[120] Liu Qidong, Tian Feng, Ji Weihua and Zheng Qinghua. 2020. A new representation learning method for individual treatment effect estimation: Split
covariate representation network. In Asian Conference on Machine Learning, Nov 17-19, 2020. 811-822.
[121] Ahmed M Alaa, Michael Weisz and Mihaela Van Der Schaar. 2017. Deep counterfactual networks with propensity-dropout. arXiv:1706.05966.
Retrieved from https://arxiv.org/abs/1706.05966
[122] Vikas Ramachandra. 2018. Deep learning for causal inference. arXiv:1803.00149. Retrieved from https://arxiv.org/abs/1803.00149
[123] Onur Atan, James Jordon and Mihaela Van der Schaar. 2018. Deep-treat: Learning optimal personalized treatments from observational data using
neural networks. In Proceedings of the AAAI Conference on Artificial Intelligence, Feb 2-7, 2018. New Orleans, Louisiana, USA. 2071-2078.
[124] Liuyi Yao, Sheng Li, Yaliang Li, Mengdi Huai, Jing Gao and Aidong Zhang. 2019. Ace: Adaptively similarity-preserved representation learning for
individual treatment effect estimation. In 2019 IEEE International Conference on Data Mining (ICDM), Nov 8-11, 2019. Beijing, China. 1432-1437.
[125] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever and Ruslan Salakhutdinov. 2014. Dropout: a simple way to prevent neural
networks from overfitting. The journal of machine learning research 15, 1 (2014), 1929-1958.
[126] Yarin Gal and Zoubin Ghahramani. 2016. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In International
conference on machine learning, Jun 19-24, 2016. New York City, NY, USA. 1050-1059.
[127] Jinsung Yoon, James Jordon and Mihaela Van Der Schaar. 2018. GANITE: Estimation of individualized treatment effects using generative
adversarial nets. In International Conference on Learning Representations, Apr 30- May 3, 2018. Vancouver Canada.
[128] Ioana Bica, James Jordon and Mihaela van der Schaar. 2020. Estimating the effects of continuous-valued interventions using generative adversarial
networks. In Advances in Neural Information Processing Systems, Dec 6-12, 2020. 16434-16445.
[129] Chandan Singh, Guha Balakrishnan and Pietro Perona. 2021. Matched sample selection with GANs for mitigating attribute confounding.
arXiv:2103.13455. Retrieved from https://arxiv.org/abs/2103.13455
[130] Qingyu Zhao, Ehsan Adeli and Kilian M Pohl. 2020. Training confounder-free deep learning models for medical applications. Nature
communications 11, 1 (2020), 1-9.

33
[131] Murat Kocaoglu, Christopher Snyder, Alexandros G Dimakis and Sriram Vishwanath. 2017. Causalgan: Learning causal implicit generative models
with adversarial training. arXiv:1709.02023. Retrieved from https://arxiv.org/abs/1709.02023
[132] Ioana Bica, Ahmed M Alaa, James Jordon and Mihaela van der Schaar. 2020. Estimating counterfactual treatment outcomes over time through
adversarially balanced representations. arXiv:2002.04083. Retrieved from https://arxiv.org/abs/2002.04083
[133] Amelia J Averitt, Natnicha Vanitchanant, Rajesh Ranganath and Adler J Perotte. 2020. The Counterfactual $\chi $-GAN. arXiv:2001.03115.
Retrieved from https://arxiv.org/abs/2001.03115
[134] Nathan Kallus. 2020. Deepmatch: Balancing deep covariate representations for causal inference using adversarial training. In International
Conference on Machine Learning, Jul 12-18, 2020. 5067-5077.
[135] Xin Du, Lei Sun, Wouter Duivesteijn, Alexander Nikolaev and Mykola Pechenizkiy. 2021. Adversarial balancing-based representation learning for
causal effect inference with observational data. Data Mining and Knowledge Discovery 35, 4 (2021), 1713-1738.
[136] Matthew James Vowels, Necati Cihan Camgoz and Richard Bowden. 2020. Targeted VAE: Structured inference and targeted learning for causal
parameter estimation. arXiv.2009.13472. Retrieved from https://arxiv.org/abs/2009.13472
[137] Christos Louizos, Uri Shalit, Joris M Mooij, David Sontag, Richard Zemel and Max Welling. 2017. Causal effect inference with deep latent-variable
models. In Advances in neural information processing systems, Dec 4-9, 2017. Long Beach, CA, USA. 6449-6459.
[138] Carl Doersch. 2016. Tutorial on variational autoencoders. arXiv:1606.05908. Retrieved from https://arxiv.org/abs/1606.05908
[139] Diederik P Kingma and Max Welling. 2019. An introduction to variational autoencoders. arXiv:1906.02691. Retrieved from
https://arxiv.org/abs/1906.02691
[140] Weijia Zhang, Lin Liu and Jiuyong Li. 2020. Treatment effect estimation with disentangled latent factors. arXiv:2001.10652. Retrieved from
https://arxiv.org/abs/2001.10652
[141] Jeffrey Wooldridge. 2009. Should instrumental variables be used as matching variables. Citeseer.
[142] Judea Pearl. 2012. On a class of bias-amplifying variables that endanger effect estimates. arXiv:1203.3503. Retrieved from
https://arxiv.org/abs/1203.3503
[143] Olav Reiersøl. 1945. Confluence analysis by means of instrumental sets of variables. Almqvist & Wiksell.
[144] Jason Hartford, Greg Lewis, Kevin Leyton-Brown and Matt Taddy. 2017. Deep IV: A flexible approach for counterfactual prediction. In
International Conference on Machine Learning, Aug 6-11, 2017. Sydney, Australia. 1414-1423.
[145] Matej Zecevic, Devendra Singh Dhami, Petar Velickovic and Kristian Kersting. 2021. Relating graph neural networks to structural causal models.
arXiv:2109.04173. Retrieved from https://arxiv.org/abs/2109.04173
[146] Franco Scarselli, Marco Gori, Ah Chung Tsoi, Markus Hagenbuchner and Gabriele Monfardini. 2008. The graph neural network model. IEEE
transactions on neural networks 20, 1 (2008), 61-80.
[147] Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio and Yoshua Bengio. 2017. Graph attention networks.
arXiv:1710.10903. Retrieved from https://arxiv.org/abs/1710.10903
[148] Kevin Xia, Kai-Zhan Lee, Yoshua Bengio and Elias Bareinboim. 2021. The causal-neural connection: Expressiveness, learnability, and inference.
arXiv:2107.00793. Retrieved from https://arxiv.org/abs/2107.00793
[149] Pascal Vincent, Hugo Larochelle, Yoshua Bengio and Pierre-Antoine Manzagol. 2008. Extracting and composing robust features with denoising
autoencoders. In Proceedings of the 25th international conference on Machine learning, Jul 5-9, 2008. Helsinki Finland. 1096-1103.
[150] Pierre Baldi. 2012. Autoencoders, unsupervised learning, and deep architectures. In Proceedings of ICML workshop on unsupervised and transfer
learning, Jun 26-Jul 1, 2012. Bellevue, Washington, USA. 37-49.
[151] Pablo Sanchez Martin, Miriam Rateike and Isabel Valera. 2022. Variational Causal Autoencoder for Interventional and Counterfactual Queries. In
The Thirty-Sixth AAAI Conference on Artificial Intelligence, Feb 22- Mar 1, 2022. 8159-8168.
[152] Thomas N Kipf and Max Welling. 2016. Variational graph auto-encoders. arXiv:1611.07308. Retrieved from https://arxiv.org/abs/1611.07308
[153] Nick Pawlowski, Daniel Coelho de Castro and Ben Glocker. 2020. Deep structural causal models for tractable counterfactual inference. In Advances
in Neural Information Processing Systems, Dec 6-12, 2020. 857-869.
[154] Álvaro Parafita and Jordi Vitrià. 2020. Causal Inference with Deep Causal Graphs. arXiv:2006.08380. Retrieved from
https://arxiv.org/abs/2006.08380
[155] David Freedman and Paul Humphreys. 1999. Are there algorithms that discover causal structure? Synthese 121, 1 (1999), 29-54.
[156] Thomas C Williams, Cathrine C Bach, Niels B Matthiesen, Tine B Henriksen and Luigi Gagliardi. 2018. Directed acyclic graphs: a tool for causal
studies in paediatrics. Pediatric research 84, 4 (2018), 487-493.
[157] Shohei Shimizu, Patrik O Hoyer, Aapo Hyvärinen, Antti Kerminen and Michael Jordan. 2006. A linear non-Gaussian acyclic model for causal
discovery. Journal of Machine Learning Research 7, 10 (2006), 2003-2030.
[158] Biwei Huang, Kun Zhang, Jiji Zhang, Joseph D Ramsey, Ruben Sanchez-Romero, Clark Glymour and Bernhard Schölkopf. 2020. Causal Discovery
from Heterogeneous/Nonstationary Data. J. Mach. Learn. Res. 21, 89 (2020), 1-53.
[159] Ahmed Alaa and Mihaela Van Der Schaar. 2019. Validating causal inference models via influence functions. In Proceedings of the 36th International
Conference on Machine Learning, Jun 9-15, 2019. 191-201.
[160] Yu Luo, David A Stephens, Daniel J Graham and Emma J McCoy. 2021. Bayesian doubly robust causal inference via loss functions.
arXiv:2103.04086. Retrieved from https://arxiv.org/abs/2103.04086
[161] Moritz Willig, Matej Zečević, Devendra Singh Dhami and Kristian Kersting. 2021. The Causal Loss: Driving Correlation to Imply Causation.

34
Retrieved from https://arxiv.org/abs/2110.12066
[162] Uzma Hasan and Md Osman Gani. 2022. KCRL: A Prior Knowledge Based Causal Discovery Framework With Reinforcement Learning.
Proceedings of Machine Learning Research 182, (2022), 1-24.
[163] Wei Wang, Gangqiang Hu, Bo Yuan, Shandong Ye, Chao Chen, Yayun Cui, Xi Zhang and Liting Qian. 2020. Prior-knowledge-driven local causal
structure learning and its application on causal discovery between type 2 diabetes and bone mineral density. IEEE Access 8, (2020), 108798-108810.
[164] Ilyes Khemakhem, Ricardo Monti, Robert Leech and Aapo Hyvarinen. 2021. Causal autoregressive flows. In International conference on artificial
intelligence and statistics, Apr 13-15, 2021. 3520-3528.
[165] Ilyes Khemakhem, Diederik P Kingma, Ricardo Pio Monti and Aapo Hyvärinen. 2020. Ice-beem: Identifiable conditional energy-based deep models.
In Proceedings of the 34th International Conference on Neural Information Processing Systems, Dec 6-12, 2020. 12768–12778.
[166] Michal Ozery-Flato, Pierre Thodoroff, Matan Ninio, Michal Rosen-Zvi and Tal El-Hay. 2018. Adversarial balancing for causal inference.
arXiv:1810.07406. Retrieved from https://arxiv.org/abs/1810.07406
[167] Zhenyu Guo, Shuai Zheng, Zhizhe Liu, Kun Yan and Zhenfeng Zhu. 2021. CETransformer: Casual Effect Estimation via Transformer Based
Representation Learning. In Chinese Conference on Pattern Recognition and Computer Vision (PRCV), 2021. 524-535.
[168] Valentyn Melnychuk, Dennis Frauen and Stefan Feuerriegel. 2022. Causal Transformer for Estimating Counterfactual Outcomes. arXiv:2204.07258.
Retrieved from https://arxiv.org/abs/2204.07258
[169] Yi-Fan Zhang, Hanlin Zhang, Zachary C Lipton, Li Erran Li and Eric Xing. 2022. Exploring transformer backbones for heterogeneous treatment
effect estimation. arXiv:2202.01336. Retrieved from https://arxiv.org/abs/2202.01336
[170] Tanmayee Narendra, Anush Sankaran, Deepak Vijaykeerthy and Senthil Mani. 2018. Explaining deep learning models using causal inference.
arXiv:1811.04376. Retrieved from https://arxiv.org/abs/1811.04376
[171] Álvaro Parafita and Jordi Vitrià. 2019. Explaining visual models by causal attribution. In 2019 IEEE/CVF International Conference on Computer
Vision Workshop (ICCVW), Oct 27-28, 2019. Seoul, Korea. 4167-4175.
[172] Martin Arjovsky, Léon Bottou, Ishaan Gulrajani and David Lopez-Paz. 2019. Invariant risk minimization. arXiv:1907.02893. Retrieved from
https://arxiv.org/abs/1907.02893
[173] Divyat Mahajan, Shruti Tople and Amit Sharma. 2021. Domain generalization using causal matching. In International Conference on Machine
Learning, Jul 18-24, 2021. 7313-7324.
[174] Yue He, Zimu Wang, Peng Cui, Hao Zou, Yafeng Zhang, Qiang Cui and Yong Jiang. 2022. CausPref: Causal Preference Learning for Out-of-
Distribution Recommendation. In Proceedings of the ACM Web Conference 2022, Apr 25-29, 2022. Lyon France 410-421.
[175] Cheng Zhang, Kun Zhang and Yingzhen Li. 2020. A causal view on robustness of neural networks. Advances in Neural Information Processing
Systems 33, (2020), 289-301.
[176] Sreya Francis, Irene Tenison and Irina Rish. 2021. Towards causal federated learning for enhanced robustness and privacy. arXiv preprint
arXiv:2104.06557 (2021),
[177] Haiteng Zhao, Chang Ma, Xinshuai Dong, Anh Tuan Luu, Zhi-Hong Deng and Hanwang Zhang. 2022. Certified Robustness Against Natural
Language Attacks by Causal Intervention. arXiv:2205.12331. Retrieved from https://arxiv.org/abs/2205.12331
[178] Yiquan Wu, Kun Kuang, Yating Zhang, Xiaozhong Liu, Changlong Sun, Jun Xiao, Yueting Zhuang, Luo Si and Fei Wu. 2020. De-biased court’s
view generation with causality. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Nov 16-20,
2020. 763-780.
[179] Kaiji Lu, Piotr Mardziel, Fangjing Wu, Preetam Amancharla and Anupam Datta. 2020. Gender bias in neural natural language processing. Springer.
[180] Sahaj Garg, Vincent Perot, Nicole Limtiaco, Ankur Taly, Ed H Chi and Alex Beutel. 2019. Counterfactual fairness in text classification through
robustness. In Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society, 2019. 219-226.

35

You might also like