Deep Causal Learning
Deep Causal Learning
ZIZHEN DENG, XIAOLONG ZHENG*, HU TIAN, and DANIEL DAJUN ZENG, School of
Artificial Intelligence, University of Chinese Academy of Sciences; The State Key Laboratory of Management and
Control for Complex Systems, Institute of Automation, Chinses Academy of Sciences, Beijing, China
Causal learning has attracted much attention in recent years because causality reveals the essential relationship between things and
indicates how the world progresses. However, there are many problems and bottlenecks in traditional causal learning methods, such as
high-dimensional unstructured variables, combinatorial optimization problems, unknown intervention, unobserved confounders,
selection bias and estimation bias. Deep causal learning, that is, causal learning based on deep neural networks, brings new insights for
addressing these problems. While many deep learning-based causal discovery and causal inference methods have been proposed, there
is a lack of reviews exploring the internal mechanism of deep learning to improve causal learning. In this article, we comprehensively
review how deep learning can contribute to causal learning by addressing conventional challenges from three aspects: representation,
discovery, and inference. We point out that deep causal learning is important for the theoretical extension and application expansion of
causal science and is also an indispensable part of general artificial intelligence. We conclude the article with a summary of open issues
and potential directions for future work.
CCS CONCEPTS • Computing methodologies → Machine learning; Neural network; Causal reasoning and diagnostics;
Mathematics of computing → Causal networks
Additional Keywords and Phrases: Deep learning, Causal variables, Causal discovery, Causal inference
1 INTRODUCTION
The study of causality has always been a very important part of scientific research. Causality has been studied in many
fields, such as biology [1-2], medicine [3-7], economics [8-12], epidemiology [13-15], and sociology [16-20]. For the
construction of general artificial intelligence systems, causality is also indispensable [21-23]. For a long time, causal
discovery and causal inference have been the main research directions of causal science. In causal discovery (CD) [24-
25], causal relationships are found in observational data, usually in the form of a causal graph. The traditional causal
discovery methods typically include constraint-based methods [26], score-based methods [27], asymmetric (function)-
based methods [28-29] and other classes of methods [30-32]. In causal inference (CI) [33-34], the causal effect is
estimated, which can be further divided into causal identification and causal estimation [35]. In causal identification,
whether the causal effect can be estimated based on the existing information is determined, and in causal estimation,
specific causal effect values are obtained. There are two mainstream frameworks for causal inference: the structural
*Authors’ address: Z. Deng, X. Zheng (corresponding author), H. Tian and D. Zeng, The State Key Laboratory of Management and Control for Complex
System, Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China; emails: {dengzizhen2021, xiaolong.zheng, tianhu2018,
dajun.zeng}@ia.ac.cn.
causal model (SCM) [33] and the potential outcome model (POM) [34]. Many methods have been developed by previous
researchers based on these two frameworks, such as front-door adjustment, back-door adjustment [35], matching [36-37],
propensity scores [38-40], and double robust regression [41-42].
Although there have been many methods in the field of causal learning, there are still many unsolved problems. In the
past, causality was usually studied on low-dimensional structured data, so there was no need to extract features from the
data. However, with the expansion of application scenarios, many high-dimensional unstructured data need to be
processed, such as images, text and video [43-47]. To discuss causal learning more conveniently and clearly, we refer to
all variables involved in causal tasks as causal variables. (e.g., variables in causal discovery and variables in causal
inference). Even if causal variables are structured, their covariate distributions could be unbalanced, becoming a source
of selection bias. For causal discovery, most methods require strong assumptions (such as linear non-Gaussian, causal
Markov conditions, and faithfulness assumptions) [26, 48-49], which are often impossible to verify. In addition, most of
the traditional causal discovery methods are based on combinatorial optimization [25], which is comparatively
intractable when the number of nodes is large. For causal inference, due to the lack of counterfactual data, the gold
standard for estimating causal effects is random controlled trials (RCT) [50]. However, in reality, we are often unable to
do this due to high costs or ethical constraints. Therefore, it is common to use observational data to make causal
inferences. The key problem with causal inference from observational data is selection bias [33], which consists of
confounding bias (from confounders) and data selection bias (from colliders). Due to selection bias, we may observe
false causality or see correlation as causation. Traditional causal inference methods often have large estimation bias due
to limited fitting ability [51-52]. There are also some common persistent issues in the causal field, such as unknown
interventions [53], unobserved confounders [54], and missing data [55].
We review the use of deep learning methods to address the above problems in the causal learning field from three
points of view: representation, discovery and inference. The three core strengths of deep learning for casual learning are
strong representational capabilities, fitting capabilities, and the ability to approximate data generation mechanisms.
First, deep representation learning for casual variables [22, 56] uses deep learning methods to learn the low-dimensional
balanced structured representation of high-dimensional, unstructured and unbalanced data so that variables can be better
used for causal discovery and causal inference [57]. Second, the universal approximation theorem indicates that neural
networks can learn extremely complex functions to estimate heterogeneous treatment effects with low bias estimators
[58-59]. Because of the general fitting ability and flexibility of neural networks, it has become the main continuous
optimization method to solve the long-standing combinatorial optimization problem in causal discovery and can
theoretically cope with large-scale data. Finally, deep learning methods can generate counterfactuals implicitly through
adversarial learning (usually implemented with a generative adversarial network (GAN) [60]) or explicitly model the
data generating process through disentanglement mechanisms to generate proxy/latent variables (usually implemented
with a variational autoencoder (VAE) [61]). Neural network-based methods for modeling data mechanisms require very
little prior knowledge and do not make many assumptions about the relationship between variables, so that deep
learning-based causal inference and causal discovery methods allow the presence of unobserved confounders and can
also make use of intervention data. Figure 1 shows the main difference between traditional causal learning and deep
causal learning. We can clearly see the improvements that deep learning brings to causal learning.
In the past few years, many deep learning-based causal discovery and causal inference methods have been proposed.
There have been many reviews related to causal learning, but few of them summarized how to take advantage of deep
learning to improve causal learning methods. Guo et al. [23] reviewed causal discovery and causal inference methods
2
(a) (b)
Representation
Discovery
Neural
Network
Structured data
Causal Effect
Unstructured data
Inference Inference
Inference Inference
Causal Learning Deep Causal Learning
Example: Example:
:Age :Age
Discovery Discovery
:Exercise :Exercise
Text: I’m so tired :Blood Pressure
:Blood Pressure
Image: :Mood
Inference Inference
Representation
(c) ( = =− . + ( = =− + (d)
Figure 1: The difference between causal learning and deep causal learning. The comparison between (a) and (b) shows the theoretical
advantages of deep causal learning. In the framework of deep causal learning, unstructured data can be processed with the
representational power of neural networks. With the modeling capabilities of neural networks, in causal discovery, observational data
and (known or unknown) intervention data can be comprehensively used in the presence of unobserved confounders to obtain a causal
graph that is closer to the facts. With the fitting ability of neural networks, the estimation bias of causal effects can be reduced in causal
inference. The 4 orange arrows represent the neural network's empowerment of representation, discovery, and inference. (c) and (d)
demonstrate the advantages of deep causal learning in more detail by exploring examples of the effect of exercise on blood pressure. We
assume that the ground truth of exercise on blood pressure is 𝐸(𝑋3|𝑑𝑜(𝑋2 = 𝑥)) = −1.1𝑥 + 84.
with observational data, but very little content was related to deep learning. Yao et al. [34] focused on causal inference
based on the potential outcome framework. It also mentioned some causal inference methods based on neural networks,
but the description was not systematic. Nogueira et al. [62] summarized causal discovery and causal inference datasets,
software, evaluation metrics and running examples without focusing on the theoretical level. Glymour et al. [24] mainly
reviewed traditional casual discovery methods, and Vowels et al. [25] focused on continuous optimization-based
discovery methods. There have been reviews that combined causality with machine learning [22, 63-64], but these
survey papers mainly explored how causal knowledge can be used to solve problems in the machine learning community.
The work of Koch et al. [65] is more similar to our starting point, and focused on the improvements that deep learning
brings to causal learning. However, they only considered the combination of deep learning and causal inference under
the framework of the potential outcome model and did not address other aspects of the field of causal learning, such as
the representation of causal variables, causal discovery, and causal inference under the framework of the structural causal
model. In this article, we argue that deep representation learning for causal variables, deep causal discovery, and deep
causal inference together constitute the field of deep causal learning, as these three parts cover the general process of
3
Survey Structure
2.1 Structural causal model 3.1 Regularization adjustment for covariate balance
2.3 NNs for causal learning 3.3 Explicit generation for disentangled causal mechanisms
4. Deep Causal Discovery 5. Deep Causal Inference 6. Conclusion and Future Directions
exploring causal relationships: representation, discovery, and inference. We present a more comprehensive and detailed
review of the changes that deep learning brings to causal learning. The overall framework of this survey is shown in
Figure 2.
The rest of this article is organized as follows. Section 2 provides basic concepts related to causality, including
structural causal models, potential outcome models, causal discovery and some main types of neural networks used in
deep causal learning. Section 3 reviews three kinds of deep representation learning methods for causal variables based on
regularization adjustment, implicit generation, and explicit generation. Section 4, we introduce deep causal discovery
methods on observed data, intervention data, and the presence of unobserved confounders data. Section 5 reviews the
deep causal inference methods based on covariate balance, adversarial training and proxy variables. In addition, the deep
structural causal model is introduced. Section 6 provides a conclusion and discusses the future directions of deep causal
learning.
2 PRELIMINARIES
In this section, we briefly introduce causal discovery, causal inference and some main types of neural networks used in
deep causal learning. There are two mainstream frameworks for causal inference: the structural causal model (SCM) and
the potential outcome model (POM). Our survey focuses on the combination of causality and deep learning in both
frameworks. The composition, assumptions, and main methods of these two frameworks are introduced to provide the
necessary background knowledge for subsequent integration with deep learning methods. Table 1 presents the basic
notations used in this article.
4
Table 1: Basic notations and their corresponding descriptions
In essence, SCM is a subjective abstraction of the objective world, and the involved endogenous and exogenous
variables are heavily dependent on the researchers' prior knowledge. That is, the definitions of these variables themselves
are not necessarily accurate, or the most essential variables cannot be observed due to various limitations. For example,
when studying the effect of a person's family status on their academic performance, we might use the family's annual
income as a proxy variable, although this variable may not be entirely appropriate or even correct.
Definition 2 (Causal Graph). Usually, each SCM model has a corresponding causal graph, which is typically a
directed acyclic graph (DAG) [33]. In fact, the casual graph can be seen as an integral part of SCM, in addition to
counterfactual logic. As shown in Figure 3, there are three basic structures in a causal graph: Chain (a), Fork (b) and
Collider (c). These three basic structures constitute a variety of causal graphs. In Figure 3 (e), 𝑋 represent covariates, 𝑌
represent outcome, 𝑇 represent treatment, 𝐶 represent confounders, 𝐼 represent instrumental variable, 𝑈𝑡 represent the
exogenous variable of 𝑇 and 𝑈𝑦 represent the exogenous variable of 𝑌.
To calculate various causal effects in the SCM framework, we must understand three forms of data corresponding to
the Pearl’s causal hierarchy (PCH) proposed by Judea Pearl [35, 66]: observation data, intervention data and
counterfactual data. Observational data represent passive collection without any intervention. Causal effects cannot be
calculated by relying solely on observational data without making any assumptions. Intervention refers to changing the
value or distribution of a few variables; data are collected, and the average treatment effect (ATE) can be calculated.
Counterfactual data are unavailable in the real world. However, under various assumptions, individual treatment effects
(ITE) can be calculated with the help of counterfactual theory. The capabilities of the three models were discussed in
more detail in previous work [33].
Calculating causal effects is the core of causal inference. The average treatment effect (ATE) under the SCM
framework is a common indicator to measure causal effects. It is defined as follows:
5
T X Y T C Y I T C
Ut Uy
Z U
X
X Y I T Y X Z Y
C
(d) Back-door criterion (e) Structural Causal Model (f) Front-door criterion
Figure 3: Three basic DAGs, a simple structure causal model, and two adjustment criteria.
As shown in Equation (1), the key to calculating the causal effect is to calculate the probability under the intervention. In
the SCM framework, there are many ways to calculate the probability under the intervention in different situations (for
example, whether there are unobserved confounders). Here, we briefly introduce several commonly used methods.
Back-door adjustment. In the causal graph corresponding to the SCM, there is a pair of ordered variables (𝑋, 𝑌). If
the variable set 𝑍 satisfies the condition that there is no descendant node of 𝑋 in 𝑍, and 𝑍 blocks every path between 𝑋
and 𝑌 pointing to the 𝑋 path, then 𝑍 is said to satisfy the backdoor criterion [35] about (𝑋, 𝑌), as shown in Figure 3 (d). If
the variable set 𝑍 satisfies the backdoor criterion of (𝑋, 𝑌), then the causal effect of 𝑋 on 𝑌 can be calculated using the
following formula:
Front-door adjustment. As shown in Figure 3 (f), variable set 𝑍 is said to satisfy the front door criterion [35] for an
ordered variable pair (𝑋, 𝑌), if the following conditions are satisfied: 1) 𝑧 cuts off all directed paths from 𝑋 to 𝑌; 2) 𝑋 to
𝑍 has no backdoor paths; and 3) all 𝑍 backdoor paths to 𝑌 are blocked by 𝑋. If 𝑍 satisfies the front door criterion for the
variable pair (𝑋, 𝑌), and 𝑃(𝑥, 𝑧) > 0, then the causal effect of 𝑋 on 𝑌 is identifiable and is calculated by:
Causal Discovery. Using observational data to discover causal relationships between variables is a fundamental and
important problem. Causal discovery has a wide range of applications in many fields, such as social science, medicine,
biology and atmospheric sciences. Causal relations among variables are usually described using a DAG, with the nodes
representing variables and the edges indicating probabilistic relations among them. When performing causal discovery,
causal Markov conditions and faithfulness assumptions are often needed. Traditional causal discovery algorithms are
roughly divided into score-based, constraint-based, and functional-based methods.
Constraint-based methods infer the causal graph by analyzing conditional independence in the data. Classical
algorithms include PC and FCI [26]. The score-based methods search through the space of all possible directed acyclic
6
graphs (DAGs) representing the causal structure based on some form of scoring function for network structures. A
typical algorithm is GES [27]. The function-based method requires assumptions about the data generation mechanism
between variables and then the causal direction is judged through the asymmetry of the residuals. Essential algorithms
include LiNGAM [28] and ANM [29].
Obtaining accurate causal effects requires both facts and counterfactuals, but we can only obtain factual data, so we aim
to approximate counterfactuals through various methods for estimating causal effects. When making causal estimates, we
usually need to rely on the following basic assumptions [15, 67]:
Stable unit treatment value assumption. The effect of treatment on a unit is independent of the treatment
assignment of other units.
Unconfoundedness. The distribution of treatment is independent of potential outcome when given the observed
variables, which means there are no unobserved confounders 𝑇 ⊥ (𝑌(0), 𝑌(1))|𝑋.
Positivity. Each unit has a nonzero probability of receiving either treatment status when given the observed variables.
Next, we introduce several methods for estimating causal effects that are frequently used in the POM framework.
Match. Matching involves selecting individuals from the treatment group and the control group and continuously
selecting the two most similar individuals to form a pair. Outcomes from matched samples can be approximated as
counterfactual for each other, so causal effects can be calculated by comparing results from paired samples. When
matching, we can use nearest neighbor matching or set a certain distance threshold.
Propensity score matching. When matching is based on distance, good results cannot be obtained because of the
high-dimension problem or insufficient samples. Therefore, the propensity score matching (PSM) [50] method was
proposed in which the probability 𝑒(𝑋) of a unit to be treated is estimated by using predictive models [39]. As a
similarity proxy, the calculated propensity score is used to implement the matching method. Propensity score is defined
as follows:
7
𝑒(𝑋) = 𝑃(𝑇 = 1|𝑋). (8)
Inverse of propensity weighting. To further address the problem of insufficient samples caused by matching, we
introduce another class of techniques called reweighting. Here, we present the reweighting technique based on the
propensity score, known as inverse of propensity weighting (IPW) [68]. The propensity score of each sample is used to
reweight it, which is equivalent to eliminating the estimation bias caused by the imbalance of covariates. ATE estimated
by the IPW method is defined as:
𝑛 𝑛
1 𝑇𝑖 𝑌𝑖 1 (1 − 𝑇𝑖 )𝑌𝑖
𝐴𝑇𝐸𝐼𝑃𝑊 = ∑ − ∑ , (9)
𝑛 𝑒̂ (𝑋𝑖 ) 𝑛 1 − 𝑒̂ (𝑋𝑖 )
𝑖=1 𝑖=1
8
data into structured data is in the content of representation learning and will not be discussed in detail in this article. We
mainly investigate how deep learning can be used to further optimize the representation of causal variables, thereby
reducing the obstacles that may be faced in causal discovery and causal inference. This requires us to think about what
constitutes a good representation in the field of causal learning. Starting from the two fundamental problems of causal
inference (selection bias and lack of counterfactual data), we believe that the following are necessary for a good causal
variable representation: the covariates are balanced, and the counterfactual data or the causal generation mechanism of
the variable can be obtained.
In this section, we introduce three deep representation learning frameworks for casual variables: regularization
adjustment for covariate balance, implicit generation of counterfactual data, and explicit generation for disentangling
casual mechanisms. This section focuses on the components of these frameworks and the role of each in constructing
representations. More details related to causal effect estimation are elaborated in section 5.
9
Feedback(e.g. Drop-out)
2 Outcome Network
1 Representation Network
...
5
Figure 4: The framework of representation learning based on regularization adjustment for covariate balance.
In the regularization adjustment-based causal representation framework, in addition to regularization terms, the
propensity scores can also be used to enable the representation network to learn balanced features. The first practical
method was DargonNet [80], in which the standard feedforward neural network is used to predict the probability of each
sample receiving treatment and the representation is obtained from the hidden layer. Such representations do not contain
treatment-independent variables and can be used to predict outcomes. The balance of covariates helps to eliminate
selection bias, but paying too much attention to the balance will affect the counterfactual prediction performance. While
maintaining global balance, SITE [81] found that maintaining local similarity can help improve prediction accuracy. This
method achieves global balance and local similarity and maintains certain predictive abilities while reducing selection
bias. The specific implementation details will be described in section 5.
10
u ui
~ G xi G ~
yiCF
x G y
t D Loss 1 − ti ITE
y yiF
The basic architecture of implicit generation-based representation learning is shown in Figure 5. It consists of
potential outcome generator 𝐺 and potential outcome discriminator 𝐷. The potential outcome generator 𝐺 generates
potential outcomes 𝑦̃ based on covariates 𝑥, treatments 𝑡, and exogenous noise 𝑢; the generated potential outcomes 𝑦̃ and
factual outcomes 𝑦 are then fed into the discriminator 𝐷, which tries to distinguish which is the factual outcome and
which is the generated outcome. Implicit generation models can learn stable representations and improve the accuracy of
subsequent causal effect estimation. The adversarial learning architecture [83] can be used to control for confounders in
the latent space. This means that the generator 𝐺 tries to maximize the error of the discriminator 𝐷 through training and
finally cannot distinguish the training samples from the control group or the treatment group, thus achieving the purpose
of eliminating the effect of confounders. Therefore, 𝐺 can be used to generate approximate counterfactual data, so it can
be seen as the implicit representation of the counterfactual data. In addition to the most basic adversarial training
structure shown in Figure 5, some studies have extended the method to multivariate treatment variables, continuous data,
and time series, greatly expanding the adaptability of representation learning based on implicit generation.
11
G1 G2 G3 ... GK
Independent Mechanism
X Encoder
Z1 Z2 Z3 Z4 ... ZN
Decoder
X̂
mechanisms from the mixture of shifted data without labels. The design architecture is modular and easily extended to a
variety of situations. Note that the real number of mechanisms is unknown a priori. The IM hypothesis can be used to
identify causal models (i.e., causal discovery), exploiting multi-expert competition to discover independent causal
mechanisms. Each expert is represented by a smaller neural network, and the experts compete for the sample data during
training. Each time, only the winning expert network can update the weight parameters through backpropagation, and the
other expert networks remain unchanged. Both GAN and VAE can be used to implement this method. In addition, in
CFL [56] the use of low-level data to discover high-level causal relations is proposed. In this way, variables that are
better suited for causal expression can be learned from low levels, which is not necessarily appropriate for all data
settings. One of the benefits of such a method is avoiding preconceived biases.
12
Encoder
Decoder
(a) Observed data (b) Known and unknown intervention (c) Unobserved confounder
Figure 7: Causal discovery using observed data, intervention data, and data with unobserved confounders.
of nodes, and ℎ(∙) is a smooth function over real matrices. This derivation is concise and powerful but also ingenious.
Most of the subsequent gradient-based methods are extensions of this. Note that in the formula, the value of ℎ(𝑊) is
usually small but not 0, so the setting of the threshold is required in most cases.
NOTEARS has a good performance under the assumption of the SEM, and many subsequent works have extended it
to the nonlinear field. DAG-GNN [89] introduces neural networks into the process of causal discovery, extending
scenarios to nonlinearities. Using the encoder-decoder architecture, the adjacency matrix is obtained during the network
training process to complete the structure selection. In Equations (12)-(14), 𝐴 is the adjacency matrix, 𝑋 is a sample of a
joint distribution and 𝑍 is the noise matrix. 𝑓𝑖 is the function that implements the transformation between 𝑋 and 𝑍,
usually implemented using the neural network. Following the idea of encoder-decoder, GAE [90] utilizes the idea of
graph self-encoder to better utilize graph structure information for causal graph structure discovery. Another method that
uses neural networks to adapt to nonlinear scenarios is GraN-DAG [91]. It can handle the parameter families of various
conditional probability distributions. The idea behind it is similar to that of DAG-GNN, but it achieves better results in
experiments.
A new indicator based on the reconstruction error of the autoencoder is proposed in AEQ [92]. Different indicator
values are used to distinguish the causal directions, and the identification effect is better under univariate conditions.
CASTLE [93] uses causal discovery as an auxiliary task to improve generalization when training a supervised model. In
CASTLE, the adjacency matrix of the DAG is learned in the process of continuous optimization and embeds it into the
input layer of the FNN. Essentially, the learned causal graph is used as an autoencoder-based regularization to improve
the model’s generalization by reconstructing only causal features. In CAN [94], high-quality and diverse samples are
generated from conditional and interventional distributions. Here, the relationship between variables is represented by a
matrix, which is continuously optimized during the training process. In practical implementations, the selection of
interventions is achieved through mask vectors. Finding the complete causal relationship from the data often requires
assumptions about the data-generating mechanism, and the data used by CAN often do not satisfy these assumptions, so
there is no guarantee that a true causal graph will be found.
Causal discovery in time series is the key to many fields in science [95]. However, most of the available causal
discovery methods for nontime series are not directly applicable to time series data. There are several issues to consider
13
for time series data, such as sampling frequency and unobserved confounders. At the same time, causal discovery on
time series data must make the necessary assumptions [96-97]; for example, the cause must occur before the effect, and
there is no instantaneous causal effect. For time series, there are two types of causal graphs: full-time graphs and
summary graphs. Full-time graphs depict the causal relationship between variables at each moment, and summary graphs
depict the causal relationship during this time. Therefore, the causal graph obtained from the time series is likely to have
cycles, which does not satisfy the DAG condition.
The most common method for discovering causal relationships in time series is the Granger causal (GC) [98].
Although the GC can achieve common-sense results under linear assumptions, the results of the GC in nonlinear
scenarios are often unsatisfactory. There are many variations of GC-based methods that address the nonlinear problem
[99-102]. NGC [102] separates the functional representation of each variable to achieve an effective distinction between
cause and effect to a certain extent. NGC provides a neural network for each variable 𝑖 to calculate the influence of other
variables on it. If a column of the obtained weight matrix is 0, it means that the corresponding variable has no Granger
causality to the variable 𝑖 . The core of the NGC is to design a structured sparsity-induced penalty to achieve
regularization so that the Granger causality and the corresponding time delay 𝑡 can be selected at the same time.
𝑇 𝑃
2
𝑚𝑖𝑛𝑊 ∑ (𝑥𝑖𝑡 − 𝑔𝑖 (𝑥(𝑡−1):(𝑡−𝐾) )) + 𝜆 ∑ 𝛺(𝑊:𝑗1 ) , (15)
𝑡=𝐾 𝑗=1
where Ω(⋅) is the penalty, W is the weights matrix and 𝑔𝑖 is the function of the relationships among variables.
Successive images also naturally imply possible causal knowledge. VCC [103] learns causality from sequential
images combined with context. It attempts to identify the causal relationship between the events extracted from the
pictures by using cross-attention to calculate the causal score of one image against another. A high-quality causal image
annotation dataset known as Vis-Causal is proposed.
Converge Cross Mapping (CCM) [104] is used to find causal relationships between time series variables in dynamical
systems. It is especially suitable for nonlinear weakly coupled dynamical systems. NSM [105] first uses AIR [106] to
learn a low-dimensional time series representation of variables from video data and then reconstructs the time series
using the time series delay to obtain its nearest neighbors at each time point. Then, the encoder is used to obtain the
vector representation of the nearest neighbor sequence, calculate the correlation coefficient matrix, and judge the
existence and direction of the causal relationship through the value of the correlation coefficient.
4.2.1Known Intervention
Unlike many methods dedicated to causal discovery, many deep learning-based methods use causal discovery as a
subsidiary function to discover causal knowledge. MTCD [107] defines a meta-learning objective to measure the speed
of adaptation. The correct causal direction can adapt to the intervention faster when sampling; that is, the speed of
adaptation is used as a score for causal discovery. CBCD [108] utilizes VAE to extract binary conceptual causal
variables from unstructured data that can explain the classifier. This is a partial causal discovery method because it
14
achieves the discovery of all causal variables that explain the results, and the relationships between these causal variables
are not explored. In the CausalVAE [109], the causal layer is used to convert independent exogenous variables into
endogenous variables with causal significance. In this method, the causal relationship is assumed to be linear.
𝑣𝑖 = 𝑓𝑖 (𝐴𝑖 ∘ u; 𝜃𝑖 ) + 𝑢𝑖 , (17)
where 𝐴 is the adjacency matrix to be learned, 𝑢 is the exogenous factor that follows the Gaussian distribution, 𝑣 is the
structured representation of the variable, 𝑓𝑖 represents the functional relationship between variables, and 𝜃𝑖 is the
parameter of 𝑓𝑖 . The observed variables 𝑥 are passed through the encoder as input to generate independent exogenous
variables, and then the causal layer module is used to convert them into endogenous variables 𝑢 with causal meaning.
Then, the mask mechanism is used to select the intervention variable from 𝑢 , and the observed variable 𝑥 is
reconstructed via the decoder. In this way, CausalVAE can discover the causal relationships with adjacency matrix 𝐴.
4.2.2Unknown Intervention
Interventions may also be unknown; it is not known which variables were intervened. A new class of approaches is
needed to deal with this situation. The SDI method [53] enables the simultaneous discovery of causal diagrams and
structural equations in the presence of unknown interventions. This method is score-based, including iterative and
continuous optimization. Considering that the structural parameters and functional parameters are not independent and
affect each other, the structural representation of the DAG and the functional representation of a set of independent
causal mechanisms are jointly trained until convergence. The first step is parameterization. There are two types of
parameters in this method: structural parameters (i.e., an adjacency matrix of 𝑀 ∗ 𝑀) and function parameters (i.e., 𝑀
function parameters). Then, a multilayer perceptron is used to fit the observed data to update the parameters. The score of
the graph is further obtained with the intervention data, which includes the penalty term for cyclic graphs.
(𝑘) (𝑘)
∑𝑘 (𝜎(𝛾𝑖𝑗 ) − 𝑐𝑖𝑗 ) 𝐿𝐶,𝑖 (𝑋)
𝑔𝑖𝑗 = , ∀𝑖, 𝑗 ∈ {0, … , 𝑀 − 1}, (18)
∑𝑘 𝐿(𝑘)
𝐶,𝑖
(𝑋)
(𝑘)
where γ are the structural parameters and 𝐿𝐶,𝑖 denotes the log-likelihood of variable 𝑋𝑖 .
However, SDI cannot handle dynamic systems well because it can only learn one causal graph at a time. To better
address the problem of causal discovery in dynamic systems, CRN [110] trains a learner to sample from different causal
mechanisms each time. In each training, several interventions and the neural network are used to synthesize the
distribution of multiple intervention data to learn the causal graph. This method can achieve domain adaptation to a
certain extent and can also accumulate prior knowledge for subsequent structural learning. Specifically, at the beginning
of each episode, a new causal graph is selected as the learning target, an intervention is randomly selected at each time,
the outcome after each intervention is predicted, and the neural network is used to train the adjacency relationship matrix.
Each episode uses 𝑘 different interventions to achieve structural learning.
15
not have a causal relationship between them, as shown in Figure 7 (c). In traditional methods, there are some methods to
deal with unobserved confounders, such as FCI. However, most of them are based on combinatorial optimization [111],
which is not efficient enough. Here, we introduce some causal discovery methods based on deep neural networks, which
can more efficiently and accurately find causal relationships in the presence of unobserved confounders.
In CGNN [112], the prior form of the function is not set and MMD is used as a metric to calculate the score of each
graph to evaluate how well each causal graph fits the observed data. The generation mechanism can be used to generate
distributions that are arbitrarily close to the observed data. The most important contribution of this method is the formal
definition of a functional causal model (FCM) with latent variables; an exogenous variable 𝑈𝑖𝑗 that affects variables 𝑖 and
𝑗 is set to solve with the presence of unobserved confounders and proved that it is still possible to learn with
backpropagation. The maximum mean discrepancy (MMD) is defined as shown in Equation (19), and 𝑘(∙) is the
Gaussian kernel:
𝑛 𝑛 𝑛
1 1 2
̂ 𝑘 (𝐷, 𝐷
𝑀𝑀𝐷 ̂) = ∑ 𝑘(𝑥𝑖 , 𝑥𝑗 ) + 2 ∑ 𝑘(𝑥̂𝑖 , 𝑥̂𝑗 ) − 2 ∑ 𝑘(𝑥𝑖 , 𝑥̂𝑗 ) . (19)
𝑛2 𝑛 𝑛
𝑖,𝑗=1 𝑖,𝑗=1 𝑖,𝑗=1
SAM [113] solves the computational limitation of CGNN. SAM is different from structural equations. It incorporates
all variables other than itself into the equation, so it is called "structural agnostic". In SAM, noise matrices instead of the
noise variable of variable pairs. The introduction of differentiable parameters in the correlation matrix allows the SAM
structure to efficiently and automatically exploit the relation between variables, thus providing SAM with a way to
discover correlations from unobserved variables.
𝑛ℎ
where 𝑓̂𝑗 is the nonlinear function, ∅𝑗,𝑘 is the feature, 𝑧𝑗,𝑘 is the Boolean vector and 𝐸𝑗 is the noise variable.
Similar to other neural network-based methods, SAM and CGNN suffer from instability; the randomness of the
learning process or neural network initialization can influence their final performance and predictions even with the same
data and parameters. Such instability can be mitigated by computing multiple runs independently on the same setup and
then averaging the results. ACD [114] is used to deal with the presence of unobserved confounders in time series data.
The core idea of this method is that different causal graphs may belong to the same dynamical system and therefore
potentially have much common information. Therefore, the goal of this method is to obtain a model that can realize
causal discovery for different causal graphs belonging to the same dynamic system. During sampling, multiple causal
graphs and their corresponding data distributions belonging to the same dynamic system are selected, and each causal
graph is used as a training sample. By extending the amortized encoder, it is possible to predict an additional variable,
combined with structural knowledge, which can be used to represent unobserved confounders. Moreover, ACD can learn
the causal mechanism of this dynamic system from the different causal graphs, which can reduce the influence of
unobserved confounders to a certain extent.
In addition to the above mentioned methods for dealing with time series, V-CDN [115] can discover causal relations
from video without ground truth by extracting key points from videos, learning causal relationships between key points,
and predicting future dynamics using dynamical system interactions. The V-CDN has three modules: visual perception,
structure inference and dynamics prediction. The visual perception module is used to extract key points from the image.
16
Table 2: The main deep causal discovery methods
Structure inference uses the extracted key points and graph neural networks to learn the causal graph. With the learned
causal graph, the dynamics module is used to learn the future condition of these key points.
In this section, we review causal methods using neural networks. According to the type of data, it is divided into
causal discovery driven by observational data, causal discovery driven by intervention data, and causal discovery with
unobserved confounders data. Table 2 summarizes the main deep causal discovery methods, and some of these methods
may apply to more than one data situation.
17
5 DEEP CAUSAL INFERENCE
Most traditional causal inference methods directly act on the original low-dimensional feature space, and the effect may
not be very good. With the popularity of deep learning, many studies have begun to use the powerful fitting ability of
neural networks to explore the relationship between treatment and effect.
The core problems of causal inference are the missing counterfactual data and selection bias, as shown in Figure 8. Of
the two problems, the former is more fundamental because once counterfactual data are obtained, the estimation of
causal effects will be very simple and natural. Therefore, from the perspective of solving the fundamental problems, the
methods of causal inference can be divided into selection bias-oriented and counterfactual data-oriented methods.
Existing deep learning-based causal inference methods can be roughly divided into four categories: covariate balance-
based methods adjust covariates to balance the distribution of covariates in different treatment groups, thereby
eliminating selection bias; adversarial training-based methods utilize adversarial training to make the discriminator
unable to distinguish between the real data and the data generated by the generator, thereby realizing the generation of
implicit counterfactual data; proxy variable-based methods model the data generation mechanism as the joint action
between multiple latent variables to achieve explicit counterfactual generation; deep structural causal model methods
usually combine the SCM and the neural network structure, using the structural information of the SCM and the fitting
ability of the neural network to model the data generation mechanism and realize counterfactual generation. The
relationship between the two core problems and three main methods is shown in Figure 8.
18
P(X|T=0) P(X|T=1) World 1 World 2
T=0 T=1
NNs GANs
Inference methods
Figure 8: The methods to solve the fundamental problems of causal inference using deep learning. The most fundamental problem in
causal inference is the missing counterfactual data. Due to the lack of counterfactual data, only observational data can be used to
estimate causal effects, leading to selection bias. In essence, if the data generation mechanism can be modeled, then the "counterfactual
data" can be approximated, and the problem of causal inference can be solved. In “Selection Bias”, gray and white nodes represent
individuals with different covariates. In “Missing Counterfactual Data”, white and gray nodes represent outcomes under two treatments,
i.e., fact and counterfactual, and only one of them can be observed.
𝑛 𝑛
1 𝛾 𝐹
𝐵𝐻,𝛼,𝛾 (𝛷, ℎ) = ∑|ℎ(𝛷(𝑥𝑖 ), 𝑡𝑖 ) − 𝑦𝑖𝐹 | + 𝛼𝑑𝑖𝑠𝐻 (𝑃̂𝛷𝐹 , 𝑃̂𝛷𝐶𝐹 ) + ∑|ℎ(𝛷(𝑥𝑖 ), 1 − 𝑡𝑖 ) − 𝑦𝑗(𝑖) |, (21)
𝑛 𝑛
𝑖=1 𝑖=1
where 𝛷 is the learned representation and 𝑑𝑖𝑠𝐻 (∙,∙) is the distance measure. By minimizing the loss function equation
(21), the BNN can simultaneously accomplish counterfactual inference and covariate balance. Unbalanced distributions
of covariates are helpful for prediction but affect the estimation of causal effects. A balanced distribution can reduce the
prediction variance when the treatment variable is shifted. In the BNN, the network structure only has one head. This
means that the treatment assignment information 𝑡𝑖 needs to be concatenated to the representation of covariate 𝛷(𝑥𝑖 ). In
most cases, 𝛷(𝑥𝑖 ) is high-dimensional, so the information of 𝑡𝑖 might be lost during training.
To address the problem of the BNN, a new architecture was proposed in the CFR [75], which has two separate heads
representing the control group and the treatment group, which share a representation network. This architecture avoids
the loss of treatment variable 𝑡 during network training. In the actual training process, according to the value of 𝑡𝑖 , each
sample is used to update the parameters of the corresponding header. Using the integral probability metric (IPM) [76-77]
to measure the distance of control and treated distributions 𝑝(𝑥|𝑡 = 0) and 𝑝(𝑥|𝑡 = 1) was also proposed. In BRNN [58],
the MSE is decomposed into bias and variance, and the estimation ability of the multi-head model is compared with that
of the single-head model. A new regularization term, PRG, was introduced to assess differences between the treatment
and control groups. It was proven that the estimation bias of PSM increases with the increase in the covariate dimension.
Two methods for estimating ITE were compared inductive inference and transductive inference. Because the inductive
inference shares the same representation layers, it may have less noise and bias. This inspired us to try more statistical
values as regular terms to constrain the network to obtain a better estimation effect.
19
The previous methods are only applicable to 0-1 cases, and PM [117] extends the approach to multiple treatment
settings by using the balanced propensity score to make the match and estimating the counterfactual outcomes using the
nearest neighbor. Here, we use the minibatch level, rather than the dataset level, which can reduce variance. RCFR [118]
alleviates the bias of BNN [74] when sample sizes are large. It is reweighted according to imbalance terms and variance.
IPM metrics and regularization terms are still used here:
𝑛
1 𝜆ℎ 𝑤
‖𝑤‖2
𝐿𝜋 (ℎ, 𝛷, 𝑤; 𝛽) = ∑ 𝑤𝑖 𝑙ℎ (𝛷(𝑥𝑖 ), 𝑡𝑖 ) + 𝑅(ℎ) + 𝛼𝐼𝑃𝑀𝐺 (𝑝
̂,
𝜋 𝛷, 𝑝̂ 𝜇,𝛷 ) + 𝜆𝑤 . (22)
𝑛 √𝑛 𝑛
𝑖=1
In the previous weighting methods, the calculation of the propensity score often requires regression approximation in
practice. However, there are many problems with approximating the propensity score as the weight. Here, BWCRF [73]
proposed a function called balancing weights to make a trade-off between balance and predictability, as shown in
Equation (23). It is not a direct balance between groups, instead a weighted balance of representations. By performing a
deeper analysis of the bound, it is proven theoretically that the balanced feature distribution is beneficial to the model.
𝑓(𝑥)
𝑤(𝑥, 𝑡) = . (23)
𝑡 ∙ 𝑒(𝑥) + (1 − 𝑡) ∙ (1 − 𝑒(𝑥))
Selection bias can also be addressed by transforming the counterfactual problem into a domain adaptation problem. In
CARW [119], the context-aware weighting scheme that leverages the importance sampling technique based on [75] to
better solve the selection bias is integrated. SITE [81] is a balanced representation learning method that preserves local
similarity. The balance of covariates helps to eliminate selection bias, but paying too much attention to the balance will
affect the counterfactual prediction performance. This method achieves global balance and local similarity and maintains
certain predictive abilities while reducing selection bias. This method uses mini-batch and selects the triplet pairs. There
are two core components: the PDDM and the MPDM. The PDDM is used to preserve the local similarity information.
The MPDM is used to achieve balanced distributions in the latent space. Then, the prediction network is used to obtain
the prediction loss of potential outcomes. The loss function is:
Using the neural network to directly fit the relationship of covariate 𝑋 to outcome 𝑌 can cause many problems; for
example, the neural network may use all the variables for 𝑌 prediction. In fact, these variables are not needed and should
not all be used for estimating causal effects. Covariates can be divided into instrumental variables (only affecting
treatment), adjustment variables (only affecting outcome), irrelevant variables (having no effect on either treatment or
outcome), and confounders (the cause of both treatment and outcome). When learning representations, SCRNet [120]
only balances the representations of confounders. It is then concatenated with the representation of the adjustment
variable. This approach reduces computational overhead and increases efficiency in practical applications. However, the
division of variables is usually subjective, especially when the true causal graph cannot be obtained. DragonNet [80] is a
method for using neural networks to find those covariates that are associated with treatment and only use these variables
to predict the outcome. First, a deep neural network (DNN) is trained to predict 𝑇, and then the last prediction layer is
removed to obtain the representation 𝛷. Next, similar to the TARNet [75], two separate DNNs are used to predict the
outcome at 𝑡 = 0 and 𝑡 = 1. Essentially, 𝛷 stands for the representation related only to 𝑇, i.e., represents the propensity
score. The objective function is:
20
Table 3: Covariate balance-based causal inference methods
1
𝑅̂(𝜃; 𝑋) = ∑[(𝑄𝑛𝑛 (𝑡𝑖 , 𝑥𝑖 ; 𝜃) − 𝑦𝑖 )2 + 𝛼𝐶𝑟𝑜𝑠𝑠𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑔𝑛𝑛 (𝑥𝑖 ; 𝜃), 𝑡𝑖 )] , (26)
𝑛
where 𝑄𝑛𝑛 (⋅) is the DNN to model the outcome and 𝑔𝑛𝑛 (⋅) is the DNN to model the propensity score model. 𝛼 is the
hyperparameter to weight the loss term. Simultaneous training to predict propensity scores and outcomes ensures that the
features used are treatment-relevant. Compared to TARNet [75], DragonNet has part of the predicted propensity score.
DragonNet makes a clear distinction between the prediction of outcomes and estimation of causal effects because
accurate predictions do not mean that causal effects can be accurately estimated. In DragonNet, 𝑔𝑛𝑛 (⋅) is used to find
the confounding factors in the covariates so that the resulting representation contains only the part related to 𝑋 (according
to the adequacy of the propensity score). DragonNet also uses target regularization from semi-parameter estimation
theory. The purpose of target regularization is to ensure asymptotic consistency if the semiparametric equation is
satisfied and the convergence is fast.
21
In DCN [121], the architecture of predicting outcome is similar to DragonNet, but it has a separate neural network to
model the propensity score. For each head of the predicted outcome DNN, there is a certain probability of dropout [125-
126] at each training, and the probability depends on the propensity score. This occurs through dropout, implicitly
reflecting the balance effect of the propensity score in the neural network. Deep-treat [123] divides the counterfactual
prediction problem into two steps: the first step uses an autoencoder to learn representations that trade off bias and
information loss (by controlling the hyperparameter λ); then, the FNN is used on the transformed data to achieve
treatment allocation. There is a module for learning propensity scores in an autoencoder (AE) to minimize the
reconstructed feature representation error for different populations. This method makes the representation as accurate as
possible and makes the representation distribution of different populations as close as possible to achieve covariate
balance. At the same time, deep treatment is suitable for the case of multivariate treatments. Table 3 summarizes the
main balanced-based deep casual inference methods.
22
Table 4: Adversarial training-based causal inference methods
and predicted outcomes. In the architecture of CF-Net, the CP module is designed to eliminate the influence of
confounders when extracting features and has achieved good results in the medical field. Another adversarial learning
architecture, CTAM [83], is used to control the confounders in the latent space. The discriminator 𝐷 is trained to
minimize the assignment of correct treatment. This means that the learner tries to maximize the error of the discriminator
through training and finally cannot distinguish the training samples from the control group or the treatment group, thus
eliminating the effect of confounders.
There are also approaches that combine adversarial and balanced approaches, i.e., using adversarial training to
produce covariate-balanced representations [135]. DeepMatch [134] uses adversarial training to balance covariates in
such situations. DeepMatch uses the discriminative discrepancy metric in the context of NNs and requires a few further
developments of alternating gradient approaches similar to GAN. DeepMatch uses the idea of adversarial learning to
learn stable representations and improves the accuracy of subsequent causal effect estimation. Another feature balancing
approach that uses adversarial training is χ-GAN [133]. It introduces the importance sampling theory to minimize the
variance of causal effect estimation. This method can better handle samples whose propensity scores are close to 0 or 1
and the prediction results are more stable. There is also a class of methods that consider time-related confounding factors,
which can help to understand how results change over time. The previously mentioned static-based methods are often not
directly applicable to time series scenarios. CRN [132] is based on the recurrent neural network and uses adversarial
training to obtain feature representations that are not affected by time, thereby removing time-related confounding factors.
This approach is ideal for precision medicine and can be used to answer key questions about when to treat patients, when
to stop, and how to determine the dosage. Table 4 summarizes the main adversarial-based deep casual inference methods.
23
5.3 Proxy Variable-based Causal Inference
For causal inference, there is generally an “unconfounderness” assumption, meaning that there are no unobserved
confounders. Most existing methods of estimating causal effects are based on this assumption, such as the traditional
methods (matching, propensity score-based methods) or the balance-based methods mentioned earlier. The
aforementioned methods are guaranteed to recover the true causal effect when variables are observed. However, in many
cases, the assumptions are not satisfied. Once there are unobserved confounders, the previous method will have large
bias and even lead to incorrect conclusions. This subsection discusses how to use neural networks for causal inference in
the presence of unobserved confounders.
Although it may not be possible to observe all confounders, there is generally a way to measure the proxy variables of
the confounders. Proxy variable-based methods utilize proxy variables to separate different types of variables using
disentangle representations. Most proxy variables required by the confounders are covered by collecting a large number
of observed variables. This is easy to achieve in today's era of big data. Exactly how these proxy variables are used
depends on their relationship to the unobserved confounders, treatments, and outcomes. There are also many causal
identification problems based on proxy variables to study the presence of unobserved confounders.
Deep latent variable techniques can use noisy proxy variables to infer unobserved confounders [136]. CEVAE [137]
uses latent variable generative models to discover unobserved confounders from the perspective of maximum likelihood.
One of the typical scenarios is shown in Figure 9 (a). This approach requires fairly weak assumptions about the data
generation process and the structure of unobserved confounders. It uses the VAE [138-139] architecture and contains
both an inference network and a model network. The inference network is equivalent to an encoder, and the model
network is equivalent to a decoder. The nonlinearity of the neural network is used to detect the nonlinearity of causal
effects; that is, the neural network is used to fit the causal relationship between the variables. The core of this approach is
the use of variational inference to obtain the probability distribution needed to estimate causal effects, 𝑝(𝑋, 𝑍, 𝑡, 𝑦). We
use 𝑍 to denote the hidden variable, and 𝑋 is the proxy variable that has no effect on outcome 𝑦 and treatment 𝑡. Actually,
𝑍 can be seen as the latent variable in VAE. The inference network was used to obtain 𝑞(𝑡|𝑥) and 𝑞(𝑧|𝑡, 𝑦, 𝑥). 𝑞(𝑡|𝑥)
can be seen as the propensity score. The model network was used to obtain 𝑝(𝑡|𝑧) , 𝑝(𝑥|𝑧) and 𝑝(𝑦|𝑡, 𝑧). Then, the
framework is used to model the generation mechanism when unobserved confounders exist:
𝑁
𝐿 = ∑ 𝐸𝑞(𝑧𝑖 |𝑥𝑖 , 𝑡𝑖 , 𝑦𝑖 ) [𝑙𝑜𝑔𝑝(𝑥𝑖 , 𝑡𝑖 |𝑧𝑖 ) + 𝑙𝑜𝑔𝑝(𝑦𝑖 |𝑡𝑖 , 𝑧𝑖 ) + 𝑙𝑜𝑔𝑝(𝑧𝑖 ) − 𝑙𝑜𝑔𝑞(𝑧𝑖 |𝑥𝑖 , 𝑡𝑖 , 𝑦𝑖 )] , (27)
𝑖=1
In practice, to better estimate the distribution of parameters, two terms are added to the lower variable boundary as
𝐹𝐶𝐸𝑉𝐴𝐸 . 𝑦𝑖∗ , 𝑥𝑖∗ , 𝑡𝑖∗ are all the input observed values. The advantage of CEVAE is that it can cope well with the presence
of unobserved confounders, but the disadvantage is the lack of theoretical guarantees. Furthermore, the latent variable
can be divided into risk latent variable 𝑧𝑦 , instrumental latent variable 𝑧𝑡 , confounding latent variable 𝑧𝑐 and noisy latent
variable 𝑧𝑜 . This more granular division can help obtain an accurate distribution when making variational inferences to
facilitate subsequent estimation of causal effects. A more complete division is shown in the lower right corner “Proxy
variable” of Figure 8.
For the estimation of causal effects, when there are too many control variables, the ability to estimate is weak; when
unnecessary variables are included, it results in suboptimal nonparametric estimation. In the high-dimensional case,
24
X
Z X
I T Y
T Y
C
Figure 9: Typical scenarios for proxy variable-based causal inference methods. Gray nodes represent unknown variables or variables
that cannot be observed.
many variables are not confounders and need to be excluded from adjustment. This leads to a dilemma: including too
many unnecessary variables will increase the bias and variance of the estimate, reducing the accuracy of the estimate;
limiting the included variables cause confounders to be missed and introduce selection bias.
To solve this problem, TEDVAE [140] divides covariates into three categories: confounders, instrumental factors and
risk factors. Confounders affect both cause and effect, instrumental factors only affect the cause, and risk factors only
affect the effect. TEDVAE uses variational inference to infer latent variables from observed data and decouples them
into these three types of variables. The rest of the architecture is similar to CEVAE and can also be used for continuous
treatment variables. TVAE [136] improved TEDVAE by combining the method of target learning and maximum
likelihood estimation training. Since the causal graph used may vary when making causal inferences, the errors brought
by different methods are different; TVAE tends to have a smaller error even though the causal graph is wrong. The
purpose of introducing targeted regularization is to make the outcome y and the treatment assignment t as independent as
possible. TVAE can be seen as the combination of DragonNet and TEDVAE.
Unlike most methods, CausalVAE [137] does not require a priori causal diagram and only needs a small amount of
information related to the true causal concept 𝑢 as a supervision signal, converting independent exogenous variables into
causal endogenous variables and realize causal discovery at the same time. There are two uses of u; the first is to use
𝑝(𝑧|𝑢) to regularize the posterior of 𝑧 and the second is to use 𝑢 to learn the causal structure 𝐴. In addition to learning
causal representations, interventions can also be used to generate counterfactual data that are unobserved.
Instrumental variable (IV) [141-142] methods look for proxy latent variables for causal inference instead of finding
hidden confounders [15]. The IV framework has a long history, especially in economics [143]. The typical scenario using
the instrumental variable method is shown in Figure 9 (b). A more powerful instrumental variable approach
incorporating deep learning is introduced. DeepIV [144] uses instrumental variables and defines a counterfactual error
function to implement neural network-based causal inference in the presence of unobserved confounders. The method
can verify the accuracy of the out-of-distribution sample, which is very beneficial and affects the need for
hyperparameters for neural network tuning. DeepIV is implemented in two steps. The first step is to learn the treatment
distribution using a neural network: 𝐹̂ = 𝐹ф (𝑡|𝑥, 𝑧), where 𝑥 is the covariate, 𝑡 is the treatment variable, and 𝑧 is the
instrumental variable. The second step is to use the outcome network to predict the counterfactual outcomes. The
objective function is:
2
𝐿(𝐷; 𝜃) = |𝐷|−1 ∑ (𝑦𝑖 − ∫ ℎ𝜃 (𝑡, 𝑥𝑖 )𝑑𝐹̂ф (𝑡|𝑥𝑖 , 𝑧𝑖 )) , (29)
𝑖
25
2016 2017 2018 2019 2020 2021
Figure 10: Timeline of the main deep causal inference methods. In this figure, the blue circle represents the covariate balance-based
method, the green circle represents the adversarial training-based method, the yellow circle represents the proxy variable-based method,
and the gray triangle represents the technique used.
where ℎ is the prediction function, 𝐹̂ф is the treatment distribution obtained from the first step, and 𝐷 is the dataset.
Figure 10 shows the timeline of the main deep causal inference methods according to covariate balance-based,
adversarial training-based and proxy variable-based methods. We can see that covariate balance-based methods are the
core of deep causal inference methods; adversarial training-based and proxy variable-based methods have received
increasing attention in recent years.
26
theories of causal recognition prove that intervention is not necessary to identify causal effects, intervention is still at the
core of causal reasoning at the current stage of research. The intervention similar to SCM is defined in the computing
layer of the GNN. In the graph, the intervention changes the connection to the neighbor node, removing the edge from
the parent node in the causal graph. The restrictions on data acquisition information in the Pearl’s causal hierarchy (PCH)
(which level of information can be obtained by the data) still apply to neural networks. Neural networks have universal
approximations, so a set of neural networks can be used to train in the data generated by SCM to obtain an estimate of
SCM.
Furthermore, some graph-based variational autoencoders are beginning to merge with the SCM. VACA [151] does
not require any assumptions about the parameters; it simulates the necessity properties of SCM, providing a framework
for achieving intervention operations. Graph neural networks are used for causal inference. The conditions that the
Variance Graph Autoencoder (VGAE) [152] must satisfy are described as a density estimator for an a priori graph
structure so that it can simulate the behavior of causal intervention. In VACA, probabilistic models are introduced that
represent uncertainty to estimate causality and provides good approximations distributions of interventional and
counterfactual.
To build a deep structural causal model, the ability to model a distribution with three levels is required: association,
intervention, and counterfactual. At the same time, deep-structured causal models are no longer limited to structured data,
and since deep neural networks have been combined with structures, they can often directly process unstructured, high-
dimensional data. By fully combining the deep mechanism with SCM, DSCM [153] can use exogenous variables for
counterfactual inference such as SCM via variational inference. Three types of deep mechanisms combined with SCM
have been discussed: explicit likelihood, amortized explicit likelihood, and amortized implicit likelihood. These three
mechanisms may need variational inference and normalizing flows to model. DSCM can also finish the three steps of
counterfactual inference depicted by Pearl [35], which are abduction, action, and prediction. First, DSCM uses the
available evidence to estimate the exogenous variables. Second, it intervenes in one of the variables and the other
mechanisms remain unchanged. Finally, it uses the exogenous variable and causal mechanisms to obtain the new
outcomes. A deep causal graph (DCG) [154] also uses neural networks to model causal relationships. This model adapts
to data sampled from observational or intervention distributions to answer questions about the intervention or
counterfactual data. Specifically, DCG uses neural networks to simulate structural equations, whose elements are deep
causal units (DCUs), which can perform three operations: sampling, calculating likelihood, and calculating noise
posteriorly.
27
training-based, and proxy variable-based methods. Finally, we introduced a special class of structural causal models
called deep structural causal models. These models deeply integrate neural networks with causal models and can make
full use of the structural information of SCM and the fitting ability of neural networks.
Although deep learning has brought many changes to causal learning, there are still many problems that must be
addressed. Here, we raised these questions and included a brief discussion, hoping to provide researchers with some
future directions.
Scarcity of causal knowledge and too many strong assumptions. Although many of us are devoted to the field of
causal learning, as researchers, we must constantly reflect on whether what we think of as causality is truly causality and
whether there is a more suitable form of studying causality than the causal graph or potential outcomes. Because of this
lack of causal knowledge, we rely heavily on untestable assumptions when studying causality. In the causal discovery
field, the correctness of the causal Markov condition assumption and faithfulness assumption still needs to be fully
verified [155]. SUTVA, unconfounderness, and positivity assumptions are required when making causal inference. The
conditions under which these assumptions apply and fail need to be fully studied (e.g., due to social connections, one
person's treatment outcomes may affect another person through social activities.). Most existing methods and
applications of causal inference are based on the assumption of directed acyclic graphs [156], but in reality, there may be
causal feedback between variables, leading to the emergence of cyclic graphs [95]. In addition, different methods are
based on different assumptions about the distribution of noise [26, 157]. How to reasonably relax the assumptions while
ensuring the accuracy is a very challenging problem. Although these assumptions are convenient, they also bring many
risks, and care must be taken when using them.
Complex unstructured treatments and effects. Existing deep representation learning for causal variables does not
completely solve the problem of complex data. In many scenarios, data are heterogeneous [158], and treatment variables
may be very complex, such as time-series multivariate continuous variables (as opposed to simple binary discrete
variables) which pose challenges to many existing methods (e.g., the sampling frequency may affect results [95]). An
important problem in causal representation learning is how to make the representations as stable and unique as possible
and how to match the representations with human cognitive understanding. In addition, after causal representation
vectors are obtained, it may not make sense to test the independence between vectors; therefore, designing suitable
causal discovery and causal inference methods for different scenarios or designing general powerful methods is a way to
make deep causal learning more applicable.
Lack of casual datasets and suitable metrics. Although there are some commonly used datasets in the fields of
causal discovery and causal inference [25, 62], these small-scale datasets severely limit the performance of neural
networks for deep causal learning methods. Therefore, releasing large-scale causal datasets would be a significant boost
to the entire field of deep causal learning. At the same time, the metrics used to estimate causal effects on different
datasets remain inconsistent. For example, 𝜖𝑃𝐸𝐻𝐸 used in IHDP is only suitable for binary treatment variables, while the
metric 𝑅𝑝𝑜𝑙 used in the Jobs dataset is highly targeted and does not have universality. Rich and diverse causal "loss
functions" similar to the current stage of deep learning [159-161] would be very helpful for the rapid improvement of the
performance of deep causal learning algorithms.
Limited scalability and excessive computational consumption. Most current causal discovery methods can only
achieve good results on small-scale datasets, although methods based on continuous optimization can theoretically cope
with hundreds or thousands of high-dimensional variables [25]. Therefore, verifying the efficiency of existing methods
on large-scale data and developing theoretically more efficient methods are urgently needed in the field of causal
discovery. In addition to using purely observational data to discover causal relationships, integrating prior knowledge
28
into the causal discovery process is also a very meaningful direction for future study [162]. It is not always necessary to
pursue the discovery of a complete causal graph [163]; sometimes it is enough to know part of the causal structure to
solve the problem.
Deeper integration with deep learning. In this article, we mentioned the implicit generation of counterfactual data
and the explicit generation of disentangled causal mechanisms. We can also explore leveraging explicit generative
methods to generate counterfactuals, using implicit generative models to approximate causal mechanisms. Other deep
models, such as normalizing flow models [153], autoregressive flows models [164], and energy-based models [165],
combined with causal discovery and casual inference methods, can be considered. Based on our classification system,
deep causal inference methods are classified into covariate balancing-based, adversarial training-based, and proxy
variable-based methods. Some studies combined covariate balancing with adversarial training and obtained good results
[134-135, 166] (e.g., achieving covariate balance through adversarial training). There are also some methods that use
Transformer for causal inference [167-169]. This inspires us to multi-dimensionally integrate deep learning ideas with
causal methods from different perspectives. Finally, a class of methods called neural causal models (NCMs) [148] has
been extensively studied in recent years, and many effective algorithms have been developed for causal discovery and
inference. NCM is a subset of SCM but generally has the same expressive power as SCM. Neural causal models usually
combine the characteristics of the SCM data structure and the advantages of the general fitting ability of neural networks.
Therefore, they are considered a research direction that may bring a breakthrough to causal learning.
Causality for deep learning. In this article, we mainly discussed the changes that deep learning brought to causal
learning. At the same time, causal learning is profoundly changing the field of deep learning. Many studies have focused
on how causality can help deep learning address long-standing issues such as interpretability [44-45, 47, 170-171],
generalization [172-174], robustness [175-177], and fairness [178-180].
ACKNOWLEDGMENTS
We thank Xingwei Zhang and Songran Bai for their precious suggestions and comments on this work. We also thank
Haitao Huang and Gang Zhou for invaluable discussion. This work is supported by the Ministry of Science and
Technology of China under Grant No. 2020AAA0108401,and the Natural Science Foundation of China under Grant
Nos. 72225011 and 71621002.
REFERENCES
[1] Md Vasimuddin and Srinivas Aluru. 2017. Parallel Exact Dynamic Bayesian Network Structure Learning with Application to Gene Networks. In
2017 IEEE 24th International Conference on High Performance Computing (HiPC), Dec 18-21, 2017. Jaipur, India. 42-51.
[2] Sofia Triantafillou, Vincenzo Lagani, Christina Heinze-Deml, Angelika Schmidt and Ioannis Tsamardinos. 2017. Predicting Causal Relationships
from Biological Data: Applying Automated Casual Discovery on Mass Cytometry Data of Human Immune Cells. Scientific Reports 7, 1 (2017), 1-12.
[3] Steffen L Lauritzen and David J Spiegelhalter. 1988. Local computations with probabilities on graphical structures and their application to expert
systems. Journal of the Royal Statistical Society: Series B (Methodological) 50, 2 (1988), 157-194.
[4] Guido W Imbens and Donald B Rubin. 2015. Causal inference in statistics, social, and biomedical sciences. Cambridge University Press.
[5] Stefan Wager and Susan Athey. 2018. Estimation and inference of heterogeneous treatment effects using random forests. Journal of the American
Statistical Association 113, 523 (2018), 1228-1242.
[6] Subramani Mani and Gregory F Cooper. 2000. Causal discovery from medical textual data. In Proceedings of the AMIA Symposium, 2000. 542.
[7] Cross-Disorder Group of the Psychiatric Genomics Consortium. 2013. Identification of risk loci with shared effects on five major psychiatric
disorders: a genome-wide analysis. The Lancet 381, 9875 (2013), 1371-1379.
[8] Clive WJ Granger. 1969. Investigating causal relations by econometric models and cross-spectral methods. Econometrica: journal of the Econometric
Society 37 (1969), 424-438.
[9] Kevin D Hoover. 2006. Causality in economics and econometrics. Springer.
[10] Alberto Abadie and Guido W Imbens. 2016. Matching on the estimated propensity score. Econometrica 84, 2 (2016), 781-807.
29
[11] Guido W Imbens. 2004. Nonparametric estimation of average treatment effects under exogeneity: A review. Review of Economics and statistics 86,
1 (2004), 4-29.
[12] Serge Darolles, Yanqin Fan, Jean-Pierre Florens and Eric Renault. 2011. Nonparametric instrumental regression. Econometrica 79, 5 (2011), 1541-
1565.
[13] Miguel Ángel Hernán, Babette Brumback and James M Robins. 2000. Marginal structural models to estimate the causal effect of zidovudine on the
survival of HIV-positive men. Epidemiology 11, 5 (2000), 561-570.
[14] James M Robins, Miguel Angel Hernan and Babette Brumback. 2000. Marginal structural models and causal inference in epidemiology.
Epidemiology 11, 5 (2000), 550-560.
[15] Miguel A Hernán and James M Robins. 2010. Causal inference: What If. CRC Press.
[16] Miguel A Hernán. 2018. The C-word: scientific euphemisms do not improve causal inference from observational data. American journal of public
health 108, 5 (2018), 616-619.
[17] Michael P Grosz, Julia M Rohrer and Felix Thoemmes. 2020. The taboo against explicit causal inference in nonexperimental psychology.
Perspectives on Psychological Science 15, 5 (2020), 1243-1255.
[18] MJ Vowels. 2020. Limited functional form, misspecification, and unreliable interpretations in psychology and social science. arXiv:2009.10025.
Retrieved from https://arxiv.org/abs/2009.10025
[19] Michael E Sobel. 1998. Causal inference in statistical models of the process of socioeconomic achievement: A case study. Sociological Methods &
Research 27, 2 (1998), 318-348.
[20] Cosma Rohilla Shalizi and Andrew C Thomas. 2011. Homophily and contagion are generically confounded in observational social network studies.
Sociological methods & research 40, 2 (2011), 211-239.
[21] Bernhard Schölkopf. 2022. Causality for machine learning. Probabilistic and Causal Inference: The Works of Judea Pearl (2022), 765-804.
[22] Bernhard Scholkopf, Francesco Locatello, Stefan Bauer, Nan Rosemary Ke, Nal Kalchbrenner, Anirudh Goyal and Yoshua Bengio. 2021. Toward
Causal Representation Learning. Proceedings of the IEEE 109, 5 (2021), 612-634.
[23] Ruocheng Guo, Lu Cheng, Jundong Li, P. Richard Hahn and Huan Liu. 2021. A Survey of Learning Causality with Data. ACM Computing Surveys
53, 4 (2021), 1-37.
[24] Clark Glymour, Kun Zhang and Peter Spirtes. 2019. Review of Causal Discovery Methods Based on Graphical Models. Front Genet 10 (2019), 524.
[25] Matthew J Vowels, Necati Cihan Camgoz and Richard Bowden. 2021. D'ya like dags? A survey on structure learning and causal discovery.
arXiv:2103.02582. Retrieved from https://arxiv.org/abs/2103.02582
[26] Peter Spirtes, Clark N Glymour, Richard Scheines and David Heckerman. 2000. Causation, prediction, and search. MIT press.
[27] David Maxwell Chickering. 2002. Optimal structure identification with greedy search. Journal of machine learning research 3, Nov (2002), 507-554.
[28] Shohei Shimizu, Patrik O Hoyer, Aapo Hyvärinen, Antti Kerminen and Michael Jordan. 2006. A linear non-Gaussian acyclic model for causal
discovery. Journal of Machine Learning Research 72, 7 (2006), 2003-2030.
[29] Patrik Hoyer, Dominik Janzing, Joris M Mooij, Jonas Peters and Bernhard Schölkopf. 2008. Nonlinear causal discovery with additive noise models.
In Advances in neural information processing systems, Dec 8-10, 2008. Vancouver, B.C., Canada. 689-696.
[30] Diego Colombo, Marloes H Maathuis, Markus Kalisch and Thomas S Richardson. 2012. Learning high-dimensional directed acyclic graphs with
latent and selection variables. The Annals of Statistics 40, 1 (2012), 294-321.
[31] Dominik Janzing, Joris Mooij, Kun Zhang, Jan Lemeire, Jakob Zscheischler, Povilas Daniušis, Bastian Steudel and Bernhard Schölkopf. 2012.
Information-geometric approach to inferring causal directions. Artificial Intelligence 182 (2012), 1-31.
[32] Antti Hyttinen, Patrik O Hoyer, Frederick Eberhardt and Matti Jarvisalo. 2013. Discovering cyclic causal models with latent variables: A general
SAT-based procedure. arXiv:1309.6836. Retrieved from https://arxiv.org/abs/1309.6836
[33] Judea Pearl. 2009. Causal inference in statistics: An overview. Statistics Surveys 3 (2009), 96-146.
[34] Liuyi Yao, Zhixuan Chu, Sheng Li, Yaliang Li, Jing Gao and Aidong Zhang. 2021. A survey on causal inference. ACM Transactions on Knowledge
Discovery from Data (TKDD) 15, 5 (2021), 1-46.
[35] Judea Pearl. 2009. Causality. Cambridge university press.
[36] Alberto Abadie, David Drukker, Jane Leber Herr and Guido W Imbens. 2004. Implementing matching estimators for average treatment effects in
Stata. The stata journal 4, 3 (2004), 290-311.
[37] Elizabeth A Stuart. 2010. Matching methods for causal inference: A review and a look forward. Statistical science: a review journal of the Institute of
Mathematical Statistics 25, 1 (2010), 1-21.
[38] Donald B Rubin. 1974. Estimating causal effects of treatments in randomized and nonrandomized studies. Journal of educational Psychology 66, 5
(1974), 688.
[39] Paul R Rosenbaum and Donald B Rubin. 1983. The central role of the propensity score in observational studies for causal effects. Biometrika 70, 1
(1983), 41-55.
[40] Donald B Rubin. 2005. Causal inference using potential outcomes: Design, modeling, decisions. Journal of the American Statistical Association 100,
469 (2005), 322-331.
[41] James M Robins, Andrea Rotnitzky and Lue Ping Zhao. 1994. Estimation of regression coefficients when some regressors are not always observed.
Journal of the American statistical Association 89, 427 (1994), 846-866.
30
[42] Michele Jonsson Funk, Daniel Westreich, Chris Wiesen, Til Stürmer, M Alan Brookhart and Marie Davidian. 2011. Doubly robust estimation of
causal effects. American journal of epidemiology 173, 7 (2011), 761-767.
[43] Youngseo Son, Nipun Bayas and H Andrew Schwartz. 2018. Causal explanation analysis on social media. arXiv:1809.01202. Retrieved from
https://arxiv.org/abs/1809.01202
[44] Jesse Vig, Sebastian Gehrmann, Yonatan Belinkov, Sharon Qian, Daniel Nevo, Yaron Singer and Stuart Shieber. 2020. Investigating gender bias in
language models using causal mediation analysis. In Advances in Neural Information Processing Systems, Dec 6-12, 2020. 12388-12401.
[45] Matthew Finlayson, Aaron Mueller, Sebastian Gehrmann, Stuart Shieber, Tal Linzen and Yonatan Belinkov. 2021. Causal analysis of syntactic
agreement mechanisms in neural language models. arXiv:2106.06087. Retrieved from https://arxiv.org/abs/2106.06087
[46] Dong Zhang, Hanwang Zhang, Jinhui Tang, Xian-Sheng Hua and Qianru Sun. 2020. Causal intervention for weakly-supervised semantic
segmentation. Advances in Neural Information Processing Systems 33, (2020), 655-666.
[47] Pranoy Panda, Sai Srinivas Kancheti and Vineeth N Balasubramanian. 2021. Instance-wise Causal Feature Selection for Model Interpretation. In
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021. 1756-1759.
[48] Markus Kalisch and Peter Bühlman. 2007. Estimating high-dimensional directed acyclic graphs with the PC-algorithm. Journal of Machine Learning
Research 8 (2007), 613-636.
[49] Joseph Ramsey, Jiji Zhang and Peter L Spirtes. 2012. Adjacency-faithfulness and conservative causal inference. arXiv:1206.6843. Retrieved from
https://arxiv.org/abs/1206.6843
[50] Peter C Austin. 2011. An introduction to propensity score methods for reducing the effects of confounding in observational studies. Multivariate
behavioral research 46, 3 (2011), 399-424.
[51] Rohit Bhattacharya, Razieh Nabi and Ilya Shpitser. 2020. Semiparametric inference for causal effects in graphical models with hidden variables.
arXiv:2003.12659. Retrieved from https://arxiv.org/abs/2003.12659
[52] Niki Kiriakidou and Christos Diou. 2022. An improved neural network model for treatment effect estimation. arXiv:2205.11106. Retrieved from
https://arxiv.org/abs/2205.11106
[53] Nan Rosemary Ke, Olexa Bilaniuk, Anirudh Goyal, Stefan Bauer, Hugo Larochelle, Bernhard Schölkopf, Michael C Mozer, Chris Pal and Yoshua
Bengio. 2019. Learning neural causal models from unknown interventions. arXiv:1910.01075. Retrieved from https://arxiv.org/abs/1910.01075
[54] Christina Heinze-Deml, Marloes H Maathuis and Nicolai Meinshausen. 2018. Causal structure learning. Annual Review of Statistics and Its
Application 5 (2018), 371-391.
[55] Clark Glymour, Kun Zhang and Peter Spirtes. 2019. Review of causal discovery methods based on graphical models. Frontiers in genetics 10, (2019),
524.
[56] Krzysztof Chalupka, Frederick Eberhardt and Pietro Perona. 2016. Causal feature learning: an overview. Behaviormetrika 44, 1 (2016), 137-164.
[57] Yoshua Bengio, Aaron Courville and Pascal Vincent. 2013. Representation learning: A review and new perspectives. IEEE transactions on pattern
analysis and machine intelligence 35, 8 (2013), 1798-1828.
[58] Mehrdad Farajtabar, Andrew Lee, Yuanjian Feng, Vishal Gupta, Peter Dolan, Harish Chandran and Martin Szummer. 2020. Balance regularized
neural network models for causal effect estimation. arXiv:2011.11199. Retrieved from https://arxiv.org/abs/2011.11199
[59] Jean Kaddour, Yuchen Zhu, Qi Liu, Matt J Kusner and Ricardo Silva. 2021. Causal effect inference for structured treatments. In Advances in Neural
Information Processing Systems, Dec 6-14, 2021. 24841-24854.
[60] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville and Yoshua Bengio. 2014.
Generative adversarial nets. In Advances in neural information processing systems, Dec 8-13, 2014. Montréal Canada. 2672-2680.
[61] Diederik P Kingma and Max Welling. 2013. Auto-encoding variational bayes. arXiv:1312.6114. Retrieved from https://arxiv.org/abs/1312.6114
[62] Ana Rita Nogueira, Andrea Pugnana, Salvatore Ruggieri, Dino Pedreschi and João Gama. 2022. Methods and tools for causal discovery and causal
inference. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 12, 2 (2022), e1449.
[63] Bernhard Schölkopf. 2019. Causality for machine learning. arXiv preprint arXiv:1911.10500 (2019),
[64] Jean Kaddour, Aengus Lynch, Qi Liu, Matt J Kusner and Ricardo Silva. 2022. Causal Machine Learning: A Survey and Open Problems.
arXiv:2206.15475. Retrieved from https://arxiv.org/abs/2206.15475
[65] Bernard Koch, Tim Sainburg, Pablo Geraldo, Song Jiang, Yizhou Sun and Jacob Gates Foster. 2021. Deep learning of potential outcomes.
arXiv:2110.04442. Retrieved from https://arxiv.org/abs/2110.04442
[66] Ilya Shpitser and Judea Pearl. 2008. Complete Identification Methods for the Causal Hierarchy. Journal of Machine Learning Research 9, 9 (2008),
1941-1979.
[67] Jonas Peters, Dominik Janzing and Bernhard Schölkopf. 2017. Elements of Causal Inference - Foundations and Learning Algorithms. The MIT Press.
[68] Keisuke Hirano, Guido W Imbens and Geert Ridder. 2003. Efficient estimation of average treatment effects using the estimated propensity score.
Econometrica 71, 4 (2003), 1161-1189.
[69] Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation 9, 8 (1997), 1735-1780.
[70] Tomas Mikolov, Martin Karafiát, Lukas Burget, Jan Cernocký and Sanjeev Khudanpur. 2010. Recurrent neural network based language model. In
Interspeech, Sep 26-30, 2010. Makuhari, Chiba, Japan. 1045-1048.
[71] Marco Gori, Gabriele Monfardini and Franco Scarselli. 2005. A new model for learning in graph domains. In Proceedings of 2005 IEEE international
joint conference on neural networks, Jul 31-Aug 4, 2005. Montreal, QC, Canada. 729-734.
[72] Thomas N Kipf and Max Welling. 2016. Semi-supervised classification with graph convolutional networks. arXiv:1609.02907. Retrieved from
31
https://arxiv.org/abs/1609.02907
[73] Serge Assaad, Shuxi Zeng, Chenyang Tao, Shounak Datta, Nikhil Mehta, Ricardo Henao, Fan Li and Lawrence Carin. 2021. Counterfactual
representation learning with balancing weights. In International Conference on Artificial Intelligence and Statistics, Apr 13-15, 2021. 1972-1980.
[74] Fredrik Johansson, Uri Shalit and David Sontag. 2016. Learning representations for counterfactual inference. In International conference on machine
learning, Jun 19-24, 2016. New York City, NY, USA. 3020-3029.
[75] Uri Shalit, Fredrik D Johansson and David Sontag. 2017. Estimating individual treatment effect: generalization bounds and algorithms. In
International Conference on Machine Learning, Aug 6-11, 2017. Sydney, Australia. 3076-3085.
[76] Cédric Villani. 2009. Optimal transport: old and new. Springer.
[77] Arthur Gretton, Karsten M Borgwardt, Malte J Rasch, Bernhard Schölkopf and Alexander Smola. 2012. A kernel two-sample test. The Journal of
Machine Learning Research 13, 1 (2012), 723-773.
[78] Nicolas Fournier and Arnaud Guillin. 2015. On the rate of convergence in Wasserstein distance of the empirical measure. Probability Theory and
Related Fields 162, 3 (2015), 707-738.
[79] Karsten M Borgwardt, Arthur Gretton, Malte J Rasch, Hans-Peter Kriegel, Bernhard Schölkopf and Alex J Smola. 2006. Integrating structured
biological data by kernel maximum mean discrepancy. Bioinformatics 22, 14 (2006), e49-e57.
[80] Claudia Shi, David Blei and Victor Veitch. 2019. Adapting neural networks for the estimation of treatment effects. arXiv:1906.0212. Retrieved from
https://arxiv.org/abs/1906.0212
[81] Liuyi Yao, Sheng Li, Yaliang Li, Mengdi Huai, Jing Gao and Aidong Zhang. 2018. Representation learning for treatment effect estimation from
observational data. In Advances in Neural Information Processing Systems, Dec 2-8, 2018. Montréal Canada. 2638-2648.
[82] Shakir Mohamed and Balaji Lakshminarayanan. 2016. Learning in implicit generative models. arXiv:1610.03483. Retrieved from
https://arxiv.org/abs/1610.03483
[83] Liuyi Yao, Sheng Li, Yaliang Li, Hongfei Xue, Jing Gao and Aidong Zhang. 2019. On the estimation of treatment effect with text covariates. In the
28th International Joint Conference on Artificial Intelligence, Aug 10-16, 2019. Macao, China. 4106-4113.
[84] Giambattista Parascandolo, Niki Kilbertus, Mateo Rojas-Carulla and Bernhard Schölkopf. 2018. Learning independent causal mechanisms. In
International Conference on Machine Learning, Jul 10-15, 2018. Stockholmsmässan, Stockholm Sweden. 4036-4044.
[85] Raphael Suter, Djordje Miladinovic, Bernhard Schölkopf and Stefan Bauer. 2019. Robustly disentangled causal mechanisms: Validating deep
representations for interventional robustness. In International Conference on Machine Learning, Jun 9-15, 2019. Long Beach, California. 6056-6065.
[86] Antonin Chambolle and Thomas Pock. 2016. An introduction to continuous optimization for imaging. Acta Numerica 25 (2016), 161-319.
[87] Niclas Andréasson, Anton Evgrafov and Michael Patriksson. 2020. An introduction to continuous optimization: foundations and fundamental
algorithms. Courier Dover Publications.
[88] Xun Zheng, Bryon Aragam, Pradeep K Ravikumar and Eric P Xing. 2018. Dags with no tears: Continuous optimization for structure learning. In
Advances in Neural Information Processing Systems, Dec 2-8, 2018. Montréal Canada. 9492-9503.
[89] Yue Yu, Jie Chen, Tian Gao and Mo Yu. 2019. DAG-GNN: DAG structure learning with graph neural networks. In International Conference on
Machine Learning, June 9-15, 2019. Long Beach, California. 7154-7163.
[90] Ignavier Ng, Shengyu Zhu, Zhitang Chen and Zhuangyan Fang. 2019. A graph autoencoder approach to causal structure learning. arXiv:1911.07420.
Retrieved from https://arxiv.org/abs/1911.07420
[91] Sébastien Lachapelle, Philippe Brouillard, Tristan Deleu and Simon Lacoste-Julien. 2019. Gradient-based neural dag learning. arXiv:1906.02226.
Retrieved from https://arxiv.org/abs/1906.02226
[92] Tomer Galanti, Ofir Nabati and Lior Wolf. 2020. A critical view of the structural causal model. arXiv:2002.10007. Retrieved from
https://arxiv.org/abs/2002.10007
[93] Trent Kyono, Yao Zhang and Mihaela van der Schaar. 2020. Castle: Regularization via auxiliary causal graph discovery. In Advances in Neural
Information Processing Systems, Dec 6-12, 2020. 1501-1512.
[94] Raha Moraffah, Bahman Moraffah, Mansooreh Karami, Adrienne Raglin and Huan Liu. 2020. Causal adversarial network for learning conditional
and interventional distributions. arXiv:2008.11376. Retrieved from https://arxiv.org/abs/2008.11376
[95] Jakob Runge, Sebastian Bathiany, Erik Bollt, Gustau Camps-Valls, Dim Coumou, Ethan Deyle, Clark Glymour, Marlene Kretschmer, Miguel D
Mahecha and Jordi Muñoz-Marí.2019. Inferring causation from time series in Earth system sciences. Nature communications 10, 1 (2019), 1-13.
[96] David Danks and Sergey Plis. 2013. Learning causal structure from undersampled time series. In Twenty-seventh Conference on Neural Information
Processing Systems, Dec 5-10, 2013. Harrahs and Harveys, Lake Tahoe 1-10.
[97] Antti Hyttinen, Sergey Plis, Matti Järvisalo, Frederick Eberhardt and David Danks. 2016. Causal discovery from subsampled time series data by
constraint optimization. In Conference on Probabilistic Graphical Models, Sep 6-9, 2016. Lugano, Switzerland. 216-227.
[98] C. W. J. Granger. 1980. Testing for causality: A personal viewpoint. Journal of Economic Dynamics and Control 2, (1980), 329-352.
[99] Lionel Barnett and Anil K Seth. 2014. The MVGC multivariate Granger causality toolbox: a new approach to Granger-causal inference. Journal of
neuroscience methods 223 (2014), 50-68.
[100] Lionel Barnett and Anil K Seth. 2014. The MVGC multivariate Granger causality toolbox: a new approach to Granger-causal inference. Journal of
neuroscience methods 223, (2014), 50-68.
[101] Belkacem Chikhaoui, Mauricio Chiazzaro and Shengrui Wang. 2015. A new granger causal model for influence evolution in dynamic social
networks: The case of dblp. In Proceedings of the AAAI Conference on Artificial Intelligence, January 25-30, 2015. Austin Texas, USA. 51-57.
32
[102] Alex Tank, Ian Covert, Nicholas Foti, Ali Shojaie and Emily Fox. 2021. Neural granger causality. IEEE Transactions on Pattern Analysis and
Machine Intelligence 44, 8 (2021), 4267-4279.
[103] Hongming Zhang, Yintong Huo, Xinran Zhao, Yangqiu Song and Dan Roth. 2021. Learning Contextual Causality between Daily Events from Time-
consecutive Images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, December 6-12, 2021. 1752-1755.
[104] George Sugihara, Robert May, Hao Ye, Chih-hao Hsieh, Ethan Deyle, Michael Fogarty and Stephan Munch. 2012. Detecting causality in complex
ecosystems. Science 338, 6106 (2012), 496-500.
[105] Matthew J Vowels, Necati Cihan Camgoz and Richard Bowden. 2021. Shadow-mapping for unsupervised neural causal discovery. In Proceedings of
the IEEE/CVF Conference on Computer Vision and Pattern Recognition, June 19-25, 2021. 1740-1743.
[106] SM Eslami, Nicolas Heess, Theophane Weber, Yuval Tassa, David Szepesvari and Geoffrey E Hinton. 2016. Attend, infer, repeat: Fast scene
understanding with generative models. arXiv:1603.08575. Retrieved from https://arxiv.org/abs/1603.08575
[107] Yoshua Bengio, Tristan Deleu, Nasim Rahaman, Rosemary Ke, Sébastien Lachapelle, Olexa Bilaniuk, Anirudh Goyal and Christopher Pal. 2019. A
meta-transfer objective for learning to disentangle causal mechanisms. arXiv:1901.10912. Retrieved from https://arxiv.org/abs/1901.10912
[108] Thien Q Tran, Kazuto Fukuchi, Youhei Akimoto and Jun Sakuma. 2021. Unsupervised Causal Binary Concepts Discovery with VAE for Black-box
Model Explanation. arXiv:2109.04518. Retrieved from https://arxiv.org/abs/2109.04518
[109] Mengyue Yang, Furui Liu, Zhitang Chen, Xinwei Shen, Jianye Hao and Jun Wang. 2021. CausalVAE: Disentangled representation learning via
neural structural causal models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, December 6-12, 2021.
9593-9602.
[110] Nan Rosemary Ke, Jane Wang, Jovana Mitrovic, Martin Szummer and Danilo J Rezende. 2020. Amortized learning of neural causal representations.
arXiv:2008.09301. Retrieved from https://arxiv.org/abs/2008.09301
[111] Bernhard H Korte, Jens Vygen, B Korte and J Vygen. 2011. Combinatorial optimization. Springer.
[112] Olivier Goudet, Diviyan Kalainathan, Philippe Caillou, Isabelle Guyon, David Lopez-Paz and Michele Sebag. 2018. Learning functional causal
models with generative neural networks. Explainable and interpretable models in computer vision and machine learning (2018), 39-80.
[113] Diviyan Kalainathan. 2019. Generative Neural Networks to infer Causal Mechanisms: algorithms and applications. UniversitéParis Saclay
[114] Sindy Löwe, David Madras, Richard Zemel and Max Welling. 2020. Amortized causal discovery: Learning to infer causal graphs from time-series
data. arXiv:2006.10833. Retrieved from https://arxiv.org/abs/2006.10833
[115] Yunzhu Li, Antonio Torralba, Anima Anandkumar, Dieter Fox and Animesh Garg. 2020. Causal discovery in physical systems from videos. In
Advances in Neural Information Processing Systems, Dec 6-12, 2020. 9180-9192.
[116] Shengyu Zhu, Ignavier Ng and Zhitang Chen. 2019. Causal discovery with reinforcement learning. arXiv:1906.04477. Retrieved from
https://arxiv.org/abs/1906.04477
[117] Patrick Schwab, Lorenz Linhardt and Walter Karlen. 2018. Perfect match: A simple method for learning representations for counterfactual inference
with neural networks. arXiv:1810.00656. Retrieved from https://arxiv.org/abs/1810.00656
[118] Fredrik D Johansson, Nathan Kallus, Uri Shalit and David Sontag. 2018. Learning weighted representations for generalization across designs.
arXiv:1802.08598. Retrieved from https://arxiv.org/abs/1802.08598
[119] Negar Hassanpour and Russell Greiner. 2019. CounterFactual Regression with Importance Sampling Weights. In the 28th International Joint
Conference on Artificial Intelligence, Aug 10-16, 2019. Macao, China. 5880-5887.
[120] Liu Qidong, Tian Feng, Ji Weihua and Zheng Qinghua. 2020. A new representation learning method for individual treatment effect estimation: Split
covariate representation network. In Asian Conference on Machine Learning, Nov 17-19, 2020. 811-822.
[121] Ahmed M Alaa, Michael Weisz and Mihaela Van Der Schaar. 2017. Deep counterfactual networks with propensity-dropout. arXiv:1706.05966.
Retrieved from https://arxiv.org/abs/1706.05966
[122] Vikas Ramachandra. 2018. Deep learning for causal inference. arXiv:1803.00149. Retrieved from https://arxiv.org/abs/1803.00149
[123] Onur Atan, James Jordon and Mihaela Van der Schaar. 2018. Deep-treat: Learning optimal personalized treatments from observational data using
neural networks. In Proceedings of the AAAI Conference on Artificial Intelligence, Feb 2-7, 2018. New Orleans, Louisiana, USA. 2071-2078.
[124] Liuyi Yao, Sheng Li, Yaliang Li, Mengdi Huai, Jing Gao and Aidong Zhang. 2019. Ace: Adaptively similarity-preserved representation learning for
individual treatment effect estimation. In 2019 IEEE International Conference on Data Mining (ICDM), Nov 8-11, 2019. Beijing, China. 1432-1437.
[125] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever and Ruslan Salakhutdinov. 2014. Dropout: a simple way to prevent neural
networks from overfitting. The journal of machine learning research 15, 1 (2014), 1929-1958.
[126] Yarin Gal and Zoubin Ghahramani. 2016. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In International
conference on machine learning, Jun 19-24, 2016. New York City, NY, USA. 1050-1059.
[127] Jinsung Yoon, James Jordon and Mihaela Van Der Schaar. 2018. GANITE: Estimation of individualized treatment effects using generative
adversarial nets. In International Conference on Learning Representations, Apr 30- May 3, 2018. Vancouver Canada.
[128] Ioana Bica, James Jordon and Mihaela van der Schaar. 2020. Estimating the effects of continuous-valued interventions using generative adversarial
networks. In Advances in Neural Information Processing Systems, Dec 6-12, 2020. 16434-16445.
[129] Chandan Singh, Guha Balakrishnan and Pietro Perona. 2021. Matched sample selection with GANs for mitigating attribute confounding.
arXiv:2103.13455. Retrieved from https://arxiv.org/abs/2103.13455
[130] Qingyu Zhao, Ehsan Adeli and Kilian M Pohl. 2020. Training confounder-free deep learning models for medical applications. Nature
communications 11, 1 (2020), 1-9.
33
[131] Murat Kocaoglu, Christopher Snyder, Alexandros G Dimakis and Sriram Vishwanath. 2017. Causalgan: Learning causal implicit generative models
with adversarial training. arXiv:1709.02023. Retrieved from https://arxiv.org/abs/1709.02023
[132] Ioana Bica, Ahmed M Alaa, James Jordon and Mihaela van der Schaar. 2020. Estimating counterfactual treatment outcomes over time through
adversarially balanced representations. arXiv:2002.04083. Retrieved from https://arxiv.org/abs/2002.04083
[133] Amelia J Averitt, Natnicha Vanitchanant, Rajesh Ranganath and Adler J Perotte. 2020. The Counterfactual $\chi $-GAN. arXiv:2001.03115.
Retrieved from https://arxiv.org/abs/2001.03115
[134] Nathan Kallus. 2020. Deepmatch: Balancing deep covariate representations for causal inference using adversarial training. In International
Conference on Machine Learning, Jul 12-18, 2020. 5067-5077.
[135] Xin Du, Lei Sun, Wouter Duivesteijn, Alexander Nikolaev and Mykola Pechenizkiy. 2021. Adversarial balancing-based representation learning for
causal effect inference with observational data. Data Mining and Knowledge Discovery 35, 4 (2021), 1713-1738.
[136] Matthew James Vowels, Necati Cihan Camgoz and Richard Bowden. 2020. Targeted VAE: Structured inference and targeted learning for causal
parameter estimation. arXiv.2009.13472. Retrieved from https://arxiv.org/abs/2009.13472
[137] Christos Louizos, Uri Shalit, Joris M Mooij, David Sontag, Richard Zemel and Max Welling. 2017. Causal effect inference with deep latent-variable
models. In Advances in neural information processing systems, Dec 4-9, 2017. Long Beach, CA, USA. 6449-6459.
[138] Carl Doersch. 2016. Tutorial on variational autoencoders. arXiv:1606.05908. Retrieved from https://arxiv.org/abs/1606.05908
[139] Diederik P Kingma and Max Welling. 2019. An introduction to variational autoencoders. arXiv:1906.02691. Retrieved from
https://arxiv.org/abs/1906.02691
[140] Weijia Zhang, Lin Liu and Jiuyong Li. 2020. Treatment effect estimation with disentangled latent factors. arXiv:2001.10652. Retrieved from
https://arxiv.org/abs/2001.10652
[141] Jeffrey Wooldridge. 2009. Should instrumental variables be used as matching variables. Citeseer.
[142] Judea Pearl. 2012. On a class of bias-amplifying variables that endanger effect estimates. arXiv:1203.3503. Retrieved from
https://arxiv.org/abs/1203.3503
[143] Olav Reiersøl. 1945. Confluence analysis by means of instrumental sets of variables. Almqvist & Wiksell.
[144] Jason Hartford, Greg Lewis, Kevin Leyton-Brown and Matt Taddy. 2017. Deep IV: A flexible approach for counterfactual prediction. In
International Conference on Machine Learning, Aug 6-11, 2017. Sydney, Australia. 1414-1423.
[145] Matej Zecevic, Devendra Singh Dhami, Petar Velickovic and Kristian Kersting. 2021. Relating graph neural networks to structural causal models.
arXiv:2109.04173. Retrieved from https://arxiv.org/abs/2109.04173
[146] Franco Scarselli, Marco Gori, Ah Chung Tsoi, Markus Hagenbuchner and Gabriele Monfardini. 2008. The graph neural network model. IEEE
transactions on neural networks 20, 1 (2008), 61-80.
[147] Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio and Yoshua Bengio. 2017. Graph attention networks.
arXiv:1710.10903. Retrieved from https://arxiv.org/abs/1710.10903
[148] Kevin Xia, Kai-Zhan Lee, Yoshua Bengio and Elias Bareinboim. 2021. The causal-neural connection: Expressiveness, learnability, and inference.
arXiv:2107.00793. Retrieved from https://arxiv.org/abs/2107.00793
[149] Pascal Vincent, Hugo Larochelle, Yoshua Bengio and Pierre-Antoine Manzagol. 2008. Extracting and composing robust features with denoising
autoencoders. In Proceedings of the 25th international conference on Machine learning, Jul 5-9, 2008. Helsinki Finland. 1096-1103.
[150] Pierre Baldi. 2012. Autoencoders, unsupervised learning, and deep architectures. In Proceedings of ICML workshop on unsupervised and transfer
learning, Jun 26-Jul 1, 2012. Bellevue, Washington, USA. 37-49.
[151] Pablo Sanchez Martin, Miriam Rateike and Isabel Valera. 2022. Variational Causal Autoencoder for Interventional and Counterfactual Queries. In
The Thirty-Sixth AAAI Conference on Artificial Intelligence, Feb 22- Mar 1, 2022. 8159-8168.
[152] Thomas N Kipf and Max Welling. 2016. Variational graph auto-encoders. arXiv:1611.07308. Retrieved from https://arxiv.org/abs/1611.07308
[153] Nick Pawlowski, Daniel Coelho de Castro and Ben Glocker. 2020. Deep structural causal models for tractable counterfactual inference. In Advances
in Neural Information Processing Systems, Dec 6-12, 2020. 857-869.
[154] Álvaro Parafita and Jordi Vitrià. 2020. Causal Inference with Deep Causal Graphs. arXiv:2006.08380. Retrieved from
https://arxiv.org/abs/2006.08380
[155] David Freedman and Paul Humphreys. 1999. Are there algorithms that discover causal structure? Synthese 121, 1 (1999), 29-54.
[156] Thomas C Williams, Cathrine C Bach, Niels B Matthiesen, Tine B Henriksen and Luigi Gagliardi. 2018. Directed acyclic graphs: a tool for causal
studies in paediatrics. Pediatric research 84, 4 (2018), 487-493.
[157] Shohei Shimizu, Patrik O Hoyer, Aapo Hyvärinen, Antti Kerminen and Michael Jordan. 2006. A linear non-Gaussian acyclic model for causal
discovery. Journal of Machine Learning Research 7, 10 (2006), 2003-2030.
[158] Biwei Huang, Kun Zhang, Jiji Zhang, Joseph D Ramsey, Ruben Sanchez-Romero, Clark Glymour and Bernhard Schölkopf. 2020. Causal Discovery
from Heterogeneous/Nonstationary Data. J. Mach. Learn. Res. 21, 89 (2020), 1-53.
[159] Ahmed Alaa and Mihaela Van Der Schaar. 2019. Validating causal inference models via influence functions. In Proceedings of the 36th International
Conference on Machine Learning, Jun 9-15, 2019. 191-201.
[160] Yu Luo, David A Stephens, Daniel J Graham and Emma J McCoy. 2021. Bayesian doubly robust causal inference via loss functions.
arXiv:2103.04086. Retrieved from https://arxiv.org/abs/2103.04086
[161] Moritz Willig, Matej Zečević, Devendra Singh Dhami and Kristian Kersting. 2021. The Causal Loss: Driving Correlation to Imply Causation.
34
Retrieved from https://arxiv.org/abs/2110.12066
[162] Uzma Hasan and Md Osman Gani. 2022. KCRL: A Prior Knowledge Based Causal Discovery Framework With Reinforcement Learning.
Proceedings of Machine Learning Research 182, (2022), 1-24.
[163] Wei Wang, Gangqiang Hu, Bo Yuan, Shandong Ye, Chao Chen, Yayun Cui, Xi Zhang and Liting Qian. 2020. Prior-knowledge-driven local causal
structure learning and its application on causal discovery between type 2 diabetes and bone mineral density. IEEE Access 8, (2020), 108798-108810.
[164] Ilyes Khemakhem, Ricardo Monti, Robert Leech and Aapo Hyvarinen. 2021. Causal autoregressive flows. In International conference on artificial
intelligence and statistics, Apr 13-15, 2021. 3520-3528.
[165] Ilyes Khemakhem, Diederik P Kingma, Ricardo Pio Monti and Aapo Hyvärinen. 2020. Ice-beem: Identifiable conditional energy-based deep models.
In Proceedings of the 34th International Conference on Neural Information Processing Systems, Dec 6-12, 2020. 12768–12778.
[166] Michal Ozery-Flato, Pierre Thodoroff, Matan Ninio, Michal Rosen-Zvi and Tal El-Hay. 2018. Adversarial balancing for causal inference.
arXiv:1810.07406. Retrieved from https://arxiv.org/abs/1810.07406
[167] Zhenyu Guo, Shuai Zheng, Zhizhe Liu, Kun Yan and Zhenfeng Zhu. 2021. CETransformer: Casual Effect Estimation via Transformer Based
Representation Learning. In Chinese Conference on Pattern Recognition and Computer Vision (PRCV), 2021. 524-535.
[168] Valentyn Melnychuk, Dennis Frauen and Stefan Feuerriegel. 2022. Causal Transformer for Estimating Counterfactual Outcomes. arXiv:2204.07258.
Retrieved from https://arxiv.org/abs/2204.07258
[169] Yi-Fan Zhang, Hanlin Zhang, Zachary C Lipton, Li Erran Li and Eric Xing. 2022. Exploring transformer backbones for heterogeneous treatment
effect estimation. arXiv:2202.01336. Retrieved from https://arxiv.org/abs/2202.01336
[170] Tanmayee Narendra, Anush Sankaran, Deepak Vijaykeerthy and Senthil Mani. 2018. Explaining deep learning models using causal inference.
arXiv:1811.04376. Retrieved from https://arxiv.org/abs/1811.04376
[171] Álvaro Parafita and Jordi Vitrià. 2019. Explaining visual models by causal attribution. In 2019 IEEE/CVF International Conference on Computer
Vision Workshop (ICCVW), Oct 27-28, 2019. Seoul, Korea. 4167-4175.
[172] Martin Arjovsky, Léon Bottou, Ishaan Gulrajani and David Lopez-Paz. 2019. Invariant risk minimization. arXiv:1907.02893. Retrieved from
https://arxiv.org/abs/1907.02893
[173] Divyat Mahajan, Shruti Tople and Amit Sharma. 2021. Domain generalization using causal matching. In International Conference on Machine
Learning, Jul 18-24, 2021. 7313-7324.
[174] Yue He, Zimu Wang, Peng Cui, Hao Zou, Yafeng Zhang, Qiang Cui and Yong Jiang. 2022. CausPref: Causal Preference Learning for Out-of-
Distribution Recommendation. In Proceedings of the ACM Web Conference 2022, Apr 25-29, 2022. Lyon France 410-421.
[175] Cheng Zhang, Kun Zhang and Yingzhen Li. 2020. A causal view on robustness of neural networks. Advances in Neural Information Processing
Systems 33, (2020), 289-301.
[176] Sreya Francis, Irene Tenison and Irina Rish. 2021. Towards causal federated learning for enhanced robustness and privacy. arXiv preprint
arXiv:2104.06557 (2021),
[177] Haiteng Zhao, Chang Ma, Xinshuai Dong, Anh Tuan Luu, Zhi-Hong Deng and Hanwang Zhang. 2022. Certified Robustness Against Natural
Language Attacks by Causal Intervention. arXiv:2205.12331. Retrieved from https://arxiv.org/abs/2205.12331
[178] Yiquan Wu, Kun Kuang, Yating Zhang, Xiaozhong Liu, Changlong Sun, Jun Xiao, Yueting Zhuang, Luo Si and Fei Wu. 2020. De-biased court’s
view generation with causality. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Nov 16-20,
2020. 763-780.
[179] Kaiji Lu, Piotr Mardziel, Fangjing Wu, Preetam Amancharla and Anupam Datta. 2020. Gender bias in neural natural language processing. Springer.
[180] Sahaj Garg, Vincent Perot, Nicole Limtiaco, Ankur Taly, Ed H Chi and Alex Beutel. 2019. Counterfactual fairness in text classification through
robustness. In Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society, 2019. 219-226.
35