0% found this document useful (0 votes)
58 views23 pages

Toward Causal Representation Learning: Byb S, F L, S B, N R K, N K, A G, Y B

This document summarizes a research article that explores how concepts from causal inference can help address open problems in machine learning like transfer learning and generalization. The article reviews key concepts in causal inference and relates them to challenges in machine learning, such as robustness to distribution shifts and learning reusable mechanisms. It argues that causality, which models how interventions and changes can affect a system, can provide a more robust understanding than statistical learning alone. The article proposes that discovering causal representations from data could help machines generalize more like humans.

Uploaded by

Anooshdini2002
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
58 views23 pages

Toward Causal Representation Learning: Byb S, F L, S B, N R K, N K, A G, Y B

This document summarizes a research article that explores how concepts from causal inference can help address open problems in machine learning like transfer learning and generalization. The article reviews key concepts in causal inference and relates them to challenges in machine learning, such as robustness to distribution shifts and learning reusable mechanisms. It argues that causality, which models how interventions and changes can affect a system, can provide a more robust understanding than statistical learning alone. The article proposes that discovering causal representations from data could help machines generalize more like humans.

Uploaded by

Anooshdini2002
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

Toward Causal

Representation Learning
This article reviews fundamental concepts of causal inference and relates them to crucial
open problems of machine learning, including transfer learning and generalization,
thereby assaying how causality can contribute to modern machine learning research.
By B ERNHARD S CHÖLKOPF , F RANCESCO L OCATELLO , S TEFAN B AUER , N AN R OSEMARY K E ,
N AL KALCHBRENNER , A NIRUDH G OYAL , AND Y OSHUA B ENGIO

ABSTRACT | The two fields of machine learning and graphical I. I N T R O D U C T I O N


causality arose and are developed separately. However, there If we compare what machine learning can do to what
is, now, cross-pollination and increasing interest in both fields animals accomplish, we observe that the former is rather
to benefit from the advances of the other. In this article, limited at some crucial feats where natural intelligence
we review fundamental concepts of causal inference and relate excels. These include transfer to new problems and any
them to crucial open problems of machine learning, including form of generalization that is not from one data point
transfer and generalization, thereby assaying how causality to the next (sampled from the same distribution), but
can contribute to modern machine learning research. This also rather from one problem to the next—both have been
applies in the opposite direction: we note that most work in termed generalization, but the latter is a much harder form
causality starts from the premise that the causal variables thereof, sometimes referred to as horizontal, strong, or out-
are given. A central problem for AI and causality is, thus, of-distribution generalization. This shortcoming is not too
causal representation learning, that is, the discovery of high- surprising, given that machine learning often disregards
level causal variables from low-level observations. Finally, information that animals use heavily: interventions in the
we delineate some implications of causality for machine learn- world, domain shifts, and temporal structure—by and
ing and propose key research areas at the intersection of both large, we consider these factors a nuisance and try to engi-
communities. neer them away. In accordance with this, the majority of
current successes of machine learning boil down to large-
KEYWORDS | Artificial intelligence; causality; deep learning;
scale pattern recognition on suitably collected independent
representation learning.
and identically distributed (i.i.d.) data.
To illustrate the implications of this choice and its rela-
Manuscript received August 14, 2020; revised December 29, 2020; accepted tion to causal models, we start by highlighting key research
February 8, 2021. Date of publication February 26, 2021; date of current version
April 30, 2021. (Bernhard Schölkopf and Francesco Locatello contributed equally
challenges.
to this work. Stefan Bauer and Nan Rosemary Ke contributed equally to this
work.) (Corresponding author: Francesco Locatello.)
Bernhard Schölkopf and Stefan Bauer are with the Max Planck Institute for A. Issue 1—Robustness
Intelligent Systems, 72076 Tübingen, Germany (e-mail: [email protected];
[email protected]). With the widespread adoption of deep learning
Francesco Locatello was with Google Research Amsterdam 1082 MD,
The Netherlands. He is now with the Computer Science Department, ETH Zürich, approaches in computer vision [103], [140], natural lan-
8092 Zürich, Switzerland, and also with the Max Planck Institute for Intelligent guage processing [55], and speech recognition [86], a sub-
Systems, 72076 Tübingen, Germany (e-mail: [email protected]).
Nan Rosemary Ke and Anirudh Goyal are with Mila, Montreal, QC H2S 3H1,
stantial body of literature explored the robustness of the
Canada, and also with the Department of Computer Science and Operational prediction of state-of-the-art deep neural network archi-
Research, University of Montreal, Montreal, QC H3T 1J4, Canada (e-mail:
[email protected]; [email protected]).
tectures. The underlying motivation originates from the
Nal Kalchbrenner is with Google Research Amsterdam 1082 MD, fact that, in the real world, there is often little control
The Netherlands (e-mail: [email protected]).
over the distribution from which the data come from.
Yoshua Bengio is with Mila, Montreal, QC H2S 3H1, Canada, with the
Department of Computer Science and Operational Research, University In computer vision [76], [228], changes in the test dis-
of Montreal, Montreal, QC H3T 1J4, Canada, and also with CIFAR, Toronto,
tribution may, for instance, come from aberrations, such as
ON M5G 1M1, Canada (e-mail: [email protected]).
camera blur, noise, or compression quality [107], [129],
Digital Object Identifier 10.1109/JPROC.2021.3058954 [170], [206], or from shifts, rotations, or viewpoints [7],

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/

612 P ROCEEDINGS OF THE IEEE | Vol. 109, No. 5, May 2021


Schölkopf et al.: Toward Causal Representation Learning

[12], [64], [282]. Motivated by this, new benchmarks predict the outcome of active intervention (“closing
were proposed to specifically test a generalization of clas- umbrellas does not stop the rain”). Causal relations
sification and detection methods with respect to simple can also be viewed as the components of reasoning
algorithmically generated interventions, such as spatial chains [151] that provide predictions for situations that
shifts, blur, changes in brightness or contrast [107], [170], are very far from the observed distribution and may
time consistency [95], [227], control over background and even remain purely hypothetical [163], [183] or require
rotation [12], as well as images collected in multiple envi- conscious deliberation [128]. In that sense, discovering
ronments [20]. Studying the failure modes of deep neural causal relations means acquiring robust knowledge that
networks from simple interventions has the potential to holds beyond the support of observed data distribution
lead to insights into the inductive biases of state-of-the- and a set of training tasks, and it extends to situations
art architectures. So far, there has been no definitive con- involving forms of reasoning.
sensus on how to solve these problems, although progress Our contributions: In this article, we argue that causal-
has been made using data augmentation, pretraining, ity, with its focus on representing structural knowledge
self-supervision, and architectures with suitable inductive about the data generating process that allows interventions
biases with respect to a perturbation of interest [60], [64], and changes, can contribute toward understanding and
[137], [170], [206], [233]. It has been argued [188] that resolving some limitations of current machine learning
such fixes may not be sufficient, and generalizing well out- methods. This would take the field a step closer to a form of
side the i.i.d. setting requires learning not mere statistical artificial intelligence that involves thinking in the sense of
associations between variables, but an underlying causal Konrad Lorenz, that is, acting in an imagined space [163].
model. The latter contains the mechanisms giving rise to Despite its success, statistical learning provides a rather
the observed statistical dependences and allows to model superficial description of reality that only holds when the
distribution shifts through the notion of interventions [35], experimental conditions are fixed. Instead, the field of
[180], [183], [188], [220], [237]. causal learning seeks to model the effect of interventions
and distribution changes with a combination of data-
B. Issue 2—Learning Reusable Mechanisms driven learning and assumptions not already included in
the statistical description of a system. This work reviews
Infants’ understanding of physics relies upon objects that
and synthesizes key contributions that have been made to
can be tracked over time and behave consistently [53],
this end.1
[236]. Such a representation allows children to quickly
learn new tasks as their knowledge and intuitive under- 1) We describe different levels of modeling in physical
standing of physics can be reused [17], [53], [144], [250]. systems in Section II and present the differences
Similarly, intelligent agents that robustly solve real-world between causal and statistical models in Section III.
tasks need to reuse and repurpose their knowledge and We do so not only in terms of modeling abilities, but
skills in novel scenarios. Machine learning models that also discuss the assumptions and challenges involved.
incorporate or learn structural knowledge of an environ- 2) We expand on the independent causal mechanism
ment have been shown to be more efficient and generalize (ICM) principle as a key component that enables the
better [9], [11], [15], [16], [27], [58], [77], [84], [85], estimation of causal relations from data in Section IV.
[141], [157], [177], [181], [197], [211], [212], [244], In particular, we state the sparse mechanism shift
[258], [272], [274]. In a modular representation of the (SMS) hypothesis as a consequence of the ICM prin-
world where the modules correspond to physical causal ciple and discuss its implications for learning causal
mechanisms, many modules can be expected to behave models.
similarly across different tasks and environments. An agent 3) We review existing approaches to learn causal rela-
facing a new environment or task may thus only need to tions from appropriate descriptors (or features) in
adapt a few modules in its internal representation of the Section V. We cover both classical approaches and
world [85], [219]. When learning a causal model, one modern reinterpretations based on deep neural net-
should, thus, require fewer examples to adapt as most works, with a focus on the underlying principles that
knowledge, that is, modules, can be reused without further enable causal discovery.
training. 4) We discuss how useful models of reality may be
learned from data in the form of causal representa-
tions and discuss several current problems of machine
C. Causality Perspective
learning from a causal point of view in Section VI.
Causation is a subtle concept that cannot be fully 5) We assay the implications of causality for practi-
described using the language of Boolean logic [151] or that cal machine learning in Section VII. Using causal
of probabilistic inference; it requires the additional language, we revisit robustness and generalization,
notion of intervention [183], [237]. The manipulative as well as existing common practices, such as semi-
definition of causation [118], [183], [237] focuses on the supervised learning (SSL), self-supervised learning,
fact that conditional probabilities (“seeing people with
open umbrellas suggests that it is raining”) cannot reliably 1 The present paper expands [221], leading to partial text overlap.

Vol. 109, No. 5, May 2021 | P ROCEEDINGS OF THE IEEE 613


Schölkopf et al.: Toward Causal Representation Learning

Table 1 Simple Taxonomy of Models. The Most Detailed Model (Top) Is a Mechanistic or Physical One, Usually in Terms of Differential Equations.
At the Other End of the Spectrum (Bottom), We Have a Purely Statistical Model; This Can Be Learned From Data, but It Often Provides Little Insight
Beyond Modeling Associations Between Epiphenomena. Causal Models Can Be Seen as Descriptions That Lie in Between, Abstracting Away From
Physical Realism While Retaining the Power to Answer Certain Interventional or Counterfactual Questions

data augmentation, and pretraining. We discuss While a differential equation is a rather comprehensive
examples at the intersection between causality and description of a system, a statistical model can be viewed
machine learning in scientific applications and spec- as a much more superficial one. It often does not refer
ulate on the advantages of combining the strengths of to dynamic processes; instead, it tells us how some of
both fields to build a more versatile AI. the variables allow the prediction of others as long as
experimental conditions do not change. For example, if we
drive a differential equation system with certain types
II. L E V E L S O F C A U S A L M O D E L I N G
of noise, or we average over time, then it may be the
The gold standard for modeling natural phenomena is
case that statistical dependencies between components of
a set of coupled differential equations modeling physical
x emerge and those can then be exploited by machine
mechanisms responsible for time evolution. This allows us
learning. Such a model does not allow us to predict the
to predict the future behavior of a physical system, reason
effect of interventions; however, its strength is that it
about the effect of interventions, and predict statistical
can often be learned from observational data, while a
dependencies between variables that are generated by
differential equation usually requires an intelligent human
coupled time evolution. It also offers physical insights,
to come up with it. Causal modeling lies in between these
explaining the functioning of the system, and lets us read
two extremes. Like models in physics, it aims to provide
off its causal structure. To this end, consider the coupled
the understanding and predict the effect of interventions.
set of differential equations:
However, causal discovery and learning try to arrive at
such models in a data-driven way, replacing expert knowl-
dx
= f (x), x ∈ Rd (1) edge with weak and generic assumptions. The overall situ-
dt
ation is summarized in Table 1, adapted from [188]. In the
following, we address some of the tasks listed in Table 1 in
with initial value x(t0 ) = x0 . The Picard–Lindelöf theorem
more detail.
states that, at least locally, if f is Lipschitz, there exists a
unique solution x(t). This implies, in particular, that the A. Predicting in the i.i.d. Setting
immediate future of x is implied by its past values.
If we formally write this in terms of infinitesimal differ- Statistical models are a superficial description of real-
entials dt and dx = x(t + dt) − x(t), we get ity as they are only required to model associations. For
a given set of input examples X and target labels Y ,
we may be interested in approximating P (Y |X) to answer
x(t + dt) = x(t) + dt · f (x(t)). (2) questions, such as “what is the probability that this par-
ticular image contains a dog?” or “what is the probability
From this, we can ascertain which entries of the vector x(t) of heart failure given certain diagnostic measurements
mathematically determine the future of others x(t + dt). (e.g., blood pressure) carried out on a patient?” Subject
This tells us that if we have a physical system whose to suitable assumptions, these questions can be provably
physical mechanisms are correctly described using such an answered by observing a sufficiently large amount of i.i.d.
ordinary differential equation (1), solved for (dx/dt) (i.e., data from P (X, Y ) [257]. Despite the impressive advances
the derivative only appears on the left-hand side), then its of machine learning, causality offers an underexplored
causal structure can be directly read off.2 complement: accurate predictions may not be sufficient
to inform decision-making. For example, the frequency
2 Note that this requires that the differential equation system describes
of storks is a reasonable predictor for human birth rates
the causal physical mechanisms. If, in contrast, we considered a set in Europe [168]. However, as there is no direct causal
of differential equations that phenomenologically correctly describe the link between these two variables, a change to the stork
time evolution of a system without capturing the underlying mechanisms population would not affect the birth rates, even though a
(e.g., due to unobserved confounding or a form of course graining
that does not preserve the causal structure [208]), then (2) may not statistical model may predict so. The predictions of a statis-
be causally meaningful [186], [217]. tical model are only accurate within identical experimental

614 P ROCEEDINGS OF THE IEEE | Vol. 109, No. 5, May 2021


Schölkopf et al.: Toward Causal Representation Learning

conditions. Performing an intervention changes the data a given patient have suffered heart failure if they had
distribution, which may lead to (arbitrarily) inaccurate started exercising a year earlier?” As we shall discuss in
predictions [183], [188], [220], [237]. the following, counterfactuals, or approximations thereof,
are especially critical in RL. They can enable agents to
reflect on their decisions and formulate hypotheses that
B. Predicting Under Distribution Shifts
can be empirically verified in a process akin to the scientific
Interventional questions are more challenging than pre- method.
dictions as they involve actions that take us out of the usual
i.i.d. setting of statistical learning. Interventions may affect
both the value of a subset of causal variables and their
D. Nature of Data: Observational, Interventional,
relations. For example, “is increasing the number of storks
and (Un)structured
in a country going to boost its human birth rate?” and The data format plays a substantial role in which type
“would fewer people smoke if cigarettes were more socially of relation can be inferred. We can distinguish two axes
stigmatized?” As interventions change the joint distribu- of data modalities: observational versus interventional,
tion of the variables of interest, classical statistical learning and hand-engineered versus raw (unstructured) percep-
guarantees [257] no longer apply. On the other hand, tual input.
learning about interventions may allow training predictive 1) Observational and Interventional Data: An extreme
models that are robust against the changes in distribution form of data which is often assumed but seldom strictly
that naturally happen in the real world. Here, interventions available is observational i.i.d. data, where each data
do not need to be deliberate actions to achieve a goal. point is independently sampled from the same distribution.
Statistical relations may change dynamically over time Another extreme is interventional data with known inter-
(e.g., people’s preferences and tastes), or there may simply ventions, where we observe data sets sampled from mul-
be a mismatch between a carefully controlled training tiple distributions each of which is the result of a known
distribution and the test distribution of a model deployed intervention. In between, we have data with (domain)
in production. The robustness of deep neural networks has shifts or unknown interventions. This is observational in
recently been scrutinized and become an active research the sense that the data is only observed passively, but
topic related to causal inference. We argue that predicting it is interventional in the sense that there are interven-
under distribution shift should not be reduced to just the tions/shifts, but unknown to us.
accuracy on a test set. If we wish to incorporate learning
algorithms into human decision-making, we need to trust 2) Hand-Engineered Data Versus Raw Data: Especially,
that the predictions of the algorithm will remain valid if in classical AI, data are often assumed to be structured into
the experimental conditions are changed. high level and semantically meaningful variables, which
may partially (modulo some variables being unobserved)
correspond to the causal variables of the underlying graph.
C. Answering Counterfactual Questions Raw data, in contrast, are unstructured and do not expose
Counterfactual problems involve reasoning about why any direct information about causality.
things happened, imagining the consequences of different While statistical models are weaker than causal models,
actions in hindsight, and determining which actions would they can be efficiently learned from observational data
have achieved the desired outcome. Answering counterfac- alone on both hand-engineered features and raw percep-
tual questions can be more difficult than answering inter- tual input, such as images, videos, and speech. On the
ventional questions. However, this may be a key challenge other hand, although methods for learning causal structure
for AI, as an intelligent agent may benefit from imag- from observations exist [18], [37], [83], [113], [123],
ining the consequences of its actions and understanding [139], [161], [174]–[176], [188]–[190], [229], [237],
in retrospect what led to certain outcomes, at least to [246], [279], learning causal relations frequently requires
some degree of approximation.3 We have mentioned the collecting data from multiple environments or the abil-
example of statistical predictions of heart failure above. ity to perform interventions [251]. In some cases, it is
An interventional question would be “how does the prob- assumed that all common causes of measured variables
ability of heart failure change if we convince a patient to are also observed (causal sufficiency).4 Overall, a signif-
exercise regularly?” A counterfactual one would be “would icant amount of prior knowledge is encoded in which
variables are measured. Moving forward, one would hope
3 Note that two types of questions occupy a continuum: to this
to develop methods that replace expert data collection with
end, consider a probability that is both conditional and interventional
P (A|B, do(C)). If B is an empty set, we have a classical intervention; suitable inductive biases and learning paradigms, such as
if B contained all (unobserved) noise terms, we have a counterfactual. metalearning and self-supervision. If we wish to learn a
If B is not identical to the noise terms, but, nevertheless, informative causal model that is useful for a particular set of tasks and
about them, we get something in between. For instance, reinforcement
learning (RL) practitioners may call Q functions as providing counter- environments, the appropriate granularity of the high-level
factuals even though they model P [return from t| agent state at time
t, do (action at time t)] and, therefore, closer to an intervention (which 4 There are also algorithms that do not require causal suffi-
is why they can be estimated from data). ciency [237].

Vol. 109, No. 5, May 2021 | P ROCEEDINGS OF THE IEEE 615


Schölkopf et al.: Toward Causal Representation Learning

variables depends on the tasks of interest and on the type Overall, it is fair to say that much of the current practice (of
of data that we have at our disposal, for example, which solving i.i.d. benchmark problems) and most theoretical
interventions can be performed and what is known about results (about generalization in i.i.d. settings) fail to tackle
the domain. the hard open challenge of generalization across problems.
To further understand how the i.i.d. assumption is prob-
III. C A U S A L M O D E L S A N D I N F E R E N C E lematic, let us consider a shopping example. Suppose that
As discussed, reality can be modeled at different levels, Alice is looking for a laptop rucksack on the Internet (i.e.,
from the physical one to statistical associations between a rucksack with a padded compartment for a laptop).
epiphenomena. In this section, we expand on the differ- The web shop’s recommendation system suggests that she
ence between statistical and causal modeling and review a should buy a laptop to go along with the rucksack. This
formal language to talk about interventions and distribu- seems odd because she probably already has a laptop;
tion changes. otherwise, she would not be looking for the rucksack in
the first place. In a way, the laptop is the cause, and
the rucksack is an effect. Now, suppose that we are told
A. Methods Driven by i.i.d. Data whether a customer has bought a laptop. This reduces our
The machine learning community has produced impres- uncertainty about whether she also bought a laptop ruck-
sive successes with machine learning applications to sack, and vice versa—and it does so by the same amount
big data problems [54], [148], [171], [223], [232]. (the mutual information), so the directionality of cause and
In these successes, there are several trends at work [215]: effect is lost. However, the directionality is present in the
1) we have massive amounts of data, often from simu- physical mechanisms generating statistical dependence,
lations or large-scale human labeling; 2) we use high- for instance, the mechanism that makes a customer want to
capacity machine learning systems (i.e., complex function buy a rucksack once she owns a laptop.5 Recommending an
classes with many adjustable parameters); 3) we employ item to buy constitutes an intervention in a system, taking
high-performance computing systems; and (often ignored, us outside the i.i.d. setting. We no longer work with the
but crucial when it comes to causality) 4) the problems are observational distribution but a distribution where certain
i.i.d. The latter can be guaranteed by the construction of variables or mechanisms have changed.
a task, including training and test set (e.g., image recog-
nition using benchmark data sets). Alternatively, problems B. Reichenbach Principle: From Statistics to
can be made approximately i.i.d., for example, by carefully Causality
collecting the right training set for a given application
Reichenbach [198] clearly articulated the connection
problem, or by methods, such as “experience replay” [171]
between causality and statistical dependence. He postu-
where an RL agent stores observations in order to later
lated the following:
permute them for the purpose of retraining.
For i.i.d. data, strong universal consistency results from Common cause principle: If two observables X and
statistical learning theory apply, guaranteeing convergence Y are statistically dependent, then there exists
of a learning algorithm to the lowest achievable risk. Such a variable Z that causally influences both and
algorithms do exist, for instance, nearest neighbor classi- explains all the dependence in the sense of making
fiers, support vector machines, and neural networks [67], them independent when conditioned on Z .
[221], [239], [257]. Seen in this light, it is not surprising
As a special case, this variable can coincide with X or Y .
that we can indeed match or surpass human performance
Suppose that X is the frequency of storks and Y the human
if given enough data. However, current machine learning
birth rate. If storks bring the babies, then the correct causal
methods often perform poorly when faced with prob-
graph is X → Y . If babies attract storks, it is X ← Y .
lems that violate the i.i.d. assumption, yet seem trivial to
If there is some other variable that causes both (such as
humans. Vision systems can be grossly misled if an object
economic development), we have X ← Z → Y .
that is normally recognized with high accuracy is placed
Without additional assumptions, we cannot distinguish
in a context that in the training set may be negatively cor-
these three cases using observational data. The class of
related with the presence of the object. Distribution shifts
observational distributions over X and Y that can be
may also arise from simple corruptions that are common
realized by these models is the same in all three cases.
in real-world data collection pipelines [10], [107], [129],
A causal model, thus, contains genuinely more information
[170], [206]. An example of this is the impact of socioe-
than a statistical one.
conomic factors in clinics in Thailand on the accuracy of a
While causal structure discovery is hard if we have only
detection system for diabetic retinopathy [19]. More dra-
two observables [190], the case of more observables is
matically, the phenomenon of “adversarial vulnerability”
surprisingly easier, the reason being that, in that case, there
[249] highlights how even tiny but targeted violations of
are nontrivial conditional independence properties [52],
the i.i.d. assumption, generated by adding suitably chosen
perturbations to images, imperceptible to humans, can 5 Note that the physical mechanisms take place in time, and if
lead to dangerous errors, such as confusion of traffic signs. available, time order may provide additional information about causality.

616 P ROCEEDINGS OF THE IEEE | Vol. 109, No. 5, May 2021


Schölkopf et al.: Toward Causal Representation Learning

[75], [238] implied by causal structure. These generalize classes (e.g., priors or capacity measures), and we shall
the Reichenbach principle and can be described by using return to it below.
the language of causal graphs or structural causal mod- 1) Causal Graphical Models: The graph structure along
els (SCMs), merging probabilistic graphical models and with the joint independence of the noises implies a canon-
the notion of interventions [183], [237]. They are best ical factorization of the joint distribution entailed by (3)
described using directed functional parent–child relation- into causal conditionals that we refer to as the causal (or
ships rather than conditionals. While conceptually simple disentangled) factorization
in hindsight, this constituted a major step in the under-
n
standing of causality.
P (X1 , . . . , Xn ) = P (Xi | PAi ). (4)
i=1

C. Structural Causal Models


While many other entangled factorizations are possible, for
The SCM viewpoint considers a set of observables (or example,
variables) X1 , . . . , Xn associated with the vertices of a n
directed acyclic graph (DAG). We assume that each observ- P (X1 , . . . , Xn ) = P (Xi | Xi+1 , . . . , Xn ) (5)
able is the result of an assignment i=1

the factorization (4) yields practical computational advan-


Xi := fi (PAi , Ui ) (i = 1, . . . , n) (3)
tages during inference, which is, in general, hard, even
when it comes to nontrivial approximations [210]. But
using a deterministic function fi depending on Xi ’s parents more interestingly, it is the only one that decomposes the
in the graph (denoted by PAi ) and on an unexplained joint distribution into conditionals corresponding to the
random variable Ui . Mathematically, the observables are, structural assignments [see (3)]. We think of these as the
thus, random variables, too. Directed edges in the graph causal mechanisms that are responsible for all statistical
represent direct causation since the parents are connected dependencies among the observables. Accordingly, in con-
to Xi by directed edges and, through (3), directly affect trast to (5), the disentangled factorization represents the
the assignment of Xi . The noise Ui ensures that the overall joint distribution as a product of causal mechanisms.
object (3) can represent a general conditional distribution
2) Latent Variables and Confounders: Variables in a
P (Xi |PAi ), and the set of noises U1 , . . . , Un is assumed
causal graph may be unobserved, which can make causal
to be jointly independent. If they were not, then, by the
inference particularly challenging. Unobserved variables
common cause principle, there should be another variable
may confound two observed variables so that they either
that causes their dependence, and thus, our model would
appear statistically related while not being causally related
not be causally sufficient.
(i.e., neither of the variables is an ancestor of the
If we specify the distributions of U1 , . . . , Un , recursive
other), or their statistical relation is altered by the presence
application of (3) allows us to compute the entailed obser-
of the confounder (e.g., one variable is a causal ancestor
vational joint distribution P (X1 , . . . , Xn ). This distribution
for the other, but the confounder is a causal ancestor of
has structural properties inherited from the graph [147],
both). Confounders may or may not be known or observed.
[183]: it satisfies the causal Markov condition stating that
conditioned on its parents, each Xj is independent of its 3) Interventions: The SCM language makes it straight-
nondescendants. forward to formalize interventions as operations that mod-
Intuitively, we can think of the independent noises as ify a subset of assignments (3), for example, changing Ui ,
“information probes” that spread through the graph (much setting fi (and thus Xi ) to a constant, or changing the
like independent elements of gossip can spread through a functional form of fi (and, thus, the dependence of Xi on
social network). Their information gets entangled, man- its parents) [183], [237].
ifesting itself in a footprint of conditional dependencies, Several types of interventions may be possible [63],
making it possible to infer aspects of the graph structure which can be categorized as follows.
from observational data using independence testing. Like 1) No intervention: Only observational data are obtained
in the gossip analogy, the footprint may not be suffi- from the causal model.
ciently characteristic to pin down a unique causal struc- 2) Hard/perfect: The function in the structural assign-
ture. In particular, it certainly is not if there are only ment [see (3)] of a variable (or, analogously, of mul-
two observables since any nontrivial conditional indepen- tiple variables) is set to a constant (implying that the
dence statement requires at least three variables. The two- value of the variable is fixed), and then, the entailed
variable problem can be addressed by making additional distribution for the modified SCM is computed.
assumptions, as not only the graph topology leaves a foot- 3) Soft/imperfect: The structural assignment (3) for a
print in the observational distribution, but the functions variable is modified by changing the function or the
fi do, too. This point is interesting for machine learning, noise term (this corresponds to changing the condi-
where much attention is devoted to properties of function tional distribution given its parents).

Vol. 109, No. 5, May 2021 | P ROCEEDINGS OF THE IEEE 617


Schölkopf et al.: Toward Causal Representation Learning

Fig. 1. Difference between statistical (left) and causal models (right) on a given set of three variables. While a statistical model specifies a
single probability distribution, a causal model represents a set of distributions, one for each possible intervention (indicated with a ).

4) Uncertain: The learner is not sure which mecha- as a “causal graph”) [see (3)]. This allows us to compute
nism/variable is affected by the intervention. interventional distributions, as depicted in Fig. 1. When
One could argue that stating the structural assignments a variable is intervened upon, we disconnect it from its
as in (3) is not yet sufficient to formulate a causal model. parents, fix its value, and perform ancestral sampling on
In addition, one should specify the set of possible inter- its children.
ventions on the SCM. This may be done implicitly via An SCM is composed of: 1) a set of causal variables and
the functional form of structural equations by allowing 2) a set of structural equations with a distribution over the
any intervention over the domain of the mechanisms. This noise variables Ui (or a set of causal conditionals). While
becomes relevant when learning a causal model from data, both causal graphical models and SCMs allow computing
as the SCM depends on the interventions. Pragmatically, interventional distributions, only the SCMs allow comput-
we should aim at learning causal models that are useful ing counterfactuals. To compute counterfactuals, we need
for specific sets of tasks of interest [208], [266] on appro- to fix the value of the noise variables. Moreover, there
priate descriptors (in terms of which causal statements are many ways to represent a conditional as a structural
they support) that must either be provided or learned. assignment (by picking different combinations of functions
We will return to the assumptions that allow learning and noise variables).
causal models and features in Section IV. Causal learning and reasoning: The conceptual basis of
statistical learning is a joint distribution P (X1 , . . . , Xn )
(where, often, one of the Xi is a response variable denoted
D. Difference Between Statistical Models, Causal as Y ), and we make assumptions about function classes
Graphical Models, and SCMs used to approximate, say, a regression E[Y |X]. Causal
An example of the difference between a statistical and a learning considers a richer class of assumptions and seeks
causal model is depicted in Fig. 1. A statistical model may to exploit the fact that the joint distribution possesses a
be defined, for instance, through a graphical model, that is, causal factorization [see (4)]. It involves the causal condi-
a probability distribution along with a graph such that the tionals P (Xi | PAi ) [e.g., represented by the functions fi
former is Markovian with respect to the latter [in which and the distribution of Ui in (3)], how these conditionals
case it can be factorized as (4)]. However, the edges in a relate to each other, and interventions or changes that they
(generic) graphical model do not need to be causal [98]. admit. Once a causal model is available, either by external
For instance, the two graphs X1 → X2 → X3 and X1 ← human knowledge or a learning process, causal reasoning
X2 ← X3 imply the same conditional independence(s) (X1 allows drawing conclusions on the effect of interventions,
and X3 are independent given X2 ). They are, thus, in the counterfactuals, and potential outcomes. In contrast, sta-
same Markov equivalence class, that is, if a distribution is tistical models only allow reasoning about the outcome of
Markovian with respect to one of the graphs, then it also is i.i.d. experiments.
with respect to the other graph. Note that the above serves
as an example that the Markov condition is not sufficient IV. I N D E P E N D E N T C A U S A L
for causal discovery. Further assumptions are needed (see MECHANISMS
below and [183], [188], and [237]). We now return to the disentangled factorization [see (4)]
A graphical model becomes causal if the edges of its of the joint distribution P (X1 , . . . , Xn ). This factorization
graph are causal (in which case the graph is referred to according to the causal graph is always possible when Ui

618 P ROCEEDINGS OF THE IEEE | Vol. 109, No. 5, May 2021


Schölkopf et al.: Toward Causal Representation Learning

is independent, but we will now consider an additional invariance [183], [188]. If we have only two variables,
notion of independence relating the factors in (4) to one it reduces to independence between the cause distribution
another. and the mechanism producing the effect distribution.
Whenever we perceive an object, our brain assumes that Applied to the causal factorization [see (4)], the princi-
the object and the mechanism by which the information ple tells us that the factors should be independent in the
contained in its light reaches our brain are independent. sense that the following holds.
We can violate this by looking at the object from an 1) Changing (or performing an intervention upon) one
accidental viewpoint, which can give rise to optical illu- mechanism P (Xi |PAi ) does not change any of the
sions [188]. The above independence assumption is useful other mechanisms P (Xj |PAj ) (i = j ) [220].
because, in practice, it holds most of the time, and our 2) Knowing some other mechanisms P (Xi |PAi ) (i = j )
brain, thus, relies on objects being independent of our does not give us information about a mechanism
vantage point and the illumination. Likewise, there should P (Xj |PAj ) [124].
not be accidental coincidences, such as 3-D structures
This notion of independence, thus, subsumes two aspects:
lining up in 2-D, or shadow boundaries coinciding with
the former pertaining to influence and the latter to infor-
texture boundaries. In vision research, this is called the
mation.
generic viewpoint assumption.
The notion of invariant, autonomous, and independent
If we move around the object, our vantage point
mechanisms has appeared in various guises throughout the
changes, but we assume that the other variables of the
history of causality research [72], [100], [111], [124],
overall generative process (e.g., lighting, object position,
[183], [188], [240]. Early work on this was done by
and structure) are unaffected by that. This is an invariance
Haavelmo [100], stating the assumption that changing one
implied by the above independence, allowing us to infer
of the structural assignments leaves the other ones invari-
3-D information even without stereo vision (“structure
ant. Hoover [111] attributed to Herb Simon the invariance
from motion”).
criterion: the true causal order is the one that is invariant
For another example, consider a data set that consists of
under the right sort of intervention. Aldrich [4] discussed
altitude A and average annual temperature T of weather
the historical development of these ideas in economics. He
stations [188]. A and T are correlated, which we believe
argued that the “most basic question one can ask about a
is due to the fact that altitude has a causal effect on
relation should be: how autonomous is it?” [72, preface].
temperature. Suppose that we had two such data sets:
Pearl [183] discussed autonomy in detail, arguing that a
one for Austria and one for Switzerland. The two joint
causal mechanism remains invariant when other mecha-
distributions P (A, T ) may be rather different since the
nisms are subjected to external influences. He pointed out
marginal distributions P (A) over altitudes will differ. The
that causal discovery methods may best work “in longitu-
conditionals P (T |A), however, may be (close to) invari-
dinal studies conducted under slightly varying conditions,
ant since they characterize the physical mechanisms that
where accidental independencies are destroyed and only
generate temperature from altitude. This similarity is lost
structural independencies are preserved.” Overviews are
upon us if we only look at the overall joint distribution,
provided by Aldrich [4], Hoover [111], Pearl [183], and
without information about the causal structure A → T .
Peters et al. [188, Section 2.2]. These seemingly different
The causal factorization P (A)P (T |A) will contain a com-
notions can be unified [124], [240].
ponent P (T |A) that generalizes across countries, while the
We view any real-world distribution as a product of
entangled factorization P (T )P (A|T ) will exhibit no such
causal mechanisms. A change in such a distribution (e.g.,
robustness. Cum grano salis, the same applies when we
when moving from one setting/domain to a related one)
consider interventions in a system. For a model to correctly
will always be due to changes in at least one of those
predict the effect of interventions, it needs to be robust to
mechanisms. Consistent with the implication 1) of the ICM
generalizing from an observational distribution to certain
Principle, we state the following hypothesis:
interventional distributions.
One can express the above insights as follows SMS: Small distribution changes tend to mani-
[188], [220]: fest themselves in a sparse or local way in the
ICM principle: The causal generative process of a sys- causal/disentangled factorization [see (4)], that
tem’s variables is composed of autonomous modules is, they should usually not affect all factors
that do not inform or influence each other. In the simultaneously.
probabilistic case, this means that the conditional
distribution of each variable given its causes (i.e., its In contrast, if we consider a noncausal factorization,
mechanism) does not inform or influence the other for example, (5), then many, if not all, terms will be
mechanisms. affected simultaneously as we change one of the physical
mechanisms responsible for a system’s statistical depen-
This principle entails several notions important to dencies. Such a factorization may, thus, be called entan-
causality, including separate intervenability of causal vari- gled, a term that has gained popularity in machine learning
ables, modularity and autonomy of subsystems, and [24], [110], [158], [247].

Vol. 109, No. 5, May 2021 | P ROCEEDINGS OF THE IEEE 619


Schölkopf et al.: Toward Causal Representation Learning

The SMS hypothesis was stated in [25], [114], [180], shortest compression of) one does not help us achieve a
and [217] and in earlier form in [219], [220], and [281]. shorter compression of the other.
An intellectual ancestor is Simon’s invariance criterion, The algorithmic information theory provides a nat-
that is, that the causal structure remains invariant across ural framework for nonstatistical graphical models
changing background conditions [235]. The hypothesis [120], [124]. Just like that the latter is obtained from
is also related to ideas of looking for features that vary SCMs by making the unexplained variables Ui random,
slowly [70], [270]. It has recently been used for learning we obtain algorithmic graphical models by making the Ui
causal models [131], modular architectures [29], [85], bit strings, jointly independent across nodes, and viewing
and disentangled representations [159]. Xi as the output of a fixed Turing machine running the
We have informally talked about the dependence of two program Ui on the input PAi . Similar to the statistical case,
mechanisms P (Xi |PAi ) and P (Xj |PAj ) when discussing one can define a local causal Markov condition, a global
the ICM principle and the disentangled factorization one in terms of d-separation, and an additive decom-
[see (4)]. Note that the dependence of two such mecha- position of the joint Kolmogorov complexity in analogy
nisms does not coincide with the statistical dependence of to (4), and prove that they are implied by the SCM [124].
the random variables Xi and Xj . Indeed, in a causal graph, Interestingly, in this case, independence of noises and inde-
many of the random variables will be dependent even if pendence of mechanisms coincide since the independent
all mechanisms are independent. Also, the independence programs play the role of the unexplained noise terms. This
of the noise terms Ui does not translate into the inde- approach shows that causality is not intrinsically bound to
pendence of the Xi . Intuitively speaking, the independent statistics.
noise terms Ui provide and parameterize the uncertainty
contained in the fact that a mechanism P (Xi |PAi ) is V. C A U S A L D I S C O V E R Y A N D
nondeterministic6 and, thus, ensure that each mechanism MACHINE LEARNING
adds an independent element of uncertainty. In this sense, Let us turn to the problem of causal discovery from data.
the ICM principle contains the independence of the unex- Subject to suitable assumptions, such as faithfulness [237],
plained noise terms in an SCM [see (3)] as a special case. one can sometimes recover aspects of the underlying
In the ICM principle, we have stated that independence graph7 from observational data by performing conditional
of two mechanisms (formalized as conditional distribu- independence tests. However, there are several problems
tions) should mean that the two conditional distributions with this approach. One is that our data sets are always
do not inform or influence each other. The latter can be finite in practice, and conditional independence testing is
thought of as requiring that independent interventions are a notoriously difficult problem, especially if conditioning
possible. To better understand the former, we next discuss sets are continuous and multidimensional. Thus, while,
a formalization in terms of algorithmic independence. In a in principle, the conditional independencies implied by
nutshell, we encode each mechanism as a bit string and the causal Markov condition hold irrespective of the com-
require that joint compression of these strings does not plexity of the functions appearing in an SCM, for finite
save space relative to independent compressions. data sets, conditional independence testing is hard without
To this end, first recall that we have, so far, discussed additional assumptions [225]. Recent progress in (con-
links between causal and statistical structures. Of the ditional) independence testing heavily relies on kernel
two, the more fundamental one is the causal structure function classes to represent probability distributions in
since it captures the physical mechanisms that generate reproducing kernel Hilbert spaces [43], [61], [74], [91],
statistical dependencies in the first place. The statistical [92], [193], [280]. The other problem is that, in the case
structure is an epiphenomenon that follows if we make the of only two variables, the ternary concept of conditional
unexplained variables random. It is awkward to talk about independence collapses and the Markov condition, thus,
statistical information contained in a mechanism since has no nontrivial implications.
deterministic functions in the generic case neither generate It turns out that both problems can be addressed by
nor destroy information. This serves as a motivation to making assumptions on function classes. This is typical for
devise an alternative model of causal structures in terms machine learning, where it is well known that finite-sample
of the Kolmogorov complexity [124]. The Kolmogorov generalization without assumptions on function classes is
complexity (or algorithmic information) of a bit string impossible. Specifically, although there are universally con-
is essentially the length of its shortest compression on a sistent learning algorithms, that is, approaching minimal
Turing machine and, thus, a measure of its information expected error in the infinite sample limit, there are always
content. Independence of mechanisms can be defined as cases where this convergence is arbitrarily slow. Thus, for
vanishing mutual algorithmic information, that is, two given sample size, it will depend on the problem being
conditionals are considered independent if knowing (the learned whether we achieve low expected error, and the
statistical learning theory provides probabilistic guarantees
7 One can recover the causal structure up to a Markov equivalence
6 In the sense that the mapping from PAi to Xi is described by a class, where DAGs have the same undirected skeleton and “immorali-
nontrivial conditional distribution, rather than by a function. ties” (Xi → Xj ← Xk ).

620 P ROCEEDINGS OF THE IEEE | Vol. 109, No. 5, May 2021


Schölkopf et al.: Toward Causal Representation Learning

in terms of measures of complexity of function classes structure is assumed to remain invariant. Hence, distri-
[56], [257]. bution shifts, such as observing a system in different
Returning to causality, we provide an intuition why “environments/contexts,” can significantly help to identify
assumptions on the functions in an SCM should be nec- causal structure [188], [251]. These contexts can come
essary to learn about them from data. Consider a toy SCM from interventions [187], [191], [220], nonstationary
with only two observables X → Y . In this case, (3) turns time series [101], [116], [192], or multiple views [90],
into [114]. The contexts can likewise be interpreted as differ-
ent tasks, which provides a connection to metalearning
X=U (6) [23], [68], [213].
The work of Bengio et al. [25] ties the generalization
Y = f (X, V ) (7)
in metalearning to invariance properties of causal models,
using the idea that a causal model should adapt faster
with U ⊥ ⊥ V . Now, think of V acting as a random selector to interventions than purely predictive models. This was
variable choosing from among a set of functions F = extended to multiple variables and unknown interventions
{fv (x) ≡ f (x, v) | v ∈ supp(V )}. If f (x, v) depends on v in in [131], proposing a framework for causal discovery using
a nonsmooth way, it should be hard to glean information neural networks by turning the discrete graph search into a
about the SCM from a finite data set, given that V is not continuous optimization problem. While Bengio et al. [25]
observed and its value randomly selects among arbitrarily and Ke et al. [131] focused on learning a causal model
different fv . using neural networks with an unsupervised loss, the work
This motivates restricting the complexity with which of Dasgupta et al. [51] explores learning a causal model
f depends on V . A natural restriction is to assume an using an RL agent. These approaches have in common that
additive noise model semantically meaningful abstract representations are given
and do not need to be learned from high-dimensional and
X=U (8) low-level (e.g., pixel) data.
Y = f (X) + V. (9)
VI. L E A R N I N G C A U S A L V A R I A B L E S
Traditional causal discovery and reasoning assume that the
If f in (7) depends smoothly on V , and if V is relatively
units are random variables connected by a causal graph.
well concentrated, this can be motivated by a local Tay-
However, real-world observations are usually not struc-
lor expansion argument. It drastically reduces the effec-
tured into those units, to begin with, for example, objects
tive size of the function class—without such assumptions,
in images [162]. Hence, the emerging field of causal rep-
the latter could depend exponentially on the cardinality of
resentation learning strives to learn these variables from
the support of V . Restrictions of function classes not only
data, much like machine learning went beyond symbolic AI
make it easier to learn functions from data but it turns out
in not requiring that the symbols that algorithms manipu-
that they can break the symmetry between cause and effect
late be given a priori (see [34]). To this end, we could try
in the two-variable case: one can show that, given a distrib-
to connect causal variables S1 , . . . , Sn to observations
ution over X, Y generated by an additive noise model, one
cannot fit an additive noise model in the opposite direction
(i.e., with the roles of X and Y interchanged) [18], [113], X = G(S1 , . . . , Sn ) (10)
[139], [175], [190] (see also [246]). This is subject to
certain genericity assumptions, and notable exceptions where G is a nonlinear function. An example can be seen
include the case where U and V are Gaussian and f is in Fig. 2, where high-dimensional observations are the
linear. It generalizes results of Shimizu et al. [229] for result of a view on the state of a causal system that is
linear functions, and it can be generalized to include non- then processed by a neural network to extract high-level
linear rescalings [279], loops [174], confounders [123], variables that are useful on a variety of tasks. Although
and multivariable settings [189]. Empirically, there is a causal models in economics, medicine, or psychology
number of methods that can detect causal direction better often use variables that are abstractions of underlying
than chance [176], some of the building on the above quantities, it is challenging to state general conditions
Kolmogorov complexity model [37], some on generative under which coarse-grained variables admit causal mod-
models [83], and some directly learning to classify bivari- els with well-defined interventions [42], [208]. Defining
ate distributions into causal versus anticausal [161]. objects or variables that can be causally related amounts
While restrictions of function classes are one possibility to coarse-graining of more detailed models of the world,
to allow identifying the causal structure, other assump- including microscopic structural equation models [208],
tions or scenarios are possible. So far, we have discussed ordinary differential equations [173], [207], and tempo-
that causal models are expected to generalize under cer- rally aggregated time series [79]. The task of identifying
tain distribution shifts since they explicitly model inter- suitable units that admit causal models is challenging for
ventions. By the SMS hypothesis, much of the causal both human and machine intelligence. Still, it aligns with

Vol. 109, No. 5, May 2021 | P ROCEEDINGS OF THE IEEE 621


Schölkopf et al.: Toward Causal Representation Learning

Fig. 2. Illustration of the causal representation learning problem setting. Perceptual data, such as images or other high-dimensional
sensor measurements, can be thought of as entangled views of the state of an unknown causal system, as described in (10). With the
exception of possible task labels, none of the variables describing the causal variables generating the system may be known. The goal of
causal representation learning is to learn a representation (partially) exposing this unknown causal structure (e.g., which variables describe
the system, and their relations). As full recovery may often be unreasonable, neural networks may map the low-level features to some
high-level variables supporting causal statements relevant to a set of downstream tasks of interest. For example, if the task is to detect the
manipulable objects in a scene, the representation may separate intrinsic object properties from their pose and appearance to achieve
robustness to distribution shifts on the latter variables. Usually, we do not get labels for the high-level variables, but the properties of
causal models can serve as useful inductive biases for learning (e.g., the SMS hypothesis).

the general goal of modern machine learning to learn


meaningful representations of data, where meaningful can
include robust, explainable, or fair [130], [134], [142],
[259], [275].
To combine structural causal modeling [see (3)] and
representation learning, we should strive to embed an SCM
into larger machine learning models whose inputs and
outputs may be high-dimensional and unstructured, but
whose inner workings are at least partly governed by an
SCM (that can be parameterized with a neural network).
The result may be a modular architecture, where the differ-
ent modules can be individually fine-tuned and repurposed
Fig. 3. Example of the SMS hypothesis where an intervention
for new tasks [85], [180], and the SMS hypothesis can (which may or may not be intentional/observed) changes the
be used to enforce the appropriate structure. We visualize position of one finger ( ), and as a consequence, the object falls.
an example in Fig. 3 where changes are sparse for the The change in pixel space is entangled (or distributed), in contrast

appropriate causal variables (the position of the finger to the change in the causal model.

and the cube changed as a result of moving the finger)


but dense in other representations, for example, in the
pixel space (as finger and cube move, many pixels change
their value). At the extreme, all pixels may change as a thus, the feasibility of the disentangled representation
result of a sparse intervention, for example, if the camera
view or the lighting changes. n

We now discuss three problems of modern machine P (S1 , . . . , Sn ) = P (Si | PAi ) (11)
i=1
learning in the light of causal representation learning.

as well as the property that the conditionals P (Si | PAi ) are


independently manipulable and largely invariant across
related problems. Suppose that we seek to reconstruct such
A. Problem 1—Learning Disentangled a disentangled representation using independent mechanisms
Representations [see (11)] from data, but the causal variables Si are not
provided to us a priori. Rather, we are given (possibly high-
We have earlier discussed the ICM principle implying dimensional) X = (X1 , . . . , Xd ) (in the following, we think
both the independence of the SCM noise terms in (3) and, of X as an image with pixels X1 , . . . , Xd ), as in (10), from

622 P ROCEEDINGS OF THE IEEE | Vol. 109, No. 5, May 2021


Schölkopf et al.: Toward Causal Representation Learning

which we should construct causal variables S1 , . . . , Sn done by ensuring that they are invariant across problems,
(n  d) as well as mechanisms [see (3)] exhibit sparse changes to actions or that they can be inde-
pendently intervened upon [22], [30], [217]. Locatello
Si := fi (PAi , Ui ) (i = 1, . . . , n) (12) et al. [159] showed that the SMS hypothesis stated above
is theoretically sufficient when given suitable training
data. Furthermore, the SMS hypothesis can be used as
modeling the causal relationships among Si . To this end, supervision signal, in practice, even if PAi = ∅ [252].
as a first step, we can use an encoder q : Rd → Rn taking However, which factors of variation can be disentangled
X to a latent “bottleneck” representation comprising the depend on which interventions can be observed [159],
unexplained noise variables U = (U1 , . . . , Un ). The next [230]. As discussed by Schölkopf et al. [219] and Shu et al.
step is the mapping f (U ) determined by the structural [230], different supervision signals may be used to identify
assignments f1 , . . . , fn . Finally, we apply a decoder p : subsets of factors. Similarly, when learning causal variables
Rn → Rd . For suitable n, the system can be trained using from data, which variables can be extracted and their
reconstruction error to satisfy p ◦ f ◦ q ≈ id on the observed granularity depends on which distribution shifts, explicit
images. If the causal graph is known, the topology of a interventions, and other supervision signals are available.
neural network implementing f can be fixed accordingly;
if not, the neural network decoder learns the composition
p̃ = p ◦ f . In practice, one may not know f and, thus, B. Problem 2—Learning Transferable Mechanisms
only learn an autoencoder p̃ ◦ q , where the causal graph An artificial or natural agent in a complex world is faced
effectively becomes an unspecified part of the decoder p̃, with limited resources. This concerns training data, that is,
possibly aided by a suitable choice of architecture [149]. we only have limited data for each task/domain, and, thus,
Much of the existing work on disentanglement [62], need to find ways of pooling/reusing data, in stark contrast
[110], [135], [157]–[159], [202], [256] focuses on inde- to the current industry practice of large-scale labeling work
pendent factors of variation. This can be viewed as the done by humans. It also concerns computational resources:
special case where the causal graph is trivial, that is, animals have constraints on the size of their brains, and
∀i : PAi = ∅ in (12). In this case, the factors are functions evolutionary neuroscience knows many examples where
of the independent exogenous noise variables and, thus, brain regions get repurposed. Similar constraints on size
independent themselves.8 However, the ICM principle is and energy apply as ML methods get embedded in (small)
more general and contains statistical independence as a physical devices that may be battery-powered. Future AI
special case. models that robustly solve a range of problems in the real
Note that the problem of object-centric representation world will, thus, likely need to reuse components, which
learning [11], [40], [84], [87], [88], [138], [155], [160], requires them to be robust across tasks and environments
[255], [262] can also be considered a special case of [219]. An elegant way to do this is to employ a modular
disentangled factorization as discussed here. Objects are structure that mirrors corresponding modularity in the
constituents of scenes that in principle permit separate world. In other words, if the world is indeed modular,
interventions. A disentangled representation of a scene in the sense that components/mechanisms of the world
containing objects should probably use objects as some play roles across a range of environments, tasks, and
of the building blocks of an overall causal factorization,9 settings, then it would be prudent for a model to employ
complemented by mechanisms, such as orientation, view- corresponding modules [85]. For instance, if variations of
ing direction, and lighting. natural lighting (the position of the sun, clouds, and so on)
The problem of recovering the exogenous noise vari- imply that the visual environment can appear in brightness
ables is ill-defined in the i.i.d. case as there are infinitely conditions spanning several orders of magnitude, then
many equivalent solutions yielding the same observa- visual processing algorithms in our nervous system should
tional distribution [117], [158], [188]. Additional assump- employ methods that can factor out these variations, rather
tions or biases can help favoring certain solutions over than building separate sets of face recognizers, say, for
others [158], [205]. Leeb et al. [149] propose a structured every lighting condition. If, for example, our nervous sys-
decoder that embeds an SCM and automatically learns a tem were to compensate for the lighting changes by a gain
hierarchy of disentangled factors. control mechanism, then this mechanism in itself need not
To make (12) causal, we can use the ICM principle, that have anything to do with the physical mechanisms bringing
is, we should make Ui statistically independent, and we about brightness differences. However, it would play a
should make the mechanisms independent. This could be role in a modular structure that corresponds to the role
that the physical mechanisms play in the world’s modular
8 For an example to see why this is often not desirable, note that the structure. This could produce a bias toward models that
presence of fork and knife may be statistically dependent, yet we might exhibit certain forms of structural homomorphism to a
want a disentangled representation to represent them as separate entities. world that we cannot directly recognize, which would
9 Objects can be represented at different levels of granularity [208],
that is, as a single entity or as a composition of other causal variables be rather intriguing, given that ultimately our brains do
encoding parts, properties, and other factors of variation. nothing but turn neuronal signals into other neuronal

Vol. 109, No. 5, May 2021 | P ROCEEDINGS OF THE IEEE 623


Schölkopf et al.: Toward Causal Representation Learning

signals. A sensible inductive bias to learn such models is X → Y . The causal factorization (4) for this case is
to look for ICMs [182], and competitive training can play
a role in this. For pattern recognition tasks, Parascandolo P (X, Y ) = P (X)P (Y |X). (13)
et al. [180] and Goyal et al. [85] suggested that learning
causal models that contain independent mechanisms may
help in transferring modules across substantially different The ICM principle posits that the modules in a joint distri-
domains. bution’s causal decomposition do not inform or influence
each other. This means that, in particular, P (X) should
contain no information about P (Y |X), which implies that
C. Problem 3—Learning Interventional World
SSL should be futile, in as far as it is using additional infor-
Models and Reasoning
mation about P (X) (from unlabelled data) to improve our
Deep learning excels at learning representations of data estimate of P (Y |X = x).
that preserve relevant statistical properties [24], [148]. In the opposite (anticausal) direction (i.e., the direction
However, it does so without taking into account the of prediction is opposite to the causal generative process),
causal properties of the variables, that is, it does not however, SSL may be possible. To see this, we refer to
care about the interventional properties of the variables Daniušis et al. [50] who define a measure of dependence
that it analyzes or reconstructs. Causal representation between input P (X) and conditional P (Y |X).10 Assuming
learning should move beyond the representation of sta- that this measure is zero in the causal direction (applying
tistical dependence structures toward models that support the ICM assumption described in Section IV to the two-
intervention, planning, and reasoning, realizing Konrad variable case), they show that it is strictly positive in the
Lorenz’ notion of thinking as acting in an imagined space anticausal direction. Applied to SSL in the anticausal direc-
[163]. This ultimately requires the ability to reflect back on tion, this implies that the distribution of the input (now:
one’s actions and envision alternative scenarios, possibly effect) variable should contain information about the con-
necessitating (the illusion of) free will [184]. The biolog- ditional output (cause) given input, that is, the quantity
ical function of self-consciousness may be related to the that machine learning is usually concerned with.
need for a variable representing oneself in one’s Lorenzian The study [220] empirically corroborated these pre-
imagined space, and free will may then be a means to dictions, thus establishing an intriguing bridge between
communicate about actions taken by that variable, crucial the structure of learning problems and certain physical
for social and cultural learning, a topic that has not yet properties (cause–effect direction) of real-world data gen-
entered the stage of machine learning research although it erating processes. It also led to a range of follow-up work
is at the core of human intelligence [108]. [32], [78], [97], [114], [115], [152], [153], [156], [167],
[195], [204], [243], [263], [267], [277], [278], [281],
VII. I M P L I C A T I O N S F O R M A C H I N E complementing the studies of Bareinboim and Pearl [14],
LEARNING [185], and it inspired a thread of work in the statistics
All these discussions call for a learning paradigm that does community exploiting invariance for causal discovery and
not rest on the usual i.i.d. assumption. Instead, we wish other tasks [105], [106], [114], [187], [191].
to make a weaker assumption that the data on which On the SSL side, subsequent developments include fur-
the model will be applied comes from a possibly different ther theoretical analyses [125], [188, Section 5.1.2] and
distribution but involving (mostly) the same causal mech- a form of conditional SSL [261]. The view of SSL as
anisms [188]. This raises serious challenges: 1) in many exploiting dependencies between a marginal P (X) and
cases, we need to infer abstract causal variables from the a noncausal conditional P (Y |X) is consistent with the
available low-level input features; 2) there is no consensus common assumptions employed to justify SSL [45]. The
on which aspects of the data reveal causal relations; 3) the cluster assumption asserts that the labeling function [which
usual experimental protocol of training and test set may is a property of P (Y |X)] should not change within clusters
not be sufficient for inferring and evaluating causal rela- of P (X). The low-density separation assumption posits that
tions on existing data sets, and we may need to create new the area where P (Y |X) takes the value of 0.5 should have
benchmarks, for example, with access to environmental small P (X); the semisupervised smoothness assumption,
information and interventions; 4) even in the limited cases applicable also to continuous outputs, states that if two
that we understand, we often lack scalable and numerically points in a high-density region are close and so should
sound algorithms. Despite these challenges, we argue that be the corresponding output values. Note, moreover, that
this endeavor has concrete implications for machine learn- some of the theoretical results in the field use assump-
ing and may shed light on desiderata and current practices tions well-known from causal graphs (even if they do
alike. not mention causality): the cotraining theorem [33] makes
a statement about learnability from unlabelled data and
A. Semisupervised Learning
10 Other dependence measures have been proposed for high-
Suppose that our underlying causal graph is X → Y , dimensional linear settings and time series [28], [119], [121], [122],
and at the same time, we are trying to learn a mapping [126], [226].

624 P ROCEEDINGS OF THE IEEE | Vol. 109, No. 5, May 2021


Schölkopf et al.: Toward Causal Representation Learning

relies on an assumption of predictors being conditionally predictor approximates the causal mechanism that is inher-
independent given the label, which we would normally ently transferable and robust, adversarial examples should
expect if the predictors are (only) caused by the label, that be harder to find [133], [216].11 Recent work supports
is, an anticausal setting. This is nicely consistent with the this view: it was shown that a possible defense against
above findings. adversarial attacks is to solve the anticausal classifica-
tion problem by modeling the causal generative direc-
tion, a method that, in vision, is referred to as analysis
B. Adversarial Vulnerability by synthesis [222]. A related defense method proceeds
One can hypothesize that the causal direction should by reconstructing the input using an autoencoder before
also have an influence on whether classifiers are vulner- feeding it to a classifier [96].
able to adversarial attacks. These attacks have recently
become popular and consist of minute changes to inputs,
invisible to a human observer yet changing a classifier’s C. Robustness and Strong Generalization
output [249]. This is related to causality in several ways.
We can speculate that structures composed of
First, these attacks clearly constitute violations of the i.i.d.
autonomous modules, such as given by a causal
assumption that underlies statistical machine learning.
factorization [see (4)], should be relatively robust to
If all we want to do is a prediction in an i.i.d. setting,
swapping out or modifying individual components.
then statistical learning is fine. In the adversarial setting,
Robustness should also play a role when studying strategic
however, the modified test examples are not drawn from
behavior, that is, decisions or actions that take into account
the same distribution as the training examples. The adver-
the actions of other agents (including AI agents). Consider
sarial phenomenon also shows that the kind of robustness
a system that tries to predict the probability of successfully
current classifiers exhibit is rather different from the one
paying back a credit, based on a set of features. The set
a human exhibits. If we knew both robustness measures,
could include, for instance, the current debt of a person,
we could try to maximize one, while minimizing the other.
as well as their address. To get a higher credit score,
Current methods can be viewed as crude approximations
people could, thus, change their current debt (by paying
to this, effectively modeling the human’s robustness as
it off), or they could change their address by moving
a mathematically simple set, say, an lp ball of radius
to a more affluent neighborhood. The former probably
 > 0: they, often, try to find examples that lead to
has a positive causal impact on the probability of paying
maximal changes in the classifier’s output, subject to the
back; for the latter, this is less likely. Thus, we could
constraint that they lie in an lp ball in the pixel metric.
build a scoring system that is more robust with respect to
As we think of a classifier as the approximation of a
such strategic behavior by only using causal features as
function, the large gradients exploited by these attacks
inputs [132].
are either property of this function or a defect of the
To formalize this general intuition, one can consider
approximation.
a form of out-of-distribution generalization, which can
There are different ways of relating this to causal mod-
be optimized by minimizing the empirical risk over a
els. As described in [188, Section 1.4], different causal
class of distributions induced by a causal model of the
models can generate the same statistical pattern recogni-
data [5], [169], [187], [204], [220]. To describe this
tion model. In one of those, we might provide a writer
notion, we start by recalling the usual empirical risk mini-
with a sequence of class labels y , with the instruction to
mization setup. We have access to data from a distribution
produce a set of corresponding images x. It is clear that
P (X, Y ) and train a predictor g in a hypothesis space
intervening on y will impact x, but intervening on x will
H (e.g., a neural network with a certain architecture
not impact y , so this is an anticausal learning problem.
predicting Y from X ) to minimize the empirical risk R̂:
In another setting, we might ask the writer to decide
for herself which digits to write and to record the labels
alongside the digit (in this case, the classifier would try to g  = argmin R̂P (X,Y ) (g) (14)
g∈H
predict one effect from another one, a situation that we
might call a confounded one). In the last one, we might
where
provide images to a person and ask the person to generate
labels by classifying them.
Let us now assume that we are in the causal setting R̂P (X,Y ) (g) = ÊP (X,Y ) [loss(Y, g(X))]. (15)
where the causal generative model factorizes into inde-
pendent components, one of which is (essentially) the Here, we denote by ÊP (X,Y ) the empirical mean computed
classification function. As discussed in Section III, when from a sample drawn from P (X, Y ). When we refer to
specifying a causal model, one needs to determine which “out-of-distribution generalization,” we mean having a
interventions are allowed, and a structural assignment
will then, by definition, be valid under every possible 11 Adversarial attacks may still exploit the quality of the (parameter-
(allowed) intervention. One may, thus, expect that if the ized) approximation of a structural equation.

Vol. 109, No. 5, May 2021 | P ROCEEDINGS OF THE IEEE 625


Schölkopf et al.: Toward Causal Representation Learning

small expected risk for a different distribution P † (X, Y ): practice only if the new training distribution is sufficiently
diverse to contain information about other distributions
in PG .
† (X,Y ) (g) = EP † (X,Y ) [loss(Y, g(X))].
OOD
RP (16)
The second approach, often coupled with the previous
one, is to rely on data augmentation to increase the
It is clear that the gap between R̂P (X,Y ) (g) and diversity of the data by “augmenting” it through a certain
† (X,Y ) (g) will depend on how different the test distribu-
OOD
RP type of artificially generated interventions [10], [140],
tion P † is from the training distribution P . To quantify this [234]. For the visual domain, common augmentations
difference, we call environments the collection of different include performing transformations, such as rotating the
circumstances that give rise to the distribution shifts, such image, translating the image by a few pixels, or flipping
as locations, times, and experimental conditions. Environ- the image horizontally. The high-level idea behind data
ments can be modeled in a causal factorization [see (4)] as augmentation is to encourage a system to learn underly-
they can be seen as interventions on one or several causal ing invariances or symmetries present in the augmented
variables or mechanisms. As a motivating example, one data distribution. For example, in a classification task,
environment may correspond to where a measurement is translating the image by a few pixels does not change
taken (e.g., a certain room), and from each environment, the class label. One may view it as specifying a set
we obtain a collection of measurements (images of objects of interventions E that the model should be robust to
in the same room). It is nontrivial (and, in some cases, (e.g., random crops/interpolations/translation/rotations).
provably hard [21]) to learn statistical models that are sta- Instead of computing the maximum over all distributions
ble across training environments and generalize to novel in E, one can relax the problem by sampling from the
testing environments [2], [5], [167], [187], [204] drawn interventional distributions and optimize an expectation
from the same environment distribution. over the different augmented images on a suitably cho-
Using causal language, one could restrict P † (X, Y ) to sen subset [39], using a search algorithm, such as RL
be the result of a certain set of interventions, that is, [49] or an algorithm based on density matching [154].
P † (X, Y ) ∈ PG , where PG is a set of interventional dis- The third approach is to rely on self-supervision to learn
tributions over a causal graph G. The worst case out-of- about P (X). Certain pretraining methods [36], [46], [55],
distribution risk then becomes [112], [196], [253] have shown that it is possible to
achieve good results using only very few class labels by first
RPOOD
G
(g) = max EP † (X,Y ) [loss(Y, g(X))]. (17) pretraining on a large unlabeled data set and then fine-
P † ∈PG
tuning on few labeled examples. Similarly, pretraining on
large unlabeled image data sets can improve performance
To learn a robust predictor, we should have available a by learning representations that can efficiently transfer to a
subset of environment distributions E ⊂ PG and solve downstream task, as demonstrated by Bachman et al. [8],
Chen et al. [47], Grill et al. [93], He et al. [102], and Oord
g  = argmin max ÊP † (X,Y ) [loss(Y, g(X))]. (18) et al. [179]. These methods fall under the umbrella of self-
g∈H P † ∈E
supervised learning, a family of techniques for converting
an unsupervised learning problem into a supervised one
In practice, solving (18) requires specifying a causal model by using the so-called pretext tasks with artificially gen-
with an associated set of interventions. If the set of erated labels without human annotations. The basic idea
observed environments E does not coincide with the set behind using pretext tasks is to force the learner to learn
of possible environments PG , we have an additional esti- representations that contain information about P (X) that
mation error that may be arbitrarily large in the worst may be useful for (an unknown) downstream task. Much
case [5], [21]. of the work on methods that use self-supervision relies on
carefully constructing pretext tasks. A central challenge
D. Pretraining, Data Augmentation, and here is to extract features that are indeed informative
Self-Supervision about the data-generating distribution. Ideas from the ICM
Learning predictive models solving the min–max opti- principle could help develop methods that can automate
mization problem of (18) is challenging. We now interpret the process of constructing pretext tasks. Finally, one can
several common techniques in machine learning as means explicitly optimize (18), for example, through adversarial
of approximating (18). training [80]. In that case, PG would contain a set of
The first approach is enriching the distribution of the attacks that an adversary might perform, while, presently,
training set. This does not mean obtaining more examples we consider a set of natural interventions.
from P (X, Y ) but training on a richer data set [54], [245], An interesting research direction is the combination
for example, through pretraining on a huge and diverse of all these techniques, large-scale training, data aug-
corpus [36], [46], [55], [60], [112], [137], [196], [253]. mentation, self-supervision, and robust fine-tuning on
Since this strategy is based on standard empirical risk the available data from multiple, potentially simulated
minimization, it can achieve stronger generalization in environments.

626 P ROCEEDINGS OF THE IEEE | Vol. 109, No. 5, May 2021


Schölkopf et al.: Toward Causal Representation Learning

E. Reinforcement Learning from interventions. Work in developmental psychology


argues that there is a need to experiment in order to
RL is closer to causality research than the machine discover causal relationships [81]. This can be modeled as
learning mainstream in which it sometimes effectively an RL environment, where the agent can discover causal
directly estimates do-probabilities. For example, on-policy factors through interventions and observing their effects.
learning estimates do-probabilities for the interventions Furthermore, causal models may allow modeling the envi-
specified by the policy (note that these may not be hard ronment as a set of underlying ICMs such that, if there
interventions if the policy depends on other variables). is a change in distribution, not all the mechanisms need
However, as soon as off-policy learning is considered, to be relearned. However, there are still open questions
in particular, in the batch (or observational) setting [146], about the right way to think about generalization in RL,
issues of causality become subtle [82], [165]. An emerging the right way to formalize the problem, and the most
line of work devoted to the intersection of RL and causality relevant tasks.
includes [1], [13], [22], [38], [51], [165], [276]. Causal
learning applied to RL can be divided into two aspects: 3) Counterfactuals: Counterfactual reasoning has been
causal induction and causal inference. Causal induction found to improve the data efficiency of RL algorithms
(discovery) involves learning causal relations from data, [38], [164] and improve performance [51], and it has
for example, an RL agent learning a causal model of the been applied to communicate about past experiences in
environment. Causal inference learns to plan and act based the multiagent setting [69], [241]. These findings are
on a causal model. Causal induction in an RL setting consistent with work in cognitive psychology [65], argu-
poses different challenges than the classic causal learn- ing that counterfactuals allow to reason about the use-
ing settings where the causal variables are often given. fulness of past actions and transfer these insights to
However, there is accumulating evidence supporting the corresponding behavioral intentions in future scenarios
usefulness of an appropriate structured representation of [145], [199], [203].
the environment [2], [27], [258]. We argue that future work in RL should consider coun-
terfactual reasoning as a critical component to enable
1) World Models: Model-based RL [68], [248] is related acting in imagined spaces and formulating hypotheses
to causality as it aims at modeling the effect of actions that can be subsequently tested with suitably chosen
(interventions) on the current state of the world. Partic- interventions.
ularly relevant for causal leaning are generative world
4) Off-Line RL: The success of deep learning methods in
models that capture some of the causal relations under-
the case of supervised learning can be largely attributed
lying the environment and serve as Lorenzian imagined
to the availability of large data sets and methods that
spaces (see I NTRODUCTION above) to train RL agents [48],
can scale to large amounts of data. In the case of RL,
[99], [127], [178], [214], [231], [248], [268], [271].
collecting large amounts of high-fidelity diverse data from
Structured generative approaches further aim at decom-
scratch can be expensive and, hence, becomes a bottle-
posing an environment into multiple entities with causally
neck. Off-line RL [73], [150] tries to address this concern
correct relations among them, modulo the completeness
by learning a policy from a fixed data set of trajecto-
of the variables, and confounding [15], [44], [59], [136],
ries, without requiring any experimental or interventional
[264], [265]. However, many of the current approaches
data (i.e., without any interaction with the environment).
(regardless of structure), only build partial models of the
The effective use of observational data (or logged data)
environment [89]. Since they do not observe the environ-
may make real-world RL more practical by incorporating
ment at every time step, the environment may become an
diverse prior experiences. To succeed at it, an agent should
unobserved confounder affecting both the agent’s actions
be able to infer the consequence of different sets of actions
and the reward. To address this issue, a model can use the
compared to those seen during training (i.e., the actions
backdoor criterion conditioning on its policy [200].
in the logged data), which essentially makes it a coun-
2) Generalization, Robustness, and Fast Transfer: While terfactual inference problem. The distribution mismatch
RL has already achieved impressive results, the sample between the current policy and the policy that was used
complexity required to achieve consistently good perfor- to collect off-line data makes off-line RL challenging as
mance is often prohibitively high. Furthermore, RL agents this requires us to move well beyond the assumption
are often brittle (if data is limited) in the face of even tiny of independently and identically distributed data. Incor-
changes to the environment (either visual or mechanistic porating invariances by factorizing knowledge in terms
changes) unseen in the training phase. The question of of ICMs can help make progress toward the off-line RL
generalization in RL is essential to the field’s future both setting.
in theory and practice. One proposed solution toward the
goal of designing machines that can extrapolate experience
across environments and tasks is to learn invariances in a F. Scientific Applications
causal graph structure. A key requirement to learn invari- A fundamental question in the application of machine
ances from data may be the possibility to perform and learn learning in natural sciences is to which extent we

Vol. 109, No. 5, May 2021 | P ROCEEDINGS OF THE IEEE 627


Schölkopf et al.: Toward Causal Representation Learning

can complement our understanding of a physical sys- from observations appears to be one of the key factors
tem with machine learning. One interesting aspect is allowing for a successful generalization from prior expe-
physics simulation with neural networks [94], which can riences to new, often quite different, “out-of-distribution”
substantially increase the efficiency of hand-engineered settings.
simulators [104], [143], [211], [265], [269]. Signifi- Multitask learning refers to building a system that
cant out-of-distribution generalization of learned phys- can solve multiple tasks across different environments
ical simulators may not be necessary if experimental [41], [209]. These tasks usually share some common
conditions are carefully controlled although the simu- traits. By learning similarities across tasks, a system could
lator has to be completely retrained if the conditions utilize the knowledge acquired from previous tasks more
change. efficiently when encountering a new task. One possibility
On the other hand, the lack of systematic experimental of learning such similarities across tasks is to learn a shared
conditions may become problematic in other applications, underlying data-generating process as a causal genera-
such as health care. One example is personalized medicine, tive model whose components satisfy the SMS hypothesis
where we may wish to build a model of a patient health [219]. In certain cases, causal models adapt faster to
state through a multitude of data sources, such as elec- sparse interventions in distribution [131], [194].
tronic health records and genetic information [66], [109]. At the same time, we have clearly come a long way
However, if we train a clinical system on doctors’ actions already without explicitly treating the multitask problem
in controlled settings, the system will likely provide little as a causal one. Fuelled by abundant data and compute,
additional insight compared to the doctors’ knowledge and AI has made remarkable advances in a wide range of
may fail in surprising ways when deployed [19]. While it applications, from image processing and natural language
may be useful to automate certain decisions, an under- processing [36] to beating human world champions in
standing of causality may be necessary to recommend games, such as chess, poker, and Go [223], improving med-
treatment options that are personalized and reliable [3], ical diagnoses [166], and generating music [57]. A critical
[6], [31], [164], [201], [224], [242], [273]. question thus arises: why cannot we just train a huge model
Causality also has significant potential in helping under- that learns environments’ dynamics (e.g., in an RL setting)
stand medical phenomena, for example, in the current including all possible interventions? After all, distributed
COVID-19 pandemic, where causal mediation analysis representations can generalize to unseen examples, and if
helps disentangle different effects contributing toward case we train over a large number of interventions, we may
fatality rates when a textbook example of Simpson’s para- expect that a big neural network will generalize across them.
dox was observed [260]. To address this, we make several points. To begin with,
Another example of a scientific application is in astron- if data were not sufficiently diverse (which is an untestable
omy, where causal models were used to identify exoplanets assumption a priori), the worst case error to unseen shifts
under the confounding of the instrument. Exoplanets are may still be arbitrarily high (see Section VII-C). While,
often detected as they partially occlude their host star in the short term, we can often beat “out-of-distribution”
when they transit in front of it, causing a slight decrease in benchmarks by training bigger models on bigger data sets,
brightness. Shared patterns in measurement noise across causality offers an important complement. The generaliza-
stars light-years apart can be removed in order to reduce tion capabilities of a model are tied to its assumptions (e.g.,
the instrument’s influence on the measurement [218], how the model is structured and how it was trained). The
which is critical especially in the context of partial techni- causal approach makes these assumptions more explicit
cal failures as experienced in the Kepler exoplanet search and aligned with our understanding of physics and human
mission. The application of [218] leads to the discovery cognition, for instance, by relying on the ICM principle.
of 36 planet candidates [71], of which 21 were subse- When these assumptions are valid, a learner that does
quently validated as bona fide exoplanets [172]. Four years not use them should fare worse than one that does. Fur-
later, astronomers found traces of water in the atmosphere thermore, if we had a model that was successful in all
of the exoplanet K2-18b—the first such discovery for an interventions over a certain environment, we may want to
exoplanet in the habitable zone, that is, allowing for liquid use it in different environments that share similar albeit
water [26], [254]. This planet turned out to be one that not necessarily identical dynamics. The causal approach
had first been detected in [71, exoplanet candidate EPIC and, in particular, the ICM principle, point to the need
201912552]. to decompose knowledge about the world into indepen-
dent and recomposable pieces (recomposable depending
G. Multitask Learning and Continual Learning on the interventions or changes in the environment),
State-of-the-art AI is relatively narrow, that is, trained to which suggests more work on modular ML architectures
perform specific tasks, as opposed to the broad, versatile and other ways to enforce the ICM principle in future
intelligence allowing humans to adapt to a wide range ML approaches.
of environments and develop a rich set of skills. The At its core, i.i.d. pattern recognition is but a mathemat-
human ability to discover robust, invariant high-level con- ical abstraction, and causality may be essential to most
cepts and abstractions and to identify causal relationships forms of animate learning. Up until now, machine learning

628 P ROCEEDINGS OF THE IEEE | Vol. 109, No. 5, May 2021


Schölkopf et al.: Toward Causal Representation Learning

has neglected a full integration of causality, and this article an arbitrary ordering in the dimensions. This fixed-format
argues that it would indeed benefit from integrating causal implies that the representation size cannot be dynamically
concepts. We argue that combining the strengths of both changed; for example, we cannot change the number of
fields, that is, current deep learning methods and tools and objects in a scene. Furthermore, structured and modular
ideas from causality, may be a necessary step on the path representations should also arise when a network is trained
toward versatile AI systems. for (sets of) specific tasks, not only autoencoding. Different
high-level variables may be extracted depending on the
VIII. C O N C L U S I O N task and affordances at hand. Understanding under which
In this work, we discussed different levels of models, conditions causal variables can be recovered could provide
including causal and statistical ones. We argued that this insights into which interventions are robust to predictive
spectrum builds upon a range of assumptions, both in tasks.
terms of modeling and data collection. In an effort to
bring together causality and machine learning research C. Understanding the Biases of Existing Deep
programs, we first presented a discussion on the funda- Learning Approaches
mentals of causal inference. Second, we discussed how the
Scaling to massive data sets and relying on data aug-
independent mechanism assumptions and related notions,
mentation and self-supervision have all been successfully
such as invariance, offer a powerful bias for causal learn-
explored to improve the robustness of the predictions of
ing. Third, we discussed how causal relations might be
deep learning models. It is nontrivial to disentangle the
learned from observational and interventional data when
benefits of the individual components, and it is often
causal variables are observed. Fourth, we discussed the
unclear which “trick” should be used when dealing with
open problem of causal representation learning, including
a new task, even if we have an intuition about useful
its relation to the recent interest in the concept of disentan-
invariances. The notion of strong generalization over a
gled representations in deep learning. Finally, we discussed
specific set of interventions may be used to probe existing
how some open research questions in the machine learning
methods, training schemes, and data sets in order to build
community may be better understood and tackled within
a taxonomy of inductive biases. In particular, it is desirable
the causal framework, including SSL, domain generaliza-
to understand how design choices in pretraining (e.g.,
tion, and adversarial robustness.
which data sets/tasks) positively impact both transfer and
Based on this discussion, we list some critical areas for
robustness downstream in a causal sense.
future research.

D. Learning Causally Correct Models of the World


A. Learning Nonlinear Causal Relations at Scale
and the Agent
Not all real-world data are unstructured and the effect of
In many real-world RL settings, abstract state repre-
interventions can often be observed, for example, by strat-
sentations are not available. Hence, the ability to derive
ifying the data collection across multiple environments.
abstract causal variables from high-dimensional, low-level
The approximation abilities of modern machine learning
pixel representations and then recover causal graphs is
methods may prove useful to model nonlinear causal
important for causal induction in real-world RL settings.
relations among large numbers of variables. For practi-
Moreover, building a causal description for both a model
cal applications, classical tools are not only limited in
of the agent and the environment (world models) should
the linearity assumptions often made, but also in their
be essential for robust and versatile model-based RL.
scalability. The paradigms of metalearning and multitask
learning are close to the assumptions and desiderata of
Acknowledgment
causal modeling, and future work should consider: 1)
The authors thank the past and present members of
understanding under which conditions nonlinear causal
the Tübingen Causality Team, without whose work
relations can be learned; 2) which training frameworks
and insights, this article would not exist, in particular,
allow to best exploit the scalability of machine learning
to Dominik Janzing, Chaochao Lu, and Julius von Kügel-
approaches; and 3) providing compelling evidence on the
gen who gave helpful comments on [217]. The text has
advantages over (noncausal) statistical representations in
also benefitted from discussions with Elias Bareinboim,
terms of generalization, repurposing, and transfer of causal
Christoph Bohle, Leon Bottou, Isabelle Guyon, Judea Pearl,
modules on real-world tasks.
and Vladimir Vapnik. The authors would like to thank
Wouter van Amsterdam for pointing out typos in the
B. Learning Causal Variables first version. They also thank Thomas Kipf, Klaus Greff,
“Disentangled” representations learned by state-of-the- and Alexander d’Amour for the useful discussions. Finally,
art neural network methods are still distributed in the they thank the thorough anonymous reviewers for highly
sense that they are represented in a vector format with valuable feedback and suggestions.

Vol. 109, No. 5, May 2021 | P ROCEEDINGS OF THE IEEE 629


Schölkopf et al.: Toward Causal Representation Learning

REFERENCES
[1] O. Ahmed et al., “Causalworld: A robotic in Proc. Int. Conf. Artif. Intell. Statist. (AISTATS), [42] K. Chalupka, P. Perona, and F. Eberhardt,
manipulation benchmark for causal structure and 2010, pp. 129–136. “Multi-level cause-effect systems,” 2015,
transfer learning,” in Proc. Int. Conf. Learn. [22] E. Bengio, V. Thomas, J. Pineau, D. Precup, and arXiv:1512.07942. [Online]. Available:
Represent., 2021. Y. Bengio, “Independently controllable features,” http://arxiv.org/abs/1512.07942
[2] OpenAI et al., “Solving Rubik’s cube with a robot 2017, arXiv:1703.07718. [Online]. Available: [43] K. Chalupka, P. Perona, and F. Eberhardt, “Fast
hand,” 2019, arXiv:1910.07113. [Online]. http://arxiv.org/abs/1703.07718 conditional independence test for vector variables
Available: http://arxiv.org/abs/1910.07113 [23] Y. Bengio, S. Bengio, and J. Cloutier, “Learning a with large sample sizes,” 2018, arXiv:1804.02747.
[3] A. Alaa and M. Schaar, “Limits of estimating synaptic learning rule,” in Proc. Seattle Int. Joint [Online]. Available:
heterogeneous treatment effects: Guidelines for Conf. Neural Netw. (IJCNN), vol. 2. IEEE, http://arxiv.org/abs/1804.02747
practical algorithm design,” in Proc. Int. Conf. Jul. 1991, p. 969. [44] M. B. Chang, T. Ullman, A. Torralba, and
Mach. Learn., 2018, pp. 129–138. [24] Y. Bengio, A. Courville, and P. Vincent, J. B. Tenenbaum, “A compositional object-based
[4] J. Aldrich, “Autonomy,” Oxford Econ. Papers, “Representation learning: A review and new approach to learning physical dynamics,” in Proc.
vol. 41, no. 1, pp. 15–34, 1989. perspectives,” 2012, arXiv:1206.5538. [Online]. 5th Int. Conf. Learn. Represent. (ICLR), 2017.
[5] M. Arjovsky, L. Bottou, I. Gulrajani, and Available: http://arxiv.org/abs/1206.5538 [45] O. Chapelle, B. Schölkopf, and A. Zien, Eds.,
D. Lopez-Paz, “Invariant risk minimization,” 2019, [25] Y. Bengio et al., “A meta-transfer objective for Semi-Supervised Learning. Cambridge, MA, USA:
arXiv:1907.02893. [Online]. Available: learning to disentangle causal mechanisms,” MIT Press, 2006.
http://arxiv.org/abs/1907.02893 2019, arXiv:1901.10912. [Online]. Available: [46] M. Chen et al., “Generative pretraining from
[6] O. Atan, J. Jordon, and M. van der Schaar, http://arxiv.org/abs/1901.10912 pixels,” in Proc. 37th Int. Conf. Mach. Learn., 2020,
“Deep-treat: Learning optimal personalized [26] B. Benneke et al., “Water vapor on the pp. 1691–1703.
treatments from observational data using neural habitable-zone exoplanet K2-18b,” 2019, [47] T. Chen, S. Kornblith, M. Norouzi, and G. Hinton,
networks,” in Proc. 32nd AAAI Conf. Artif. Intell., arXiv:1909.04642. [Online]. Available: “A simple framework for contrastive learning of
2018. https://arxiv.org/abs/1909.04642 visual representations,” 2020, arXiv:2002.05709.
[7] A. Azulay and Y. Weiss, “Why do deep [27] OpenAI et al., “Dota 2 with large scale deep [Online]. Available: http://arxiv.org/
convolutional networks generalize so poorly to reinforcement learning,” 2019, arXiv:1912.06680. abs/2002.05709
small image transformations?” J. Mach. Learn. [Online]. Available: http://arxiv.org/abs/ [48] S. Chiappa, S. Racaniere, D. Wierstra, and
Res., vol. 20, no. 184, pp. 1–25, 2019. 1912.06680 S. Mohamed, “Recurrent environment
[8] A. L. Rezaabad and S. Vishwanath, “Learning [28] M. Besserve, N. Shajarisales, B. Schölkopf, and simulators,” in Proc. 5th Int. Conf. Learn.
representations by maximizing mutual D. Janzing, “Group invariance principles for Represent. (ICLR), 2017.
information in variational autoencoders,” in Proc. causal generative models,” in Proc. 21st Int. Conf. [49] E. D. Cubuk, B. Zoph, D. Mane, V. Vasudevan, and
IEEE Int. Symp. Inf. Theory (ISIT), Jun. 2020, Artif. Intell. Statist. (AISTATS), 2018, pp. 557–565. Q. V. Le, “AutoAugment: Learning augmentation
pp. 15535–15545. [29] M. Besserve, R. Sun, D. Janzing, and B. Schölkopf, strategies from data,” in Proc. IEEE Conf. Comput.
[9] D. Bahdanau, S. Murty, M. Noukhovitch, “A theory of independent mechanisms for Vis. Pattern Recognit. (CVPR), Jun. 2019,
T. H. Nguyen, H. de Vries, and A. Courville, extrapolation in generative models,” in Proc. 35th pp. 113–123.
“Systematic generalization: What is required and AAAI Conf. Artif. Intell. Virtual Conf., Feb. 2021. [50] P. Daniušis et al., “Inferring deterministic causal
can it be learned,” 2018, arXiv:1811.12889. [30] M. Besserve, A. Mehrjou, R. Sun, and relations,” in Proc. 26th Annu. Conf. Uncertainty
[Online]. Available: https://arxiv.org/ B. Schölkopf, “Counterfactuals uncover the Artif. Intell. (UAI), 2010, pp. 143–150.
abs/1811.12889 modular structure of deep generative models,” [51] I. Dasgupta et al., “Causal reasoning from
[10] H. Baird, “Document image defect models,” in 2018, arXiv:1812.03253. [Online]. Available: meta-reinforcement learning,” 2019,
Proc. IAPR Workshop Syntactic Structural Pattern http://arxiv.org/abs/1812.03253 arXiv:1901.08162. [Online]. Available:
Recognit., Murray Hill, NJ, USA, 1990, [31] I. Bica, A. M. Alaa, and M. van der Schaar, “Time http://arxiv.org/abs/1901.08162
pp. 38–46. series deconfounder: Estimating treatment effects [52] A. P. Dawid, “Conditional independence in
[11] V. Bapst et al., “Structured agents for physical over time in the presence of hidden confounders,” statistical theory,” J. Roy. Stat. Soc. B, Stat.
construction,” in Proc. Int. Conf. Mach. Learn., 2019, arXiv:1902.00450. [Online]. Available: Methodol., vol. 41, no. 1, pp. 1–31, 1979.
2019, pp. 464–474. http://arxiv.org/abs/1902.00450 [53] S. Dehaene, How We Learn: Why Brains Learn
[12] A. Barbu et al., “Objectnet: A large-scale [32] P. Blöbaum, T. Washio, and S. Shimizu, “Error Better Than Any Machine ... for Now. Baltimore,
bias-controlled dataset for pushing the limits of asymmetry in causal and anticausal regression,” MD, USA: Penguin, 2020.
object recognition models,” in Proc. Adv. Neural 2016, arXiv:1610.03263. [Online]. Available: [54] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and
Inf. Process. Syst., 2019, pp. 9448–9458. http://arxiv.org/abs/1610.03263 L. Fei-Fei, “ImageNet: A large-scale hierarchical
[13] E. Bareinboim, A. Forney, and J. Pearl, “Bandits [33] A. Blum and T. Mitchell, “Combining labeled and image database,” in Proc. IEEE Conf. Comput. Vis.
with unobserved confounders: A causal unlabeled data with co-training,” in Proc. 11th Pattern Recognit., Jun. 2009, pp. 248–255.
approach,” in Proc. Adv. Neural Inf. Process. Syst., Annu. Conf. Comput. Learn. Theory, New York, NY, [55] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova,
2015, pp. 1342–1350. USA, 1998, pp. 92–100. “BERT: Pre-training of deep bidirectional
[14] E. Bareinboim and J. Pearl, “Transportability from [34] B. Bonet and H. Geffner, “Learning first-order transformers for language understanding,” 2018,
multiple environments with limited experiments: symbolic representations for planning from the arXiv:1810.04805. [Online]. Available:
Completeness results,” in Proc. Adv. Neural Inf. structure of the state space,” 2019, http://arxiv.org/abs/1810.04805
Process. Syst., 2014, pp. 280–288. arXiv:1909.05546. [Online]. Available: [56] L. Devroye, L. Györfi, and G. Lugosi,
[15] P. Battaglia et al., “Interaction networks for http://arxiv.org/abs/1909.05546 “A probabilistic theory of pattern recognition,” in
learning about objects, relations and physics,” in [35] L. Bottou et al., “Counterfactual reasoning and Applications of Mathematics, vol. 31. New York,
Proc. Adv. Neural Inf. Process. Syst., 2016, learning systems: The example of computational NY, USA: Springer, 1996.
pp. 4502–4510. advertising,” J. Mach. Learn. Res., vol. 14, no. 1, [57] P. Dhariwal, H. Jun, C. Payne, J. W. Kim,
[16] P. W. Battaglia et al., “Relational inductive biases, pp. 3207–3260, Jan. 2013. A. Radford, and I. Sutskever, “Jukebox:
deep learning, and graph networks,” 2018, [36] T. B. Brown et al., “Language models are few-shot A generative model for music,” 2020,
arXiv:1806.01261. [Online]. Available: learners,” 2020, arXiv:2005.14165. [Online]. arXiv:2005.00341. [Online]. Available:
http://arxiv.org/abs/1806.01261 Available: https://arxiv.org/abs/2005.14165 http://arxiv.org/abs/2005.00341
[17] P. W. Battaglia, J. B. Hamrick, and [37] K. Budhathoki and J. Vreeken, “Causal inference [58] A. Dittadi et al., “On the transfer of disentangled
J. B. Tenenbaum, “Simulation as an engine of by compression,” in Proc. IEEE 16th Int. Conf. Data representations in realistic settings,” in Proc. Int.
physical scene understanding,” Proc. Nat. Acad. Mining (ICDM), Dec. 2016, pp. 41–50. Conf. Learn. Represent., 2021.
Sci. USA, vol. 110, no. 45, pp. 18327–18332, [38] L. Buesing et al., “Woulda, coulda, shoulda: [59] C. Diuk, A. Cohen, and M. L. Littman, “An
Nov. 2013. Counterfactually-guided policy search,” 2018, object-oriented representation for efficient
[18] S. Bauer, B. Schölkopf, and J. Peters, “The arrow arXiv:1811.06272. [Online]. Available: reinforcement learning,” in Proc. 25th Int. Conf.
of time in multivariate time series,” in Proc. 33rd http://arxiv.org/abs/1811.06272 Mach. Learn. (ICML), 2008, pp. 240–247.
Int. Conf. Mach. Learn., vol. 48, 2016, [39] C. J. C. Burges and B. Schölkopf, “Improving the [60] J. Djolonga et al., “On robustness and
pp. 2043–2051. accuracy and speed of support vector learning transferability of convolutional neural networks,”
[19] E. Beede et al., “A human-centered evaluation of a machines,” in Advances in Neural Information 2020, arXiv:2007.08558. [Online]. Available:
deep learning system deployed in clinics for the Processing Systems, vol. 9, M. Mozer, M. Jordan, https://arxiv.org/abs/2007.08558
detection of diabetic retinopathy,” in Proc. CHI and T. Petsche, Eds. Cambridge, MA, USA: [61] G. Doran, K. Muandet, K. Zhang, and
Conf. Hum. Factors Comput. Syst., Apr. 2020, MIT Press, 1997, pp. 375–381. B. Schölkopf, “A permutation-based kernel
pp. 1–12. [40] C. P. Burgess et al., “MONet: Unsupervised scene conditional independence test,” in Proc. 30th Conf.
[20] S. Beery, G. Van Horn, and P. Perona, “Recognition decomposition and representation,” 2019, Uncertainty Artif. Intell., N. L. Zhang and J. Tian,
in terra incognita,” in Proc. Eur. Conf. Comput. Vis. arXiv:1901.11390. [Online]. Available: Eds. Corvallis, OR, USA: AUAI Press, 2014,
(ECCV), 2018, pp. 456–473. http://arxiv.org/abs/1901.11390 pp. 132–141.
[21] S. Ben-David, T. Lu, T. Luu, and D. Pál, [41] R. Caruana, “Multitask learning,” Mach. Learn., [62] C. Eastwood and C. K. Williams, “A framework for
“Impossibility theorems for domain adaptation,” vol. 28, no. 1, pp. 41–75, 1997. the quantitative evaluation of disentangled

630 P ROCEEDINGS OF THE IEEE | Vol. 109, No. 5, May 2021


Schölkopf et al.: Toward Causal Representation Learning

representations,” in Proc. Int. Conf. Learn. D. Lopez-Paz, and M. Sebag, “Causal generative representation learning,” in Proc. IEEE Conf.
Represent., 2018. neural networks,” 2017, arXiv:1711.08936. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2020,
[63] D. Eaton and K. Murphy, “Exact Bayesian structure [Online]. Available: pp. 9729–9738.
learning from uncertain interventions,” in Proc. http://arxiv.org/abs/1711.08936 [103] K. He, X. Zhang, S. Ren, and J. Sun, “Deep
Artif. Intell. Statist., 2007, pp. 107–114. [84] A. Goyal et al., “Object files and schemata: residual learning for image recognition,” in Proc.
[64] L. Engstrom, B. Tran, D. Tsipras, L. Schmidt, and Factorizing declarative and procedural knowledge IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR),
A. Madry, “Exploring the landscape of spatial in dynamical systems,” 2020, arXiv:2006.16225. Jun. 2016, pp. 770–778.
robustness,” 2017, arXiv:1712.02779. [Online]. [Online]. Available: [104] S. He et al., “Learning to predict the cosmological
Available: http://arxiv.org/abs/1712.02779 http://arxiv.org/abs/2006.16225 structure formation,” Proc. Nat. Acad. Sci. USA,
[65] K. Epstude and N. J. Roese, “The functional theory [85] A. Goyal, A. Lamb, J. Hoffmann, S. Sodhani, vol. 116, no. 28, pp. 13825–13832, Jul. 2019.
of counterfactual thinking,” Personality Social S. Levine, Y. Bengio, and B. Schölkopf, “Recurrent [105] C. Heinze-Deml and N. Meinshausen,
Psychol. Rev., vol. 12, no. 2, pp. 168–192, independent mechanisms,” in Proc. Int. Conf. “Conditional variance penalties and domain shift
May 2008. Learn. Represent., 2021. robustness,” 2017, arXiv:1710.11469. [Online].
[66] A. Esteva et al., “A guide to deep learning in [86] A. Graves, A.-R. Mohamed, and G. Hinton, Available: http://arxiv.org/abs/1710.11469
healthcare,” Nature Med., vol. 25, no. 1, “Speech recognition with deep recurrent neural [106] C. Heinze-Deml, J. Peters, and N. Meinshausen,
pp. 24–29, 2019. networks,” in Proc. IEEE Int. Conf. Acoust., Speech “Invariant causal prediction for nonlinear
[67] A. Farago and G. Lugosi, “Strong universal Signal Process., May 2013, pp. 6645–6649. models,” 2017, arXiv:1706.08576. [Online].
consistency of neural network classifiers,” IEEE [87] K. Greff et al., “Multi-object representation Available: http://arxiv.org/abs/1706.08576
Trans. Inf. Theory, vol. 39, no. 4, pp. 1146–1151, learning with iterative variational inference,” in [107] D. Hendrycks and T. Dietterich, “Benchmarking
Jul. 1993. Proc. Int. Conf. Mach. Learn., 2019, neural network robustness to common
[68] C. Finn, P. Abbeel, and S. Levine, “Model-agnostic pp. 2424–2433. corruptions and perturbations,” 2019,
meta-learning for fast adaptation of deep [88] K. Greff, S. van Steenkiste, and J. Schmidhuber, arXiv:1903.12261. [Online]. Available:
networks,” 2017, arXiv:1703.03400. [Online]. “On the binding problem in artificial neural http://arxiv.org/abs/1903.12261
Available: http://arxiv.org/abs/1703.03400 networks,” 2020, arXiv:2012.05208. [Online]. [108] J. Henrich, The Secret our Success. Princeton, NJ,
[69] J. N. Foerster, G. Farquhar, T. Afouras, N. Nardelli, Available: http://arxiv.org/abs/2012.05208 USA: Princeton Univ. Press, 2016.
and S. Whiteson, “Counterfactual multi-agent [89] K. Gregor, D. J. Rezende, F. Besse, Y. Wu, [109] K. E. Henry, D. N. Hager, P. J. Pronovost, and
policy gradients,” in Proc. 32nd AAAI Conf. Artif. H. Merzic, and A. van den Oord, “Shaping belief S. Saria, “A targeted real-time early warning score
Intell., 2018. states with generative environment models for (TREWScore) for septic shock,” Sci. Transl. Med.,
[70] P. Földiák, “Learning invariance from RL,” in Proc. Adv. Neural Inf. Process. Syst., 2019, vol. 7, no. 299, 2015, Art. no. 299ra122.
transformation sequences,” Neural Comput., pp. 13475–13487. [110] I. Higgins et al., “Beta-VAE: Learning basic visual
vol. 3, no. 2, pp. 194–200, Jun. 1991. [90] L. Gresele, P. K. Rubenstein, A. Mehrjou, concepts with a constrained variational
[71] D. Foreman-Mackey, B. T. Montet, D. W. Hogg, F. Locatello, and B. Schölkopf, “The incomplete framework,” in Proc. Int. Conf. Learn. Represent.,
T. D. Morton, D. Wang, and B. Schölkopf, Rosetta stone problem: Identifiability results for 2016.
“A systematic search for transiting planets in the multi-view nonlinear ICA,” 2019, [111] K. D. Hoover, “Causality in economics and
K2 data,” Astrophys. J., vol. 806, no. 2, p. 215, arXiv:1905.06642. [Online]. Available: econometrics,” in The New Palgrave Dictionary of
2015. https://arxiv.org/abs/1905.06642 Economics, S. N. Durlauf and L. E. Blume, Eds.,
[72] R. Frisch, T. Haavelmo, T. Koopmans, and [91] A. Gretton, O. Bousquet, A. Smola, and 2nd ed. Basingstoke, U.K.: Palgrave Macmillan,
J. Tinbergen, “Autonomy of economic relations,” B. Schölkopf, “Measuring statistical dependence 2008.
Universitets Socialøkonomiske Institutt, Oslo, with Hilbert–Schmidt norms,” in Algorithmic [112] J. Howard and S. Ruder, “Universal language
Norway, Tech. Rep., 1948. Learning Theory. Springer-Verlag, 2005, model fine-tuning for text classification,” 2018,
[73] S. Fujimoto, D. Meger, and D. Precup, “Off-policy pp. 63–78. arXiv:1801.06146. [Online]. Available:
deep reinforcement learning without exploration,” [92] A. Gretton, R. Herbrich, A. Smola, O. Bousquet, http://arxiv.org/abs/1801.06146
in Proc. Int. Conf. Mach. Learn., 2019, and B. Schölkopf, “Kernel methods for measuring [113] P. O. Hoyer, D. Janzing, J. M. Mooij, J. Peters, and
pp. 2052–2062. independence,” J. Mach. Learn. Res., vol. 6, B. Schölkopf, “Nonlinear causal discovery with
[74] K. Fukumizu, A. Gretton, X. Sun, and pp. 2075–2129, Dec. 2005. additive noise models,” in Proc. Adv. Neural Inf.
B. Schölkopf, “Kernel measures of conditional [93] J.-B. Grill et al., “Bootstrap your own latent: Process. Syst. (NIPS), 2009, pp. 689–696.
dependence,” in Proc. Adv. Neural Inf. Process. A new approach to self-supervised learning,” [114] B. Huang, K. Zhang, J. Zhang, J. Ramsey,
Syst., 2008, pp. 489–496. 2020, arXiv:2006.07733. [Online]. Available: R. Sanchez-Romero, C. Glymour, and
[75] D. Geiger and J. Pearl, “Logical and algorithmic http://arxiv.org/abs/2006.07733 B. Schölkopf, “Causal discovery from
properties of independence and their application [94] R. Grzeszczuk, D. Terzopoulos, and G. Hinton, heterogeneous/nonstationary data,” J. Mach.
to Bayesian networks,” Ann. Math. Artif. Intell., “NeuroAnimator: Fast neural network emulation Learn. Res., vol. 21, no. 89, pp. 1–53, 2020.
vol. 2, nos. 1–4, pp. 165–178, Mar. 1990. and control of physics-based models,” in Proc. [115] B. Huang, K. Zhang, J. Zhang,
[76] R. Geirhos, P. Rubisch, C. Michaelis, M. Bethge, 25th Annu. Conf. Comput. Graph. Interact. Techn., R. Sanchez-Romero, C. Glymour, and
F. A. Wichmann, and W. Brendel, 1998, pp. 9–20. B. Schölkopf, “Behind distribution shift: Mining
“ImageNet-trained CNNs are biased towards [95] K. Gu, B. Yang, J. Ngiam, Q. Le, and J. Shlens, driving forces of changes and causal arrows,” in
texture; increasing shape bias improves accuracy “Using videos to evaluate image model Proc. IEEE Int. Conf. Data Mining (ICDM),
and robustness,” 2018, arXiv:1811.12231. robustness,” 2019, arXiv:1904.10076. [Online]. Nov. 2017, pp. 913–918.
[Online]. Available: Available: http://arxiv.org/abs/1904.10076 [116] A. Hyvarinen and H. Morioka, “Nonlinear ICA of
http://arxiv.org/abs/1811.12231 [96] S. Gu and L. Rigazio, “Towards deep neural temporally dependent stationary sources,” in Proc.
[77] M. W. Gondal et al., “On the transfer of inductive network architectures robust to adversarial Mach. Learn. Res., 2017, pp. 460–469.
bias from simulation to the real world: A new examples,” 2014, arXiv:1412.5068. [Online]. [117] A. Hyvärinen and P. Pajunen, “Nonlinear
disentanglement dataset,” in Proc. Adv. Neural Inf. Available: http://arxiv.org/abs/1412.5068 independent component analysis: Existence and
Process. Syst., 2019, pp. 15740–15751. [97] R. Guo, L. Cheng, J. Li, P. R. Hahn, and H. Liu, uniqueness results,” Neural Netw., vol. 12, no. 3,
[78] M. Gong, K. Zhang, T. Liu, D. Tao, C. Glymour, “A survey of learning causality with data: pp. 429–439, Apr. 1999.
and B. Schölkopf, “Domain adaptation with Problems and methods,” 2018, arXiv:1809.09337. [118] G. W. Imbens and D. B. Rubin, Causal Inference in
conditional transferable components,” in Proc. [Online]. Available: Statistics, Social, and Biomedical Sciences.
33rd Int. Conf. Mach. Learn., 2016, http://arxiv.org/abs/1809.09337 Cambridge, U.K.: Cambridge Univ. Press, 2015.
pp. 2839–2848. [98] I. Guyon, D. Janzing, and B. Schölkopf, [119] D. Janzing, “Causal regularization,” in Proc. Adv.
[79] M. Gong, K. Zhang, B. Schölkopf, C. Glymour, and “Causality: Objectives and assessment,” in Proc. Neural Inf. Process. Syst., 2019, pp. 12704–12714.
D. Tao, “Causal discovery from temporally JMLR Workshop Conf., vol. 6, I. Guyon, D. Janzing, [120] D. Janzing, R. Chaves, and B. Schölkopf,
aggregated time series,” in Proc. 33rd Conf. and B. Schölkopf, Eds. Cambridge, MA, USA: “Algorithmic independence of initial condition and
Uncertainty Artif. Intell. (UAI), 2017, p. 269. MIT Press, 2010, pp. 1–42. dynamical law in thermodynamics and causal
[80] I. J. Goodfellow, J. Shlens, and C. Szegedy, [99] D. Ha and J. Schmidhuber, “World models,” 2018, inference,” New J. Phys., vol. 18, no. 9, Sep. 2016,
“Explaining and harnessing adversarial examples,” arXiv:1803.10122. [Online]. Available: Art. no. 093052.
2014, arXiv:1412.6572. [Online]. Available: http://arxiv.org/abs/1803.10122 [121] D. Janzing, P. Hoyer, and B. Schölkopf, “Telling
http://arxiv.org/abs/1412.6572 [100] T. Haavelmo, “The probability approach in cause from effect based on high-dimensional
[81] A. Gopnik, C. Glymour, D. M. Sobel, L. E. Schulz, econometrics,” Econometrica, vol. 12, observations,” in Proc. 27th Int. Conf. Mach.
T. Kushnir, and D. Danks, “A theory of causal pp. S1–S115, Jul. 1944. Learn., J. Fürnkranz and T. Joachims, Eds., 2010,
learning in children: Causal maps and Bayes nets,” [101] H. Hälvä and A. Hyvärinen, “Hidden Markov pp. 479–486.
Psychol. Rev., vol. 111, no. 1, pp. 3–32, 2004. nonlinear ICA: Unsupervised learning from [122] D. Janzing et al., “Information-geometric
[82] O. Gottesman et al., “Evaluating reinforcement nonstationary time series,” 2020, approach to inferring causal directions,” Artif.
learning algorithms in observational health arXiv:2006.12107. [Online]. Available: Intell., vols. 182–183, pp. 1–31, May 2012.
settings,” 2018, arXiv:1805.12298. [Online]. http://arxiv.org/abs/2006.12107 [123] D. Janzing, J. Peters, J. M. Mooij, and
Available: http://arxiv.org/abs/1805.12298 [102] K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick, B. Schölkopf, “Identifying confounders using
[83] O. Goudet, D. Kalainathan, P. Caillou, I. Guyon, “Momentum contrast for unsupervised visual additive noise models,” in Proc. 25th Annu. Conf.

Vol. 109, No. 5, May 2021 | P ROCEEDINGS OF THE IEEE 631


Schölkopf et al.: Toward Causal Representation Learning

Uncertainty Artif. Intell. (UAI), 2009, pp. 249–257. J. E. Malley, “Missed opportunities: Psychological MRI,” Zeitschrift für Medizinische Physik, vol. 29,
[124] D. Janzing and B. Schölkopf, “Causal inference ramifications of counterfactual thought in midlife no. 2, pp. 102–127, May 2019.
using the algorithmic Markov condition,” IEEE women,” J. Adult Develop., vol. 2, no. 2, [167] S. Magliacane, T. van Ommen, T. Claassen,
Trans. Inf. Theory, vol. 56, no. 10, pp. 5168–5194, pp. 87–97, Apr. 1995. S. Bongers, P. Versteeg, and J. M. Mooij, “Domain
Oct. 2010. [146] S. Lange, T. Gabel, and M. Riedmiller, “Batch adaptation by using causal inference to predict
[125] D. Janzing and B. Schölkopf, “Semi-supervised reinforcement learning,” in Reinforcement invariant conditional distributions,” in Proc.
interpolation in an anticausal learning scenario,” Learning: State-of-the-Art, M. Wiering and M. van NeurIPS, 2018, pp. 10869–10879.
J. Mach. Learn. Res., vol. 16, no. 1, Otterlo, Eds. Berlin, Germany: Springer, 2012, [168] R. Matthews, “Storks deliver babies (p= 0.008),”
pp. 1923–1948, 2015. pp. 45–73. Teaching Statist., vol. 22, no. 2, pp. 36–38, 2000.
[126] D. Janzing and B. Schölkopf, “Detecting [147] S. L. Lauritzen, Graphical Models. New York, NY, [169] N. Meinshausen, “Causality from a distributional
non-causal artifacts in multivariate linear USA: Oxford Univ. Press, 1996. robustness point of view,” in Proc. IEEE Data Sci.
regression models,” in Proc. 35th Int. Conf. Mach. [148] Y. LeCun, Y. Bengio, and G. Hinton, “Deep Workshop (DSW), Jun. 2018, pp. 6–10.
Learn. (ICML), 2018, pp. 2250–2258. learning,” Nature, vol. 521, no. 7553, [170] C. Michaelis et al., “Benchmarking robustness in
[127] L. P. Kaelbling, M. L. Littman, and A. W. Moore, pp. 436–444, 2015. object detection: Autonomous driving when
“Reinforcement learning: A survey,” J. Artif. Intell. [149] F. Leeb, Y. Annadani, S. Bauer, and B. Schölkopf, winter is coming,” 2019, arXiv:1907.07484.
Res., vol. 4, pp. 237–285, Jan. 1996. “Structural autoencoders improve representations [Online]. Available: http://arxiv.org/
[128] D. Kahneman, Thinking, Fast Slow. New York, NY, for generation and transfer,” 2020, abs/1907.07484
USA: Farrar, Straus and Giroux, 2011. arXiv:2006.07796. [Online]. Available: [171] V. Mnih et al., “Human-level control through deep
[129] S. Karahan, M. K. Yildirum, K. Kirtac, F. S. Rende, http://arxiv.org/abs/2006.07796 reinforcement learning,” Nature, vol. 518,
G. Butun, and H. K. Ekenel, “How image [150] S. Levine, A. Kumar, G. Tucker, and J. Fu, “Offline no. 7540, pp. 529–533, 2015.
degradations affect deep CNN-based face reinforcement learning: Tutorial, review, and [172] B. T. Montet et al., “Stellar and planetary
recognition,” in Proc. Int. Conf. Biometrics Special perspectives on open problems,” 2020, properties of K2 campaign 1 candidates and
Interest Group (BIOSIG), 2016, pp. 1–5. arXiv:2005.01643. [Online]. Available: validation of 17 planets, including a planet
[130] A.-H. Karimi, J. von Kügelgen, B. Schölkopf, and http://arxiv.org/abs/2005.01643 receiving earth-like insolation,” Astrophys. J.,
I. Valera, “Algorithmic recourse under imperfect [151] D. Lewis, “Causation,” J. Philosophy, vol. 70, vol. 809, no. 1, p. 25, 2015.
causal knowledge: A probabilistic approach,” no. 17, pp. 556–567, 1974. [173] J. Mooij, D. Janzing, and B. Schölkopf, “From
2020, arXiv:2006.06831. [Online]. Available: [152] Y. Li, M. Gong, X. Tian, T. Liu, and D. Tao, ordinary differential equations to structural causal
http://arxiv.org/abs/2006.06831 “Domain generalization via conditional invariant models: The deterministic case,” in Proc. 29th
[131] N. R. Ke et al., “Learning neural causal models representation,” 2018, arXiv:1807.08479. Conf. Annu. Conf. Uncertainty Artif. Intell.,
from unknown interventions,” 2019, [Online]. Available: http://arxiv.org/ A. Nicholson and P. Smyth, Eds. Corvallis, OR,
arXiv:1910.01075. [Online]. Available: abs/1807.08479 USA: AUAI Press, 2013, pp. 440–448.
http://arxiv.org/abs/1910.01075 [153] Y. Li et al., “Deep domain generalization via [174] J. M. Mooij, D. Janzing, T. Heskes, and
[132] S. Tsirtsis, B. Tabibian, M. Khajehnejad, A. Singla, conditional invariant adversarial networks,” in B. Schölkopf, “On causal discovery with cyclic
B. Schölkopf, and M. Gomez-Rodriguez, “Optimal Proc. Eur. Conf. Comput. Vis. (ECCV), 2018, additive noise models,” in Proc. Adv. Neural Inf.
decision making under strategic behavior,” 2019, pp. 624–639. Process. Syst. (NIPS), 2011, pp. 639–647.
arXiv:1905.09239. [Online]. Available: [154] S. Lim, I. Kim, T. Kim, C. Kim, and S. Kim, “Fast [175] J. M. Mooij, D. Janzing, J. Peters, and
http://arxiv.org/abs/1905.09239 autoaugment,” in Proc. Adv. Neural Inf. Process. B. Schölkopf, “Regression by dependence
[133] N. Kilbertus, G. Parascandolo, and B. Schölkopf, Syst., 2019, pp. 6665–6675. minimization and its application to causal
“Generalization in anti-causal learning,” 2018, [155] Z. Lin et al., “Space: Unsupervised object-oriented inference,” in Proc. 26th Int. Conf. Mach. Learn.
arXiv:1812.00524. [Online]. Available: scene representation via spatial attention and (ICML), 2009, pp. 745–752.
http://arxiv.org/abs/1812.00524 decomposition,” in Proc. Int. Conf. Learn. [176] J. M. Mooij, J. Peters, D. Janzing, J. Zscheischler,
[134] N. Kilbertus, M. R. Carulla, G. Parascandolo, Represent., 2019. and B. Schölkopf, “Distinguishing cause from
M. Hardt, D. Janzing, and B. Schölkopf, “Avoiding [156] Z. C. Lipton, Y.-X. Wang, and A. Smola, “Detecting effect using observational data: Methods and
discrimination through causal reasoning,” in Proc. and correcting for label shift with black box benchmarks,” J. Mach. Learn. Res., vol. 17, no. 32,
Adv. Neural Inf. Process. Syst., 2017, pp. 656–666. predictors,” 2018, arXiv:1802.03916. [Online]. pp. 1–102, 2016.
[135] H. Kim and A. Mnih, “Disentangling by Available: http://arxiv.org/abs/1802.03916 [177] D. Mrowca et al., “Flexible neural representation
factorising,” in Proc. Int. Conf. Mach. Learn., 2018, [157] F. Locatello, G. Abbati, T. Rainforth, S. Bauer, for physics prediction,” in Proc. Adv. Neural Inf.
pp. 2649–2658. B. Schölkopf, and O. Bachem, “On the fairness of Process. Syst., 2018, pp. 8799–8810.
[136] T. Kipf, E. Fetaya, K.-C. Wang, M. Welling, and disentangled representations,” in Proc. Adv. Neural [178] J. Oh, X. Guo, H. Lee, R. L. Lewis, and S. Singh,
R. Zemel, “Neural relational inference for Inf. Process. Syst., 2019, pp. 14544–14557. “Action-conditional video prediction using deep
interacting systems,” in Proc. Int. Conf. Mach. [158] F. Locatello et al., “Challenging common networks in Atari games,” in Proc. Adv. Neural Inf.
Learn., 2018, pp. 2688–2697. assumptions in the unsupervised learning of Process. Syst., 2015, pp. 2863–2871.
[137] A. Kolesnikov et al., “Big transfer (BiT): General disentangled representations,” in Proc. 36th Int. [179] A. van den Oord, Y. Li, and O. Vinyals,
visual representation learning,” 2019, Conf. Mach. Learn., 2019, pp. 4114–4124. “Representation learning with contrastive
arXiv:1912.11370. [Online]. Available: [159] F. Locatello, B. Poole, G. Rätsch, B. Schölkopf, predictive coding,” 2018, arXiv:1807.03748.
http://arxiv.org/abs/1912.11370 O. Bachem, and M. Tschannen, [Online]. Available: http://arxiv.org/
[138] A. Kosiorek, H. Kim, Y. W. Teh, and I. Posner, “Weakly-supervised disentanglement without abs/1807.03748
“Sequential attend, infer, repeat: Generative compromises,” in Proc. 37th Int. Conf. Mach. [180] G. Parascandolo, N. Kilbertus, M. Rojas-Carulla,
modelling of moving objects,” in Proc. Adv. Neural Learn. (ICML), 2020, pp. 6348–6359. and B. Schölkopf, “Learning independent causal
Inf. Process. Syst., vol. 31, 2018, pp. 8606–8616. [160] F. Locatello et al., “Object-centric learning with mechanisms,” in Proc. 35th Int. Conf. Mach. Learn.
[139] S. Kpotufe, E. Sgouritsa, D. Janzing, and slot attention,” in Proc. Adv. Neural Inf. Process. (PMLR), vol. 80, 2018, pp. 4036–4044.
B. Schölkopf, “Consistency of causal inference Syst., 2020. [181] G. Parascandolo, A. Neitz, A. ORVIETO, L. Gresele,
under the additive noise model,” in Proc. 31st Int. [161] D. Lopez-Paz, K. Muandet, B. Schölkopf, and and B. Schölkopf, “Learning explanations that are
Conf. Mach. Learn., 2014, pp. 478–486. I. Tolstikhin, “Towards a learning theory of hard to vary,” in Proc. Int. Conf. Learn. Represent.,
[140] A. Krizhevsky, I. Sutskever, and G. E. Hinton, cause-effect inference,” in Proc. 32nd Int. Conf. 2021.
“ImageNet classification with deep convolutional Mach. Learn., 2015, pp. 1452–1461. [182] G. Parascandolo, M. Rojas-Carulla, N. Kilbertus,
neural networks,” in Proc. Adv. Neural Inf. Process. [162] D. Lopez-Paz, R. Nishihara, S. Chintala, and B. Schölkopf, “Learning independent causal
Syst., 2012, pp. 1097–1105. B. Scholkopf, and L. Bottou, “Discovering causal mechanisms,” in Proc. Workshop Learn.
[141] T. D. Kulkarni et al., “Unsupervised learning of signals in images,” in Proc. IEEE Conf. Comput. Vis. Disentangled Represent. From Perception Control
object keypoints for perception and control,” in Pattern Recognit. (CVPR), Jul. 2017, pp. 58–66. 31st Conf. Neural Inf. Process. Syst. (NIPS), 2017.
Proc. Adv. Neural Inf. Process. Syst., 2019, [163] K. Lorenz, Die Rückseite des Spiegels. Munich, [183] J. Pearl, Causality: Models, Reasoning, Inference,
pp. 10723–10733. Germany: Piper Verlag, 1973. 2nd ed. New York, NY, USA: Cambridge Univ.
[142] M. J. Kusner, J. Loftus, C. Russell, and R. Silva, [164] C. Lu, B. Huang, K. Wang, J. Miguel Press, 2009.
“Counterfactual fairness,” in Advances in Neural Hernández-Lobato, K. Zhang, and B. Schölkopf, [184] J. Pearl, “Giving computers free will,” Forbes,
Information Processing Systems. Red Hook, NY, “Sample-efficient reinforcement learning via 2009.
USA: Curran Associates, 2017, pp. 4066–4076. counterfactual-based data augmentation,” 2020, [185] J. Pearl and E. Bareinboim, “External validity:
[143] L. Ladický, S. Jeong, B. Solenthaler, M. Pollefeys, arXiv:2012.09092. [Online]. Available: From do-calculus to transportability across
and M. Gross, “Data-driven fluid simulations using http://arxiv.org/abs/2012.09092 populations,” 2015, arXiv:1503.01603. [Online].
regression forests,” ACM Trans. Graph., vol. 34, [165] C. Lu, B. Schölkopf, and J. M. Hernández-Lobato, Available: http://arxiv.org/abs/1503.01603
no. 6, pp. 1–9, Nov. 2015. “Deconfounding reinforcement learning in [186] J. Peters, S. Bauer, and N. Pfister, “Causal models
[144] B. M. Lake, T. D. Ullman, J. B. Tenenbaum, and S. observational settings,” 2018, arXiv:1812.10576. for dynamical systems,” 2020, arXiv:2001.06208.
J. Gershman, “Building machines that learn and [Online]. Available: http://arxiv.org/ [Online]. Available: http://arxiv.org/abs/
think like people,” Behav. Brain Sci., vol. 40, abs/1812.10576 2001.06208
p. e253, Jan. 2017. [166] A. S. Lundervold and A. Lundervold, “An overview [187] J. Peters, P. Bühlmann, and N. Meinshausen,
[145] J. Landman, E. A. Vandewater, A. J. Stewart, and of deep learning in medical imaging focusing on “Causal inference by using invariant prediction:

632 P ROCEEDINGS OF THE IEEE | Vol. 109, No. 5, May 2021


Schölkopf et al.: Toward Causal Representation Learning

Identification and confidence intervals,” J. Roy. deep neural networks,” 2017, arXiv:1706.05098. [231] D. Silver et al., “The predictron: End-to-end
Stat. Soc. B, Stat. Methodol., vol. 78, no. 5, [Online]. Available: http://arxiv.org/abs/ learning and planning,” in Proc. Int. Conf. Mach.
pp. 947–1012, Nov. 2016. 1706.05098 Learn. (PMLR), 2017, pp. 3191–3199.
[188] J. Peters, D. Janzing, and B. Schölkopf,” Elements [210] S. Russell and P. Norvig, Artificial Intelligence: [232] D. Silver et al., “Mastering the game of go with
of Causal Inference—Foundations and Learning A Modern Approach. Upper Saddle River, NJ, USA: deep neural networks and tree search,” Nature,
Algorithms. Cambridge, MA, USA: MIT Press, Prentice-Hall, 2002. vol. 529, no. 7587, pp. 484–489, Jan. 2016.
2017. [211] A. Sanchez-Gonzalez, J. Godwin, T. Pfaff, R. Ying, [233] P. Simard, B. Victorri, Y. LeCun, and J. Denker,
[189] J. Peters, J. M. Mooij, D. Janzing, and J. Leskovec, and P. W. Battaglia, “Learning to “Tangent prop—A formalism for specifying
B. Schölkopf, “Identifiability of causal graphs simulate complex physics with graph networks,” selected invariances in an adaptive network,” in
using functional models,” in Proc. 27th Annu. Conf. 2020, arXiv:2002.09405. [Online]. Available: Advances in Neural Information Processing Systems,
Uncertainty Artif. Intell. (UAI), 2011, pp. 589–598. http://arxiv.org/abs/2002.09405 vol. 4, J. Moody, S. Hanson, R. P. Lippmann, Eds.
[190] J. Peters, J. M. Mooij, D. Janzing, and [212] A. Santoro et al., “A simple neural network San Mateo, CA, USA: Morgan Kaufmann, 1992,
B. Schölkopf, “Causal discovery with continuous module for relational reasoning,” in Proc. Adv. pp. 895–903.
additive noise models,” J. Mach. Learn. Res., Neural Inf. Process. Syst., 2017, pp. 4967–4976. [234] P. Y. Simard, D. Steinkraus, and J. C. Platt, “Best
vol. 15, pp. 2009–2053, Jan. 2014. [213] J. Schmidhuber, “Evolutionary principles in practices for convolutional neural networks
[191] N. Pfister, S. Bauer, and J. Peters, “Learning stable self-referential learning, or on learning how to applied to visual document analysis,” in Proc. 7th
and predictive structures in kinetic systems,” Proc. learn: The meta-meta-... hook,” Ph.D. dissertation, Int. Conf. Document Anal. Recognit. (ICDAR),
Nat. Acad. Sci. USA, vol. 116, no. 51, Technische Universität München, München, vol. 3, 2003.
pp. 25405–25411, Dec. 2019. Germany, 1987. [235] H. A. Simon, “Causal ordering and identifiability,”
[192] N. Pfister, P. Bühlmann, and J. Peters, “Invariant [214] J. Schmidhuber, “Curious model-building control in Studies in Econometric Methods, W. C. Hood and
causal prediction for sequential data,” J. Amer. systems,” in Proc. IEEE Int. Joint Conf. Neural T. C. Koopmans, Eds. New York, NY, USA: Wiley,
Stat. Assoc., vol. 114, no. 527, pp. 1264–1276, Netw., Nov. 1991, pp. 1458–1463. 1953, pp. 49–74.
Jul. 2019. [215] B. Schölkopf, “Artificial intelligence: Learning to [236] E. S. Spelke, “Principles of object perception,”
[193] N. Pfister, P. Bühlmann, B. Schölkopf, and see and act,” Nature, vol. 518, no. 7540, Cognit. Sci., vol. 14, no. 1, pp. 29–56, Jan. 1990.
J. Peters, “Kernel-based tests for joint pp. 486–487, 2015. [237] P. Spirtes, C. Glymour, and R. Scheines, Causation,
independence,” J. Roy. Stat. Soc. B, Stat. [216] B. Schölkopf, “Causal learning, 2017,” in Proc. Prediction, and Search, 2nd ed. Cambridge, MA,
Methodol., vol. 80, no. 1, pp. 5–31, Jan. 2018. 34th Int. Conf. Mach. Learn. (ICML), 2017. USA: MIT Press, 2000.
[194] R. L. Priol, R. B. Harikandeh, Y. Bengio, and [Online]. Available: https://vimeo. [238] W. Spohn, Grundlagen der Entscheidungstheorie.
S. Lacoste-Julien, “An analysis of the adaptation com/238274659 Berlin, Germany: Scriptor-Verlag, 1978.
speed of causal models,” 2020, arXiv:2005.09136. [217] B. Schölkopf, “Causality for machine learning,” [239] I. Steinwart and A. Christmann, Support Vector
[Online]. Available: http://arxiv.org/abs/ 2019, arXiv:1911.10500. [Online]. Available: Machines. New York, NY, USA: Springer, 2008.
2005.09136 https://arxiv.org/abs/1911.10500 [240] B. Steudel, D. Janzing, and B. Schölkopf, “Causal
[195] S. Rabanser, S. Günnemann, and Z. C. Lipton, [218] B. Schölkopf et al., “Modeling confounding by Markov condition for submodular information
“Failing loudly: An empirical study of methods for half-sibling regression,” Proc. Nat. Acad. Sci. USA, measures,” in Proc. 23rd Annu. Conf. Learn. Theory
detecting dataset shift,” 2018, arXiv:1810.11953. vol. 113, no. 27, pp. 7391–7398, Jul. 2016. (COLT), 2010, pp. 464–476.
[Online]. Available: http://arxiv.org/ [219] B. Schölkopf, D. Janzing, and D. Lopez-Paz, [241] J. Su, S. Adams, and P. A. Beling, “Counterfactual
abs/1810.11953 “Causal and statistical learning,” Oberwolfach multi-agent reinforcement learning with graph
[196] A. Radford, K. Narasimhan, T. Salimans, and Rep., vol. 13, no. 3, pp. 1896–1899, 2016. convolution communication,” 2020,
I. Sutskever, “Improving language understanding [220] B. Schölkopf, D. Janzing, J. Peters, E. Sgouritsa, arXiv:2004.00470. [Online]. Available:
by generative pre-training,” Tech. Rep., 2018. K. Zhang, and J. M. Mooij, “On causal and http://arxiv.org/abs/2004.00470
[197] N. Rahaman et al., “Spatially structured recurrent anticausal learning,” in Proc. 29th Int. Conf. Mach. [242] A. Subbaswamy and S. Saria, “Counterfactual
modules,” in Proc. Int. Conf. Learn. Represent., Learn. (ICML), 2012, pp. 1255–1262. normalization: Proactively addressing dataset
2021. [221] B. Schölkopf and A. J. Smola, Learning With shift and improving reliability using causal
[198] H. Reichenbach, The Direction Time. Berkeley, CA, Kernels. Cambridge, MA, USA: MIT Press, mechanisms,” 2018, arXiv:1808.03253. [Online].
USA: Univ. of California Press, 1956. 2002. Available: http://arxiv.org/abs/1808.03253
[199] L. K. Reichert and J. R. Slate, “Reflective learning: [222] L. Schott, J. Rauber, M. Bethge, and W. Brendel, [243] A. Subbaswamy, P. Schulam, and S. Saria,
The use of ‘if only ...’ statements to improve “Towards the first adversarially robust neural “Preventing failures due to dataset shift: Learning
performance,” Social Psychol. Educ., vol. 3, no. 4, network model on MNIST,” in Proc. Int. Conf. predictive models that transport,” 2018,
pp. 261–275, 1999. Learn. Represent., 2019. arXiv:1812.04597. [Online]. Available:
[200] D. J. Rezende et al., “Causally correct partial [223] J. Schrittwieser et al., “Mastering Atari, go, chess http://arxiv.org/abs/1812.04597
models for reinforcement learning,” 2020, and Shogi by planning with a learned model,” [244] C. Sun, P. Karlsson, J. Wu, J. B Tenenbaum, and
arXiv:2002.02836. [Online]. Available: 2019, arXiv:1911.08265. [Online]. Available: K. Murphy, “Stochastic prediction of multi-agent
http://arxiv.org/abs/2002.02836 http://arxiv.org/abs/1911.08265 interactions from partial observations,” 2019,
[201] J. G. Richens, C. M. Lee, and S. Johri, “Improving [224] P. Schulam and S. Saria, “Reliable decision support arXiv:1902.09641. [Online]. Available:
the accuracy of medical diagnosis with causal using counterfactual models,” in Proc. Adv. Neural http://arxiv.org/abs/1902.09641
machine learning,” Nature Commun., vol. 11, Inf. Process. Syst., 2017, pp. 1697–1708. [245] C. Sun, A. Shrivastava, S. Singh, and A. Gupta,
no. 1, p. 3923, Dec. 2020. [225] R. D. Shah and J. Peters, “The hardness of “Revisiting unreasonable effectiveness of data in
[202] K. Ridgeway and M. C. Mozer, “Learning deep conditional independence testing and the deep learning era,” in Proc. IEEE Int. Conf.
disentangled embeddings with the f-statistic loss,” generalised covariance measure,” 2018, Comput. Vis., Oct. 2017, pp. 843–852.
in Proc. Adv. Neural Inf. Process. Syst., 2018, arXiv:1804.07203. [Online]. Available: [246] X. Sun, D. Janzing, and B. Schölkopf, “Causal
pp. 185–194. http://arxiv.org/abs/1804.07203 inference by choosing graphs with most plausible
[203] N. J. Roese, “The functional basis of [226] N. Shajarisales, D. Janzing, B. Schölkopf, and Markov kernels,” in Proc. 9th Int. Symp. Artif.
counterfactual thinking,” J. Personality Social M. Besserve, “Telling cause from effect in Intell. Math., 2006, pp. 1–11.
Psychol., vol. 66, no. 5, p. 805, 1994. deterministic linear dynamical systems,” in Proc. [247] R. Suter, D. Miladinovic, B. Schölkopf, and
[204] M. Rojas-Carulla, B. Schölkopf, R. Turner, and 32nd Int. Conf. Mach. Learn. (ICML), 2015, S. Bauer, “Robustly disentangled causal
J. Peters, “Invariant models for causal transfer pp. 285–294. mechanisms: Validating deep representations for
learning,” J. Mach. Learn. Res., vol. 19, no. 36, [227] V. Shankar, A. Dave, R. Roelofs, D. Ramanan, interventional robustness,” in Proc. Int. Conf.
pp. 1–34, 2018. B. Recht, and L. Schmidt, “Do image classifiers Mach. Learn. (PMLR), 2019, pp. 6056–6065.
[205] M. Rolinek, D. Zietlow, and G. Martius, generalize across time,” 2019, arXiv:1906.02168. [248] R. S. Sutton et al., Introduction to Reinforcement
“Variational autoencoders pursue PCA directions [Online]. Available: https://arxiv.org/ Learning, vol. 135. Cambridge, MA, USA:
(by Accident),” in Proc. IEEE Conf. Comput. Vis. abs/1906.02168 MIT Press, 1998.
Pattern Recognit. (CVPR), Jun. 2019, [228] R. Shetty, B. Schiele, and M. Fritz, “Not using the [249] C. Szegedy et al., “Intriguing properties of neural
pp. 12406–12415. car to see the sidewalk—Quantifying and networks,” 2013, arXiv:1312.6199. [Online].
[206] P. Roy, S. Ghosh, S. Bhattacharya, and U. Pal, controlling the effects of context in classification Available: http://arxiv.org/abs/1312.6199
“Effects of degradations on deep neural network and segmentation,” in Proc. IEEE Conf. Comput. [250] E. Téglás, E. Vul, V. Girotto, M. Gonzalez,
architectures,” 2018, arXiv:1807.10108. [Online]. Vis. Pattern Recognit. (CVPR), Jun. 2019, J. B. Tenenbaum, and L. L. Bonatti, “Pure
Available: http://arxiv.org/abs/1807.10108 pp. 8218–8226. reasoning in 12-month-old infants as probabilistic
[207] P. K. Rubenstein, S. Bongers, B. Schölkopf, and [229] S. Shimizu, P. O. Hoyer, A. Hyvärinen, and inference,” Science, vol. 332, no. 6033,
J. M. Mooij, “From deterministic ODEs to dynamic A. J. Kerminen, “A linear non-Gaussian acyclic pp. 1054–1059, May 2011.
structural causal models,” in Proc. 34th Conf. model for causal discovery,” J. Mach. Learn. Res., [251] J. Tian and J. Pearl, “Causal discovery from
Uncertainty Artif. Intell. (UAI), 2018, pp. 114–123. vol. 7, no. 10, pp. 2003–2030, 2006. changes,” in Proc. 17th Annual Conf. Uncertainty
[208] P. K. Rubenstein et al., “Causal consistency of [230] R. Shu, Y. Chen, A. Kumar, S. Ermon, and B. Poole, Artif. Intell. (UAI), 2001, pp. 512–522.
structural equation models,” in Proc. Thirty-Third “Weakly supervised disentanglement with [252] F. Träuble et al., “Is independence all you need?
Conf. Uncertainty Artif. Intell., 2017, pp. 808–817. guarantees,” 2019, arXiv:1910.09772. [Online]. On the generalization of representations learned
[209] S. Ruder, “An overview of multi-task learning in Available: http://arxiv.org/abs/1910.09772 from correlated data,” 2020, arXiv:2006.07886.

Vol. 109, No. 5, May 2021 | P ROCEEDINGS OF THE IEEE 633


Schölkopf et al.: Toward Causal Representation Learning

[Online]. Available: http://arxiv.org/ experts,” 2020, arXiv:2004.12906. [Online]. REpresentation and reasoning,” 2019,
abs/2006.07886 Available: http://arxiv.org/abs/2004.12906 arXiv:1910.01442. [Online]. Available:
[253] M. Tschannen et al., “Self-supervised learning of [263] H. Wang, Z. He, Z. C. Lipton, and E. P. Xing, http://arxiv.org/abs/1910.01442
video-induced visual invariances,” in Proc. IEEE “Learning robust representations by projecting [273] J. Yoon, J. Jordon, and M. van der Schaar,
Conf. Comput. Vis. Pattern Recognit. (CVPR), superficial statistics out,” 2019, “GANITE: Estimation of individualized treatment
Jun. 2020, pp. 13806–13815. arXiv:1903.06256. [Online]. Available: effects using generative adversarial nets,” in Proc.
[254] A. Tsiaras, I. Waldmann, G. Tinetti, J. Tennyson, http://arxiv.org/abs/1903.06256 Int. Conf. Learn. Represent., 2018.
and S. Yurchenko, “Water vapour in the [264] N. Watters, L. Matthey, M. Bosnjak, C. P. Burgess, [274] V. Zambaldi et al., “Deep reinforcement learning
atmosphere of the habitable-zone and A. Lerchner, “COBRA: Data-efficient with relational inductive biases,” in Proc. Int. Conf.
eight-Earth-mass planet K2-18b,” Nature Astron., model-based RL through unsupervised object Learn. Represent., 2018.
vol. 3, pp. 1086–1091, Sep. 2019. discovery and curiosity-driven exploration,” 2019, [275] J. Zhang and E. Bareinboim, “Fairness in
[255] S. Van Steenkiste, M. Chang, K. Greff, and arXiv:1905.09275. [Online]. Available: decision-making—The causal explanation
J. Schmidhuber, “Relational neural expectation http://arxiv.org/abs/1905.09275 formula,” in Proc. 32nd AAAI Conf. Artif. Intell.,
maximization: Unsupervised discovery of objects [265] N. Watters, D. Zoran, T. Weber, P. Battaglia, New Orleans, LA, USA, 2018, pp. 2037–2045.
and their interactions,” in Proc. 6th Proc. Int. Conf. R. Pascanu, and A. Tacchetti, “Visual interaction [276] J. Zhang and E. Bareinboim, “Near-optimal
Learn. Represent. (ICLR), 2018. networks: Learning a physics simulator from reinforcement learning in dynamic treatment
[256] S. van Steenkiste, F. Locatello, J. Schmidhuber, video,” in Proc. Adv. Neural Inf. Process. Syst., regimes,” in Proc. Adv. Neural Inf. Process. Syst.,
and O. Bachem, “Are disentangled representations 2017, pp. 4539–4547. 2019, pp. 13401–13411.
helpful for abstract visual reasoning,” in Proc. Adv. [266] S. Weichwald, “Pragmatism and variable [277] K. Zhang, M. Gong, and B. Schölkopf,
Neural Inf. Process. Syst., 2019, pp. 14178–14191. transformations in causal modelling,” “Multi-source domain adaptation: A causal view,”
[257] V. N. Vapnik, Statistical Learning Theory. Ph.D. dissertation, ETH Zurich, Zürich, in Proc. 29th AAAI Conf. Artif. Intell., 2015,
New York, NY, USA: Wiley, 1998. Switzerland, 2019. pp. 3150–3157.
[258] O. Vinyals et al., “Grandmaster level in StarCraft II [267] S. Weichwald, B. Schölkopf, T. Ball, and [278] K. Zhang, B. Huang, J. Zhang, C. Glymour, and
using multi-agent reinforcement learning,” M. Grosse-Wentrup, “Causal and anti-causal B. Schölkopf, “Causal discovery from
Nature, vol. 575, no. 7782, pp. 350–354, learning in pattern recognition for neuroimaging,” nonstationary/heterogeneous data: Skeleton
Nov. 2019. in Proc. 4th Int. Workshop Pattern Recognit. estimation and orientation determination,” in
[259] J. von Kügelgen, U. Bhatt, A.-H. Karimi, I. Valera, Neuroimag. (PRNI), 2014, pp. 1–4. Proc. 26th Int. Joint Conf. Artif. Intell., Aug. 2017,
A. Weller, and B. Schölkopf, “On the fairness of [268] M. Wiering and M. Van Otterlo, Reinforcement pp. 1347–1353.
causal algorithmic recourse,” 2020, Learning, vol. 12. Springer, 2012. [279] K. Zhang and A. Hyvärinen, “On the identifiability
arXiv:2010.06529. [Online]. Available: [269] S. Wiewel, M. Becher, and N. Thuerey, “Latent of the post-nonlinear causal model,” in Proc. 25th
http://arxiv.org/abs/2010.06529 space physics: Towards learning the temporal Annu. Conf. Uncertainty Artif. Intell. (UAI), 2009,
[260] J. von Kügelgen, L. Gresele, and B. Schölkopf, evolution of fluid flow,” in Computer Graphics pp. 647–655.
“Simpson’s paradox in COVID-19 case fatality Forum, vol. 38. Hoboken, NJ, USA: Wiley, 2019, [280] K. Zhang, J. Peters, D. Janzing, and B. Schölkopf,
rates: A mediation analysis of age-related causal pp. 71–82. “Kernel-based conditional independence test and
effects,” 2020, arXiv:2005.07180. [Online]. [270] L. Wiskott and T. J. Sejnowski, “Slow feature application in causal discovery,” in Proc. 27th
Available: http://arxiv.org/abs/ analysis: Unsupervised learning of invariances,” Annu. Conf. Uncertainty Artif. Intell. (UAI), 2011,
2005.07180 Neural Comput., vol. 14, no. 4, pp. 715–770, pp. 804–813.
[261] J. von Kügelgen, A. Mey, M. Loog, and Apr. 2002. [281] K. Zhang, B. Schölkopf, K. Muandet, and Z. Wang,
B. Schölkopf, “Semi-supervised learning, causality [271] C. Xie, S. Patil, T. Moldovan, S. Levine, and “Domain adaptation under target and conditional
and the conditional cluster assumption,” in Proc. P. Abbeel, “Model-based reinforcement learning shift,” in Proc. 30th Int. Conf. Mach. Learn. (ICML),
Conf. Uncertainty Artif. Intell. (UAI), 2020, with parametrized physical models and 2013, pp. 819–827.
pp. 1–10. optimism-driven exploration,” in Proc. IEEE Int. [282] R. Zhang, “Making convolutional networks
[262] J. von Kügelgen, I. Ustyuzhaninov, P. Gehler, Conf. Robot. Autom. (ICRA), May 2016, shift-invariant again,” 2019, arXiv:1904.11486.
M. Bethge, and B. Schölkopf, “Towards causal pp. 504–511. [Online]. Available: http://arxiv.org/abs/
generative scene models via competition of [272] K. Yi et al., “CLEVRER: CoLlision events for video 1904.11486

634 P ROCEEDINGS OF THE IEEE | Vol. 109, No. 5, May 2021

You might also like