0% found this document useful (0 votes)
30 views28 pages

Causal Inference

The document provides an overview of causal inference, emphasizing its importance in understanding cause-effect relationships in various systems. It includes links to influential papers, books, and tutorials that explore the theoretical foundations and practical applications of causal reasoning in machine learning and data science. The content highlights the distinction between correlation and causation, advocating for a deeper understanding of causal dynamics to address complex 'what if' questions.

Uploaded by

jamesfrenklin5
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views28 pages

Causal Inference

The document provides an overview of causal inference, emphasizing its importance in understanding cause-effect relationships in various systems. It includes links to influential papers, books, and tutorials that explore the theoretical foundations and practical applications of causal reasoning in machine learning and data science. The content highlights the distinction between correlation and causation, advocating for a deeper understanding of causal dynamics to address complex 'what if' questions.

Uploaded by

jamesfrenklin5
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 28

* [**overview**](#overview)

* [**theory**](#theory)

* [**interesting papers**](#interesting-papers)

---

### overview

["Why Correlation Usually != Causation"](https://gwern.net/Causality) by


Gwern Branwen

["Do we still need models or just more data and


compute?"](https://staff.fnwi.uva.nl/m.welling/wp-content/uploads/Model-
versus-Data-AI-1.pdf) by Max Welling

["ML beyond Curve Fitting: An Intro to Causal Inference and do-Calculus"]


(http://inference.vc/untitled) by Ferenc Huszar

["Causal Inference 2: Illustrating Interventions via a Toy


Example"](https://inference.vc/causal-inference-2-illustrating-interventions-
in-a-toy-example) by Ferenc Huszar

["Causal Inference 3: Counterfactuals"](https://inference.vc/causal-


inference-3-counterfactuals) by Ferenc Huszar

["Causal Data Science"](https://medium.com/@akelleh/causal-data-science-


721ed63a4027) by Adam Kelleher:
- ["If Correlation Doesn’t Imply Causation, Then What
Does?"](https://medium.com/@akelleh/if-correlation-doesnt-imply-causation-
then-what-does-c74f20d26438)

- ["Understanding Bias: A Prerequisite For Trustworthy


Results"](https://medium.com/@akelleh/understanding-bias-a-pre-requisite-
for-trustworthy-results-ee590b75b1be)

- ["Speed vs. Accuracy: When Is Correlation Enough? When Do You Need


Causation?"](https://medium.com/@akelleh/speed-vs-accuracy-when-is-
correlation-enough-when-do-you-need-causation-708c8ca93753)

- ["A Technical Primer on Causality"](https://medium.com/@akelleh/a-


technical-primer-on-causality-181db2575e41)

- ["The Data Processing Inequality"](https://medium.com/@akelleh/the-data-


processing-inequality-da242b40800b)

- ["Causal Graph Inference"](https://medium.com/@akelleh/causal-graph-


inference-b3e3afd47110)

["If Correlation Doesn’t Imply Causation, then What


Does?"](http://michaelnielsen.org/ddi/if-correlation-doesnt-imply-causation-
then-what-does) by Michael Nielsen

["Latent Variables and Model


Mis-specification"](https://jsteinhardt.wordpress.com/2017/01/10/latent-
variables-and-model-mis-specification/) by Jacob Steinhardt

["Causality in Machine
Learning"](http://unofficialgoogledatascience.com/2017/01/causality-in-
machine-learning.html) by Muralidharan et al.

----
["The Seven Tools of Causal Inference with Reflections on Machine
Learning"](https://dl.acm.org/citation.cfm?id=3241036) by Judea Pearl
`paper` ([talk](https://youtube.com/watch?v=nWaM6XmQEmU) `video`)

["Theoretical Impediments to Machine


Learning"](http://web.cs.ucla.edu/~kaoru/theoretical-impediments.pdf) by
Judea Pearl `paper`

["Causality for Machine Learning"](https://arxiv.org/abs/1911.10500) by


Bernhard Scholkopf `paper`

["Towards Causal Representation


Learning"](https://arxiv.org/abs/2102.11107) by Scholkopf et al. `paper`

["On Pearl’s Hierarchy and the Foundations of Causal


Inference"](https://causalai.net/r60.pdf) by Bareinboim, Correa, Ibeling, Icard
`paper` ([talk](https://youtube.com/watch?v=fNuMHDrh6AY) `video`)

["Causality"](http://www.homepages.ucl.ac.uk/~ucgtrbd/papers/causality.pdf)
by Ricardo Silva `paper`

["Introduction to Causal
Inference"](http://jmlr.org/papers/volume11/spirtes10a/spirtes10a.pdf) by
Peter Spirtes `paper`

["Graphical Causal
Models"](http://www.stat.cmu.edu/~cshalizi/uADA/12/lectures/ch22.pdf) by
Cosma Shalizi `paper`

----

["The Book of Why: The New Science of Cause and


Effect"](https://amazon.com/Book-Why-Science-Cause-Effect/dp/046509760X
) by Judea Pearl and Dana Mackenzie `book`
([overview](http://bayes.cs.ucla.edu/WHY/why-intro.pdf))

["Causal Inference in Statistics: A


Primer"](https://books.google.co.uk/books/about/Causal_Inference_in_Statisti
cs.html?id=IqCECwAAQBAJ) by Judea Pearl, Madelyn Glymour, Nicholas Jewell
`book`

["Causality: Models, Reasoning, and


Inference"](https://dropbox.com/s/m2m1935e6tohii9/Pearl%20-%20Causality
%3A%20Models%2C%20Reasoning%2C%20and%20Inference.pdf) by Judea
Pearl `book` ([epilogue](http://bayes.cs.ucla.edu/BOOK-2K/causality2-
epilogue.pdf))

["Elements of Causal Inference"](https://mitpress.mit.edu/books/elements-


causal-inference) by Jonas Peters, Dominik Janzing, Bernhard Scholkopf
`book`

["Causal Inference
Book"](https://www.hsph.harvard.edu/miguel-hernan/causal-inference-book/)
by Miguel Hernan and James Robins `book`

----

[tutorial](https://youtube.com/watch?v=CTcQlRSnvvM) by Bernhard
Scholkopf `video`

[tutorial](https://youtube.com/watch?v=zvrcyqcN9Wo) by Jonas Peters


`video`

[tutorial](https://youtube.com/watch?v=_wFagI5Fn9I) by Jonas Peters


`video`

[course](https://youtube.com/channel/UCbOJ2eEdvf2wOPrAmA72Gzg) by
Brady Neal `video`

["Causal Inference in Everyday Machine


Learning"](https://youtube.com/watch?v=HOgx_SBBzn0) tutorial by Ferenc
Huszar `video`
["Causal Inference in Online Systems: Methods, Pitfalls and Best Practices"]
(https://mediasite.kellogg.northwestern.edu/Mediasite/Play/
8e78dc83c6fb4d20abeeb18028a8f7071d?catalog=1533bdef-0c88-4513-
ad97-5fce50c92e62) tutorial by Amit Sharma `video`
([slides](https://github.com/amit-sharma/causal-inference-tutorial))

["Counterfactual Evaluation and Learning for Search, Recommendation and


Ad Placement"](http://www.cs.cornell.edu/~adith/CfactSIGIR2016/) tutorial by
Thorsten Joachims and Adith Swaminathan `video`

["Counterfactual Reasoning and Massive Data


Sets"](https://youtube.com/watch?v=s37cIYDOM6s) by Leon Bottou `video`

["Counterfactual
Inference"](https://facebook.com/nipsfoundation/videos/1291139774361116)
tutorial by Susan Athey `video`

["Causal Inference for Observational


Studies"](http://techtalks.tv/talks/causal-inference-for-observational-studies/
62355/) tutorial by David Sontag and Uri Shalit `video`
([slides](https://cs.nyu.edu/~shalit/slides.pdf))

["Connections between Causality and Machine


Learning"](https://youtube.com/watch?v=9pm0eXuiTZs) by Jonas Peters
`video`

----

["Science vs Data: Contesting the Soul of Data


Science"](https://youtube.com/watch?v=X_1MG4ViVGM) by Judea Pearl
`video`

["The Foundations of Causal Inference with Reflections on Machine Learning


and Artificial Intelligence"](https://youtube.com/watch?v=nWaM6XmQEmU)
by Judea Pearl `video`

["The New Science of Cause and Effect"](https://youtube.com/watch?


v=ZaPV1OSEpHw) by Judea Pearl `video`

["The Mathematics of Causal Inference with Reflections on Machine


Learning"](https://youtube.com/watch?v=bcRl7sXR1hE) by Judea Pearl
`video`
["The Mathematics of Causal Inference, with Reflections on Machine
Learning and the Logic of Science"](https://youtube.com/watch?v=zHjdd--
W6o4) by Judea Pearl `video`

["On the Causal Foundations of AI (Explainability & Decision-Making)"]


(https://youtube.com/watch?v=fNuMHDrh6AY) by Elias Bareinboim `video`

["Causal Data Science: A General Framework for Data Fusion and Causal
Inference"](https://youtube.com/watch?v=dUsokjG4DHc) by Elias Bareinboim
`video`

"Towards Causal Reinforcement Learning" ([[1]](https://youtube.com/watch?


v=QRTgLWfFBMM), [[2]](https://youtube.com/watch?v=2hGvd_9ho6s)) by
Elias Bareinboim `video`

["Causal Reinforcement Learning"](https://youtube.com/watch?


v=bwz3NpVfz6k) by Elias Bareinboim `video`

["Learning Causal
Mechanisms"](https://facebook.com/iclr.cc/videos/2123421684353553?
t=294) by Bernhard Scholkopf `video`

["The Role of Causality for Interpretability"](https://vimeo.com/252188186)


by Bernhard Scholkopf `video`

["Causal Learning"](https://vimeo.com/238274659#t=13m22s) by Bernhard


Scholkopf `video`

["Toward Causal Machine Learning"](https://youtube.com/watch?


v=ooeRlw3U2zU) by Bernhard Scholkopf `video`

["Statistical and Causal Approaches to Machine


Learning"](https://youtu.be/ek9jwRA2Jio?t=26m) by Bernhard Scholkopf
`video`

["The Missing Signal"](https://youtube.com/watch?v=DfJeaa--xO0) by Leon


Bottou `video`

["Learning Representations Using Causal


Invariance"](https://facebook.com/722677142/posts/10155953319752143?
t=714) by Leon Bottou `video`
----

[workshop](https://sites.google.com/view/nips2018causallearning) at
NeurIPS 2018 ([videos](https://youtube.com/playlist?
list=PLJscN9YDD1bu1dCKuXSV1qYmicx3g9t7A))

[symposium](https://why19.causalai.net) at AAAI 2019

---

### theory

Causal inference is a problem of uncovering cause-effect relations between


variables of data generating system. Causal structures provide
understanding about how the system will behave under changing and
unseen environments. Knowledge about these causal dynamics allows to
answer "what if" questions, describing potential responses of the system
under hypothetical manipulations and interventions.

What if some railways are closed, what will passengers do? What if we
incentivize members of a social network to propagate an idea, how
influential can they be? What if some genes in a cell are knocked-out, which
phenotypes can we expect? Such questions need to be addressed via a
combination of experimental and observational data, and require a careful
approach to modelling heterogeneous datasets and structural assumptions
concerning the causal relations among components of the system.

Causal model is a set of assumptions about the data generating process,


which cannot be expressed as properties of the joint distribution of observed
variables.
----

"In retrospect, my greatest challenge was to break away from probabilistic


thinking and accept, first, that people are not probability thinkers but cause-
effect thinkers and, second, that causal thinking cannot be captured in the
language of probability; it requires a formal language of its own."

"What is more likely, that a daughter will have blue eyes given that her
mother has blue eyes or the other way around — that the mother will have
blue eyes given that the daughter has blue eyes? Most people will say the
former — they'll prefer the causal direction. But it turns out the two
probabilities are the same, because the number of blue-eyed people in every
generation remains stable. I took it as evidence that people think causally,
not probabilistically — they're biased by having easy access to causal
explanations, even though probability theory tells you something different.

There are many biases in our judgment that are created by our inclination
to attribute causal relationships where they do not belong. We see the world
as a collection of causal relationships and not as a collection of statistical or
associative relationships. Most of the time, we can get by, because they are
closely tied together. Once in a while we fail. The blue-eye story is an
example of such failure.

The slogan, "Correlation doesn't imply causation" leads to many paradoxes.


For instance, the size of a child's thumb is highly correlated with their
reading ability. So, naively, if you want to be taller, you should learn to read
better. This kind of paradoxical example convinces us that correlation does
not imply causation. Still, people fall into that trap quite often because they
crave causal explanations. The mind is a causal processor, not an association
processor. Once you acknowledge that, the question remains how we
reconcile the discrepancies between the two. How do we organize causal
relationships in our mind? How do we operate on and update such a mental
presentation?"

"I now take causal relations as the fundamental building block that of
physical reality and of human understanding of that reality, and I regard
probabilistic relationships as but the surface phenomena of the causal
machinery that underlies and propels our understanding of our world."
*(Judea Pearl)*

----

"If we examine the information that drives machine learning today, we find
that it is almost entirely statistical. In other words, learning machines
improve their performance by optimizing parameters over a stream of
sensory inputs received from the environment. It is a slow process,
analogous in many respects to the evolutionary survival-of-the-fittest process
that explains how species like eagles and snakes have developed superb
vision systems over millions of years. It cannot explain however the super-
evolutionary process that enabled humans to build eyeglasses and
telescopes over barely one thousand years. What humans possessed that
other species lacked was a mental representation, a blue-print of their
environment which they could manipulate at will to imagine alternative
hypothetical environments for planning and learning. Anthropologists like N.
Harari, and S. Mithen are in general agreement that the decisive ingredient
that gave our homo sapiens ancestors the ability to achieve global dominion,
about 40,000 years ago, was their ability to sketch and store a
representation of their environment, interrogate that representation, distort
it by mental acts of imagination and finally answer “What if?” kind of
questions. Examples are interventional questions: “What if I act?” and
retrospective or explanatory questions: “What if I had acted differently?” No
learning machine in operation today can answer such questions about
actions not taken before. Moreover, most learning machine today do not
utilize a representation from which such questions can be answered. We
postulate that the major impediment to achieving accelerated learning
speeds as well as human level performance can be overcome by removing
these barriers and equipping learning machines with causal reasoning tools.
This postulate would have been speculative twenty years ago, prior to the
mathematization of counterfactuals. Not so today. Advances in graphical and
structural models have made counterfactuals computationally manageable
and thus rendered metastatistical learning worthy of serious exploration."
"An extremely useful insight unveiled by the logic of causal reasoning is the
existence of a sharp classification of causal information, in terms of the kind
of questions that each class is capable of answering. The classification forms
a 3-level hierarchy in the sense that questions at one level can only be
answered if information from next levels is available."

- association P(y|x) - seeing (what is?)

How would seeing X change my belief in Y?

What does a symptom tell me about a disease?

- intervention P(y|do(x),z) - doing (what if?)

What if I do X?

What if I take aspirin, will my headache be cured?

What if we ban cigarettes?

- counterfactuals P(yx|x0,y0) - imagining, retrospection (why?)

Was it X that caused Y?

What if I had acted differently?

Was it the aspirin that stopped my headache?

What if I had not been smoking the past 2 years?

"The first level, Association, invokes purely statistical relationships, defined


by the naked data. For instance, observing a customer who buys toothpaste
makes it more likely that he/she buys floss; such association can be inferred
directly from the observed data using conditional expectation. Questions at
this layer, because they require no causal information, are placed at the
bottom level on the hierarchy.
The second level, Intervention, ranks higher than Association because it
involves not just seeing what is, but changing what we see. A typical
question at this level would be: What happens if we double the price? Such
questions cannot be answered from sales data alone, because they involve a
change in customers behavior, in reaction to the new pricing. These choices
may differ substantially from those taken in previous price-raising situations.
Unless we replicate precisely the market conditions that existed when the
price reached double its current value.

The third level, Counterfactuals, is placed at the top of the hierarchy


because they subsume interventional and associational questions. A typical
question in the counterfactual category is “What if I had acted differently”
thus necessitating retrospective reasoning.

If we have a model that can answer counterfactual queries, we can also


answer questions about interventions and observations. For example, the
interventional question “What will happen if we double the price?” can be
answered by asking the counterfactual question: “What would happen had
the price been twice its current value?” Likewise, associational questions can
be answered once we can answer interventional questions; we simply ignore
the action part and let observations take over.

The translation does not work in the opposite direction. Interventional


questions cannot be answered from purely observational information (i.e.,
from statistical data alone). No counterfactual question involving
retrospection can be answered from purely interventional information, such
as that acquired from controlled experiments; we cannot re-run an
experiment on subjects who were treated with a drug and see how they
behave had they not given the drug."

[*(Judea Pearl)*](http://web.cs.ucla.edu/~kaoru/theoretical-
impediments.pdf)

----

tuple (d1, d2, d4, d4) - (population, observational/experimental, sampling,


measure)
(Los Angeles, experimental with randomized Z1, selection on Age, (X1, Z1,
W, M, Y1))

(New York, observational, selection on SES, (X1, X2, Z1, N, Y2))

(Texas, experimental with randomized Z2, (X2, Z1, W, L, M, Y1))

*statistics - descriptive*:

(d1, samples(observations), d3, d4) -> (d1, distribution(observations),


d3, d4) *(Bernulli, Poisson, Kolmogorov)*

*statistics - experimental*:

(d1, samples(do(X)), d3, d4) -> (d1, distribution(do(X)), d3, d4)


*(Fisher, Cox, Goodman)*

*causal inference from observational studies*:

(d1, distribution(observations), d3, d4) -> (d1, distribution(do(X)), d3,


d4) *(Rubin, Robins, Dawid, Pearl)*

*experimental inference (generalized instrumental variables)*:

(d1, distribution(do(Z)), d3, d4) -> (d1, distribution(do(X)), d3, d4) *(P.
Wright, S. Wright)*

*sampling selection bias*:

(d1, d2, select(Age), d4) -> (d1, d2, {}, d4) *(Heckman)*

*transportability (external validity)*:

(bonobos, d2, d3, d4) -> (humans, d2, d3, d4) *(Shadish, Cook,
Campbell)*

[*(Elias Bareinboim)*](https://youtu.be/dUsokjG4DHc?t=8m13s)

----

"Under probabilistic interpretation of causation from Pearl, the causal


structure underlying a set of random variables X=(X1, ..., Xd), with joint
denoted by G = (V, E). In this graph, each vertex Vi ∈ V is associated to the
distribution P, is often described in terms of a Directed Acyclic Graph,

random variable Xi ∈ X, and an edge Eji ∈ E from Vj to Vi denotes the causal


relationship “Xi ← Xj”. More specifically, these causal relationships are

function, Pa(Xi) is the parental set of Vi ∈ V, and Ni is some independent


defined by a structural equation model: each Xi ← fi(Pa(Xi), Ni), where fi is a

noise variable. Then, causal inference is the task of recovering G from S ∼


P^n."

"Causal graph and the intervention types and targets may be (partially)
unknown. This is a realistic setting in many practical applications. For
example, in biology, many interventions that can be performed on organisms
are known to result in measurable downstream effects, but the exact
mechanism and direct intervention targets are unknown, and therefore it is
not clear whether the knowledge gained may be transferred to other species.
In pharmaceutical research, it is desirable to target the root causes of illness
directly and minimize side-effects; however, as the causal mechanisms are
often poorly understood, it is unclear what exactly a drug is doing and
whether the results of a particular study on a subpopulation of patients (say,
middle-aged males in the US) will generalize to other subpopulations (e.g.,
elderly women with dementia). In policy decisions, changing tax rules may
have different repercussions for different socio-economic classes, but the
exact workings of an economy can only be modeled to a certain extent.
Machine learning may help to make such predictions more data-driven, but
should then correctly take into account the transfer of distributions that
result from interventions and context changes. For prediction in IID setting,
imitating the exterior of a process is enough (i.e. can disregard causal
structure). Anything else can benefit from causal learning."

---

### interesting papers


[recent papers](http://deeplearningpatterns.com/doku.php?
id=causal_analysis)

----

#### ["The Seven Tools of Causal Inference with Reflections on Machine


Learning"](https://dl.acm.org/citation.cfm?id=3241036) Pearl

> "The dramatic success in machine learning has led to an explosion of


artificial intelligence applications and increasing expectations for
autonomous systems that exhibit human-level intelligence. These
expectations have, however, met with fundamental obstacles that cut across
many application areas. One such obstacle is adaptability, or robustness.
Machine learning researchers have noted current systems lack the ability to
recognize or react to new circumstances they have not been specifically
programmed or trained for."

- `video` <https://youtube.com/watch?v=nWaM6XmQEmU> (Pearl)

#### ["Causality for Machine Learning"](https://arxiv.org/abs/1911.10500)


Scholkopf

> "Graphical causal inference as pioneered by Judea Pearl arose from


research on artificial intelligence, and for a long time had little connection to
the field of machine learning. This article discusses where links have been
and should be established, introducing key concepts along the way. It argues
that the hard open problems of machine learning and AI are intrinsically
related to causality, and explains how the field is beginning to understand
them."
#### ["Causal Inference and the Data-fusion
Problem"](https://pnas.org/content/113/27/7345) Bareinboim, Pearl

> "We review concepts, principles, and tools that unify current
approaches to causal analysis and attend to new challenges presented by
big data. In particular, we address the problem of data fusion - piecing
together multiple datasets collected under heterogeneous conditions (i.e.,
different populations, regimes, and sampling methods) to obtain valid
answers to queries of interest. The availability of multiple heterogeneous
datasets presents new opportunities to big data analysts, because the
knowledge that can be acquired from combined data would not be possible
from any individual source alone. However, the biases that emerge in
heterogeneous environments require new analytical tools. Some of these
biases, including confounding, sampling selection, and cross-population
biases, have been addressed in isolation, largely in restricted parametric
models. We here present a general, nonparametric framework for handling
these biases and, ultimately, a theoretical solution to the problem of data
fusion in causal inference tasks."

- `video` <https://youtube.com/watch?v=_cNbWuErsoI> (Bareinboim)

- `video` <https://youtube.com/watch?v=dUsokjG4DHc> (Bareinboim)

#### ["On Causal and Anticausal


Learning"](https://arxiv.org/abs/1206.6471) Schoelkopf et al.

`ICML 2012`

> "We consider the problem of function estimation in the case where an
underlying causal model can be inferred. This has implications for popular
scenarios such as covariate shift, concept drift, transfer learning and semi-
supervised learning. We argue that causal knowledge may facilitate some
approaches for a given problem, and rule out others. In particular, we
formulate a hypothesis for when semi-supervised learning can help, and
corroborate it with empirical results."

- `video` <https://youtu.be/zo4oRqfMrgo?t=15m58s> (Lipton)


#### ["Counterfactual Reasoning and Learning Systems: The Example of
Computational Advertising"](https://arxiv.org/abs/1209.2355) Bottou et al.

> "This work shows how to leverage causal inference to understand the
behavior of complex learning systems interacting with their environment and
predict the consequences of changes to the system. Such predictions allow
both humans and algorithms to select the changes that would have
improved the system performance. This work is illustrated by experiments on
the ad placement system associated with the Bing search engine."

- `video` <https://youtube.com/watch?v=qmQceWeYg04> (Bottou)

- `video` <https://youtube.com/watch?v=W8k5KqYqVBw> (Bottou)

- `video` <https://youtube.com/watch?v=isGAY9ELqyo> (Bottou)

- `video` <https://youtu.be/_RtxTpOb8e4?t=52m6s> (Huszar)

#### ["Causal Bootstrapping"](https://arxiv.org/abs/1910.09648) Little,


Badawy

> "To draw scientifically meaningful conclusions and build reliable


engineering models of quantitative phenomena, statistical models must take
cause and effect into consideration (either implicitly or explicitly). This is
particularly challenging when the relevant measurements are not obtained
from controlled experimental (interventional) settings, so that cause and
effect can be obscured by spurious, indirect influences. Modern predictive
techniques from machine learning are capable of capturing high-dimensional,
complex, nonlinear relationships between variables while relying on few
parametric or probabilistic modelling assumptions. However, since these
techniques are associational, applied to observational data they are prone to
picking up spurious influences from non-experimental (observational) data,
making their predictions unreliable. Techniques from causal inference, such
as probabilistic causal diagrams and do-calculus, provide powerful
(nonparametric) tools for drawing causal inferences from such observational
data. However, these techniques are often incompatible with modern,
nonparametric machine learning algorithms since they typically require
explicit probabilistic models. Here, we develop causal bootstrapping, a set of
techniques for augmenting classical nonparametric bootstrap resampling
with information about the causal relationship between variables. This makes
it possible to resample observational data such that, if it is possible to
identify an interventional relationship from that data, new data representing
that relationship can be simulated from the original observational data. In
this way, we can use modern machine learning algorithms unaltered to make
statistically powerful, yet causally-robust, predictions. We develop several
causal bootstrapping algorithms for drawing interventional inferences from
observational data, for classification and regression problems, and
demonstrate, using synthetic and real-world examples, the value of this
approach."

#### ["Discovering Causal Signals in


Images"](https://arxiv.org/abs/1605.08179) Lopez-Paz, Nishihara, Chintala,
Scholkopf, Bottou

> "This paper establishes the existence of observable footprints that


reveal the "causal dispositions" of the object categories appearing in
collections of images. We achieve this goal in two steps. First, we take a
learning approach to observational causal discovery, and build a classifier
that achieves state-of-the-art performance on finding the causal direction
between pairs of random variables, given samples from their joint
distribution. Second, we use our causal direction classifier to effectively
distinguish between features of objects and features of their contexts in
collections of static images. Our experiments demonstrate the existence of a
relation between the direction of causality and the difference between
objects and their contexts, and by the same token, the existence of
observable signals that reveal the causal dispositions of objects."

> "First, we take a learning approach to observational causal inference,


and build a classifier that achieves state-of-the-art performance on finding
the causal direction between pairs of random variables, when given samples
from their joint distribution. Second, we use our causal direction finder to
effectively distinguish between features of objects and features of their
contexts in collections of static images. Our experiments demonstrate the
existence of (1) a relation between the direction of causality and the
difference between objects and their contexts, and (2) observable causal
signals in collections of static images."

> "Causal features are those that cause the presence of the object of
interest in the image (that is, those features that cause the object’s class
label), while anticausal features are those caused by the presence of the
object in the image (that is, those features caused by the class label)."

> "Paper aims to verify experimentally that the higher-order statistics of


image datasets can inform about causal relations. Authors conjecture that
object features and anticausal features are closely related and vice-versa
context features and causal features are not necessarily related. Context
features give the background while object features are what it would be
usually inside bounding boxes in an image dataset."

> "Better algorithms for causal direction should, in principle, help


learning features that generalize better when the data distribution changes.
Causality should help with building more robust features by awareness of the
generating process of the data."

- `video` <https://youtube.com/watch?v=DfJeaa--xO0> (Bottou)

- `post` <http://giorgiopatrini.org/posts/2017/09/06/in-search-of-the-
missing-signals/>

- `notes`
<http://www.shortscience.org/paper?bibtexKey=journals/corr/Lopez-
PazNCSB16>

#### ["Learning Representations for Counterfactual


Inference"](http://arxiv.org/abs/1605.03661) Johansson, Shalit, Sontag

> "Observational studies are rising in importance due to the widespread


accumulation of data in fields such as healthcare, education, employment
and ecology. We consider the task of answering counterfactual questions
such as, "Would this patient have lower blood sugar had she received a
different medication?". We propose a new algorithmic framework for
counterfactual inference which brings together ideas from domain adaptation
and representation learning. In addition to a theoretical justification, we
perform an empirical comparison with previous approaches to causal
inference from observational data. Our deep learning algorithm significantly
outperforms the previous state-of-the-art."

> "In this paper we focus on counterfactual inference, which is a widely


applicable special case of causal inference. We cast counterfactual inference
as a type of domain adaptation problem, and derive a novel way of learning
representations suited for this problem. Our models rely on a novel type of
regularization criteria: learning balanced representations, representations
which have similar distributions among the treated and untreated
populations. We show that trading off a balancing criterion with standard
data fitting and regularization terms is both practically and theoretically
prudent. Open questions which remain are how to generalize this method for
cases where more than one treatment is in question, deriving better
optimization algorithms and using richer discrepancy measures."

- `video` <http://techtalks.tv/talks/learning-representations-for-
counterfactual-inference/62489/> (Johansson)

- `video` <https://channel9.msdn.com/Events/Neural-Information-
Processing-Systems-Conference/Neural-Information-Processing-Systems-
Conference-NIPS-2016/Deep-Learning-Symposium-Session-3> (Shalit)

- `notes`
<http://www.shortscience.org/paper?bibtexKey=journals/corr/JohanssonSS16
>

- `code` <https://github.com/clinicalml/cfrnet>

#### ["Causal Effect Inference with Deep Latent-Variable


Models"](https://arxiv.org/abs/1705.08821) Louizos, Shalit, Mooij, Sontag,
Zemel, Welling
> "Learning individual-level causal effects from observational data, such
as inferring the most effective medication for a specific patient, is a problem
of growing importance for policy makers. The most important aspect of
inferring causal effects from observational data is the handling of
confounders, factors that affect both an intervention and its outcome. A
carefully designed observational study attempts to measure all important
confounders. However, even if one does not have direct access to all
confounders, there may exist noisy and uncertain measurement of proxies
for confounders. We build on recent advances in latent variable modeling to
simultaneously estimate the unknown latent space summarizing the
confounders and the causal effect. Our method is based on Variational
Autoencoders which follow the causal structure of inference with proxies. We
show our method is significantly more robust than existing methods, and
matches the state-of-the-art on previous benchmarks focused on individual
treatment effects."

- `code` <https://github.com/AMLab-Amsterdam/CEVAE>

#### ["Implicit Causal Models for Genome-wide Association Studies"]


(https://arxiv.org/abs/1710.10742) Tran, Blei

> "Progress in probabilistic generative models has accelerated,


developing richer models with neural architectures, implicit densities, and
with scalable algorithms for their Bayesian inference. However, there has
been limited progress in models that capture causal relationships, for
example, how individual genetic factors cause major human diseases. In this
work, we focus on two challenges in particular: How do we build richer causal
models, which can capture highly nonlinear relationships and interactions
between multiple causes? How do we adjust for latent confounders, which
are variables influencing both cause and effect and which prevent learning of
causal relationships? To address these challenges, we synthesize ideas from
causality and modern probabilistic modeling. For the first, we describe
implicit causal models, a class of causal models that leverages neural
architectures with an implicit density. For the second, we describe an implicit
causal model that adjusts for confounders by sharing strength across
examples. In experiments, we scale Bayesian inference on up to a billion
genetic measurements. We achieve state of the art accuracy for identifying
causal factors: we significantly outperform existing genetics methods by an
absolute difference of 15-45.3%."

- `video` <https://vimeo.com/253922904> (Tran)

- `video` <https://youtube.com/watch?v=gi2jZ_bVJuA> (Tran)

- `slides` <http://dustintran.com/talks/Tran_Genomics.pdf>

- `post` <https://www.alexdamour.com/blog/public/2018/05/18/non-
identification-in-latent-confounder-models>

#### ["Learning Functional Causal Models with Generative Neural


Networks"](https://arxiv.org/abs/1709.05321) Goudet, Kalainathan, Caillou,
Lopez-Paz, Guyon, Sebag, Tritas, Tubaro

`CGNN`

> "We introduce a new approach to functional causal modeling from


observational data. The approach, called Causal Generative Neural Networks,
leverages the power of neural networks to learn a generative model of the
joint distribution of the observed variables, by minimizing the Maximum
Mean Discrepancy between generated and observed data. An approximate
learning criterion is proposed to scale the computational cost of the
approach to linear complexity in the number of observations. The
performance of CGNN is studied throughout three experiments. First, we
apply CGNN to the problem of cause-effect inference, where two CGNNs
model P(Y|X,noise) and P(X|Y,noise) identify the best causal hypothesis out of
X → Y and Y → X. Second, CGNN is applied to the problem of identifying v-
structures and conditional independences. Third, we apply CGNN to problem
of multivariate functional causal modeling: given a skeleton describing the
dependences in a set of random variables {X1,…,Xd}, CGNN orients the
edges in the skeleton to uncover the directed acyclic causal graph describing
the causal structure of the random variables. On all three tasks, CGNN is
extensively assessed on both artificial and real-world data, comparing
favorably to the state-of-the-art. Finally, we extend CGNN to handle the case
of confounders, where latent variables are involved in the overall causal
model."
- `video` <https://vimeo.com/252105914#t=37m10s> (Goudet)

- `code` <https://github.com/GoudetOlivier/CGNN>

- `paper` ["Causal Generative Neural


Networks"](https://arxiv.org/abs/1711.08936) by Goudet et al.

#### ["SAM: Structural Agnostic Model, Causal Discovery and Penalized


Adversarial Learning"](https://arxiv.org/abs/1803.04929) Kalainathan,
Goudet, Guyon, Lopez-Paz, Sebag

> "We present the Structural Agnostic Model, a framework to estimate


end-to-end non-acyclic causal graphs from observational data. In a nutshell,
SAM implements an adversarial game in which a separate model generates
each variable, given real values from all others. In tandem, a discriminator
attempts to distinguish between the joint distributions of real and generated
samples. Finally, a sparsity penalty forces each generator to consider only a
small subset of the variables, yielding a sparse causal graph. SAM scales
easily to hundreds variables. Our experiments show the state-of-the-art
performance of SAM on discovering causal structures and modeling
interventions, in both acyclic and non-acyclic graphs."

#### ["Woulda, Coulda, Shoulda: Counterfactually-Guided Policy Search"]


(https://arxiv.org/abs/1811.06272) Buesing, Weber, Zwols, Racaniere, Guez,
Lespiau, Heess

`CF-GPS` `counterfactual inference` `ICLR 2019`

> Learning policies on data synthesized by models can in principle


quench the thirst of reinforcement learning algorithms for large amounts of
real experience, which is often costly to acquire. However, simulating
plausible experience de novo is a hard problem for many complex
environments, often resulting in biases for model-based policy evaluation
and search. Instead of de novo synthesis of data, here we assume logged,
real experience and model alternative outcomes of this experience under
counterfactual actions, i.e. actions that were not actually taken. Based on
this, we propose the Counterfactually-Guided Policy Search algorithm for
learning policies in POMDPs from off-policy experience. It leverages structural
causal models for counterfactual evaluation of arbitrary policies on individual
off-policy episodes. CF-GPS can improve on vanilla model-based RL
algorithms by making use of available logged data to de-bias model
predictions. In contrast to off-policy algorithms based on Importance
Sampling which re-weight data, CF-GPS leverages a model to explicitly
consider alternative outcomes, allowing the algorithm to make better use of
experience data. We find empirically that these advantages translate into
improved policy evaluation and search results on a non-trivial grid-world
task. Finally, we show that CF-GPS generalizes the previously proposed
Guided Policy Search and that reparameterization-based algorithms such
Stochastic Value Gradient can be interpreted as counterfactual methods."

> "Instead of relying on data synthesized from scratch by a model, we


train policies on model predictions of alternate outcomes of past experience
from the true environment under counterfactual actions, i.e. actions that had
not actually been taken, while everything else remaining the same. At the
heart of CF-GPS are structural causal models which model the environment
with two ingredients: 1) Independent random variables, called scenarios
here, summarize all aspects of the environment that cannot be influenced by
the agent. 2) Deterministic transition functions (also called causal
mechanisms) take these scenarios, together with the agent’s actions, as
input and produce the predicted outcome. The central idea of CF-GPS is that,
instead of running an agent on scenarios sampled de novo from a model, we
infer scenarios in hindsight from given off-policy data, and then evaluate and
improve the agent on these specific scenarios using given or learned causal
mechanisms."

> "We show that CF-GPS generalizes and empirically improves on a


vanilla model-based RL algorithm, by mitigating model mismatch via
“grounding” or “anchoring” model-based predictions in inferred scenarios. As
a result, this approach explicitly allows to trade-off historical data for model
bias. CF-GPS differs substantially from standard off-policy RL algorithms
based on Importance Sampling, where historical data is re-weighted with
respect to the importance weights to evaluate or learn new policies. In
contrast, CF-GPS explicitly reasons counterfactually about given off-policy
data."
> "We formulate model-based RL in POMDPs in terms of structural causal
models, thereby connecting concepts from reinforcement learning and causal
inference."

> "We provide the first results, to the best of our knowledge, showing
that counterfactual reasoning in structural causal models on off-policy data
can facilitate solving non-trivial RL tasks."

> "We show that two previously proposed classes of RL algorithms,


namely Guided Policy Search and Stochastic Value Gradient methods can be
interpreted as counterfactual methods, opening up possible generalizations."

> "Simulating plausible synthetic experience de novo is a hard problem


for many environments, often resulting in biases for model-based RL
algorithms. The main takeaway from this work is that we can improve policy
learning by evaluating counterfactual actions in concrete, past scenarios.
Compared to only considering synthetic scenarios, this procedure mitigates
model bias."

> "We assumed that there are no additional hidden confounders in the
environment and that the main challenge in modelling the environment is
capturing the distribution of the noise sources p(U), whereas we assumed
that the transition and reward kernels given the noise is easy to model. This
seems a reasonable assumption in some environments, such as the partially
observed grid-world considered here, but not all. Probably the most
restrictive assumption is that we require the inference over the noise U given
data hT to be sufficiently accurate. We showed in our example, that we could
learn a parametric model of this distribution from privileged information, i.e.
from joint samples u, hT from the true environment. However, imperfect
inference over the scenario U could result e.g. in wrongly attributing a
negative outcome to the agent’s actions, instead environment factors. This
could in turn result in too optimistic predictions for counterfactual actions.
Future research is needed to investigate if learning a sufficiently strong SCM
is possible without privileged information for interesting RL domains. If,
however, we can trust the transition and reward kernels of the model, we can
substantially improve model-based RL methods by counterfactual reasoning
on off-policy data, as demonstrated in our experiments and by the success of
Guided Policy Search and Stochastic Value Gradient methods."

----

> "The proposed approach here is general but only instantiated (in terms
of inference algorithms and experiments) for when the initial starting state is
unknown in a deterministic POMDP environment, where the dynamics and
reward model is known. The authors show that they can use inference over
the full trajectory (or some multi-time-step subpart) to get a (often delta
function) posterior over the initial starting state, which then allows them to
build a more accurate initial state distribution for use in their model
simulations than approaches that do not use more than 1 step to do so. This
is interesting, but it’s not quite clear where this sort of situation would arise
in practice, and the proposed experimental results are limited to one
simulated toy domain."

#### ["Causal Reasoning from Meta-reinforcement


Learning"](https://arxiv.org/abs/1901.08162) Dasgupta et al.

> "Discovering and exploiting the causal structure in the environment is


a crucial challenge for intelligent agents. Here we explore whether causal
reasoning can emerge via meta-reinforcement learning. We train a recurrent
network with model-free reinforcement learning to solve a range of problems
that each contain causal structure. We find that the trained agent can
perform causal reasoning in novel situations in order to obtain rewards. The
agent can select informative interventions, draw causal inferences from
observational data, and make counterfactual predictions. Although
established formal causal reasoning algorithms also exist, in this paper we
show that such reasoning can arise from model-free reinforcement learning,
and suggest that causal reasoning in complex settings may benefit from the
more end-to-end learning-based approaches presented here. This work also
offers new strategies for structured exploration in reinforcement learning, by
providing agents with the ability to perform - and interpret - experiments."
> "Agents trained in this manner performed causal reasoning in three
data settings: observational, interventional, and counterfactual. Our
approach did not require explicit encoding of formal principles of causal
inference. Rather, by optimizing an agent to perform a task that depended
on causal structure, the agent learned implicit strategies to generate and use
different kinds of available data for causal reasoning, including drawing
causal inferences from passive observation, actively intervening, and making
counterfactual predictions, all on held out causal CBNs that the agents had
never previously seen. A consistent result in all three data settings was that
our agents learned to perform good experiment design or active learning.
That is, they learned a non-random data collection policy where they actively
chose which nodes to intervene (or condition) on in the information phase,
and thus could control the kinds of data they saw, leading to higher
performance in the quiz phase than that from an agent with a random data
collection policy."

> "We showed that agents learned to perform do-calculus. We saw that,
the trained agent with access to only observational data received more
reward than the highest possible reward achievable without causal
knowledge. We further observed that this performance increase occurred
selectively in cases where do-calculus made a prediction distinguishable
from the predictions based on correlations – i.e. where the externally
intervened node had a parent, meaning that the intervention resulted in a
different graph."

> "We showed that agents learned to resolve unobserved confounders


using interventions (which is impossible with only observational data). We
saw that agents with access to interventional data performed better than
agents with access to only observational data only in cases where the
intervened node shared an unobserved parent (a confounder) with other
variables in the graph."

> "We showed that agents learned to use counterfactuals. We saw that
agents with additional access to the specific randomness in the test phase
performed better than agents with access to only interventional data. We
found that the increased performance was observed only in cases where the
maximum mean value in the graph was degenerate, and optimal choice was
affected by the latent randomness – i.e. where multiple nodes had the same
value on average and the specific randomness could be used to distinguish
their actual values in that specific case."

#### ["General Identifiability with Arbitrary Surrogate Experiments"]


(http://auai.org/uai2019/proceedings/papers/144.pdf) Lee, Correa,
Bareinboim

`UAI 2019`

> "We study the problem of causal identification from an arbitrary


collection of observational and experimental distributions, and substantive
knowledge about the phenomenon under investigation, which usually comes
in the form of a causal graph. We call this problem g-identifiability, or gID for
short. The gID setting encompasses two well-known problems in causal
inference, namely, identifiability and z-identifiability — the former assumes
that an observational distribution is necessarily available, and no
experiments can be performed, conditions that are both relaxed in the gID
setting; the latter assumes that all combinations of experiments are
available, i.e., the power set of the experimental set Z, which gID does not
require a priori. In this paper, we introduce a general strategy to prove non-
gID based on hedgelets and thickets, which leads to a necessary and
sufficient graphical condition for the corresponding decision problem. We
further develop a procedure for systematically computing the target effect,
and prove that it is sound and complete for gID instances. In other words,
failure of the algorithm in returning an expression implies that the target
effect is not computable from the available distributions. Finally, as a
corollary of these results, we show that do-calculus is complete for the task
of g-identifiability."

> "In one line of investigation, this task is formalized through the
question of whether the effect that an intervention on a set of variables X will
have on another set of outcome variables Y (denoted Px(y)) can be uniquely
computed from the probability distribution P over the observed variables V
and a causal diagram G. This is known as the problem of identification, and
has received great attention in the literature, starting with a number of
sufficient conditions, and culminating in a complete graphical and
algorithmic characterization. Despite the generality of such results, it’s the
case that in some real-world applications the quantity Px(y) is not identifiable
(i.e., not uniquely computable) from the observational data and the causal
diagram."

> "On an alternative thread in the literature, causal effects (Px(y)) are
obtained directly through controlled experimentation. In the biomedical
sciences, for instance, considerable resources are spent every year by the
FDA, the NIH, and others, in supporting large-scale, systematic, and
controlled experimentation, which comes under the rubric of Randomized
Controlled Trials. The same method is also leveraged in the context of
reinforcement learning, for example, when an autonomous agent is deployed
in an environment and is given the capability of performing interventions and
observing how they unfold in time. Through this process, experimental data
is gathered, and used in the construction of a strategy, also known as policy,
with the goal of optimizing the agent’s cumulative reward (e.g., survival,
profitability, happiness). Despite all the inferential power entailed by this
approach, there are real-world settings where controlling the variables in X is
not feasible, possibly due to economical, technical, or ethical constraints."

> "In this paper, we note that these two approaches can be seen as
extremes in a spectrum of possible research designs, which can be combined
to solve very natural, albeit non-trivial, causal inference problems. In fact,
this generalized setting has been investigated in the literature under the
rubric of z-identifiability (zID, for short). Formally, zID asks whether Px(y) can

P(V) and the experimental distributions Pz'(V), for all Z'⊆ Z for some Z ⊆ V."
be uniquely computed from the combination of the observational distribution

You might also like