Causal Inference
Causal Inference
* [**theory**](#theory)
* [**interesting papers**](#interesting-papers)
---
### overview
["Causality in Machine
Learning"](http://unofficialgoogledatascience.com/2017/01/causality-in-
machine-learning.html) by Muralidharan et al.
----
["The Seven Tools of Causal Inference with Reflections on Machine
Learning"](https://dl.acm.org/citation.cfm?id=3241036) by Judea Pearl
`paper` ([talk](https://youtube.com/watch?v=nWaM6XmQEmU) `video`)
["Causality"](http://www.homepages.ucl.ac.uk/~ucgtrbd/papers/causality.pdf)
by Ricardo Silva `paper`
["Introduction to Causal
Inference"](http://jmlr.org/papers/volume11/spirtes10a/spirtes10a.pdf) by
Peter Spirtes `paper`
["Graphical Causal
Models"](http://www.stat.cmu.edu/~cshalizi/uADA/12/lectures/ch22.pdf) by
Cosma Shalizi `paper`
----
["Causal Inference
Book"](https://www.hsph.harvard.edu/miguel-hernan/causal-inference-book/)
by Miguel Hernan and James Robins `book`
----
[tutorial](https://youtube.com/watch?v=CTcQlRSnvvM) by Bernhard
Scholkopf `video`
[course](https://youtube.com/channel/UCbOJ2eEdvf2wOPrAmA72Gzg) by
Brady Neal `video`
["Counterfactual
Inference"](https://facebook.com/nipsfoundation/videos/1291139774361116)
tutorial by Susan Athey `video`
----
["Causal Data Science: A General Framework for Data Fusion and Causal
Inference"](https://youtube.com/watch?v=dUsokjG4DHc) by Elias Bareinboim
`video`
["Learning Causal
Mechanisms"](https://facebook.com/iclr.cc/videos/2123421684353553?
t=294) by Bernhard Scholkopf `video`
[workshop](https://sites.google.com/view/nips2018causallearning) at
NeurIPS 2018 ([videos](https://youtube.com/playlist?
list=PLJscN9YDD1bu1dCKuXSV1qYmicx3g9t7A))
---
### theory
What if some railways are closed, what will passengers do? What if we
incentivize members of a social network to propagate an idea, how
influential can they be? What if some genes in a cell are knocked-out, which
phenotypes can we expect? Such questions need to be addressed via a
combination of experimental and observational data, and require a careful
approach to modelling heterogeneous datasets and structural assumptions
concerning the causal relations among components of the system.
"What is more likely, that a daughter will have blue eyes given that her
mother has blue eyes or the other way around — that the mother will have
blue eyes given that the daughter has blue eyes? Most people will say the
former — they'll prefer the causal direction. But it turns out the two
probabilities are the same, because the number of blue-eyed people in every
generation remains stable. I took it as evidence that people think causally,
not probabilistically — they're biased by having easy access to causal
explanations, even though probability theory tells you something different.
There are many biases in our judgment that are created by our inclination
to attribute causal relationships where they do not belong. We see the world
as a collection of causal relationships and not as a collection of statistical or
associative relationships. Most of the time, we can get by, because they are
closely tied together. Once in a while we fail. The blue-eye story is an
example of such failure.
"I now take causal relations as the fundamental building block that of
physical reality and of human understanding of that reality, and I regard
probabilistic relationships as but the surface phenomena of the causal
machinery that underlies and propels our understanding of our world."
*(Judea Pearl)*
----
"If we examine the information that drives machine learning today, we find
that it is almost entirely statistical. In other words, learning machines
improve their performance by optimizing parameters over a stream of
sensory inputs received from the environment. It is a slow process,
analogous in many respects to the evolutionary survival-of-the-fittest process
that explains how species like eagles and snakes have developed superb
vision systems over millions of years. It cannot explain however the super-
evolutionary process that enabled humans to build eyeglasses and
telescopes over barely one thousand years. What humans possessed that
other species lacked was a mental representation, a blue-print of their
environment which they could manipulate at will to imagine alternative
hypothetical environments for planning and learning. Anthropologists like N.
Harari, and S. Mithen are in general agreement that the decisive ingredient
that gave our homo sapiens ancestors the ability to achieve global dominion,
about 40,000 years ago, was their ability to sketch and store a
representation of their environment, interrogate that representation, distort
it by mental acts of imagination and finally answer “What if?” kind of
questions. Examples are interventional questions: “What if I act?” and
retrospective or explanatory questions: “What if I had acted differently?” No
learning machine in operation today can answer such questions about
actions not taken before. Moreover, most learning machine today do not
utilize a representation from which such questions can be answered. We
postulate that the major impediment to achieving accelerated learning
speeds as well as human level performance can be overcome by removing
these barriers and equipping learning machines with causal reasoning tools.
This postulate would have been speculative twenty years ago, prior to the
mathematization of counterfactuals. Not so today. Advances in graphical and
structural models have made counterfactuals computationally manageable
and thus rendered metastatistical learning worthy of serious exploration."
"An extremely useful insight unveiled by the logic of causal reasoning is the
existence of a sharp classification of causal information, in terms of the kind
of questions that each class is capable of answering. The classification forms
a 3-level hierarchy in the sense that questions at one level can only be
answered if information from next levels is available."
What if I do X?
[*(Judea Pearl)*](http://web.cs.ucla.edu/~kaoru/theoretical-
impediments.pdf)
----
*statistics - descriptive*:
*statistics - experimental*:
(d1, distribution(do(Z)), d3, d4) -> (d1, distribution(do(X)), d3, d4) *(P.
Wright, S. Wright)*
(d1, d2, select(Age), d4) -> (d1, d2, {}, d4) *(Heckman)*
(bonobos, d2, d3, d4) -> (humans, d2, d3, d4) *(Shadish, Cook,
Campbell)*
[*(Elias Bareinboim)*](https://youtu.be/dUsokjG4DHc?t=8m13s)
----
"Causal graph and the intervention types and targets may be (partially)
unknown. This is a realistic setting in many practical applications. For
example, in biology, many interventions that can be performed on organisms
are known to result in measurable downstream effects, but the exact
mechanism and direct intervention targets are unknown, and therefore it is
not clear whether the knowledge gained may be transferred to other species.
In pharmaceutical research, it is desirable to target the root causes of illness
directly and minimize side-effects; however, as the causal mechanisms are
often poorly understood, it is unclear what exactly a drug is doing and
whether the results of a particular study on a subpopulation of patients (say,
middle-aged males in the US) will generalize to other subpopulations (e.g.,
elderly women with dementia). In policy decisions, changing tax rules may
have different repercussions for different socio-economic classes, but the
exact workings of an economy can only be modeled to a certain extent.
Machine learning may help to make such predictions more data-driven, but
should then correctly take into account the transfer of distributions that
result from interventions and context changes. For prediction in IID setting,
imitating the exterior of a process is enough (i.e. can disregard causal
structure). Anything else can benefit from causal learning."
---
----
> "We review concepts, principles, and tools that unify current
approaches to causal analysis and attend to new challenges presented by
big data. In particular, we address the problem of data fusion - piecing
together multiple datasets collected under heterogeneous conditions (i.e.,
different populations, regimes, and sampling methods) to obtain valid
answers to queries of interest. The availability of multiple heterogeneous
datasets presents new opportunities to big data analysts, because the
knowledge that can be acquired from combined data would not be possible
from any individual source alone. However, the biases that emerge in
heterogeneous environments require new analytical tools. Some of these
biases, including confounding, sampling selection, and cross-population
biases, have been addressed in isolation, largely in restricted parametric
models. We here present a general, nonparametric framework for handling
these biases and, ultimately, a theoretical solution to the problem of data
fusion in causal inference tasks."
`ICML 2012`
> "We consider the problem of function estimation in the case where an
underlying causal model can be inferred. This has implications for popular
scenarios such as covariate shift, concept drift, transfer learning and semi-
supervised learning. We argue that causal knowledge may facilitate some
approaches for a given problem, and rule out others. In particular, we
formulate a hypothesis for when semi-supervised learning can help, and
corroborate it with empirical results."
> "This work shows how to leverage causal inference to understand the
behavior of complex learning systems interacting with their environment and
predict the consequences of changes to the system. Such predictions allow
both humans and algorithms to select the changes that would have
improved the system performance. This work is illustrated by experiments on
the ad placement system associated with the Bing search engine."
> "Causal features are those that cause the presence of the object of
interest in the image (that is, those features that cause the object’s class
label), while anticausal features are those caused by the presence of the
object in the image (that is, those features caused by the class label)."
- `post` <http://giorgiopatrini.org/posts/2017/09/06/in-search-of-the-
missing-signals/>
- `notes`
<http://www.shortscience.org/paper?bibtexKey=journals/corr/Lopez-
PazNCSB16>
- `video` <http://techtalks.tv/talks/learning-representations-for-
counterfactual-inference/62489/> (Johansson)
- `video` <https://channel9.msdn.com/Events/Neural-Information-
Processing-Systems-Conference/Neural-Information-Processing-Systems-
Conference-NIPS-2016/Deep-Learning-Symposium-Session-3> (Shalit)
- `notes`
<http://www.shortscience.org/paper?bibtexKey=journals/corr/JohanssonSS16
>
- `code` <https://github.com/clinicalml/cfrnet>
- `code` <https://github.com/AMLab-Amsterdam/CEVAE>
- `slides` <http://dustintran.com/talks/Tran_Genomics.pdf>
- `post` <https://www.alexdamour.com/blog/public/2018/05/18/non-
identification-in-latent-confounder-models>
`CGNN`
- `code` <https://github.com/GoudetOlivier/CGNN>
> "We provide the first results, to the best of our knowledge, showing
that counterfactual reasoning in structural causal models on off-policy data
can facilitate solving non-trivial RL tasks."
> "We assumed that there are no additional hidden confounders in the
environment and that the main challenge in modelling the environment is
capturing the distribution of the noise sources p(U), whereas we assumed
that the transition and reward kernels given the noise is easy to model. This
seems a reasonable assumption in some environments, such as the partially
observed grid-world considered here, but not all. Probably the most
restrictive assumption is that we require the inference over the noise U given
data hT to be sufficiently accurate. We showed in our example, that we could
learn a parametric model of this distribution from privileged information, i.e.
from joint samples u, hT from the true environment. However, imperfect
inference over the scenario U could result e.g. in wrongly attributing a
negative outcome to the agent’s actions, instead environment factors. This
could in turn result in too optimistic predictions for counterfactual actions.
Future research is needed to investigate if learning a sufficiently strong SCM
is possible without privileged information for interesting RL domains. If,
however, we can trust the transition and reward kernels of the model, we can
substantially improve model-based RL methods by counterfactual reasoning
on off-policy data, as demonstrated in our experiments and by the success of
Guided Policy Search and Stochastic Value Gradient methods."
----
> "The proposed approach here is general but only instantiated (in terms
of inference algorithms and experiments) for when the initial starting state is
unknown in a deterministic POMDP environment, where the dynamics and
reward model is known. The authors show that they can use inference over
the full trajectory (or some multi-time-step subpart) to get a (often delta
function) posterior over the initial starting state, which then allows them to
build a more accurate initial state distribution for use in their model
simulations than approaches that do not use more than 1 step to do so. This
is interesting, but it’s not quite clear where this sort of situation would arise
in practice, and the proposed experimental results are limited to one
simulated toy domain."
> "We showed that agents learned to perform do-calculus. We saw that,
the trained agent with access to only observational data received more
reward than the highest possible reward achievable without causal
knowledge. We further observed that this performance increase occurred
selectively in cases where do-calculus made a prediction distinguishable
from the predictions based on correlations – i.e. where the externally
intervened node had a parent, meaning that the intervention resulted in a
different graph."
> "We showed that agents learned to use counterfactuals. We saw that
agents with additional access to the specific randomness in the test phase
performed better than agents with access to only interventional data. We
found that the increased performance was observed only in cases where the
maximum mean value in the graph was degenerate, and optimal choice was
affected by the latent randomness – i.e. where multiple nodes had the same
value on average and the specific randomness could be used to distinguish
their actual values in that specific case."
`UAI 2019`
> "In one line of investigation, this task is formalized through the
question of whether the effect that an intervention on a set of variables X will
have on another set of outcome variables Y (denoted Px(y)) can be uniquely
computed from the probability distribution P over the observed variables V
and a causal diagram G. This is known as the problem of identification, and
has received great attention in the literature, starting with a number of
sufficient conditions, and culminating in a complete graphical and
algorithmic characterization. Despite the generality of such results, it’s the
case that in some real-world applications the quantity Px(y) is not identifiable
(i.e., not uniquely computable) from the observational data and the causal
diagram."
> "On an alternative thread in the literature, causal effects (Px(y)) are
obtained directly through controlled experimentation. In the biomedical
sciences, for instance, considerable resources are spent every year by the
FDA, the NIH, and others, in supporting large-scale, systematic, and
controlled experimentation, which comes under the rubric of Randomized
Controlled Trials. The same method is also leveraged in the context of
reinforcement learning, for example, when an autonomous agent is deployed
in an environment and is given the capability of performing interventions and
observing how they unfold in time. Through this process, experimental data
is gathered, and used in the construction of a strategy, also known as policy,
with the goal of optimizing the agent’s cumulative reward (e.g., survival,
profitability, happiness). Despite all the inferential power entailed by this
approach, there are real-world settings where controlling the variables in X is
not feasible, possibly due to economical, technical, or ethical constraints."
> "In this paper, we note that these two approaches can be seen as
extremes in a spectrum of possible research designs, which can be combined
to solve very natural, albeit non-trivial, causal inference problems. In fact,
this generalized setting has been investigated in the literature under the
rubric of z-identifiability (zID, for short). Formally, zID asks whether Px(y) can
P(V) and the experimental distributions Pz'(V), for all Z'⊆ Z for some Z ⊆ V."
be uniquely computed from the combination of the observational distribution