0% found this document useful (0 votes)

11 views88 pages

Buhlman 2020 PPT

The document discusses high-dimensional causal inference in genomics, focusing on predicting the effects of gene interventions using observational data rather than randomized experiments. It emphasizes the limitations of traditional regression models for causal inference and introduces graphical modeling and structural equation models as alternatives for quantifying intervention effects. The document also highlights the challenges of inferring causal relationships from observational data and the importance of understanding the underlying causal structures represented by Directed Acyclic Graphs (DAGs).

Uploaded by

j.lowhorn

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views88 pages

Buhlman 2020 PPT

Uploaded by

j.lowhorn

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 88

High-dimensional causal inference, graphical

modeling and structural equation models

Peter Bühlmann

Seminar für Statistik, ETH Zürich

cannot do confirmatory causal inference without
randomized intervention experiments...

but we can do better than proceeding naively

Goal

in genomics:
if we would make an intervention at a single gene, what would
be its effect on a phenotype of interest?

want to infer/predict such effects without actually doing the

intervention
i.e. from observational data
(from observations of a “steady-state system”)

it doesn’t need to be genes

can generalize to intervention at more than one variable/gene
Goal

in genomics:
if we would make an intervention at a single gene, what would
be its effect on a phenotype of interest?

want to infer/predict such effects without actually doing the

intervention
i.e. from observational data
(from observations of a “steady-state system”)

it doesn’t need to be genes

can generalize to intervention at more than one variable/gene
Policy making

James Heckman: Nobel Prize Economics 2000

e.g.:
“Pritzker Consortium on Early Childhood Development
identifies when and how child intervention programs can be
most influential”
Genomics

1. Flowering of Arabidopsis Thaliana

phenotype/response variable of interest:

Y = days to bolting (flowering)
“covariates” X = gene expressions from p = 210 326 genes
remark: “gene expression”: process by which information from
a gene is used in the synthesis of a functional gene product
(e.g. protein)
question: infer/predict the effect of knocking-out/knocking-down
(or enhancing) a single gene (expression) on the
phenotype/response variable Y ?
2. Gene expressions of yeast

p = 5360 genes
phenotype of interest: Y = expression of first gene
“covariates” X = gene expressions from all other genes
and then
phenotype of interest: Y = expression of second gene
“covariates” X = gene expressions from all other genes
and so on
infer/predict the effects of a single gene knock-down on all
other genes
; consider the framework of an

intervention effect = causal effect

(mathematically defined ; see later)
Regression – the “statistical workhorse”: the wrong approach

we could use linear model (fitted from n observational data)

p
X
Y = βj X (j) + ε,
j=1

Var(X (j) ) ≡ 1 for all j

|βj | measures the effect of variable X (j) in terms of “association”

i.e. change of Y as a function of X (j) when keeping all other
variables X (k) fixed

; not very realistic for intervention problem

if we change e.g. one gene, some others will also change
and these others are not (cannot be) kept fixed
Regression – the “statistical workhorse”: the wrong approach

we could use linear model (fitted from n observational data)

p
X
Y = βj X (j) + ε,
j=1

Var(X (j) ) ≡ 1 for all j

|βj | measures the effect of variable X (j) in terms of “association”

i.e. change of Y as a function of X (j) when keeping all other
variables X (k) fixed

; not very realistic for intervention problem

if we change e.g. one gene, some others will also change
and these others are not (cannot be) kept fixed
and indeed:

1,000 IDA
Lasso
Elastic−net
800
Random
True positives

600

400

200

0
0 1,000 2,000 3,000 4,000
False positives

; can do much better than (penalized) regression!

and indeed:

1,000 IDA
Lasso
Elastic−net
800
Random
True positives

600

400

200

0
0 1,000 2,000 3,000 4,000
False positives

; can do much better than (penalized) regression!

Effects of single gene knock-downs on all other genes (yeast)
(Maathuis, Colombo, Kalisch & PB, 2010)

• p = 5360 genes (expression of genes)

• 231 gene knock downs ; 1.2 · 106 intervention effects
• the truth is “known in good approximation”
(thanks to intervention experiments)

goal: prediction of the true large intervention effects

based on observational data with no knock-downs

1,000 IDA
Lasso
Elastic−net
800
Random

n = 63
True positives

600

observational data 400

200

0
0 1,000 2,000 3,000 4,000
False positives
A bit more specifically

I univariate response Y
I p-dimensional covariate X

question:
what is the effect of setting the jth component of X to a certain
value x:

do(X (j) = x)

; this is a question of intervention type

not the effect of X (j) on Y when keeping all other variables fixed
(regression effect)
Reichenbach, 1956; Suppes, 1970; Rubin, Dawid, Holland, Pearl,
Glymour, Scheines, Spirtes,...
Intervention calculus (a review)
“dynamic” notion of an effect:
if we set a variable X (j) to a value x (intervention)
; some other variables X (k) (k 6= j) and maybe Y will change
we want to quantify the “total” effect of
X (j) on Y including “all changed” X (k) on Y

a graph or influence diagram will be very useful

X2 Y

X4 X3
for simplicity: just consider DAGs (Directed Acyclic Graphs)
[with hidden variables (Spirtes, Glymour & Scheines (1993);
Colombo et al. (2012)) much more complicated and not validated
with real data]
random variables are represented as nodes in the DAG

assume a Markov condition, saying that

X (j) |X (pa(j)) cond. independent of its non-descendant variables

; recursive factorization of joint distribution

p
Y
(1) (p) (pa(Y ))
P(Y , X ,...,X ) = P(Y |X ) P(X (j) |X (pa(j)) )
j=1

for intervention calculus: use truncated factorization (e.g. Pearl)

for simplicity: just consider DAGs (Directed Acyclic Graphs)
[with hidden variables (Spirtes, Glymour & Scheines (1993);
Colombo et al. (2012)) much more complicated and not validated
with real data]
random variables are represented as nodes in the DAG

assume a Markov condition, saying that

X (j) |X (pa(j)) cond. independent of its non-descendant variables

; recursive factorization of joint distribution

p
Y
(1) (p) (pa(Y ))
P(Y , X ,...,X ) = P(Y |X ) P(X (j) |X (pa(j)) )
j=1

for intervention calculus: use truncated factorization (e.g. Pearl)

assume Markov property for causal DAG:

non-intervention intervention do(X (2) = x)

X (1) X (1)

X (2) Y
X (2) = x Y

X (4) X (3) X (4) X (3)

P(Y , X (1) , X (3) , X (4) |do(X (2) = x))

= P(Y |X (1) , X (3) )P(X (1) |X (2) = x)P(X (3) )P(X (4) )

P(Y |do(X (2) = x))

Z
= P(Y , X (1) , X (3) , X (4) |do(X (2) = x))dX (1) dX (3) dX (4)
the truncated factorization is a mathematical consequence of
the Markov condition (with respect to the causal DAG) for the
observational probability distribution P
(plus assumption that structural eqns. are “autonomous”)
the intervention distribution P(Y |do(X (2) = x)) can be
calculated from
I observational data distribution
; need to estimate conditional distributions
I an influence diagram (causal DAG)
; need to estimate structure of a graph/influence diagram

intervention effect:
Z
E[Y |do(X (2) = x)] = yP(y |do(X (2) = x))dy
∂
intervention effect at x0 : E[Y |do(X (2) = x)]|x=x0
∂x

in the Gaussian case: Y , X (1) , . . . , X (p) ∼ Np+1 (µ, Σ),

∂
E[Y |do(X (2) = x)]≡ θ2 for all x
∂x
The backdoor criterion (Pearl, 1993)

we only need to know the local parental set: for Z = Xpa(X ) ,

Z
if Y ∈
/ pa(X ) : P(Y |do(X = x)) = P(Y |X = x, Z )dP(Z )
Z

parental set might not be the minimal set but always suffices

this is a consequence of the global Markov property:

XA independent of XB |XS : A and B are d-separated by S

Gaussian case
∂
for Gaussian case: E[Y |do(X (j) = x)] ≡ θj for all x
∂x

for Y ∈
/ pa(j):

θj is the regression parameter in

X
Y = θj X (j) + θk X (k) + error
k∈pa(j)
only need parental set and regression
j = 2, pa(j) = {3, 4} X (1)

X (2) Y

X (4) X (3)
when having no unmeasured confounder (variable):

intervention effect (as defined) = causal effect

recap:
causal effect = effect from a randomized trial
(but we want to infer it without a randomized study...
because often we cannot do it, or it is too expensive)

structural equation models provide a different (but closely

related) route for quantifying intervention effects
when having no unmeasured confounder (variable):

intervention effect (as defined) = causal effect

recap:
causal effect = effect from a randomized trial
(but we want to infer it without a randomized study...
because often we cannot do it, or it is too expensive)

structural equation models provide a different (but closely

related) route for quantifying intervention effects
when having no unmeasured confounder (variable):

intervention effect (as defined) = causal effect

recap:
causal effect = effect from a randomized trial
(but we want to infer it without a randomized study...
because often we cannot do it, or it is too expensive)

structural equation models provide a different (but closely

related) route for quantifying intervention effects
Inferring intervention effects from observational
distribution
main problem: inferring DAG from observational data

impossible! can only infer equivalence class of DAGs

(several DAGs can encode exactly the same conditional
independence relationships)

Example:

X Y X Y

X causes Y Y causes X

a lot of work about identifiability:

Verma & Pearl (1991); Spirtes, Glymour & Scheines (1993); Tian &
Pearl (2000–2002); Lauritzen & Richardson (2002); Shpitser & Pearl
(2006–2011); vanderWeele & Robins (2007–2011); Drton, Foygel &
Sullivant (2011);...
Markov equivalence class of DAGs

P a family of probability distributions on Rp+1

Definition:

M(D) = {P ∈ P; P is Markov w.r.t. DAG D}

D ∼ D 0 ⇔ M(D) = M(D 0 )

the definition depends on the family P:

typical models:
• P = Gaussian distributions
• P = nonparametric model/set of all distributions
• P = set of distributions from
additive structural equation model (see later)
A graphical characterization (Frydenberg, 1990; Verma & Pearl, 1991)

for
P = {Gaussian distributions} or
P = {nonparametric distributions}
it holds:

D ∼ D0 ⇐⇒ D and D 0 have the same skeleton

and the same v -structures
for a DAG D, we write its corresponding Markov equivalence
class

E(D) = {D 0 ; D 0 ∼ D}
Equivalence class of DAGs

• Several DAGs can encode exactly the same conditional

independence relationships. Such DAGs form an equivalence class.
• Example: unshielded triple
X1 ⊥⊥ X3 X1 ⊥
⊥ X3 |X2
X1 X2 X3 false true
X1 X2 X3 false true no v-structure
X1 X2 X3 false true
X1 X2 X3 true false v-structure

• All DAGs in an equivalence class have the same skeleton and the
same v-structures
• An equivalence class can be uniquely represented by a completed
partially directed acyclic graph (CPDAG)

CPDAG DAG 1 DAG 2 DAG 3 DAG 4

22
we cannot estimate causal/intervention effects from
observational distribution

but we will be able to estimate lower bounds of causal effects

conceptual “procedure”:
I probability distribution P from a DAG, generating the data
; true underlying equivalence class of DAGs (CPDAG)
I find all DAG-members of true equivalence class (CPDAG):
D1 , . . . , Dm
I for every DAG-member Dr , and every variable X (j) :
single intervention effect θr ,j
summarize them by

Θ = {θr ,j ; r = 1, . . . , m; j = 1, . . . , p}
| {z }
identifiable parameter
we cannot estimate causal/intervention effects from
observational distribution

but we will be able to estimate lower bounds of causal effects

Θ = {θr ,j ; r = 1, . . . , m; j = 1, . . . , p}
| {z }
identifiable parameter
IDA (oracle version)

PC-algorithm do-calculus

DAG 1 effect 1

DAG 2 effect 2

oracle CPDAG .. .. multi-set Θ

. .
.. ..
. .

DAG m effect m

17
If you want a single number for every variable ...

instead of the multi-set

Θ = {θr ,j ; r = 1, . . . , m; j = 1, . . . , p}

minimal absolute value

αj = min |θr ,j | (j = 1, . . . , p),

r
|θtrue,j | ≥ αj

minimal absolute effect αj is a lower bound for true absolute

intervention effect
Computationally tractable algorithm

searching all DAGs is computationally infeasible if p is large

(we actually can do this up to p ≈ 15 − 20)

instead of finding all m DAGs within an equivalence class ;

compute all intervention effects without finding all DAGs
(Maathuis, Kalisch & PB, 2009)
key idea: exploring local aspects of the graph is sufficient
PC-algorithm do-calculus

effect 1

effect 2

data CPDAG .. multi-set ΘL

.
..
.

effect q

the local ΘL = Θ up to multiplicities

(Maathuis, Kalisch & PB, 2009)
Estimation from finite samples

notation: drop the Y -notation (Y = X (1) , X (2) , . . . , X (p) )

difficult part: estimation of CPDAG (equivalence class of DAGs)

; ¨structural learning¨ pcAlgo(dm = d, alpha = 0.05)

1 7

P⇒ CPDAG
| {z } 5 4 2

equiv. class of DAGs 8 9 6

10
two main approaches:
I multiple testing of conditional dependencies:
PC-algorithm as prime example
I score-based methods: MLE as prime example
Estimation from finite samples

notation: drop the Y -notation (Y = X (1) , X (2) , . . . , X (p) )

difficult part: estimation of CPDAG (equivalence class of DAGs)

; ¨structural learning¨ pcAlgo(dm = d, alpha = 0.05)

1 7

P⇒ CPDAG
| {z } 5 4 2

equiv. class of DAGs 8 9 6

10
two main approaches:
I multiple testing of conditional dependencies:
PC-algorithm as prime example
I score-based methods: MLE as prime example
Faithfulness assumption
(necessary for conditional dependence testing approaches)

a distribution P is called faithful to a DAG G if all conditional

independencies can be inferred from the graph

(can infer some conditional independencies from a Markov

assumption; but we require here “all” conditional
independencies)

assuming faithfulness: ; can infer the CPDAG from a list of

conditional (in-)dependence relations
Faithfulness assumption
(necessary for conditional dependence testing approaches)

a distribution P is called faithful to a DAG G if all conditional

independencies can be inferred from the graph

(can infer some conditional independencies from a Markov

assumption; but we require here “all” conditional
independencies)

assuming faithfulness: ; can infer the CPDAG from a list of

conditional (in-)dependence relations
What does it mean?
1

X (1) ← ε(1) ,
X (2) ← αX (1) + ε(2) ,
X (3) ← βX (1) + γX (2) + ε(3) ,
2 3
ε(1) , ε(2) , ε(3) i.i.d. ∼ N (0, 1)

enforce marginal independence of X (1) and X (3)

β + αγ = 0, e.g. α = β = 1, γ = −1
   
1 1 0 3 −2 −1
Σ= 1 2 −1  , Σ−1 =  −2 2 1 .
0 −1 2 −1 1 1

failure of faithfulness due to cancellation of coefficients

failure of exact faithfulness is “rare” (Lebesgue measure zero)

but for statistical estimation (in the Gaussian case): “often”

require strong faithfulness (Robins, Scheines, Spirtes &
Wasserman, 2003):
n o
min |ρ(i, j|S)|; ρ(i, j|S) 6= 0, i 6= j, |S| ≤ d ≥ τ,
p
τ d log(p)/n

(d is the maximal degree of the skeleton of the DAG)

... strong faithfulness can be rather severe
(Uhler, Raskutti, PB & Yu, 2013)

3 nodes, full graph

imagine a strip around it ;

large volume!
⇒ strong faithfulness is
restrictive in high dimensions

unfaithful distributions
due to exact cancellation
The PC-algorithm (Spirtes & Glymour, 1991)

I crucial assumption:
distribution P is faithful to the true underlying DAG
I less crucial but convenient:
Gaussian assumption for Y , X (1) , . . . , X (p) ; can work with
partial correlations

I input: Σ̂MLE
but we only need to consider many small sub-matrices of it
(assuming sparsity of the graph)
I output: based on a clever data-dependent (random)
sequence of multiple tests
estimated CPDAG
PC-algorithm: a rough outline
for estimating the skeleton of underlying DAG

1. start with full graph

2. remove edge i − j if
d (i) , X (j) ) is small
Cor(X
(Fisher’s Z-transform and
null-distribution of zero full graph correlation screening

correlation)
3. partial correlations of
order 1:
remove edge i − j if
\ (i) , X (j) |X (k) ) is
Parcor(X
partial correlation order 1 stopped
small for some k in the
current neighborhood of i
or j (thanks to faithfulness)
4. move-up to partial
correlations of order 2:
remove edge i − j if partial
correlation
\ (i) , X (j) |X (k) , X (`) )
Parcor(X
is small for some k , ` in full graph correlation screening

the current neighborhood

of i or j (thanks to
faithfulness)
5. until removal of edges is
not possible anymore,
partial correlation order 1 stopped
i.e. stop at minimal order
of partial correlation
where edge-removal
becomes impossible

additional step of the algorithm needed for estimating directions

yields an estimate of the CPDAG (equivalence class of DAGs)
one tuning parameter (cut-off parameter) α for truncation of
estimated Z -transformed partial correlations

if the graph is “sparse” (few neighbors) ; few iterations only

and only low-order partial correlations play a role
and thus: the estimation algorithm works for p n problems
Computational complexity

crudely bounded to be polynomial in p

sparser underlying structure ; faster algorithm

we can easily do the computations for

sparse cases with p ≈ 104 ≈ 1-2 hrs CPU time
3
log10( Processor Time [s] )
2
1
0

E[N]=2
−1

E[N]=8

1.0 1.5 2.0 2.5 3.0

log10(p)
IDA (Intervention calculus when DAG is Absent)
(Maathuis, Colombo, Kalisch & PB, 2010)

1. PC-algorithm ; CPDAG
\
2. local algorithm ; Θ̂local
3. lower bounds for absolute causal effects ; α̂j
R-package: pcalg

this is what we used in the yeast example to score the

importance of the genes according to size of α̂j
Statistical theory (Kalisch & PB, 2007; Maathuis, Kalisch & PB, 2009)
n i.i.d. observational data points; p variables
high-dimensional setting where p n
assumptions:
I X (1) , . . . , X (p) ∼ Np (0, Σ) Markov and faithful to true DAG
I high-dimensionality: log(p) n
I sparsity: maximal degree d = maxj |ne(j)| n
I “coherence”: maximal (partial) correlations ≤ C < 1
max{|ρi,j|S |; i 6= j, |S| ≤ d} ≤ C < 1
I signal strength/strong faithfulness:
p
min{|ρi,j|S |; ρi,j|S 6= 0, i 6= j, |S| ≤ d} d log(p)/n

Then, for some suitable tuning param. and 0 < δ < 1:

\ = true CPDAG] = 1 − O(exp(−Cn1−δ ))
P[CPDAG
as set
P[Θ̂L = Θ] = 1 − O(exp(−Cn1−δ ))
(i.e. consistency of lower bounds for causal effects)
Main strategy of a proof
we have to analyze an algorithm...
;
I understand the population version:
in particular:
can show that algorithm stops in d or d − 1 steps
I analysis of the noisy version: additional errors
use a union bound argument to control the overall error
(seems very rough but leads to asymptotically “optimal”
regime, e.g., for signal strength)
The role of “sparsity” in causal inference
as usual: sparsity is necessary for accurate estimation in
presence of noise

but here: “sparsity” (so-called protectedness) is crucial for

identifiability as well

X Y X Y

X causes Y Y causes X

cannot tell from observational data the direction of the arrow

the same situation arises with a full graph with more than 2
nodes
;

causal identification really needs “sparsity”

the better the “sparsity” the tighter the bounds for causal effects
Maximum likelihood estimation

R.A. Fisher
Gaussian DAG is Gaussian linear structural equation model:
1

X (1) ← ε(1)
X (2) ← β21 X (1) + ε(2)
2 3
X (3) ← β31 X (1) + β32 X (2) + ε(3)

in general:
p
X
(j)
X ← βjk X (k) + ε(j) (j = 1, . . . , p), βjk 6= 0 ⇔ edge k → j
k=1
X = BX + ε, ε ∼ Np (0, diag(σ12 , . . . , σp2 )) in matrix notation
; reparameterization
Σ̂, D̂ = argminΣ;D a DAG − `(Σ, D; data) + λ|D|
= argminB; {σ2 ;j} − `(B, {σj2 ; j}; data) + λ kBk0
j | {z }
P
ij I(Bij 6=0)

under the non-convex constraint that B corresponds to “no

directed cycles”

severe non-convex problem due to the “no directed cycle”

constraint
(k · k0 -penalty rather than e.g. k · k1 doesn’t make the problem
much harder)
toy-example X (1) ← β1 X (2) + ε1
X (2) ← β2 X (1) + ε2

beta2

(0,0)

X1 X2 beta1

non-convex parameter space!

(no straightforward way to do convex relaxation)
computation:
I dynamic programming algorithm (up to p ≈ 20)
(Silander and Myllymäki, 2006)
I greedy algorithms on equivalence classes
(Chickering, 2002; Hauser & PB, 2012)

statistical properties of penalized MLE (for “large” p):

(van de Geer & PB, 2013)
no faithfulness assumption required to obtain the minimal
edges I-MAP !
Successes in biology
Effects of single gene knock-downs on all other genes in yeast
(Maathuis, Colombo, Kalisch & PB, 2010)

1,000 IDA
Lasso
Elastic−net
800
Random

True positives
n = 63 600

observational data
400

200

0
0 1,000 2,000 3,000 4,000
False positives
Arabidopsis thaliana (Stekhoven, Moraes, Sveinbjörnsson, Hennig,
Maathuis & PB, 2012)

response Y : days to bolting (flowering) of the plant

(aim: fast flowering plants)
covariates X : gene-expression profile

observational data with n = 47 and p = 210 326

; lower bound estimates α̂j for causal effect of every
gene/variable on Y (using the PC-algorithm)

apply stability selection (Meinshausen & PB, 2010)

; assigning uncertainties via control of
2
PCER = E[V ]/p ≤ 2πthr1 −1 qp2 (per comparison error rate)
Causal gene ranking
summary median error
Gene rank eﬀect expression (PCER) name
1 AT2G45660 1 0.60 5.07 0.0017 AGL20 (SOC1)
2 AT4G24010 2 0.61 5.69 0.0021 ATCSLG1
3 AT1G15520 2 0.58 5.42 0.0017 PDR12
4 AT3G02920 5 0.58 7.44 0.0024 replication protein-related
5 AT5G43610 5 0.41 4.98 0.0101 ATSUC6
6 AT4G00650 7 0.48 5.56 0.0020 FRI
7 AT1G24070 8 0.57 6.13 0.0026 ATCSLA10
8 AT1G19940 9 0.53 5.13 0.0019 AtGH9B5
9 AT3G61170 9 0.51 5.12 0.0034 protein coding
10 AT1G32375 10 0.54 5.21 0.0031 protein coding
11 AT2G15320 10 0.50 5.57 0.0027 protein coding
12 AT2G28120 10 0.49 6.45 0.0026 protein coding
13 AT2G16510 13 0.50 10.7 0.0023 AVAP5
14 AT3G14630 13 0.48 4.87 0.0039 CYP72A9
15 AT1G11800 15 0.51 6.97 0.0028 protein coding
16 AT5G44800 16 0.32 6.55 0.0704 CHR4
17 AT3G50660 17 0.40 7.60 0.0059 DWF4
18 AT5G10140 19 0.30 10.3 0.0064 FLC
19 AT1G24110 20 0.49 4.66 0.0059 peroxidase, putative
20 AT1G27030 20 0.45 10.1 0.0059 unknown protein
• biological validation by gene knockout experiments in progress.

red: biologically known genes responsible for flowering

performed validation experiment with mutants corresponding to
these top 20 - 3 = 17 genes
I 14 mutants easily available ; only test for 14 genes
I more than usual: mutants showed low germination or
survival...
I 9 among the 14 mutants survived (sufficiently strongly), i.e.
9 mutants for which we have an outcome
I 3 among the 9 mutants (genes) showed a significant effect
for Y relative to the wildtype (non-mutated plant)
; that is: besides the three known genes, we find three
additional genes which exhibit a significant difference in terms
of “time to flowering”
in short:
bounds on causal effects (α̂j ’s) based on observational data
lead to interesting predictions for interventions in genomics
(i.e. which genes would exhibit a large intervention effect)
and these predictions have been validated using experiments
Fully identifiable cases

recap:
if P = {Gaussian distributions}, then:
cardinality of the Markov-equivalence class of a DAG D
|E(D)| often > 1

same is true for P = {nonparametric distributions}

but under some additional constraints, the Markov equivalence

class becomes smaller or even consisting of one DAG only (i.e.,
identifiable)
structural equation model (SEM)

Xj ← fj (Xpa(j) , εj ) (j = 1, . . . , p)

e.g.
X
Xj ← fjk (Xk ) + εj (j = 1, . . . , p)
k∈pa(j)
three types of identifiable SEMs where
true DAG is identifiable from observational distribution:

I LiNGAM (Shimizu et al., 2006)

X
Xj ← Bjk Xk + εj (j = 1, . . . , p),
k∈pa(j)

with εj ’s non-Gaussian
as with independent component analysis (ICA)
; identifying Gaussian components is the hardest

I linear Gauss with equal error variances (Peters & PB, 2014)
X
Xj ← Bjk Xk + εj (j = 1, . . . , p),
k ∈pa(j)

Var(εj ) ≡ σ 2 for all j

I nonlinear SEM with additive noise (Schölkopf and
co-workers, 2008–2014: Hoyer et al., 2009; Peters et al., 2013)

Xj ← fj (Xpa(j) ) + εj (j = 1, . . . , p)

toy example: p = 2 with DAG X (1) = X → Y = X (2)

Y = X^3 + epsilon errors

30
● ● ●
● ●●
●●●
● ●●
●

4
●●
●●
●●
●●●
●● ● ●
●●●● ● ●● ●

25
●
●●
●●
● ● ●
●●●●●
● ● ●
● ●● ● ●● ● ●
●
●●●●●
●● ● ●● ● ●
● ●
●
● ●
●● ●●
●●
● ●●● ● ● ● ●●● ● ●●● ●● ● ●● ●
●
●● ● ●●
● ●●
● ●● ●
●●●●
● ● ●● ● ●● ● ●
●● ●● ● ● ●● ● ●●● ● ● ●● ● ● ● ●

20
● ● ● ●●●● ● ●●● ●●

X3
● ●● ●●●
●● ● ●● ●● ●●●●

2
●● ● ● ●●
●●● ● ●● ●● ●● ● ●●●● ● ●● ● ●● ●●●

+ε
● ●● ●● ●●● ●●●●

truth: Y =
●
●●●
● ●●
●●● ● ● ●●● ●●● ●● ● ● ● ●● ●●
●●●●
●●●●
● ●●●
●● ●●● ●●
●
●●●● ●
● ●●●●●
●●●● ●● ● ●●● ●● ● ●● ●● ●
●●● ● ●● ●●●●● ● ●●●●
●●●●●●●●
●●●
●●●
●● ● ●●●●●●
●
● ●●● ●● ●●●●● ●
● ●●●● ● ●● ●● ●● ● ●
●●●
● ●
●● ●● ● ● ●●

epsilon
● ● ●● ●●●●●
●●●●● ●●●
● ●●● ●●● ●●● ●● ●● ●●●

15
●● ●● ● ● ● ●
●
●
●● ●●
●●●● ●●●●●●●●●
●
●●●
● ●● ●●● ●● ● ●● ●
●●●● ●●
●● ●● ●●●● ● ● ●●● ●
●● ●
● ●
●●
● ●●
● ●● ●●●●●●● ●●
● ●● ● ● ●●●●
● ●
●●●●
● ● ●● ● ●● ●●
●●● ●●●●●●●● ● ●●● ●● ●●● ● ●● ●●
●● ●● ●
● ●●●●●
● ● ●●
●●●● ●●
● ●●●●●

y
● ● ●●●●●● ●●● ●● ●●●
● ●
●●
●
●●●●●●
● ●●
●●
●● ●
●● ●● ● ● ●
●
●●● ● ●●●●● ●●● ●●
●●●●● ●●● ●
● ●●●●●● ● ●● ●●● ●●

0
●●
●●
●●● ● ● ● ●
● ●●●●●
●● ●
● ● ● ● ●●● ●● ● ●●
● ●● ●●
●● ●●● ● ●●●
● ●●●●● ●● ●●●●●●●● ● ●●●●
● ● ●●●●●
● ● ●● ●● ●●●●
● ●●●
●
●●●
●●● ● ●
●●●●● ● ●● ●● ●● ●●●●●●●● ● ●● ●
●
●● ●●●●

10
● ● ●● ● ●●●● ●●●●● ●●
●
●●●●
●●
●●
● ●● ● ●● ●
● ●● ●● ●●●●●●●●●
●● ● ● ●● ●●
●●● ●●● ●●● ●● ● ●
●● ●●●● ●●●
● ● ● ●
●
●● ● ●●●●●●●
● ●
● ●●
● ●
●●● ● ● ● ● ● ● ●● ● ●● ● ● ● ●●● ●●● ●
● ●●●
●
●●
●
●●●
● ●●
●●
●
● ●●
● ● ●●● ●●●● ● ● ●● ●● ● ●●●●●●● ● ● ●● ● ●●● ● ●
● ●● ●

ε indep. of X
●●● ● ● ●
●
●
●●●
●
●● ●
● ●
●
●
●
●
●●
●● ●●●● ●● ●
● ●●● ●● ● ●●●●● ● ● ● ●● ●● ● ●
●●● ● ●
●
● ●●● ● ●
● ●●●●●
● ●
●● ● ●
●●
●●●●
●●
●● ●● ●● ●●● ●● ● ●●●
● ●
●●
● ●● ●●
●●●●●●
●
●
●●●
●
●●
●
●
●●●
●
●●●
●
●●
●
●
●●●●
● ● ●●●●● ●
●
●
●
● ● ●● ●●●●● ● ●● ● ● ● ● ● ● ● ●● ● ● ●● ●●
●

−2
●

5
● ● ● ●● ● ●● ●
●● ●● ●●
●●●●●
●
●
●
●●
●
●●
●●
●●●
●●
●●
●● ●●
●
●
●●
● ●● ● ●●● ●● ●●● ● ● ● ● ●● ●●●●● ●● ●●●● ●● ●
●● ●
●● ● ● ●● ●●●●
● ●
●● ●●●
● ●
●●
●●
●
●
● ●
●●●● ●●● ● ● ●● ●● ● ●●● ● ●● ●●
●
● ●●
● ●●●●●●●● ●
●●●●●●
●●●
●●●
●●
●
●●●
●● ●●
●
●●
●●
●●
●●● ●● ●● ●
● ●●
● ●●● ●●●●● ●
●●●●
● ●●● ● ●●
●●●
●
●●
●
●●●●
●
●
●
●●●●
●
●
●●●●●●
●
●●●
●●●
● ●
● ●
●
●
●●
●
●
●
●
●●
●
●●
●
●
●
●●
●●
●
●●●
●
●● ● ●●●● ●
● ● ● ● ●● ● ● ●
●● ●
●
●●
●
●
●●
●
●● ●
●
●●●●
●
●●
●
●●
●●●●●●●
●●●● ●●●
●●●
●
●●
●●
●●
●●
● ●●
●●
● ●●● ●●
● ●● ●● ● ● ● ●● ● ● ●● ● ● ●
●●●
● ●
●●
●
●●
●●●
● ● ●
●●●●●
●●
●●
●●
●●
●●●
● ●●
●●●
●●●
● ●
●●● ●
●
●●
● ● ●● ● ● ● ● ●● ● ●

0
●
●●
●
●
● ●
●●
● ●
●●
●
● ●
●●
● ●●●●●●
● ●
●●●
●●
●●●●●●
●
●●●
●
●●●
● ●●●●● ● ● ●
●
● ●●●
●●
● ●●●● ●
●●●
●●●●
●●●●●●● ●●●●
● ●● ●●
●●
●● ●●
●
●
●●●
● ●●
● ●●●
● ●●
●●
● ●
●
●
●●●●
●● ● ● ● ● ● ●
● ● ●●●● ● ●● ● ● ●● ● ● ●

−4
●● ●● ●
●● ● ● ●

−5
0.0 0.5 1.0 1.5 2.0 2.5 3.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0

x x

X= Y^{1/3} + eta errors

3.0

●● ●●●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●●●
●●
● ●
●
●● ●●●●
●●
●●
●● ●●●●
●
●●●●●●
● ●
●
●●●● ●
●
●
●●●
●
●●●●●●
●● ●●
●
● ●
●●●
●●●
●
●●
● ●●● ●● ●●
●● ●●● ●●
●● ●● ●●
● ●●
●●
2.5

0.5
●●
●
●
●●
●●●●●
● ●●●
●
●
●
● ●
●●
●● ●
●
● ●●●
●● ●
●● ●●
●
● ●
●●●●
● ●
●
●●
●●●●
● ●●
●
●●●●●●
●●●
●● ●●
● ●● ●
● ●
●
● ● ●●
●● ●●●
●●●
●●
●
●
●
●●
●
●
●●
●
●
●
●
●●●
● ●● ● ●
●●
● ●●●●● ● ● ●●
●●●●●
● ●●
●●
●●
●
●●●●●
●
● ●●
● ●
●
●
●
●
●●
●● ● ●●●
●●●● ●
●● ●
●
● ●
●●● ●●●● ●●●●
2.0

●●●
● ●
●●● ●
● ● ●● ● ●● ●
● ●●●● ●●●●● ●● ●
●●● ●●
●●
● ●●
●
●●●
●
●●
●
●●
●●●● ●●
●●●●●
●●
●
●
●
●●●●●●
●
●●
●
●● ●●● ●●●●
● ●●●●●●
●●● ●● ● ●●●●
●● ●
●●● ●● ● ●● ●●●● ● ●
●●●●●●●
● ●●●●●●● ● ●● ●●●●●● ●●● ●●●●●
●●●● ●● ● ● ● ● ● ●
● ●●

0.0
●
● ●
● ●●●
●●●● ●
●●● ●●●●●●
●●●
●●
●
●
●● ● ● ● ●●
●●●●
●● ● ●
●●●●●● ●●
●● ●●●
●● ●●●● ●
●●
●
●●
● ●● ● ●
●●
● ●
●
●●●●
●●●●
●●
●●●
●● ●●
●●●
●
●●●
● ●●●●
●
●
●● ●
●●●●●
●
● ● ●●
● ● ●
●●●●
●
● ● ●
●●● ● ●
●●●
●●
●●
●
●●
●
●●●
●●●●
●●
●
●●●
●●●●●
● ●
●
●●
●●
● ●
●
●●●●
●●●
●●●
●●●●●
● ●
●●●●
●
●●●●●●
●
●●●●
●●
●
●●●●●●
●●●
●
●●
●●
● ●● ●
●
●●●● ●●
●● ●● ●● ●●
●
●●●
●●
● ●
●●●●●●
●
●●●
●●●●
● ● ●●
●● ●●
● ●● ●
●● ●● ● ● ●●● ● ●●
●
● ●
●

reverse mod.: X = Y 1/3 + η

●● ●● ● ● ●
● ●●●●● ● ● ●●● ● ●
● ●●
● ●
●●●●●
●●
●●●●
● ●
●● ● ●
●●●
●●
●●
●●
● ● ●●●
● ●
●● ●●●
● ●● ●●●●●● ●● ●● ● ●
●●● ●● ● ●●●● ●● ●● ●●
1.5

eta
●
● ●
● ● ●●●
● ● ● ●● ● ●● ●●●● ●●●
●●●● ●●
●
●● ● ●● ●●● ●
●●
●●
●● ●●
●● ●● ● ● ●●
x

●●●●●●
● ● ●● ● ●●
● ●●● ●●●● ●● ●
●●
●● ● ●●●
● ●
●● ●
● ●
●●
●
●
●
●
●
●
●
●
●●●
●
● ● ●●●
●● ●
●●●
●●
●● ●●●
●
●● ●●●
● ● ●●
● ● ●●●
● ●
●●●●●

−0.5
●●● ●
●●
●●
●
● ●● ● ●●● ●●
●●●●● ● ● ● ● ●
●●
● ●●● ● ●
●
●●● ● ●●
●●●●●
● ●
●●
●●●
●●●
●
●
●
●●
●
●●
●●
●
● ●
●●
● ● ●
1.0

●●●● ●
●● ●
●●
●●●
●●
●●● ●●● ●
●
●● ●●●
● ● ●
●●●● ●
●
●●
●●
●●
●● ●●● ●
●●●
● ●
●●● ●
●
●
●
●
●
●●
●
●●●
●
●
●
●
●●
●●●●
●
●
● ●
●
●
●
●
●
●
●
●●●
●
●●
●●
● ●●●
●●
●●●●
●●
●
●●●●
●●●●
● ●
●●●
●●●●
●●
●
● ● ● ● ●
●
●●●●●● ●● ●
●● ● ●
●●● ●
●●
●●
●●● ● ●●●●
●
●●●●
●
●● ●
●●
●
●●
●
● ●●
●●
●●
●
●
● ●
● ●●

−1.0
●● ●● ● ● ●

η not indep. of Y
0.5

●●●●
●● ●
●
●●
●●●● ● ●●
● ●
●●
●
●●●
● ●●●●
●●
●●●● ●●●●● ●
●●●●● ●●
●●
●
●
●●
●
●
●●●
●● ●
●
●●
●
●●●● ●
● ● ●
●
●●
●●●●●
●● ●
●
●●
●
●
●
●
●
●
●
●
●
● ●● ● ●●
●● ●●●●●●●
●●●
●
●
●
● ●
●●●●
●
●
● ●●●● ●
●●●●
●
●●●
●●●●●●
●●
●
●
●●
●
●
● ● ●
●
● ●●
●●
●●●●
0.0

●●●●●● ●●
● ● ●
● ● ●

−5 0 5 10 15 20 25 30 −5 0 5 10 15 20 25 30

y y
strategy to infer structure and model parameters
(for identifiable models):

order search (order among the variables)

&
subsequent (sparse) regression
order search: MLE for best permutation/order

I candidate order/permutation π (of {1, . . . , p})

I regressions

Xπ(j) versus Xπ(j−1) , Xπ(j−2) , . . . , Xπ(1)

(when sparse: only some of the regressors are selected)

evaluation score for π:
likelihood
Qp of all the regressions Xπ(j) vs. Xπ(j−1) , Xπ(j−2) , . . . , Xπ(1)
; j=1 p(Xπ(j) |Xπ(j−1) , . . . , Xπ(1) )

need a model for the regressions:

I functional form of regressions (e.g. additive, linear)
I distribution of error terms (e.g. Gaussian, nonparametric)
; can compute the likelihood with estimated parameters
and search over the orderings/permutations
search over the orderings/permutations:
there are p! permutations...

trick: preliminary neighborhood selection

aiming for an undirected graph containing the skeleton of the
DAG (i.e. the Markov blanket or a superset thereof)
2

1 3

“superset skeleton”

4
5

; and do order search and estimation restricted to

superset-skeleton
I much reduced search space
I can use unpenalized MLE (for best order/permutation
restricted to superset-skeleton)
that is: regularization in neighborhood selection suffices!
CAM: Causal Additive Model (PB, Peters & Ernest, 2013)

X
Xj ← fjk (Xk ) + εj (j = 1, . . . , n)
k∈pa(j)

εj ∼ N (0, σj2 )

I underlying DAG is identifiable from joint distribution

I statistically “feasible” for estimation due to additive
functions
I good empirical performance

log-likelihood equals (up to constant term):

p
X n
X X
−1
log(σ̂jπ ), (σ̂jπ )2 =n (Xi,π(j) − f̂j,π(k) (Xi,k ))
j=1 i=1 k<j
fitting method
1. preliminary neighborhood selection
(for superset of skeleton of DAG or Markov blanket):
“nodewise” sparse additive regression of
one variable Xj against all others {Xk ; k 6= j}
easy (using CV), works well and is established
(Ravikumar et al. (2009), Meier, van de Geer & PB (2009),...)
2. estimate a best order by unpenalized MLE restricted to
the superset-skeleton
3. based on the estimated order π̂: additive sparse model
fitting restricted to the superset-skeleton ; edge pruning
2 2 2

1 3 1 3 1 3

4 4 4
step 1: 5 step 2: 5 step 3: 5

for step 2: greedy search restricted to superset-skeleton

for step 2: greedy search restricted to superset-skeleton
– exhaustive computation is possible for trees (Bürge, MSC thesis 2014)
– greedy methods are shown to find the optimum for restricted classes of
DAGs (Peters, Balakrishnan and Wainwright, in progress)

unpenalized MLE for best order search: very convenient

regularization is decoupled
and only used in sparse additive regression
statistical consistency: high-dimensional p n setting
(PB, Peters & Ernest, 2013)
main assumptions:
I maximal degree of DAG is O(1) (“sparse”)
I sufficiently large identifiability constant (w.r.t. Kullback-
Leibler divergence)
between true and wrong ordering:
ξp := p−1 minπ∈Π
/ 0 Eθ0 [− log(pθπ0 (X ))] − Eθ0 [− log(pθ0 (X ))]
p
ξp log(p)/n
I exponential moments for Xj ’s and certain smooth functions
thereof
then:
P[π̂ ∈ Π0
|{z} ] → 1 (n → ∞)
set of true permutations
I consistent recovery of true functions fjk0 and DAG D 0
I consistent estimation of E[Y |do(X = x)]
remark:

preliminary neighborhood selection

I yields empirical and computational improvements
I seems to improve the theoretical results
e.g. in linear Gaussian case ; less stringent conditions
than `0 -regularized MLE (van de Geer & PB, 2013)
Empirical results to illustrate what can be achieved with CAM

p = 100, n = 200
true model is CAM (additive SEM) with Gaussian error

SHD: Structural Hamming Distance

SID: Structural Intervention Distance (Peters & PB, 2013)
400 ●
SHD to true DAG

p = 100 p = 100

SID to true DAG

● ●

300 1500

200 1000
●
100 500
●
0 0
●
CAM

RESIT

LiNGAM

CPC

GES

CAM

RESIT

LiNGAM

PC (lower)

PC (upper)

CPC (lower)

CPC (upper)

GES (lower)

GES (upper)
RESIT (Mooij et al. 2009) cannot be used for p = 100
CAM method is impressive where true functions are
non-monotone and nonlinear (sampled from Gaussian proc.);
for monotone functions: still good but less impressive gains
Empirical results to illustrate what can be achieved with CAM

p = 100, n = 200
true model is CAM (additive SEM) with Gaussian error

SHD: Structural Hamming Distance

SID: Structural Intervention Distance (Peters & PB, 2013)
400 ●
SHD to true DAG

p = 100 p = 100

SID to true DAG

● ●

300 1500

200 1000
●
100 500
●
0 0
●
CAM

RESIT

LiNGAM

CPC

GES

CAM

RESIT

LiNGAM

PC (lower)

PC (upper)

CPC (lower)

CPC (upper)

GES (lower)

GES (upper)
RESIT (Mooij et al. 2009) cannot be used for p = 100
CAM method is impressive where true functions are
non-monotone and nonlinear (sampled from Gaussian proc.);
for monotone functions: still good but less impressive gains
Gene expressions from isoprenoid pathways in Arabidopsis
Thaliana (Wille et al., 2004)
p = 39, n = 118

top 20 edges from CAM

Chloroplast (MEP pathway) Cytoplasm (MVA pathway)
stability selection
Chloroplast (MEP pathway) Cytoplasm (MVA pathway)

DXPS1 DXPS2 DXPS3 DXPS1 DXPS2 DXPS3

AACT1 AACT2 AACT1 AACT2

DXR DXR
HMGS HMGS

MCT MCT
HMGR1 HMGR2 HMGR1 HMGR2

CMK CMK
MK MK

MECPS MECPS
MPDC1 MPDC2 MPDC1 MPDC2

HDS HDS

IPPI2 IPPI2
HDR HDR

Mitochondrion Mitochondrion
IPPI1 FPPS1 FPPS2 IPPI1 FPPS1 FPPS2

UPPS1 UPPS1
GPPS GPPS
DPPS2 DPPS2

GGPPS PPDS1 PPDS2 GGPPS1,5,9 DPPS1,3 GGPPS3,4 GGPPS PPDS1 PPDS2 GGPPS1,5,9 DPPS1,3 GGPPS3,4
2,6,8,10,11,12 2,6,8,10,11,12

solid edges: estimated from data

stability selection: expected no. of false positives ≤ 2
(Meinshausen & PB, 2010)
When knowing the order of the variables
a new approach...
Theorem (Ernest & PB, 2014)
For a general nonlinear SEM with functions fj ∈ L2 (P)
and regularity condition on the joint density/distribution:
for all I ⊂ {1, . . . , p}: p(x) ≥ Mp(xI )p(xI c ) for some 0 < M ≤ 1

E[Y |do(X = x)] = g(x),

g(x) = best additive L2 -approx. of Y versus X , {Xk ; k ∈ S(X )}
e.g. S(X ) = {k ; k < jX }, jX = index corresponding to X
that is
E[Y |do(X = x)] = g(x),
g(x) = best additive approx. of Y versus X , {Xj ; j ∈ S(X )},
X
g app = argminfj E[(Y − fjX (X ) − fk (Xk ))2 ],
k∈S(X )
app
gjX (·) = g(·)
implication:
only need to run additive model fitting even for an “arbitrary
nonlinear SEM”
e.g. even if fj (Xpa(j) , εj ) is very complicated...!

call it ord-additive
modeling/fitting
very robust against model-misspecification !

misspecification could be dependent εj ’s, i.e. due to hidden

variables
if we were to consider E[Y |do(X1 = x1 , X2 = x2 )] ; would need
to fit first-order interaction model
implication:
only need to run additive model fitting even for an “arbitrary
nonlinear SEM”
e.g. even if fj (Xpa(j) , εj ) is very complicated...!

call it ord-additive
modeling/fitting
very robust against model-misspecification !

misspecification could be dependent εj ’s, i.e. due to hidden

variables
if we were to consider E[Y |do(X1 = x1 , X2 = x2 )] ; would need
to fit first-order interaction model
Swiss Air Ltd. ticket pricing and revenue
(Ghielmetti et al., in progress)

do not need to worry about complicated non-linearities!

estimation – in practice
additive model fitting of

Y versus X , {Xk ; k ∈ S(X )}

Hastie & Tibshirani (≈ 1990); Wood (2006); ...

high-dimensional scenario when |S(X )| is large

e.g. |S(X )| ≈ 100 − 50 000 and sample size n ≈ 50
consistency and optimal convergence rate results for sparse
additive model fitting if
– problem is sparse: maximal degree of true DAG is small
– identifiability aka restricted eigenvalue conditions
Ravikumar et al. (2009); Meier, van de Geer & PB (2009)

additive model fitting is well-established and works well

including high-dimensional setting
estimation – in practice
additive model fitting of

Y versus X , {Xk ; k ∈ S(X )}

Hastie & Tibshirani (≈ 1990); Wood (2006); ...

high-dimensional scenario when |S(X )| is large

additive model fitting is well-established and works well

including high-dimensional setting
MEP pathway in Arabidopsis Thaliana (Wille et al, 2004)

p = 14 expressions of genes, n = 118

order of variables: biological information w.r.t. up-/downstream

Chloroplast (MEP pathway)

DXPS1 DXPS2 DXPS3

DXR

MCT
I rank Rdirected gene pairs (X causes Y )
CMK
with (Ê[Y |do(X = x) − Ê[Y ])2 dx
MECPS
I top 10 scoring directed gene pairs and
HDS
check their stability ;
HDR
stability selection:
E[false positives] ≤ 1
IPPI1

(Meinshausen & PB, 2010)

GPPS

GGPPS PPDS1 PPDS2

2,6,8,10,11,12

do not need to worry about complicated nonlinearities!

ord-additive regression: a local operation

inferring E[Y |do(X = x)] when DAG (or order) is known:

I ord-additive regression Y versus X , pa(X ): local
I integration of all directed paths from X to Y : global
4
2.0 Entire Path Entire Path
squared error

squared error
Partial Path 3 Partial Path
1.5 Parent Adjustment Parent Adjustment
2
1.0

0.5 1

0.0 0
0 (100%) 16 (96%) 32 (92%) 64 (84%) 128 (68%) 256 (34%) 0 (100%) 4 (96%) 8 (92%) 16 (84%) 32 (68%) 64 (34%)
SHD to true DAG (percentage of correct edges) SHD to true DAG (percentage of correct edges)

nonsparse DAG sparse DAG

ord-additive regression is much more reliable and “robust”

when having imperfect knowledge of the DAG or the order of
variables
and computationally much faster
Conclusions

1. Beware of over-interpretation!

so far, based on current data:

we can not reliably infer a causal network
despite theorems...
(perturbation of the data yields unstable networks)

2. Causal inference relies on subtle uncheckable(!)

assumptions
; experimental validations are important (simple organisms in
1

biology are great for pursuing this!)

statistical (and other) inference is often not confirmatory

3. many technical issues in identifiability, high-dimensional

statistical inference and optimization
4. but there is potential:
for stable ranking/prediction of intervention/causal effects
... “causal inference from purely observed data could have
practical value in the prioritization and design of perturbation
experiments”
Editorial in Nature Methods (April 2010)

this can be very useful in computational biology

and in this sense:

“causal inference from observational data is much further
developed than 30 years ago when it was thought to be
impossible”
Thank you!

R-package: pcalg
(Kalisch, Mächler, Colombo, Maathuis & PB, 2012)
References:
I Ernest, J. and Bühlmann, P. (2014). On the role of additive regression for (high-dimensional) causal
inference. Preprint arXiv:1405.1868
I Bühlmann, P., Peters, J. and Ernest, J. (2013). CAM: Causal Additive Models, high-dimensional order
search and penalized regression. Preprint arXiv:1310.1533
I Peters, J. and Bühlmann, P. (2014). Identifiability of Gaussian structural equation models with equal error
variances. Biometrika 101, 219-228.
I Uhler, C., Raskutti, G., Bühlmann, P. and Yu, B. (2013). Geometry of faithfulness assumption in causal
inference. Annals of Statistics 41, 436-463.
I van de Geer, S. and Bühlmann, P. (2013). `0 -penalized maximum likelihood for sparse directed acyclic
graphs. Annals of Statistics 41, 536-567.
I Kalisch, M., Mächler, M., Colombo, D., Maathuis, M.H. and Bühlmann, P. (2012). Causal inference using
graphical models with the R package pcalg. Journal of Statistical Software 47 (11), 1-26.
I Stekhoven, D.J., Moraes, I., Sveinbjörnsson, G., Hennig, L., Maathuis, M.H. and Bühlmann, P. (2011).
Causal stability ranking. Bioinformatics 28, 2819-2823.
I Hauser, A. and Bühlmann, P. (2012). Characterization and greedy learning of interventional Markov
equivalence classes of directed acyclic graphs. Journal of Machine Learning Research 13, 2409-2464.
I Maathuis, M.H., Colombo, D., Kalisch, M. and Bühlmann, P. (2010). Predicting causal effects in large-scale
systems from observational data. Nature Methods 7, 247-248.
I Maathuis, M.H., Kalisch, M. and Bühlmann, P. (2009). Estimating high-dimensional intervention effects from
observational data. Annals of Statistics 37, 3133-3164.

Stanford University CS 229, Autumn 2014 Midterm Examination
No ratings yet
Stanford University CS 229, Autumn 2014 Midterm Examination
23 pages
(Tom Bottomore) The Frankfurt School and Its Criti
100% (3)
(Tom Bottomore) The Frankfurt School and Its Criti
92 pages
Verboamerica. - Catalogo en Inglés
No ratings yet
Verboamerica. - Catalogo en Inglés
47 pages
Aglietti2020 Part2
No ratings yet
Aglietti2020 Part2
51 pages
Lec 13
No ratings yet
Lec 13
35 pages
Causal Notes
No ratings yet
Causal Notes
17 pages
Causal Probabilistic Programming Without Tears
No ratings yet
Causal Probabilistic Programming Without Tears
15 pages
Bayesian Causal Tutorial Ohiostate June2019
No ratings yet
Bayesian Causal Tutorial Ohiostate June2019
56 pages
Generalization Bounds and Representation Learning For Estimation of Potential Outcomes and Causal Effects
No ratings yet
Generalization Bounds and Representation Learning For Estimation of Potential Outcomes and Causal Effects
50 pages
An Introduction To Causal Modelling: Gauranga Kumar Baishya and M. R. Srinivasan Chennai Mathematical Institute (CMI)
No ratings yet
An Introduction To Causal Modelling: Gauranga Kumar Baishya and M. R. Srinivasan Chennai Mathematical Institute (CMI)
52 pages
Shpit Ser 2016
No ratings yet
Shpit Ser 2016
34 pages
21.1 Causality
No ratings yet
21.1 Causality
56 pages
Peter Spirtes 2010
No ratings yet
Peter Spirtes 2010
20 pages
Causal Inference: 1.1 Two Types of Causal Questions
No ratings yet
Causal Inference: 1.1 Two Types of Causal Questions
19 pages
Formulating Causal Questions and Principled Statistical Answers
No ratings yet
Formulating Causal Questions and Principled Statistical Answers
26 pages
22008-Final PDF-26068-1-10-20230105
No ratings yet
22008-Final PDF-26068-1-10-20230105
11 pages
CausalML Book
No ratings yet
CausalML Book
496 pages
Lecture 21
No ratings yet
Lecture 21
8 pages
Imperial Causality
No ratings yet
Imperial Causality
124 pages
CIML2023
No ratings yet
CIML2023
87 pages
Causal AI Final
No ratings yet
Causal AI Final
71 pages
Causal Inference: Yu Xie University of Michigan
No ratings yet
Causal Inference: Yu Xie University of Michigan
51 pages
AAAI-2023 教程用于因果推断的机器学习
No ratings yet
AAAI-2023 教程用于因果推断的机器学习
145 pages
Hernanrobins V2march52017
No ratings yet
Hernanrobins V2march52017
86 pages
Hernanrobins v2.17.18
No ratings yet
Hernanrobins v2.17.18
86 pages
1712 04802
No ratings yet
1712 04802
81 pages
Random Sets Approach and Its Applications
No ratings yet
Random Sets Approach and Its Applications
12 pages
Hernanrobins v2.17.14
No ratings yet
Hernanrobins v2.17.14
86 pages
Causal Inference in Python
No ratings yet
Causal Inference in Python
10 pages
CausalML Book 2022
No ratings yet
CausalML Book 2022
500 pages
Causal Inference For Social Network Data
No ratings yet
Causal Inference For Social Network Data
16 pages
Dataanalyticsunit 2
No ratings yet
Dataanalyticsunit 2
24 pages
Causal Inference in Statistics: An Overview
100% (1)
Causal Inference in Statistics: An Overview
51 pages
Causal Inference in The Social Sciences
No ratings yet
Causal Inference in The Social Sciences
30 pages
To What Extent, Statistically, Are The Causal Effects of Parenting Programs On Rural Chinese Toddlers' Cognition Different Between Genders - A
No ratings yet
To What Extent, Statistically, Are The Causal Effects of Parenting Programs On Rural Chinese Toddlers' Cognition Different Between Genders - A
47 pages
Causal Inference: 1.1 Two Types of Causal Questions
No ratings yet
Causal Inference: 1.1 Two Types of Causal Questions
8 pages
ThesisFinal AlbertoCaron
No ratings yet
ThesisFinal AlbertoCaron
197 pages
AIML-Unit 5 Notes
No ratings yet
AIML-Unit 5 Notes
45 pages
Causal Inference Book Part I-Ifqdve
No ratings yet
Causal Inference Book Part I-Ifqdve
158 pages
Annurev Statistics 033121 114601
No ratings yet
Annurev Statistics 033121 114601
30 pages
2025 - Applied Causal Inference Powered by ML and AI
No ratings yet
2025 - Applied Causal Inference Powered by ML and AI
518 pages
Introduction 1
No ratings yet
Introduction 1
113 pages
Introduction To Causal Inference-Aug25 2020-Neal
No ratings yet
Introduction To Causal Inference-Aug25 2020-Neal
61 pages
Соц Эффект По Закону2
No ratings yet
Соц Эффект По Закону2
29 pages
Causal Reasoning From Meta-Reinforcement Learning
No ratings yet
Causal Reasoning From Meta-Reinforcement Learning
13 pages
6 Causal Inference Technical
No ratings yet
6 Causal Inference Technical
28 pages
IS4242 W4 Causal Inference & Experiment
No ratings yet
IS4242 W4 Causal Inference & Experiment
87 pages
Causal Review
No ratings yet
Causal Review
25 pages
Econ 4
No ratings yet
Econ 4
92 pages
04 Interference Dynamics Notes
No ratings yet
04 Interference Dynamics Notes
13 pages
Causality From A Distributional Robustness Point of View
No ratings yet
Causality From A Distributional Robustness Point of View
5 pages
A2 Causality
No ratings yet
A2 Causality
28 pages
R Bloggers3
No ratings yet
R Bloggers3
13 pages
DS ML Applied Modelling
No ratings yet
DS ML Applied Modelling
6 pages
The International Journal of Biostatistics: An Introduction To Causal Inference
No ratings yet
The International Journal of Biostatistics: An Introduction To Causal Inference
62 pages
Causal Inference in Statistics: An Overview
100% (2)
Causal Inference in Statistics: An Overview
51 pages
Tejadalapuerta2025causalmachinelearnin - Causal Machine Learning For Single Cell Genomics
No ratings yet
Tejadalapuerta2025causalmachinelearnin - Causal Machine Learning For Single Cell Genomics
12 pages
Causal Inference - A Statistical Learning Approach
No ratings yet
Causal Inference - A Statistical Learning Approach
247 pages
Causal Inference: 36-350, Data Mining, Fall 2009 4 December 2009
No ratings yet
Causal Inference: 36-350, Data Mining, Fall 2009 4 December 2009
11 pages
Learning Bayesian Networks (Neapolitan, Richard) PDF
100% (1)
Learning Bayesian Networks (Neapolitan, Richard) PDF
704 pages
Geometric functions in computer aided geometric design
From Everand
Geometric functions in computer aided geometric design
Oscar Ruiz
No ratings yet
Worked Examples in Mathematics for Scientists and Engineers
From Everand
Worked Examples in Mathematics for Scientists and Engineers
G. Stephenson
No ratings yet
Multiple Integrals, A Collection of Solved Problems
From Everand
Multiple Integrals, A Collection of Solved Problems
Steven Tan
No ratings yet
PHD Thesis Structure and Content
No ratings yet
PHD Thesis Structure and Content
2 pages
Create A Plan of Action (HARDCOPY)
No ratings yet
Create A Plan of Action (HARDCOPY)
5 pages
HW 1 W
No ratings yet
HW 1 W
3 pages
An Introduction To Global Climate Change
No ratings yet
An Introduction To Global Climate Change
35 pages
The Relationship Between Facebook and Uncertainty: Chris Watson Mena Shenouda
No ratings yet
The Relationship Between Facebook and Uncertainty: Chris Watson Mena Shenouda
28 pages
Vidler Troubles in Theory
No ratings yet
Vidler Troubles in Theory
7 pages
Spoken Language, Oral Culture
No ratings yet
Spoken Language, Oral Culture
7 pages
STS PPT G1
No ratings yet
STS PPT G1
39 pages
A Lexico Statistic Analysis On The Vocabulary of Nigerian Pidgin Varieties Spoken in Warri and Port
No ratings yet
A Lexico Statistic Analysis On The Vocabulary of Nigerian Pidgin Varieties Spoken in Warri and Port
7 pages
Excel Solver
100% (1)
Excel Solver
10 pages
Pega Interview Questions
100% (1)
Pega Interview Questions
3 pages
The - Role - of - Vibration - Monitoring - Schaeffler (UK) - (2009) PDF
No ratings yet
The - Role - of - Vibration - Monitoring - Schaeffler (UK) - (2009) PDF
20 pages
The History of Using Solar Energy
No ratings yet
The History of Using Solar Energy
8 pages
Mid-Year Review Form (MRF) For Teacher I-Iii: Department of Education
No ratings yet
Mid-Year Review Form (MRF) For Teacher I-Iii: Department of Education
11 pages
Gas Dynamics Outline Fall 2014
No ratings yet
Gas Dynamics Outline Fall 2014
3 pages
CMT Lesson 5
No ratings yet
CMT Lesson 5
4 pages
Analisis Profil Sekolah Dasar Islam Terpadu (Studi Kasus Di Sekolah Dasar Permata Bunda Bandar Lampung)
No ratings yet
Analisis Profil Sekolah Dasar Islam Terpadu (Studi Kasus Di Sekolah Dasar Permata Bunda Bandar Lampung)
13 pages
Final Project Almost Done
No ratings yet
Final Project Almost Done
55 pages
The Australian National University: Semester Examination - June 2009 MATH1014 - Mathematics and Its Applications II
No ratings yet
The Australian National University: Semester Examination - June 2009 MATH1014 - Mathematics and Its Applications II
12 pages
Design For Pin Steel Bearings With Huge Horizontal Carrying Capability
No ratings yet
Design For Pin Steel Bearings With Huge Horizontal Carrying Capability
6 pages
Super Goal 6
No ratings yet
Super Goal 6
51 pages
30XW Catalogue
No ratings yet
30XW Catalogue
4 pages
Amba-Axi Protocol Verification by Using UVM: P. Naveen Kalyan
No ratings yet
Amba-Axi Protocol Verification by Using UVM: P. Naveen Kalyan
9 pages
Codex Adeptus Custodes 1.5
85% (13)
Codex Adeptus Custodes 1.5
40 pages
Internship/Training: On Cyber Security
No ratings yet
Internship/Training: On Cyber Security
25 pages
Sinamics s120 Function Manual
No ratings yet
Sinamics s120 Function Manual
560 pages
MATH 4 Q1 Lesson 13 Solving Routine and Nonroutine Word Problems... Marvietblanco
No ratings yet
MATH 4 Q1 Lesson 13 Solving Routine and Nonroutine Word Problems... Marvietblanco
24 pages