0% found this document useful (0 votes)
11 views88 pages

Buhlman 2020 PPT

The document discusses high-dimensional causal inference in genomics, focusing on predicting the effects of gene interventions using observational data rather than randomized experiments. It emphasizes the limitations of traditional regression models for causal inference and introduces graphical modeling and structural equation models as alternatives for quantifying intervention effects. The document also highlights the challenges of inferring causal relationships from observational data and the importance of understanding the underlying causal structures represented by Directed Acyclic Graphs (DAGs).

Uploaded by

j.lowhorn
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views88 pages

Buhlman 2020 PPT

The document discusses high-dimensional causal inference in genomics, focusing on predicting the effects of gene interventions using observational data rather than randomized experiments. It emphasizes the limitations of traditional regression models for causal inference and introduces graphical modeling and structural equation models as alternatives for quantifying intervention effects. The document also highlights the challenges of inferring causal relationships from observational data and the importance of understanding the underlying causal structures represented by Directed Acyclic Graphs (DAGs).

Uploaded by

j.lowhorn
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 88

High-dimensional causal inference, graphical

modeling and structural equation models

Peter Bühlmann

Seminar für Statistik, ETH Zürich


cannot do confirmatory causal inference without
randomized intervention experiments...

but we can do better than proceeding naively


Goal

in genomics:
if we would make an intervention at a single gene, what would
be its effect on a phenotype of interest?

want to infer/predict such effects without actually doing the


intervention
i.e. from observational data
(from observations of a “steady-state system”)

it doesn’t need to be genes


can generalize to intervention at more than one variable/gene
Goal

in genomics:
if we would make an intervention at a single gene, what would
be its effect on a phenotype of interest?

want to infer/predict such effects without actually doing the


intervention
i.e. from observational data
(from observations of a “steady-state system”)

it doesn’t need to be genes


can generalize to intervention at more than one variable/gene
Policy making

James Heckman: Nobel Prize Economics 2000

e.g.:
“Pritzker Consortium on Early Childhood Development
identifies when and how child intervention programs can be
most influential”
Genomics

1. Flowering of Arabidopsis Thaliana

phenotype/response variable of interest:


Y = days to bolting (flowering)
“covariates” X = gene expressions from p = 210 326 genes
remark: “gene expression”: process by which information from
a gene is used in the synthesis of a functional gene product
(e.g. protein)
question: infer/predict the effect of knocking-out/knocking-down
(or enhancing) a single gene (expression) on the
phenotype/response variable Y ?
2. Gene expressions of yeast

p = 5360 genes
phenotype of interest: Y = expression of first gene
“covariates” X = gene expressions from all other genes
and then
phenotype of interest: Y = expression of second gene
“covariates” X = gene expressions from all other genes
and so on
infer/predict the effects of a single gene knock-down on all
other genes
; consider the framework of an

intervention effect = causal effect


(mathematically defined ; see later)
Regression – the “statistical workhorse”: the wrong approach

we could use linear model (fitted from n observational data)


p
X
Y = βj X (j) + ε,
j=1

Var(X (j) ) ≡ 1 for all j

|βj | measures the effect of variable X (j) in terms of “association”


i.e. change of Y as a function of X (j) when keeping all other
variables X (k) fixed

; not very realistic for intervention problem


if we change e.g. one gene, some others will also change
and these others are not (cannot be) kept fixed
Regression – the “statistical workhorse”: the wrong approach

we could use linear model (fitted from n observational data)


p
X
Y = βj X (j) + ε,
j=1

Var(X (j) ) ≡ 1 for all j

|βj | measures the effect of variable X (j) in terms of “association”


i.e. change of Y as a function of X (j) when keeping all other
variables X (k) fixed

; not very realistic for intervention problem


if we change e.g. one gene, some others will also change
and these others are not (cannot be) kept fixed
and indeed:

1,000 IDA
Lasso
Elastic−net
800
Random
True positives

600

400

200

0
0 1,000 2,000 3,000 4,000
False positives

; can do much better than (penalized) regression!


and indeed:

1,000 IDA
Lasso
Elastic−net
800
Random
True positives

600

400

200

0
0 1,000 2,000 3,000 4,000
False positives

; can do much better than (penalized) regression!


Effects of single gene knock-downs on all other genes (yeast)
(Maathuis, Colombo, Kalisch & PB, 2010)

• p = 5360 genes (expression of genes)


• 231 gene knock downs ; 1.2 · 106 intervention effects
• the truth is “known in good approximation”
(thanks to intervention experiments)

goal: prediction of the true large intervention effects


based on observational data with no knock-downs

1,000 IDA
Lasso
Elastic−net
800
Random

n = 63
True positives

600

observational data 400

200

0
0 1,000 2,000 3,000 4,000
False positives
A bit more specifically

I univariate response Y
I p-dimensional covariate X

question:
what is the effect of setting the jth component of X to a certain
value x:

do(X (j) = x)

; this is a question of intervention type


not the effect of X (j) on Y when keeping all other variables fixed
(regression effect)
Reichenbach, 1956; Suppes, 1970; Rubin, Dawid, Holland, Pearl,
Glymour, Scheines, Spirtes,...
Intervention calculus (a review)
“dynamic” notion of an effect:
if we set a variable X (j) to a value x (intervention)
; some other variables X (k) (k 6= j) and maybe Y will change
we want to quantify the “total” effect of
X (j) on Y including “all changed” X (k) on Y

a graph or influence diagram will be very useful


X1

X2 Y

X4 X3
for simplicity: just consider DAGs (Directed Acyclic Graphs)
[with hidden variables (Spirtes, Glymour & Scheines (1993);
Colombo et al. (2012)) much more complicated and not validated
with real data]
random variables are represented as nodes in the DAG

assume a Markov condition, saying that

X (j) |X (pa(j)) cond. independent of its non-descendant variables

; recursive factorization of joint distribution


p
Y
(1) (p) (pa(Y ))
P(Y , X ,...,X ) = P(Y |X ) P(X (j) |X (pa(j)) )
j=1

for intervention calculus: use truncated factorization (e.g. Pearl)


for simplicity: just consider DAGs (Directed Acyclic Graphs)
[with hidden variables (Spirtes, Glymour & Scheines (1993);
Colombo et al. (2012)) much more complicated and not validated
with real data]
random variables are represented as nodes in the DAG

assume a Markov condition, saying that

X (j) |X (pa(j)) cond. independent of its non-descendant variables

; recursive factorization of joint distribution


p
Y
(1) (p) (pa(Y ))
P(Y , X ,...,X ) = P(Y |X ) P(X (j) |X (pa(j)) )
j=1

for intervention calculus: use truncated factorization (e.g. Pearl)


assume Markov property for causal DAG:

non-intervention intervention do(X (2) = x)

X (1) X (1)

X (2) Y
X (2) = x Y

X (4) X (3) X (4) X (3)

P(Y , X (1) , X (2) , X (3) , X (4) ) = P(Y , X (1) , X (3) , X (4) |do(X (2) = x)) =
P(Y |X (1) , X (3) )× P(Y |X (1) , X (3) )×
P(X (1) |X (2) )× P(X (1) |X (2) = x)×
P(X (2) |X (3) , X (4) )× P(X (3) )×
P(X (3) )× P(X (4) )
P(X (4) )
truncated factorization for do(X (2) = x):

P(Y , X (1) , X (3) , X (4) |do(X (2) = x))


= P(Y |X (1) , X (3) )P(X (1) |X (2) = x)P(X (3) )P(X (4) )

P(Y |do(X (2) = x))


Z
= P(Y , X (1) , X (3) , X (4) |do(X (2) = x))dX (1) dX (3) dX (4)
the truncated factorization is a mathematical consequence of
the Markov condition (with respect to the causal DAG) for the
observational probability distribution P
(plus assumption that structural eqns. are “autonomous”)
the intervention distribution P(Y |do(X (2) = x)) can be
calculated from
I observational data distribution
; need to estimate conditional distributions
I an influence diagram (causal DAG)
; need to estimate structure of a graph/influence diagram

intervention effect:
Z
E[Y |do(X (2) = x)] = yP(y |do(X (2) = x))dy

intervention effect at x0 : E[Y |do(X (2) = x)]|x=x0
∂x

in the Gaussian case: Y , X (1) , . . . , X (p) ∼ Np+1 (µ, Σ),


E[Y |do(X (2) = x)]≡ θ2 for all x
∂x
The backdoor criterion (Pearl, 1993)

we only need to know the local parental set: for Z = Xpa(X ) ,


Z
if Y ∈
/ pa(X ) : P(Y |do(X = x)) = P(Y |X = x, Z )dP(Z )
Z

parental set might not be the minimal set but always suffices

this is a consequence of the global Markov property:

XA independent of XB |XS : A and B are d-separated by S


Gaussian case

for Gaussian case: E[Y |do(X (j) = x)] ≡ θj for all x
∂x

for Y ∈
/ pa(j):

θj is the regression parameter in


X
Y = θj X (j) + θk X (k) + error
k∈pa(j)
only need parental set and regression
j = 2, pa(j) = {3, 4} X (1)

X (2) Y

X (4) X (3)
when having no unmeasured confounder (variable):

intervention effect (as defined) = causal effect

recap:
causal effect = effect from a randomized trial
(but we want to infer it without a randomized study...
because often we cannot do it, or it is too expensive)

structural equation models provide a different (but closely


related) route for quantifying intervention effects
when having no unmeasured confounder (variable):

intervention effect (as defined) = causal effect

recap:
causal effect = effect from a randomized trial
(but we want to infer it without a randomized study...
because often we cannot do it, or it is too expensive)

structural equation models provide a different (but closely


related) route for quantifying intervention effects
when having no unmeasured confounder (variable):

intervention effect (as defined) = causal effect

recap:
causal effect = effect from a randomized trial
(but we want to infer it without a randomized study...
because often we cannot do it, or it is too expensive)

structural equation models provide a different (but closely


related) route for quantifying intervention effects
Inferring intervention effects from observational
distribution
main problem: inferring DAG from observational data

impossible! can only infer equivalence class of DAGs


(several DAGs can encode exactly the same conditional
independence relationships)

Example:

X Y X Y

X causes Y Y causes X

a lot of work about identifiability:


Verma & Pearl (1991); Spirtes, Glymour & Scheines (1993); Tian &
Pearl (2000–2002); Lauritzen & Richardson (2002); Shpitser & Pearl
(2006–2011); vanderWeele & Robins (2007–2011); Drton, Foygel &
Sullivant (2011);...
Markov equivalence class of DAGs

P a family of probability distributions on Rp+1


Definition:

M(D) = {P ∈ P; P is Markov w.r.t. DAG D}

D ∼ D 0 ⇔ M(D) = M(D 0 )

the definition depends on the family P:


typical models:
• P = Gaussian distributions
• P = nonparametric model/set of all distributions
• P = set of distributions from
additive structural equation model (see later)
A graphical characterization (Frydenberg, 1990; Verma & Pearl, 1991)

for
P = {Gaussian distributions} or
P = {nonparametric distributions}
it holds:

D ∼ D0 ⇐⇒ D and D 0 have the same skeleton


and the same v -structures
for a DAG D, we write its corresponding Markov equivalence
class

E(D) = {D 0 ; D 0 ∼ D}
Equivalence class of DAGs

• Several DAGs can encode exactly the same conditional


independence relationships. Such DAGs form an equivalence class.
• Example: unshielded triple
X1 ⊥⊥ X3 X1 ⊥
⊥ X3 |X2
X1 X2 X3 false true
X1 X2 X3 false true no v-structure
X1 X2 X3 false true
X1 X2 X3 true false v-structure

• All DAGs in an equivalence class have the same skeleton and the
same v-structures
• An equivalence class can be uniquely represented by a completed
partially directed acyclic graph (CPDAG)

CPDAG DAG 1 DAG 2 DAG 3 DAG 4

22
we cannot estimate causal/intervention effects from
observational distribution

but we will be able to estimate lower bounds of causal effects

conceptual “procedure”:
I probability distribution P from a DAG, generating the data
; true underlying equivalence class of DAGs (CPDAG)
I find all DAG-members of true equivalence class (CPDAG):
D1 , . . . , Dm
I for every DAG-member Dr , and every variable X (j) :
single intervention effect θr ,j
summarize them by

Θ = {θr ,j ; r = 1, . . . , m; j = 1, . . . , p}
| {z }
identifiable parameter
we cannot estimate causal/intervention effects from
observational distribution

but we will be able to estimate lower bounds of causal effects

conceptual “procedure”:
I probability distribution P from a DAG, generating the data
; true underlying equivalence class of DAGs (CPDAG)
I find all DAG-members of true equivalence class (CPDAG):
D1 , . . . , Dm
I for every DAG-member Dr , and every variable X (j) :
single intervention effect θr ,j
summarize them by

Θ = {θr ,j ; r = 1, . . . , m; j = 1, . . . , p}
| {z }
identifiable parameter
IDA (oracle version)

PC-algorithm do-calculus

DAG 1 effect 1

DAG 2 effect 2

oracle CPDAG .. .. multi-set Θ


. .
.. ..
. .

DAG m effect m

17
If you want a single number for every variable ...

instead of the multi-set

Θ = {θr ,j ; r = 1, . . . , m; j = 1, . . . , p}

minimal absolute value

αj = min |θr ,j | (j = 1, . . . , p),


r
|θtrue,j | ≥ αj

minimal absolute effect αj is a lower bound for true absolute


intervention effect
Computationally tractable algorithm

searching all DAGs is computationally infeasible if p is large


(we actually can do this up to p ≈ 15 − 20)

instead of finding all m DAGs within an equivalence class ;


compute all intervention effects without finding all DAGs
(Maathuis, Kalisch & PB, 2009)
key idea: exploring local aspects of the graph is sufficient
PC-algorithm do-calculus

effect 1

effect 2

data CPDAG .. multi-set ΘL


.
..
.

effect q

33

the local ΘL = Θ up to multiplicities


(Maathuis, Kalisch & PB, 2009)
Estimation from finite samples

notation: drop the Y -notation (Y = X (1) , X (2) , . . . , X (p) )

difficult part: estimation of CPDAG (equivalence class of DAGs)


; ¨structural learning¨ pcAlgo(dm = d, alpha = 0.05)

1 7

P⇒ CPDAG
| {z } 5 4 2

equiv. class of DAGs 8 9 6

10
two main approaches:
I multiple testing of conditional dependencies:
PC-algorithm as prime example
I score-based methods: MLE as prime example
Estimation from finite samples

notation: drop the Y -notation (Y = X (1) , X (2) , . . . , X (p) )

difficult part: estimation of CPDAG (equivalence class of DAGs)


; ¨structural learning¨ pcAlgo(dm = d, alpha = 0.05)

1 7

P⇒ CPDAG
| {z } 5 4 2

equiv. class of DAGs 8 9 6

10
two main approaches:
I multiple testing of conditional dependencies:
PC-algorithm as prime example
I score-based methods: MLE as prime example
Faithfulness assumption
(necessary for conditional dependence testing approaches)

a distribution P is called faithful to a DAG G if all conditional


independencies can be inferred from the graph

(can infer some conditional independencies from a Markov


assumption; but we require here “all” conditional
independencies)

assuming faithfulness: ; can infer the CPDAG from a list of


conditional (in-)dependence relations
Faithfulness assumption
(necessary for conditional dependence testing approaches)

a distribution P is called faithful to a DAG G if all conditional


independencies can be inferred from the graph

(can infer some conditional independencies from a Markov


assumption; but we require here “all” conditional
independencies)

assuming faithfulness: ; can infer the CPDAG from a list of


conditional (in-)dependence relations
What does it mean?
1

X (1) ← ε(1) ,
X (2) ← αX (1) + ε(2) ,
X (3) ← βX (1) + γX (2) + ε(3) ,
2 3
ε(1) , ε(2) , ε(3) i.i.d. ∼ N (0, 1)

enforce marginal independence of X (1) and X (3)


β + αγ = 0, e.g. α = β = 1, γ = −1
   
1 1 0 3 −2 −1
Σ= 1 2 −1  , Σ−1 =  −2 2 1 .
0 −1 2 −1 1 1

failure of faithfulness due to cancellation of coefficients


failure of exact faithfulness is “rare” (Lebesgue measure zero)

but for statistical estimation (in the Gaussian case): “often”


require strong faithfulness (Robins, Scheines, Spirtes &
Wasserman, 2003):
n o
min |ρ(i, j|S)|; ρ(i, j|S) 6= 0, i 6= j, |S| ≤ d ≥ τ,
p
τ  d log(p)/n

(d is the maximal degree of the skeleton of the DAG)


... strong faithfulness can be rather severe
(Uhler, Raskutti, PB & Yu, 2013)

3 nodes, full graph

imagine a strip around it ;


large volume!
⇒ strong faithfulness is
restrictive in high dimensions

unfaithful distributions
due to exact cancellation
The PC-algorithm (Spirtes & Glymour, 1991)

I crucial assumption:
distribution P is faithful to the true underlying DAG
I less crucial but convenient:
Gaussian assumption for Y , X (1) , . . . , X (p) ; can work with
partial correlations

I input: Σ̂MLE
but we only need to consider many small sub-matrices of it
(assuming sparsity of the graph)
I output: based on a clever data-dependent (random)
sequence of multiple tests
estimated CPDAG
PC-algorithm: a rough outline
for estimating the skeleton of underlying DAG

1. start with full graph


2. remove edge i − j if
d (i) , X (j) ) is small
Cor(X
(Fisher’s Z-transform and
null-distribution of zero full graph correlation screening

correlation)
3. partial correlations of
order 1:
remove edge i − j if
\ (i) , X (j) |X (k) ) is
Parcor(X
partial correlation order 1 stopped
small for some k in the
current neighborhood of i
or j (thanks to faithfulness)
4. move-up to partial
correlations of order 2:
remove edge i − j if partial
correlation
\ (i) , X (j) |X (k) , X (`) )
Parcor(X
is small for some k , ` in full graph correlation screening

the current neighborhood


of i or j (thanks to
faithfulness)
5. until removal of edges is
not possible anymore,
partial correlation order 1 stopped
i.e. stop at minimal order
of partial correlation
where edge-removal
becomes impossible

additional step of the algorithm needed for estimating directions


yields an estimate of the CPDAG (equivalence class of DAGs)
one tuning parameter (cut-off parameter) α for truncation of
estimated Z -transformed partial correlations

if the graph is “sparse” (few neighbors) ; few iterations only


and only low-order partial correlations play a role
and thus: the estimation algorithm works for p  n problems
Computational complexity

crudely bounded to be polynomial in p


sparser underlying structure ; faster algorithm

we can easily do the computations for


sparse cases with p ≈ 104 ≈ 1-2 hrs CPU time
3
log10( Processor Time [s] )
2
1
0

E[N]=2
−1

E[N]=8

1.0 1.5 2.0 2.5 3.0


log10(p)
IDA (Intervention calculus when DAG is Absent)
(Maathuis, Colombo, Kalisch & PB, 2010)

1. PC-algorithm ; CPDAG
\
2. local algorithm ; Θ̂local
3. lower bounds for absolute causal effects ; α̂j
R-package: pcalg

this is what we used in the yeast example to score the


importance of the genes according to size of α̂j
Statistical theory (Kalisch & PB, 2007; Maathuis, Kalisch & PB, 2009)
n i.i.d. observational data points; p variables
high-dimensional setting where p  n
assumptions:
I X (1) , . . . , X (p) ∼ Np (0, Σ) Markov and faithful to true DAG
I high-dimensionality: log(p)  n
I sparsity: maximal degree d = maxj |ne(j)|  n
I “coherence”: maximal (partial) correlations ≤ C < 1
max{|ρi,j|S |; i 6= j, |S| ≤ d} ≤ C < 1
I signal strength/strong faithfulness:
p
min{|ρi,j|S |; ρi,j|S 6= 0, i 6= j, |S| ≤ d}  d log(p)/n

Then, for some suitable tuning param. and 0 < δ < 1:


\ = true CPDAG] = 1 − O(exp(−Cn1−δ ))
P[CPDAG
as set
P[Θ̂L = Θ] = 1 − O(exp(−Cn1−δ ))
(i.e. consistency of lower bounds for causal effects)
Main strategy of a proof
we have to analyze an algorithm...
;
I understand the population version:
in particular:
can show that algorithm stops in d or d − 1 steps
I analysis of the noisy version: additional errors
use a union bound argument to control the overall error
(seems very rough but leads to asymptotically “optimal”
regime, e.g., for signal strength)
The role of “sparsity” in causal inference
as usual: sparsity is necessary for accurate estimation in
presence of noise

but here: “sparsity” (so-called protectedness) is crucial for


identifiability as well

X Y X Y

X causes Y Y causes X

cannot tell from observational data the direction of the arrow

the same situation arises with a full graph with more than 2
nodes
;

causal identification really needs “sparsity”


the better the “sparsity” the tighter the bounds for causal effects
Maximum likelihood estimation

R.A. Fisher
Gaussian DAG is Gaussian linear structural equation model:
1

X (1) ← ε(1)
X (2) ← β21 X (1) + ε(2)
2 3
X (3) ← β31 X (1) + β32 X (2) + ε(3)

in general:
p
X
(j)
X ← βjk X (k) + ε(j) (j = 1, . . . , p), βjk 6= 0 ⇔ edge k → j
k=1
X = BX + ε, ε ∼ Np (0, diag(σ12 , . . . , σp2 )) in matrix notation
; reparameterization
Σ̂, D̂ = argminΣ;D a DAG − `(Σ, D; data) + λ|D|
= argminB; {σ2 ;j} − `(B, {σj2 ; j}; data) + λ kBk0
j | {z }
P
ij I(Bij 6=0)

under the non-convex constraint that B corresponds to “no


directed cycles”

severe non-convex problem due to the “no directed cycle”


constraint
(k · k0 -penalty rather than e.g. k · k1 doesn’t make the problem
much harder)
toy-example X (1) ← β1 X (2) + ε1
X (2) ← β2 X (1) + ε2

beta2

(0,0)

X1 X2 beta1

non-convex parameter space!


(no straightforward way to do convex relaxation)
computation:
I dynamic programming algorithm (up to p ≈ 20)
(Silander and Myllymäki, 2006)
I greedy algorithms on equivalence classes
(Chickering, 2002; Hauser & PB, 2012)

statistical properties of penalized MLE (for “large” p):


(van de Geer & PB, 2013)
no faithfulness assumption required to obtain the minimal
edges I-MAP !
Successes in biology
Effects of single gene knock-downs on all other genes in yeast
(Maathuis, Colombo, Kalisch & PB, 2010)

1,000 IDA
Lasso
Elastic−net
800
Random

True positives
n = 63 600

observational data
400

200

0
0 1,000 2,000 3,000 4,000
False positives
Arabidopsis thaliana (Stekhoven, Moraes, Sveinbjörnsson, Hennig,
Maathuis & PB, 2012)

response Y : days to bolting (flowering) of the plant


(aim: fast flowering plants)
covariates X : gene-expression profile

observational data with n = 47 and p = 210 326


; lower bound estimates α̂j for causal effect of every
gene/variable on Y (using the PC-algorithm)

apply stability selection (Meinshausen & PB, 2010)


; assigning uncertainties via control of
2
PCER = E[V ]/p ≤ 2πthr1 −1 qp2 (per comparison error rate)
Causal gene ranking
summary median error
Gene rank effect expression (PCER) name
1 AT2G45660 1 0.60 5.07 0.0017 AGL20 (SOC1)
2 AT4G24010 2 0.61 5.69 0.0021 ATCSLG1
3 AT1G15520 2 0.58 5.42 0.0017 PDR12
4 AT3G02920 5 0.58 7.44 0.0024 replication protein-related
5 AT5G43610 5 0.41 4.98 0.0101 ATSUC6
6 AT4G00650 7 0.48 5.56 0.0020 FRI
7 AT1G24070 8 0.57 6.13 0.0026 ATCSLA10
8 AT1G19940 9 0.53 5.13 0.0019 AtGH9B5
9 AT3G61170 9 0.51 5.12 0.0034 protein coding
10 AT1G32375 10 0.54 5.21 0.0031 protein coding
11 AT2G15320 10 0.50 5.57 0.0027 protein coding
12 AT2G28120 10 0.49 6.45 0.0026 protein coding
13 AT2G16510 13 0.50 10.7 0.0023 AVAP5
14 AT3G14630 13 0.48 4.87 0.0039 CYP72A9
15 AT1G11800 15 0.51 6.97 0.0028 protein coding
16 AT5G44800 16 0.32 6.55 0.0704 CHR4
17 AT3G50660 17 0.40 7.60 0.0059 DWF4
18 AT5G10140 19 0.30 10.3 0.0064 FLC
19 AT1G24110 20 0.49 4.66 0.0059 peroxidase, putative
20 AT1G27030 20 0.45 10.1 0.0059 unknown protein
• biological validation by gene knockout experiments in progress.

red: biologically known genes responsible for flowering


performed validation experiment with mutants corresponding to
these top 20 - 3 = 17 genes
I 14 mutants easily available ; only test for 14 genes
I more than usual: mutants showed low germination or
survival...
I 9 among the 14 mutants survived (sufficiently strongly), i.e.
9 mutants for which we have an outcome
I 3 among the 9 mutants (genes) showed a significant effect
for Y relative to the wildtype (non-mutated plant)
; that is: besides the three known genes, we find three
additional genes which exhibit a significant difference in terms
of “time to flowering”
in short:
bounds on causal effects (α̂j ’s) based on observational data
lead to interesting predictions for interventions in genomics
(i.e. which genes would exhibit a large intervention effect)
and these predictions have been validated using experiments
Fully identifiable cases

recap:
if P = {Gaussian distributions}, then:
cardinality of the Markov-equivalence class of a DAG D
|E(D)| often > 1

same is true for P = {nonparametric distributions}

but under some additional constraints, the Markov equivalence


class becomes smaller or even consisting of one DAG only (i.e.,
identifiable)
structural equation model (SEM)

Xj ← fj (Xpa(j) , εj ) (j = 1, . . . , p)

e.g.
X
Xj ← fjk (Xk ) + εj (j = 1, . . . , p)
k∈pa(j)
three types of identifiable SEMs where
true DAG is identifiable from observational distribution:

I LiNGAM (Shimizu et al., 2006)


X
Xj ← Bjk Xk + εj (j = 1, . . . , p),
k∈pa(j)

with εj ’s non-Gaussian
as with independent component analysis (ICA)
; identifying Gaussian components is the hardest

I linear Gauss with equal error variances (Peters & PB, 2014)
X
Xj ← Bjk Xk + εj (j = 1, . . . , p),
k ∈pa(j)

Var(εj ) ≡ σ 2 for all j


I nonlinear SEM with additive noise (Schölkopf and
co-workers, 2008–2014: Hoyer et al., 2009; Peters et al., 2013)

Xj ← fj (Xpa(j) ) + εj (j = 1, . . . , p)

toy example: p = 2 with DAG X (1) = X → Y = X (2)


Y = X^3 + epsilon errors

30
● ● ●
● ●●
●●●
● ●●

4
●●
●●
●●
●●●
●● ● ●
●●●● ● ●● ●

25

●●
●●
● ● ●
●●●●●
● ● ●
● ●● ● ●● ● ●

●●●●●
●● ● ●● ● ●
● ●

● ●
●● ●●
●●
● ●●● ● ● ● ●●● ● ●●● ●● ● ●● ●

●● ● ●●
● ●●
● ●● ●
●●●●
● ● ●● ● ●● ● ●
●● ●● ● ● ●● ● ●●● ● ● ●● ● ● ● ●

20
● ● ● ●●●● ● ●●● ●●

X3
● ●● ●●●
●● ● ●● ●● ●●●●

2
●● ● ● ●●
●●● ● ●● ●● ●● ● ●●●● ● ●● ● ●● ●●●


● ●● ●● ●●● ●●●●

truth: Y =

●●●
● ●●
●●● ● ● ●●● ●●● ●● ● ● ● ●● ●●
●●●●
●●●●
● ●●●
●● ●●● ●●

●●●● ●
● ●●●●●
●●●● ●● ● ●●● ●● ● ●● ●● ●
●●● ● ●● ●●●●● ● ●●●●
●●●●●●●●
●●●
●●●
●● ● ●●●●●●

● ●●● ●● ●●●●● ●
● ●●●● ● ●● ●● ●● ● ●
●●●
● ●
●● ●● ● ● ●●

epsilon
● ● ●● ●●●●●
●●●●● ●●●
● ●●● ●●● ●●● ●● ●● ●●●

15
●● ●● ● ● ● ●


●● ●●
●●●● ●●●●●●●●●

●●●
● ●● ●●● ●● ● ●● ●
●●●● ●●
●● ●● ●●●● ● ● ●●● ●
●● ●
● ●
●●
● ●●
● ●● ●●●●●●● ●●
● ●● ● ● ●●●●
● ●
●●●●
● ● ●● ● ●● ●●
●●● ●●●●●●●● ● ●●● ●● ●●● ● ●● ●●
●● ●● ●
● ●●●●●
● ● ●●
●●●● ●●
● ●●●●●

y
● ● ●●●●●● ●●● ●● ●●●
● ●
●●

●●●●●●
● ●●
●●
●● ●
●● ●● ● ● ●

●●● ● ●●●●● ●●● ●●
●●●●● ●●● ●
● ●●●●●● ● ●● ●●● ●●

0
●●
●●
●●● ● ● ● ●
● ●●●●●
●● ●
● ● ● ● ●●● ●● ● ●●
● ●● ●●
●● ●●● ● ●●●
● ●●●●● ●● ●●●●●●●● ● ●●●●
● ● ●●●●●
● ● ●● ●● ●●●●
● ●●●

●●●
●●● ● ●
●●●●● ● ●● ●● ●● ●●●●●●●● ● ●● ●

●● ●●●●

10
● ● ●● ● ●●●● ●●●●● ●●

●●●●
●●
●●
● ●● ● ●● ●
● ●● ●● ●●●●●●●●●
●● ● ● ●● ●●
●●● ●●● ●●● ●● ● ●
●● ●●●● ●●●
● ● ● ●

●● ● ●●●●●●●
● ●
● ●●
● ●
●●● ● ● ● ● ● ● ●● ● ●● ● ● ● ●●● ●●● ●
● ●●●

●●

●●●
● ●●
●●

● ●●
● ● ●●● ●●●● ● ● ●● ●● ● ●●●●●●● ● ● ●● ● ●●● ● ●
● ●● ●

ε indep. of X
●●● ● ● ●


●●●

●● ●
● ●




●●
●● ●●●● ●● ●
● ●●● ●● ● ●●●●● ● ● ● ●● ●● ● ●
●●● ● ●

● ●●● ● ●
● ●●●●●
● ●
●● ● ●
●●
●●●●
●●
●● ●● ●● ●●● ●● ● ●●●
● ●
●●
● ●● ●●
●●●●●●


●●●

●●


●●●

●●●

●●


●●●●
● ● ●●●●● ●



● ● ●● ●●●●● ● ●● ● ● ● ● ● ● ● ●● ● ● ●● ●●

−2

5
● ● ● ●● ● ●● ●
●● ●● ●●
●●●●●



●●

●●
●●
●●●
●●
●●
●● ●●


●●
● ●● ● ●●● ●● ●●● ● ● ● ● ●● ●●●●● ●● ●●●● ●● ●
●● ●
●● ● ● ●● ●●●●
● ●
●● ●●●
● ●
●●
●●


● ●
●●●● ●●● ● ● ●● ●● ● ●●● ● ●● ●●

● ●●
● ●●●●●●●● ●
●●●●●●
●●●
●●●
●●

●●●
●● ●●

●●
●●
●●
●●● ●● ●● ●
● ●●
● ●●● ●●●●● ●
●●●●
● ●●● ● ●●
●●●

●●

●●●●



●●●●


●●●●●●

●●●
●●●
● ●
● ●


●●




●●

●●



●●
●●

●●●

●● ● ●●●● ●
● ● ● ● ●● ● ● ●
●● ●

●●


●●

●● ●

●●●●

●●

●●
●●●●●●●
●●●● ●●●
●●●

●●
●●
●●
●●
● ●●
●●
● ●●● ●●
● ●● ●● ● ● ● ●● ● ● ●● ● ● ●
●●●
● ●
●●

●●
●●●
● ● ●
●●●●●
●●
●●
●●
●●
●●●
● ●●
●●●
●●●
● ●
●●● ●

●●
● ● ●● ● ● ● ● ●● ● ●

0

●●


● ●
●●
● ●
●●

● ●
●●
● ●●●●●●
● ●
●●●
●●
●●●●●●

●●●

●●●
● ●●●●● ● ● ●

● ●●●
●●
● ●●●● ●
●●●
●●●●
●●●●●●● ●●●●
● ●● ●●
●●
●● ●●


●●●
● ●●
● ●●●
● ●●
●●
● ●


●●●●
●● ● ● ● ● ● ●
● ● ●●●● ● ●● ● ● ●● ● ● ●

−4
●● ●● ●
●● ● ● ●

−5
0.0 0.5 1.0 1.5 2.0 2.5 3.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0

x x

X= Y^{1/3} + eta errors


3.0

●● ●●●●







●●






●●●●
●●
● ●

●● ●●●●
●●
●●
●● ●●●●

●●●●●●
● ●

●●●● ●


●●●

●●●●●●
●● ●●

● ●
●●●
●●●

●●
● ●●● ●● ●●
●● ●●● ●●
●● ●● ●●
● ●●
●●
2.5

0.5
●●


●●
●●●●●
● ●●●



● ●
●●
●● ●

● ●●●
●● ●
●● ●●

● ●
●●●●
● ●

●●
●●●●
● ●●

●●●●●●
●●●
●● ●●
● ●● ●
● ●

● ● ●●
●● ●●●
●●●
●●



●●


●●




●●●
● ●● ● ●
●●
● ●●●●● ● ● ●●
●●●●●
● ●●
●●
●●

●●●●●

● ●●
● ●




●●
●● ● ●●●
●●●● ●
●● ●

● ●
●●● ●●●● ●●●●
2.0

●●●
● ●
●●● ●
● ● ●● ● ●● ●
● ●●●● ●●●●● ●● ●
●●● ●●
●●
● ●●

●●●

●●

●●
●●●● ●●
●●●●●
●●



●●●●●●

●●

●● ●●● ●●●●
● ●●●●●●
●●● ●● ● ●●●●
●● ●
●●● ●● ● ●● ●●●● ● ●
●●●●●●●
● ●●●●●●● ● ●● ●●●●●● ●●● ●●●●●
●●●● ●● ● ● ● ● ● ●
● ●●

0.0

● ●
● ●●●
●●●● ●
●●● ●●●●●●
●●●
●●


●● ● ● ● ●●
●●●●
●● ● ●
●●●●●● ●●
●● ●●●
●● ●●●● ●
●●

●●
● ●● ● ●
●●
● ●

●●●●
●●●●
●●
●●●
●● ●●
●●●

●●●
● ●●●●


●● ●
●●●●●

● ● ●●
● ● ●
●●●●

● ● ●
●●● ● ●
●●●
●●
●●

●●

●●●
●●●●
●●

●●●
●●●●●
● ●

●●
●●
● ●

●●●●
●●●
●●●
●●●●●
● ●
●●●●

●●●●●●

●●●●
●●

●●●●●●
●●●

●●
●●
● ●● ●

●●●● ●●
●● ●● ●● ●●

●●●
●●
● ●
●●●●●●

●●●
●●●●
● ● ●●
●● ●●
● ●● ●
●● ●● ● ● ●●● ● ●●

● ●

reverse mod.: X = Y 1/3 + η


●● ●● ● ● ●
● ●●●●● ● ● ●●● ● ●
● ●●
● ●
●●●●●
●●
●●●●
● ●
●● ● ●
●●●
●●
●●
●●
● ● ●●●
● ●
●● ●●●
● ●● ●●●●●● ●● ●● ● ●
●●● ●● ● ●●●● ●● ●● ●●
1.5

eta

● ●
● ● ●●●
● ● ● ●● ● ●● ●●●● ●●●
●●●● ●●

●● ● ●● ●●● ●
●●
●●
●● ●●
●● ●● ● ● ●●
x

●●●●●●
● ● ●● ● ●●
● ●●● ●●●● ●● ●
●●
●● ● ●●●
● ●
●● ●
● ●
●●








●●●

● ● ●●●
●● ●
●●●
●●
●● ●●●

●● ●●●
● ● ●●
● ● ●●●
● ●
●●●●●

−0.5
●●● ●
●●
●●

● ●● ● ●●● ●●
●●●●● ● ● ● ● ●
●●
● ●●● ● ●

●●● ● ●●
●●●●●
● ●
●●
●●●
●●●



●●

●●
●●

● ●
●●
● ● ●
1.0

●●●● ●
●● ●
●●
●●●
●●
●●● ●●● ●

●● ●●●
● ● ●
●●●● ●

●●
●●
●●
●● ●●● ●
●●●
● ●
●●● ●





●●

●●●




●●
●●●●


● ●







●●●

●●
●●
● ●●●
●●
●●●●
●●

●●●●
●●●●
● ●
●●●
●●●●
●●

● ● ● ● ●

●●●●●● ●● ●
●● ● ●
●●● ●
●●
●●
●●● ● ●●●●

●●●●

●● ●
●●

●●

● ●●
●●
●●


● ●
● ●●

−1.0
●● ●● ● ● ●

η not indep. of Y
0.5

●●●●
●● ●

●●
●●●● ● ●●
● ●
●●

●●●
● ●●●●
●●
●●●● ●●●●● ●
●●●●● ●●
●●


●●


●●●
●● ●

●●

●●●● ●
● ● ●

●●
●●●●●
●● ●

●●









● ●● ● ●●
●● ●●●●●●●
●●●



● ●
●●●●


● ●●●● ●
●●●●

●●●
●●●●●●
●●


●●


● ● ●

● ●●
●●
●●●●
0.0

●●●●●● ●●
● ● ●
● ● ●

−5 0 5 10 15 20 25 30 −5 0 5 10 15 20 25 30

y y
strategy to infer structure and model parameters
(for identifiable models):

order search (order among the variables)


&
subsequent (sparse) regression
order search: MLE for best permutation/order

I candidate order/permutation π (of {1, . . . , p})


I regressions

Xπ(j) versus Xπ(j−1) , Xπ(j−2) , . . . , Xπ(1)

(when sparse: only some of the regressors are selected)


evaluation score for π:
likelihood
Qp of all the regressions Xπ(j) vs. Xπ(j−1) , Xπ(j−2) , . . . , Xπ(1)
; j=1 p(Xπ(j) |Xπ(j−1) , . . . , Xπ(1) )

need a model for the regressions:


I functional form of regressions (e.g. additive, linear)
I distribution of error terms (e.g. Gaussian, nonparametric)
; can compute the likelihood with estimated parameters
and search over the orderings/permutations
search over the orderings/permutations:
there are p! permutations...

trick: preliminary neighborhood selection


aiming for an undirected graph containing the skeleton of the
DAG (i.e. the Markov blanket or a superset thereof)
2

1 3

“superset skeleton”

4
5

; and do order search and estimation restricted to


superset-skeleton
I much reduced search space
I can use unpenalized MLE (for best order/permutation
restricted to superset-skeleton)
that is: regularization in neighborhood selection suffices!
CAM: Causal Additive Model (PB, Peters & Ernest, 2013)

X
Xj ← fjk (Xk ) + εj (j = 1, . . . , n)
k∈pa(j)

εj ∼ N (0, σj2 )

I underlying DAG is identifiable from joint distribution


I statistically “feasible” for estimation due to additive
functions
I good empirical performance

log-likelihood equals (up to constant term):


p
X n
X X
−1
log(σ̂jπ ), (σ̂jπ )2 =n (Xi,π(j) − f̂j,π(k) (Xi,k ))
j=1 i=1 k<j
fitting method
1. preliminary neighborhood selection
(for superset of skeleton of DAG or Markov blanket):
“nodewise” sparse additive regression of
one variable Xj against all others {Xk ; k 6= j}
easy (using CV), works well and is established
(Ravikumar et al. (2009), Meier, van de Geer & PB (2009),...)
2. estimate a best order by unpenalized MLE restricted to
the superset-skeleton
3. based on the estimated order π̂: additive sparse model
fitting restricted to the superset-skeleton ; edge pruning
2 2 2

1 3 1 3 1 3

4 4 4
step 1: 5 step 2: 5 step 3: 5

for step 2: greedy search restricted to superset-skeleton


for step 2: greedy search restricted to superset-skeleton
– exhaustive computation is possible for trees (Bürge, MSC thesis 2014)
– greedy methods are shown to find the optimum for restricted classes of
DAGs (Peters, Balakrishnan and Wainwright, in progress)

unpenalized MLE for best order search: very convenient


regularization is decoupled
and only used in sparse additive regression
statistical consistency: high-dimensional p  n setting
(PB, Peters & Ernest, 2013)
main assumptions:
I maximal degree of DAG is O(1) (“sparse”)
I sufficiently large identifiability constant (w.r.t. Kullback-
Leibler divergence)
 between true and wrong ordering: 
ξp := p−1 minπ∈Π
/ 0 Eθ0 [− log(pθπ0 (X ))] − Eθ0 [− log(pθ0 (X ))]
p
ξp  log(p)/n
I exponential moments for Xj ’s and certain smooth functions
thereof
then:
P[π̂ ∈ Π0
|{z} ] → 1 (n → ∞)
set of true permutations
I consistent recovery of true functions fjk0 and DAG D 0
I consistent estimation of E[Y |do(X = x)]
remark:

preliminary neighborhood selection


I yields empirical and computational improvements
I seems to improve the theoretical results
e.g. in linear Gaussian case ; less stringent conditions
than `0 -regularized MLE (van de Geer & PB, 2013)
Empirical results to illustrate what can be achieved with CAM

p = 100, n = 200
true model is CAM (additive SEM) with Gaussian error

SHD: Structural Hamming Distance


SID: Structural Intervention Distance (Peters & PB, 2013)
400 ●
SHD to true DAG

p = 100 p = 100

SID to true DAG


● ●

300 1500

200 1000

100 500

0 0

CAM

RESIT

LiNGAM

PC

CPC

GES

CAM

RESIT

LiNGAM

PC (lower)

PC (upper)

CPC (lower)

CPC (upper)

GES (lower)

GES (upper)
RESIT (Mooij et al. 2009) cannot be used for p = 100
CAM method is impressive where true functions are
non-monotone and nonlinear (sampled from Gaussian proc.);
for monotone functions: still good but less impressive gains
Empirical results to illustrate what can be achieved with CAM

p = 100, n = 200
true model is CAM (additive SEM) with Gaussian error

SHD: Structural Hamming Distance


SID: Structural Intervention Distance (Peters & PB, 2013)
400 ●
SHD to true DAG

p = 100 p = 100

SID to true DAG


● ●

300 1500

200 1000

100 500

0 0

CAM

RESIT

LiNGAM

PC

CPC

GES

CAM

RESIT

LiNGAM

PC (lower)

PC (upper)

CPC (lower)

CPC (upper)

GES (lower)

GES (upper)
RESIT (Mooij et al. 2009) cannot be used for p = 100
CAM method is impressive where true functions are
non-monotone and nonlinear (sampled from Gaussian proc.);
for monotone functions: still good but less impressive gains
Gene expressions from isoprenoid pathways in Arabidopsis
Thaliana (Wille et al., 2004)
p = 39, n = 118

top 20 edges from CAM


Chloroplast (MEP pathway) Cytoplasm (MVA pathway)
stability selection
Chloroplast (MEP pathway) Cytoplasm (MVA pathway)

DXPS1 DXPS2 DXPS3 DXPS1 DXPS2 DXPS3


AACT1 AACT2 AACT1 AACT2

DXR DXR
HMGS HMGS

MCT MCT
HMGR1 HMGR2 HMGR1 HMGR2

CMK CMK
MK MK

MECPS MECPS
MPDC1 MPDC2 MPDC1 MPDC2

HDS HDS

IPPI2 IPPI2
HDR HDR

Mitochondrion Mitochondrion
IPPI1 FPPS1 FPPS2 IPPI1 FPPS1 FPPS2

UPPS1 UPPS1
GPPS GPPS
DPPS2 DPPS2

GGPPS PPDS1 PPDS2 GGPPS1,5,9 DPPS1,3 GGPPS3,4 GGPPS PPDS1 PPDS2 GGPPS1,5,9 DPPS1,3 GGPPS3,4
2,6,8,10,11,12 2,6,8,10,11,12

solid edges: estimated from data


stability selection: expected no. of false positives ≤ 2
(Meinshausen & PB, 2010)
When knowing the order of the variables
a new approach...
Theorem (Ernest & PB, 2014)
For a general nonlinear SEM with functions fj ∈ L2 (P)
and regularity condition on the joint density/distribution:
for all I ⊂ {1, . . . , p}: p(x) ≥ Mp(xI )p(xI c ) for some 0 < M ≤ 1

E[Y |do(X = x)] = g(x),


g(x) = best additive L2 -approx. of Y versus X , {Xk ; k ∈ S(X )}
e.g. S(X ) = {k ; k < jX }, jX = index corresponding to X
that is
E[Y |do(X = x)] = g(x),
g(x) = best additive approx. of Y versus X , {Xj ; j ∈ S(X )},
X
g app = argminfj E[(Y − fjX (X ) − fk (Xk ))2 ],
k∈S(X )
app
gjX (·) = g(·)
implication:
only need to run additive model fitting even for an “arbitrary
nonlinear SEM”
e.g. even if fj (Xpa(j) , εj ) is very complicated...!

call it ord-additive
modeling/fitting
very robust against model-misspecification !

misspecification could be dependent εj ’s, i.e. due to hidden


variables
if we were to consider E[Y |do(X1 = x1 , X2 = x2 )] ; would need
to fit first-order interaction model
implication:
only need to run additive model fitting even for an “arbitrary
nonlinear SEM”
e.g. even if fj (Xpa(j) , εj ) is very complicated...!

call it ord-additive
modeling/fitting
very robust against model-misspecification !

misspecification could be dependent εj ’s, i.e. due to hidden


variables
if we were to consider E[Y |do(X1 = x1 , X2 = x2 )] ; would need
to fit first-order interaction model
Swiss Air Ltd. ticket pricing and revenue
(Ghielmetti et al., in progress)

do not need to worry about complicated non-linearities!


estimation – in practice
additive model fitting of

Y versus X , {Xk ; k ∈ S(X )}

Hastie & Tibshirani (≈ 1990); Wood (2006); ...

high-dimensional scenario when |S(X )| is large


e.g. |S(X )| ≈ 100 − 50 000 and sample size n ≈ 50
consistency and optimal convergence rate results for sparse
additive model fitting if
– problem is sparse: maximal degree of true DAG is small
– identifiability aka restricted eigenvalue conditions
Ravikumar et al. (2009); Meier, van de Geer & PB (2009)

additive model fitting is well-established and works well


including high-dimensional setting
estimation – in practice
additive model fitting of

Y versus X , {Xk ; k ∈ S(X )}

Hastie & Tibshirani (≈ 1990); Wood (2006); ...

high-dimensional scenario when |S(X )| is large


e.g. |S(X )| ≈ 100 − 50 000 and sample size n ≈ 50
consistency and optimal convergence rate results for sparse
additive model fitting if
– problem is sparse: maximal degree of true DAG is small
– identifiability aka restricted eigenvalue conditions
Ravikumar et al. (2009); Meier, van de Geer & PB (2009)

additive model fitting is well-established and works well


including high-dimensional setting
MEP pathway in Arabidopsis Thaliana (Wille et al, 2004)

p = 14 expressions of genes, n = 118


order of variables: biological information w.r.t. up-/downstream

Chloroplast (MEP pathway)

DXPS1 DXPS2 DXPS3

DXR

MCT
I rank Rdirected gene pairs (X causes Y )
CMK
with (Ê[Y |do(X = x) − Ê[Y ])2 dx
MECPS
I top 10 scoring directed gene pairs and
HDS
check their stability ;
HDR
stability selection:
E[false positives] ≤ 1
IPPI1

(Meinshausen & PB, 2010)


GPPS

GGPPS PPDS1 PPDS2


2,6,8,10,11,12

do not need to worry about complicated nonlinearities!


ord-additive regression: a local operation

inferring E[Y |do(X = x)] when DAG (or order) is known:


I ord-additive regression Y versus X , pa(X ): local
I integration of all directed paths from X to Y : global
4
2.0 Entire Path Entire Path
squared error

squared error
Partial Path 3 Partial Path
1.5 Parent Adjustment Parent Adjustment
2
1.0

0.5 1

0.0 0
0 (100%) 16 (96%) 32 (92%) 64 (84%) 128 (68%) 256 (34%) 0 (100%) 4 (96%) 8 (92%) 16 (84%) 32 (68%) 64 (34%)
SHD to true DAG (percentage of correct edges) SHD to true DAG (percentage of correct edges)

nonsparse DAG sparse DAG

ord-additive regression is much more reliable and “robust”


when having imperfect knowledge of the DAG or the order of
variables
and computationally much faster
Conclusions

1. Beware of over-interpretation!

so far, based on current data:


we can not reliably infer a causal network
despite theorems...
(perturbation of the data yields unstable networks)

2. Causal inference relies on subtle uncheckable(!)


assumptions
; experimental validations are important (simple organisms in
1

biology are great for pursuing this!)


statistical (and other) inference is often not confirmatory

3. many technical issues in identifiability, high-dimensional


statistical inference and optimization
4. but there is potential:
for stable ranking/prediction of intervention/causal effects
... “causal inference from purely observed data could have
practical value in the prioritization and design of perturbation
experiments”
Editorial in Nature Methods (April 2010)

this can be very useful in computational biology

and in this sense:


“causal inference from observational data is much further
developed than 30 years ago when it was thought to be
impossible”
Thank you!

R-package: pcalg
(Kalisch, Mächler, Colombo, Maathuis & PB, 2012)
References:
I Ernest, J. and Bühlmann, P. (2014). On the role of additive regression for (high-dimensional) causal
inference. Preprint arXiv:1405.1868
I Bühlmann, P., Peters, J. and Ernest, J. (2013). CAM: Causal Additive Models, high-dimensional order
search and penalized regression. Preprint arXiv:1310.1533
I Peters, J. and Bühlmann, P. (2014). Identifiability of Gaussian structural equation models with equal error
variances. Biometrika 101, 219-228.
I Uhler, C., Raskutti, G., Bühlmann, P. and Yu, B. (2013). Geometry of faithfulness assumption in causal
inference. Annals of Statistics 41, 436-463.
I van de Geer, S. and Bühlmann, P. (2013). `0 -penalized maximum likelihood for sparse directed acyclic
graphs. Annals of Statistics 41, 536-567.
I Kalisch, M., Mächler, M., Colombo, D., Maathuis, M.H. and Bühlmann, P. (2012). Causal inference using
graphical models with the R package pcalg. Journal of Statistical Software 47 (11), 1-26.
I Stekhoven, D.J., Moraes, I., Sveinbjörnsson, G., Hennig, L., Maathuis, M.H. and Bühlmann, P. (2011).
Causal stability ranking. Bioinformatics 28, 2819-2823.
I Hauser, A. and Bühlmann, P. (2012). Characterization and greedy learning of interventional Markov
equivalence classes of directed acyclic graphs. Journal of Machine Learning Research 13, 2409-2464.
I Maathuis, M.H., Colombo, D., Kalisch, M. and Bühlmann, P. (2010). Predicting causal effects in large-scale
systems from observational data. Nature Methods 7, 247-248.
I Maathuis, M.H., Kalisch, M. and Bühlmann, P. (2009). Estimating high-dimensional intervention effects from
observational data. Annals of Statistics 37, 3133-3164.

You might also like