BOOK - Heuristics, Probability and Causality - A Tribute To Judea Pearl
BOOK - Heuristics, Probability and Causality - A Tribute To Judea Pearl
BOOK - Heuristics, Probability and Causality - A Tribute To Judea Pearl
Linked Table of Contents begins on page 2. From there you can easily access all
contributed papers.
1
Table of Contents
List of Contributors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi
I. Heuristics ........................................................... 1
v
11. Judea Pearl and Graphical Models for Economics
Michael Kearns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
vi
26. Pearl Causality and the Value of Control
Ross Shachter and David Heckerman . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 445
vii
List of Contributors
ix
Azaria Paz, Computer Science Department, Technion – Israel Institute of Technology,
Israel
Ira Pohl , Computer Science Department, UC Santa Cruz, USA
David Poole, Department of Computer Science, University of British Columbia, Canada
Edward T. Purcell, Los Angeles, USA
Thomas S. Richardson, Department of Statistics, University of Washington, USA
James M. Robins, Departments of Epidemiology and Biostatistics, Harvard University,
USA
Emma Rollon, School of Information and Computer Sciences, UC Irvine, USA
Nelson Rushton, Department of Computer Science, Texas Tech University, USA
Stuart Russell, Department of EECS, UC Berkeley, USA
Jonathan Schaeffer, Department of Computing Science, University of Alberta, Canada
Richard Scheines, Department of Philosophy, CMU, USA
Ross Shachter, Department of Management Science and Engineering, Stanford University,
USA
Yoav Shoham, Computer Science Department, Stanford University, USA
Ilya Shpitser, Department of Epidemiology, Harvard University, USA
David Spiegelhalter, Statistics Laboratory, University of Cambridge, UK
Peter Spirtes, Department of Philosophy, CMU, USA
Wolfgang Spohn, Department of Philosophy, University of Konstanz, Germany
V. S. Subrahmanian, Computer Science Department, University of Maryland, USA
Jin Tian, Department of Computer Science, Iowa State University, USA
Robert Tillman, Department of Philosophy and Machine Learning Department, CMU,
USA
Christopher Winship, Department of Sociology, Harvard University, USA
Ingrid Zukerman, Faculty of Information Technology, Monash University, Australia
x
Preface
This book is a collection of articles in honor of Judea Pearl written by close col-
leagues and former students. Its three main parts, heuristics, probabilistic rea-
soning, and causality, correspond to the titles of the three ground-breaking books
authored by Judea, and are followed by a section of short reminiscences.
Judea Pearl was born in Tel Aviv and is a graduate of the Technion - Israel In-
stitute of Technology. He came to the United States for postgraduate work in 1960.
He received his Master’s degree in physics from Rutgers University and his Ph.D.
degree in electrical engineering from the Brooklyn Polytechnic Institute, both in
1965. Until 1969, he held research positions at RCA David Sarnoff Research Labo-
ratories in Princeton, New Jersey and at Electronic Memories, Inc. at Hawthorne,
California. In 1969 Pearl joined the UCLA faculty where he is currently an emeritus
professor of computer science and director of the cognitive systems laboratory.
Judea started his research work in artificial intelligence (AI) in the mid-1970s,
not long after joining UCLA. In the eyes of a hard scientist, AI must have been
a fascinating but slippery scientific discipline then; a lot of AI was done through
introspection and programming, building systems that could display some form of
intelligence.
Since then, AI has changed a great deal. Arguably no one has played a larger
role in that change than Judea. Judea Pearl’s work made probability the prevailing
language of modern AI and, perhaps more significantly, it placed the elaboration
of crisp and meaningful models, and of effective computational mechanisms, at
the center of AI research. This work is conveyed in the more than 300 scientific
papers, and in his three landmark books Heuristics (1984), Probabilistic Reasoning
(1988), and Causality (2000), where he deals with the basic questions concerning the
acquisition, representation, and effective use of heuristic, probabilistic, and causal
knowledge. He tackled these issues not as a philosopher or mathematician, but
as an engineer and a cognitive scientist. His “burning question” was (and still is)
how does the human mind “do it”, and he set out to answer this question with an
unusual combination of intuition, passion, intellectual honesty, and technical skill.
Judea is the recipient of numerous scientific awards. In 1996 he was selected
by the UCLA Academic Senate as the 81st Faculty Research Lecturer to deliver an
annual research lecture which presents the university’s most distinguished scholars
to the public. He received the 1999 IJCAI Research Excellence Award in Artificial
Intelligence for “his fundamental work on heuristic search, reasoning under uncer-
tainty, and causality”, the 2001 London School of Economics Lakatos Award for the
“best book in the philosophy of science”, the 2004 ACM Allen Newell Award for
“seminal contributions that extend to philosophy, psychology, medicine, statistics,
econometrics, epidemiology and social science”, and the 2008 Benjamin Franklin
xi
Medal for “creating the first general algorithms for computing and reasoning with
uncertain evidence”.
Judea has had more than 20 PhD students at UCLA, many of whom have
become successful AI researchers on their own and many have contributed to this
volume. Chronologically, they are: Antonio Leal (1976), Alan Chrolotte (1977),
Ed Purcell (1978), Joseph Saleh (1980), Jin Kim (1983), Gerard Michon (1983),
Rina Dechter (1985), Ingrid Zukerman (1986), Hector Geffner (1989), Dan Geiger
(1990), Moises Goldszmidt (1992), Tom Verma (1990), Itay Meiri (1992), Rachel
Ben-Eliyahu (1993), Sek-Wah Tan (1995), Alexander Balke (1995), Max Chickering
(1996), Jin Tian (2002), Carlos Brito (2004), Blai Bonet (2004), Mark Hopkins
(2005), Chen Avin (2006), and Ilya Shpitser (2008).
On a sadder note, Judea is the father of slain Wall Street Journal reporter Daniel
Pearl and president of the Daniel Pearl Foundation, which he co-founded with his
wife Ruth in April 2002 to continue Daniel’s life-work of dialogue and understanding
and to address the root causes of his tragic death.
This book will be presented to Judea on March 12, 2010 at a special event at
UCLA honoring his life and work, where many of the contributing authors to this
book will speak. Two of the editors of this volume, Rina and Hector, are former
students of Judea, and the third, Joe, is a close colleague and collaborator. The
three of us would like to thank all the authors whose articles are included in this vol-
ume. Special thanks go to Adnan Darwiche and Rich Korf of the UCLA Computer
Science Department, who helped to organize this event, and to Avi Dechter, Randy
Hess, Nir Lipovetzky, Felix Elwert, and Jane Spurr, who helped in the production
of the book.
Judea, on behalf of those present in the book, and the many of your students and
colleagues who are not, we would like to express our most profound gratitude and
admiration to you, as an advisor, a scientist, and a great human being. It has been
a real privilege to know you, to benefit from your (truly enjoyable!) company, to
watch you, and to learn from you. As students, we couldn’t have hoped for a better
role model. As colleagues, we couldn’t have benefited more from your collaboration
and leadership. We know that you don’t like compliments, but you are certainly
the light in our candle!
Thank you Judea!!!
xii
Part I: Heuristics
Return to TOC
1 Introduction
The artificial intelligence (AI) subfields of heuristic search and automated planning
are closely related, with planning problems often providing a stimulus for developing
and testing search algorithms. Classical approaches to heuristic search and planning
assume a deterministic model of sequential decision making in which a solution
takes the form of a sequence of actions that transforms a start state into a goal
state. The effectiveness of heuristic search for classical planning is illustrated by
the results of the planning competitions organized by the AI planning community,
where optimal planners based on A*, and satisficing planners based on variations
of best-first search and enforced hill climbing, have performed as well or better
than many other planners in the deterministic track of the competition [Edelkamp,
Hoffmann, and Littman 2004; Gerevini, Bonet, and Givan 2006].
Beginning in the 1990’s, AI researchers became increasingly interested in the
problem of planning under uncertainty and adopted Markov decision theory as a
framework for formulating and solving such problems [Boutilier, Dean, and Hanks
1999]. The traditional dynamic programming approach to solving Markov decision
problems (MDPs) [Bertsekas 1995; Puterman 1994] can be viewed as a form of
“blind” or uninformed search. Accordingly, several AI researchers considered how to
generalize well-known heuristic-search techniques in order to develop more efficient
planning algorithms for MDPs. The advantage of heuristic search over traditional,
blind dynamic programming is that it uses an admissible heuristic and intelligent
search control to focus computation on solving the problem for relevant states, given
a start state and goal states, without considering irrelevant or unreachable parts of
the state space.
In this article, we present an overview of research on heuristic search for prob-
lems of sequential decision making where state transitions are stochastic instead
of deterministic, an important class of planning problems that corresponds to the
most basic kind of Markov decision process, called a fully-observable Markov de-
cision process. For this special case of the problem of planning under uncertainty,
a fairly mature theory of heuristic search has emerged over the past decade and a
half. In reviewing this work, we focus on two key issues: how to generalize classic
heuristic search algorithms in order to solve planning problems with stochastic state
3
Blai Bonet and Eric A. Hansen
transitions, and how to compute admissible heuristics for these search problems.
Judea Pearl’s classic book, Heuristics, provides a comprehensive overview of
heuristic search theory as of its publication date in 1984. One of our goals in
this article is to show that the twin themes of that book, admissible heuristics and
intelligent search control, have been central issues in the subsequent development
of a class of algorithms for problems of planning under uncertainty. In this short
survey, we rely on references to the literature for many of the details of the al-
gorithms we review, including proofs of their properties and experimental results.
Our objective is to provide a high-level overview that identifies the key ideas and
contributions in the field and to show how the new search algorithms for MDPs
relate to the classical search algorithms covered in Pearl’s book.
4
Heuristic Search for Planning under Uncertainty
MDP models with different optimization criteria, and almost all of the algorithms
and results we review in this article apply to other MDPs. The most widely-used
model in the AI community is the discounted infinite-horizon MDP. In this model,
there are rewards instead of costs, r(s, a) denotes the expected reward for taking
action a in state s, which can be positive or negative, γ ∈ (0, 1) denotes a dis-
count factor, and the objective is to maximize expected total discounted reward
over an infinite horizon. Interestingly, any discounted infinite-horizon MDP can be
reduced to an equivalent stochastic shortest-path problem [Bertsekas 1995; Bonet
and Geffner 2009]. Thus, we do not sacrifice any generality by focusing our attention
on stochastic shortest-path problems.
5
Blai Bonet and Eric A. Hansen
where the Xk ’s are random variables that denote states of the system at different
time points, distributed according to Pπ , and where Eπ is the expectation with
respect to Pπ . The function Vπ is called the state evaluation function, or simply
the value function, for policy π. For a stochastic shortest-path problem, it is well-
defined as long as π is a proper policy, and Vπ (s) equals the expected cost to reach
a goal state from state s when using policy π.
A policy π for a stochastic shortest-path problem is optimal if its value function
satisfies the Bellman optimality equation:
(
∗ 0 if s ∈ G,
V (s) = P ′ ∗ ′
(1)
mina∈A(s) c(s, a) + s′ ∈S p(s |s, a)V (s ) otherwise.
The unique solution of this functional equation, denoted V ∗ , is the optimal value
function; hence, all optimal policies have the same value function. Given the optimal
value function, one can recover an optimal policy by acting greedily with respect to
the value function. A greedy policy with respect to a value function V is defined as
follows:
X
′ ′
πV (s) = argmin c(s, a) + p(s |s, a)V (s ) .
a∈A(s) s′ ∈S
Thus, the problem of finding an optimal policy for an MDP is reduced to the
problem of solving the optimality equation.
There are two basic dynamic programming approaches for solving Equation (1):
value iteration and policy iteration. The value iteration approach is used by all
of the heuristic search algorithms we consider, and so we review it here. Starting
with an initial value function V0 , satisfying V0 (s) = 0 for s ∈ G, value iteration
computes a sequence of updated value functions by performing, at each iteration,
the following backup for all states s ∈ S:
X
′ ′
Vn+1 (s) := min c(s, a) + p(s |s, a)Vn (s ) . (2)
a∈A(s)
s′ ∈S
6
Heuristic Search for Planning under Uncertainty
7
Blai Bonet and Eric A. Hansen
8
Heuristic Search for Planning under Uncertainty
(i) all immediate costs incurred by transitions from non-goal states are positive, and
(ii) the initial state evaluation function is admissible, with all goal states having an
initial value of zero.1
Although we classify RTDP as a heuristic search algorithm, it is also a dynamic
programming algorithm. We consider an algorithm to be a form of dynamic pro-
gramming if it solves a dynamic programming recursion such as Equation (1) and
caches results for subproblems in a table, so that they can be reused without need-
ing to be recomputed. We consider it to be a form of heuristic search if it uses
an admissible heuristic and reachability analysis, beginning from a start state, to
prune parts of the state space. By these definitions, LRTA* and RTDP are both
dynamic programming algorithms and heuristic search algorithms, and so is A*.
We still contrast heuristic search to simple dynamic programming, which solves the
problem for the entire state space. Value iteration and policy iteration are simple
dynamic programming algorithms, as are Dijkstra’s algorithm and Bellman-Ford.
But heuristic search algorithms can often be viewed as a form of enhanced or fo-
cused dynamic programming, and that is how we view the algorithms we consider
in the rest of this survey.2 The relationship between heuristic search and dynamic
programming comes into clearer focus when we consider LAO*, another heuristic
search algorithm for solving MDPs.
3.2 LAO*
Whereas RTDP generalizes LRTA*, an online heuristic search algorithm, the next
algorithm we consider, LAO* [Hansen and Zilberstein 2001], generalizes the classic
AO* search algorithm, which is an offline heuristic search algorithm. The ‘L’ in
LAO* indicates that it can find solutions with loops, unlike AO*. Table 1 shows how
various dynamic programming and heuristic search algorithms are related, based on
the structure of the solutions they find. As we will see, the branching and cyclic
behavior specified by a policy for an indefinite-horizon MDP can be represented
explicitly in the form of a cyclic graph.
Both AO* and LAO* represent the search space of a planning problem as an
AND/OR graph. In an AND/OR graph, an OR node represents the choice of an
action and an AND node represents a set of outcomes. AND/OR graph search was
1 Although the convergence proof given by Barto et al. depends on the assumption that all action
costs are positive, Bertsekas and Tsitsiklis [Bertsekas and Tsitsiklis 1996] prove that RTDP also
converges for stochastic shortest-path problems with both positive and negative action costs, given
the additional assumption that all improper policies have infinite cost. If action costs are positive
and negative, however, the assumption that all improper policies have infinite cost is difficult to
verify. In practice, it is often more convenient to assume that all action costs are positive.
2 Not every heuristic search algorithm is a dynamic programming algorithm. Tree-search heuris-
tic search algorithms, in particular, do not cache the results of subproblems and thus do not qualify
as dynamic programming algorithms. For example, IDA*, which explores the tree expansion of a
graph, does not cache the results of subproblems and thus does not qualify as a dynamic program-
ming algorithm. On the other hand, IDA* extended with a transposition table caches the results
of subproblems and thus is a dynamic programming algorithm.
9
Blai Bonet and Eric A. Hansen
Solution form
simple path acyclic graph cyclic graph
Dynamic programming Dijkstra’s backwards induction value iteration
Offline heuristic search A* AO* LAO*
Online heuristic search LRTA* RTDP RTDP
10
Heuristic Search for Planning under Uncertainty
11
Blai Bonet and Eric A. Hansen
traversing the reachable state space and updating the value function.
Experiments show that Improved LAO* finds a good solution as quickly as RTDP
and converges to an optimal solution much faster; faster convergence is due to its
use of systematic search instead of stochastic simulation to explore the state space.
The test for convergence to an optimal solution generalizes the convergence test for
AO*: the best solution graph is optimal if it is complete (i.e., it does not contain
any unexpanded nodes), and if state values have converged to exact values for all
nodes in the best solution graph. If the state values are not exact, it is possible to
bound the suboptimality of the solution by adapting the error bounds developed
for value iteration.
12
Heuristic Search for Planning under Uncertainty
the course of a depth-first traversal of the graph. Since Improved LAO* expands
and updates the states in the current best solution graph in the course of a depth-
first traversal of the graph, the two algorithms are easily combined. In fact, Bonet
and Geffner [Bonet and Geffner 2003a] present their HDP algorithm as a synthesis
of Tarjan’s algorithm and a depth-first search algorithm, similar to the one used in
Improved LAO*.
The same idea of labeling states as ‘solved’ can also be combined with RTDP.
In Labeled RTDP (LRTDP), trials are very much like RTDP trials except that
they terminate when a solved stated is reached. (Initially only the goal states are
solved.) At the end of a trial, a labeling procedure is invoked for each unsolved
state visited in the trial, in reverse order from the last unsolved state to the start
state. For each state s, the procedure performs a depth-first traversal of the states
that are reachable from s by selecting actions greedily based on the current value
function. If the residuals of these states are less than a threshold ǫ, then all of
them are labeled as ‘solved’. Like AO*, Labeled RTDP terminates when the initial
state is labeled as ‘solved’. The labeling procedure used by LRTDP is similar to the
traversal procedures used in HDP and Improved LAO*. However, the innovation
of LRTDP is that instead of always traversing the solution graph from the start
state, it begins the traversal at each state visited in a trial, in backwards order from
the last unsolved state, which allows the convergence of states near the goal to be
recognized before states near the initial state have converged.
Experiments show that LRTDP converges much faster than RTDP, and some-
what faster than Improved LAO*, in solving benchmark “racetrack” problems. In
general, the amount of improvement is problem-dependent since it depends on the
extent to which the solution graph decomposes into strongly-connected components.
In the racetrack domain, the improvement over Improved LAO* is due to labeling
states as ‘solved’; the more substantial improvement over RTDP is partly due to
labeling, but also due to the more systematic traversal of the state space.
Lower and upper bounds. Both LRTDP and HDP gradually reduce the Bellman
residual until it falls below a threshold ǫ. If the threshold is sufficiently small, the
policy is optimal. But the residual, by itself, does not bound the suboptimality of
the solution. To bound its suboptimality, we need an upper bound on the value
of the starting state in addition to the lower-bound values computed by heuristic
search. Once a closed policy is found, an obvious way to bound its suboptimality
is to evaluate the policy; its value for the start state is an upper bound that can be
compared to the admissible lower-bound value computed by heuristic search. But
this approach does not allow the suboptimality of an incomplete solution (one for
which the start state is not yet labeled ‘solved’) to be bounded.
McMahan et al. [McMahan, Likhachev, and Gordon 2005] and Smith and Sim-
mons [Smith and Simmons 2006] describe two algorithms, called Bounded RTDP
(BRTDP) and Focused RTDP (FRTDP) respectively, that compute upper bounds
in order to bound the suboptimality of a solution, including incomplete solutions,
13
Blai Bonet and Eric A. Hansen
and use the difference between the upper and lower bounds on state values to fo-
cus search effort. The key assumption of both algorithms is that in addition to
an admissible heuristic function that returns lower bounds for any state, there is a
function that returns upper bounds for any state. Every time BRTDP or FRTDP
visit a state, they perform two backups: a standard RTDP backup to compute a
lower-bound value and another backup to compute an upper-bound value. In sim-
ulated trials, action outcomes are determined based on their probability and the
largest difference between the upper and lower bound values of the possible succes-
sor states, which has the effect of biasing state exploration to where it is most likely
to improve the value function.
This approach has a lot of attractive properties. In particular, being able to
bound the suboptimality of an incomplete solution is useful when it is computa-
tionally prohibitive to compute a policy that is closed with respect to the start
state. However, the approach is based on the assumption that an upper-bound
value function is available and easily computed, and this assumption may not be
realistic for many stochastic shortest-path problems. For discounted MDPs, on the
other hand, such bounds are easily computed, as we show in Section 4.3
14
Heuristic Search for Planning under Uncertainty
4 Admissible heuristics
Heuristic search algorithms require admissible heuristics to prune large state spaces
effectively. As advocated by Pearl, an effective and domain-independent strategy
for obtaining admissible heuristics consists in optimally solving a relaxation of the
problem, an MDP in our case. In this section, we review some relaxation-based
heuristics for MDPs. However, we first consider admissible heuristics that are not
based on relaxations. Although such heuristics are not informative, they are useful
when informative heuristics cannot be easily computed.
15
Blai Bonet and Eric A. Hansen
bound on the optimal value function. The time required to compute these bounds
is linear in the number of states and actions, but the bounds need to be computed
just once as their value does not depend on the state s.
where S(s, a) = {s′ : p(s′ |s, a) > 0} is the subset of successor states of s through the
action a. Interestingly, this equation is the optimality equation for a deterministic
shortest-path problem over the graph Gmin = (V, E) where V = S, and there is
an edge (s, s′ ) with cost c(s, s′ ) = min{c(s, a) : p(s′ |s, a) > 0, a ∈ A(s)} for s′ ∈
S(s, a). The graph Gmin is a relaxation of the MDP on which the non-deterministic
outcomes of an action are separated along different deterministic actions, in a way
that the agent has the ability to choose the most convenient one. If this relaxation
is solved optimally, the state values Vmin (s) provide an admissible heuristic for the
MDP. This relaxation is called the min-min relaxation of the MDP [Bonet and
Geffner 2005b]; its optimal value at state s is denoted by Vmin (s).
When the number of states is relatively small and can fit in memory, the state
values Vmin (s) can be obtained using Dijkstra’s algorithm in time polynomial in the
number of states and actions. Otherwise, the values can be obtained, as needed,
using a search algorithm such as A* or IDA* on the graph Gmin . Indeed, the state
value Vmin (s) is the cost of a minimum-cost path from s to any goal state. A* and
IDA* require an admissible heuristic function h(s) for searching Gmin ; if nothing
better is available, the non-informative heuristic h(s) = 0 can be used.
Given a deterministic relaxation of an MDP, such as this, another approach
to computing admissible heuristics for the original MDP is based on the recogni-
tion that any admissible heuristic for the deterministic relaxation is also admissible
16
Heuristic Search for Planning under Uncertainty
for the original MDP. That is, if an estimate h(s) is a lower bound on the value
Vmin (s), it is also a lower bound on the value V ∗ (s) for the MDP. Therefore, we can
use any method for computing admissible heuristics for deterministic shortest-path
problems in order to compute admissible heuristics for the corresponding stochas-
tic shortest-path problems. Since such methods often rely on state abstraction,
the heuristics can be stored in memory even when the state space of the original
problem is much too large to fit in memory.
Instead of applying relaxation methods for deterministic shortest-path problems
to a deterministic relaxation of an MDP, another approach is to apply similar re-
laxation methods directly to the MDP. This strategy was explored by Dearden and
Boutilier [Dearden and Boutilier 1997], who describe an approach to state abstrac-
tion for factored MDPs that can be used to compute admissible heuristics. Their
approach ignores certain state variables of the original MDP in order to create an
exponentially smaller abstract MDP that can be solved more easily. Such a relax-
ation can be useful when it is not desirable to abstract away all stochastic aspects
of a problem.
17
Blai Bonet and Eric A. Hansen
variants for probabilistic planning, and how to compute admissible heuristics for it.
A STRIPS planning problem with conditional effects (simply STRIPS) is a tuple
hF, I, G, Oi where F is a set of fluent symbols, I ⊆ F is the initial state, G ⊆ F
denotes the set of goal states, and O is a set of operators. A state is a valuation
of fluent symbols that is denoted by the subset of fluents are true in the state.
An operator a ∈ O consists of a precondition P re ⊆ F , and a collection CE of
conditional effects of the form C → L, where C and L are sets of literals that
denote the condition and effect of the conditional effect.
A simple probabilistic STRIPS problem (simply sp-STRIPS) is a STRIPS prob-
lem in which each operator a has a precondition P re and a list of probabilistic
P
outcomes of the form h(p1 , CE1 ), . . . , (pn , CEn )i where pi > 0, i pi ≤ 1, and
each CEi is a set of conditional effects. In sp-STRIPS, the state that results after
applying action a on state s is equal to the state that result after applying the con-
ditional effects in CEi on s with probability pi , or the same state s with probability
Pn
1 − i=1 pi .
In PPDDL, probabilistic effects are expressed using statements of the form
(:action toss-three-coins
:parameters (c1 c2 c3 - coin)
:precondition (and (not (tossed c1)) (not (tossed c2)) (not (tossed c3)))
:effect (and (tossed c1)
(tossed c2)
(tossed c3)
(probabilistic 1/2 (heads c1) 1/2 (tails c1))
(probabilistic 1/2 (heads c2) 1/2 (tails c2))
(probabilistic 1/2 (heads c3) 1/2 (tails c3))))
This action is not an sp-STRIPS action since its outcomes are factored along mul-
tiple probabilistic effects. An equivalent sp-STRIPS action has as precondition the
same precondition but effects of the form h(1/8, CE1 ), (1/8, CE2 ), . . . , (1/8, CE8 )i
where each CEi stands for a deterministic outcome of the action; e.g., CE1 =
(and (heads c1) (heads c2) (heads c3)).
Under the assumptions that there are no probabilistic effects inside conditional
effects and that there are no nested conditional effects, a probabilistic planning
problem described with PPDDL can be transformed into an equivalent sp-STRIPS
18
Heuristic Search for Planning under Uncertainty
problem by taking the cross products of the probabilistic effects within each ac-
tion; a translation that takes exponential time in the maximum number of prob-
abilistic effects per action. However, once in sp-STRIPS, the problem can be
further relaxed into (deterministic) STRIPS by converting each action of form
hP re, h(p1 , CE1 ), . . . , (pn , CEn )ii into n deterministic actions of the form hP re, CEi i.
This relaxation is the min-min relaxation now implemented at the level of the rep-
resentation language, without the need to explicitly generate the state and action
spaces of the MDP.
The min-min relaxation of a PPDDL problem is a deterministic planning problem
whose optimal solution provides an admissible heuristic for the probabilistic plan-
ning problem. Thus, any admissible heuristic for the deterministic problem provides
an admissible heuristic for the probabilistic problem. (This is the approach used in
the mGPT planner for probabilistic planning [Bonet and Geffner 2005b].)
Above relaxation gives an interesting and fruitful connection with the field of (de-
terministic) automated planning in which the computation of domain-independent
and admissible heuristics is an important area of research. Over the last decade, the
field has witnessed important progresses in the development of novel and powerful
heuristics that can be used for probabilistic planning.
5 Conclusions
We have shown that increased interest in the problem of planning under uncertainty
has led to the development of a new class of heuristic search algorithms for these
planning problems. The effectiveness of these algorithms illustrates the wide appli-
cability of the heuristic search approach. This approach is influenced by ideas that
can be traced back to some of the fundamental contributions in the field of heuristic
search laid down by Pearl.
In this brief survey, we only reviewed search algorithms for the special case of
the problem of planning under uncertainty in which state transitions are uncertain.
Many other forms of uncertainty may need to be considered by a planner. For
example, planning problems with imperfect state information are often modeled as
partially observable Markov decision processes for which there are also algorithms
based on heuristic search [Bonet and Geffner 2000; Bonet and Geffner 2009; Hansen
1998; Smith and Simmons 2005]. For some planning problems, there is uncertainty
about the parameters of the model. For other planning problems, there is uncer-
tainty due to the presence of multiple agents. The development of effective heuristic
search algorithms for these more complex planning problems remains an important
and active area of research.
References
Barto, A., S. Bradtke, and S. Singh (1995). Learning to act using real-time dy-
namic programming. Artificial Intelligence 72 (1), 81–138.
19
Blai Bonet and Eric A. Hansen
20
Heuristic Search for Planning under Uncertainty
21
Blai Bonet and Eric A. Hansen
Return to TOC
22
Return to TOC
1 Introduction
In the book Heuristics, Pearl studies the strategies for the control of problem solving
processes in human beings and machines, pondering how people manage to solve
an extremely broad range of problems with so little effort, and how machines could
do the same [Pearl 1983, pp. vii]. The central concept in the book, as captured
in the title, are the heuristics: the “criteria, methods, or principles for deciding
which among several alternative courses of action promises to be the most effective
in order to achieve some goal” [Pearl 1983, pp. 3]. Pearl places special emphasis on
heuristics that take the form of evaluation functions and which provide quick but
approximate estimates of the distance or cost-to-go from a given state to the goal.
These heuristic evaluation functions provide the search with a sense of direction
with actions resulting in states that are closer to the goal being preferred. An
informative heuristic h(s) in the 15-puzzle, for example, is the well known ’sum of
Manhattan distances’, that adds up the Manhattan distance of each tile, from its
location in the state s to its goal location.
The book Heuristics laid the foundations for the work in automated problem
solving in Artificial Intelligence (AI) and is still a basic reference in the field. On
the other hand, as an account of human problem solving, the book has not been as
influential. A reason for this is that while the book devotes one chapter to discuss
the derivation of heuristics, most of the book is devoted to the formulation and
analysis of heuristic search algorithms. Most of these algorithms, such as A* and
AO*, are complete and optimal, meaning that they will find a solution if there is
one, and that the solution found will have minimal cost (provided that the heuristic
does not overestimate the true costs). Yet, while people excel at solving a wide
variety of problems almost effortlessly, it’s only in puzzle-like problems where they
need to restore to search, and then, they are not particularly good at it and are
even worse when solutions must be optimal.
Thus, the account of problem solving in the book exhibits a gap that has been
characteristic of AI systems, that result in programs that rival the best human
experts in specialized domains but are no match to children in their general problem
solving abilities.
In this article, I aim to present recent work in AI Planning, a form of domain-
independent problem solving, that builds on Pearl’s work and bears on this gap.
23
Hector Geffner
Planners are general problem solvers aimed at solving an infinite collection of prob-
lems automatically. The problems are instances of various classes of models all of
which are intractable in the worst case. In order to solve these problems effectively
thus, a planner must automatically recognize and exploit their structure. This is
the key challenge in planning and, more generally, in domain-independent problem
solving. In planning, this challenge has been addressed by deriving the heuristic eval-
uations functions automatically from the problems, an idea explored by Pearl and
developed more fully in recent planning research. The resulting domain-independent
planners are not as efficient as specialized solvers but are more general, and thus, be-
have in a way that is closer to people. Moreover, the resulting evaluation functions
often enable the solution of problems with almost no search, and appear to play the
role of the ‘intuitions’ and ‘feelings’ that guide human problem solving and have
been difficult to capture explicitly by means of rules. We will see indeed how such
heuristic evaluation functions are defined and computed in a domain-independent
fashion, and why they can be regarded as relevant from a cognitive point of view.
The organization of the article is the following. We consider in order, planning
models, languages, and algorithms (Section 2), the automatic extraction of heuristic
evaluation functions and other developments in planning (Sections 3 and 4), the
cognitive interpretation of these heuristics (Section 5), and then, more generally,
the relation between AI and Cognitive Science (Section 6).
2 Planning
Planning is an area of AI concerned with the selection of actions for achieving goals.
The first AI planner and one of the first AI programs was the General Problem Solver
(GPS) developed by Newell, Shaw, and Simon in the late 50’s [Newell, Shaw, and
Simon 1958; Newell and Simon 1963]. Since then, planning has remained a central
topic in AI while changing in significant ways: on the one hand, it has become more
mathematical, with a variety of planning problems defined and studied; on the other,
it has become more empirical, with planning algorithms evaluated experimentally
and planning competitions held periodically.
Planning can be understood as representing one of the three main approaches for
selecting the action to do next; a problem that is central in the design of autonomous
systems, called often the control problem in AI.
In the programming-based approach, the programmer solves the control problem
in its head and makes the solution explicit in the program. For example, for a robot
moving in an office environment, the program may say to back up when too close to
a wall, to search for a door if the robot has to move to another room, etc. [Brooks
1987; Mataric 2007].
In the learning-based approach, the control knowledge is not provided explicitly by
a programmer but is learned by trial and error, as in reinforcement learning [Sutton
and Barto 1998], or by generalization from examples, as in supervised learning
[Mitchell 1997].
24
Heuristics, Planning and Cognition
Actions Actions
Sensors Planner Controller World
Goals Obs
25
Hector Geffner
26
Heuristics, Planning and Cognition
observable otherwise. While the model for planning with sensing is a slight varia-
tion of the model for conformant planning, the resulting solution or plan forms are
quite different as observations can and must be taken into account in the selection of
actions. Indeed, solution to planning with sensing problems can be expressed equiv-
alently as either trees [Weld, Anderson, and Smith 1998], policies mapping beliefs
into actions [Bonet and Geffner 2000], or finite-state controllers [Bonet, Palacios,
and Geffner 2009]. A finite-state controller is an automata defined by a collection of
tuples of the form hq, o, a, q ′ i that prescribe to do action a and move to the controller
state q ′ after getting the observation o in the controller state q.
The probabilistic versions of these models are also used in planning. The models
that result when the actions have stochastic effects and the states are fully ob-
servable are the familiar Markov Decision Processes (MDPs) used in Operations
Research and Control Theory [Bertsekas 1995], while the models that result when
action and sensors are stochastic, are the Partial Observable MDPs (POMDPs)
[Kaelbling, Littman, and Cassandra 1998].
In Strips, the actions o ∈ O are represented by three sets of atoms from F called
the Add, Delete, and Precondition lists, denoted as Add(o), Del(o), P re(o). The
first, describes the atoms that the action o makes true, the second, the atoms that
o makes false, and the third, the atoms that must be true in order for the action
to be applicable. A Strips problem P = hF, O, I, Gi encodes in compact form the
state model S(P ) where
27
Hector Geffner
The states in S(P ) represent the possible valuations over the boolean variables
in F . Thus, if the set of variables F has cardinality |F | = n, the number of states
in S(P ) is 2n . A state s represents the valuation where the variables appearing in
s are taken to be true, while the variables not appearing in s are false.
As an example, a planning domain that involves three locations l1 , l2 , and l3 , and
three tasks t1 , t2 , and t3 , where ti can be performed only at li , can be modeled with
a set F of fluents at(li ) and done(ti ), and a set O of actions go(li , lj ) and do(ti ),
i, j = 1, . . . , 3, with precondition, add, and delete lists
for a = do(ti ). The problem of doing tasks t1 and t2 starting at location l3 can then
be modeled by the tuple P = hF, I, O, Gi where
The number of states in the problem is 26 as there are 6 boolean variables. Still,
it can be shown that many of these states are not reachable from the initial state.
Indeed, the atoms at(li ) for i = 1, 2, 3 are mutually exclusive and exhaustive, mean-
ing that every state reachable from s0 makes one and only one of these atoms
true. These boolean variables encode indeed the possible values of the multi-valued
variable that represents the agent location.
Strips is a planning language based on variables that are boolean, yet planning
languages featuring primitive multi-valued variables and richer syntactic constructs
are commonly used for describing both classical and non-classical planning models
[McDermott 1998; Younes, Littman, Weissman, and Asmuth 2005].
28
Heuristics, Planning and Cognition
GPS, the first AI planner introduced by Newell, Shaw, and Simon, used a tech-
nique called means-ends analysis where differences between the current state and
the goal situation were identified and mapped into operators that could decrease
those differences [Newell and Simon 1963]. Since then, the idea of means-ends anal-
ysis has been refined and extended in many ways, seeking planning algorithms that
are sound (only produce plans), complete (produce a plan if one exists), and effec-
tive (scale up to large problems). By the early 90’s, the state-of-the-art method was
UCPOP [Penberthy and Weld 1992], an elegant algorithm based on partial-order
causal link planning [Sacerdoti 1975; Tate 1977; McAllester and Rosenblitt 1991], a
planning method that is sound and complete, but which doesn’t scale up too well.
The situation in planning changed drastically in the middle 90’s with the in-
troduction of Graphplan [Blum and Furst 1995], a planning algorithm based on
the Strips representation but which otherwise had little in common with previous
approaches, and scaled up better. Graphplan works iteratively in two phases. In
the first phase, Graphplan builds a plan graph in polynomial time, made up of a
sequence of layers F0 , A0 , . . . , Fn−1 , An−1 , Fn where Fi and Ai denote sets of fluents
and actions respectively. F0 is the set of fluents true in the initial situation and
n is a planning horizon, initially the index of the first layer Fi where all the goals
appear. In this construction, certain pairs of actions and certain pairs of fluents are
marked as mutually exclusive or mutex. The meaning of these layers and mutexes
is roughly the following: if a fluent p is not in layer Fi , then no plan can achieve
p in i steps or less, while if the pair p and q is in Fi but marked as mutex, then
no plan can achieve p and q jointly in i steps or less. Graphplan makes then an
attempt to extract a plan from the graph, a computation that is exponential in the
worst case. If the plan extraction fails, the planning horizon n is increased by 1, the
plan graph is extended one level, and the plan extraction procedure is tried again.
Blum and Furst showed that the planning algorithm is sound, complete, and opti-
mal, meaning that the plan obtained minimizes the number of time steps provided
that certain sets of actions can be done in parallel. More importantly, they showed
experimentally that this planning approach scaled up much better than previous
approaches.
Due to the new ideas and the emphasis on the empirical evaluation of planning
algorithms, Graphplan had a great influence in planning research that has seen two
new approaches in recent years that scale up better than Graphplan using methods
that are not specific to planning.
In the SAT approach to planning [Kautz and Selman 1996], Strips problems are
converted into satisfiability problems expressed as a set of clauses (a formula in
Conjunctive Normal Form) that are fed into state-of-the-art SAT solvers. If for
some horizon n, the clauses are satisfiable, a parallel plan that solves the problem
can be read from the model returned by the solver. If not, like in Graphplan, the
plan horizon is increased by 1 and the process is repeated until a plan is found. The
approach works well when the required horizon is not large and optimal parallel
29
Hector Geffner
30
Heuristics, Planning and Cognition
and the conditions under which they render inference polynomial have been gener-
alized since then in the notion of treewidth, a parameter that measures how tree-like
is a graph structure [Pearl 1988; Dechter 2003].
Research on the automatic derivation of heuristics in planning builds on Pearl’s
intuition but takes a different path. The relaxation P + that underlies most current
heuristics in domain-independent planning is obtained from a Strips problem P
by dropping, not the preconditions, but the delete lists. This relaxation is quite
informative but is not ‘easy’; indeed finding an optimal solution to a delete-free
problem P + is not easier from a complexity point of view than finding an optimal
solution to the original problem P . On the other hand, finding one solution to P + ,
not necessarily optimal, can be done easily, in low polynomial time. The result
is that heuristics obtained from P + are informative but not admissible (they may
overestimate the true cost), and hence, they can be used effectively for finding plans
but not for finding optimal plans.
If P (s) refers to a planning problem that is like P = hF, I, O, Gi but with I = s,
and π(s) is the solution found for the delete-relaxation P + (s), the heuristic h(s)
that estimates the cost of the problem P (s) is defined as
X
h(s) = Cost(π(s)) = cost(a) .
a∈π(s)
The plans π(s) for the relaxation P + (s) are called relaxed plans, and there have
been many proposals for defining and computing them. We explain below one such
method that corresponds to running Graphplan on the delete-relaxation P + (s)
[Hoffmann and Nebel 2001]. In delete-free problems, Graphplan runs in polynomial
time and its plan graph construction is simplified as there are no mutex relations
to keep track of.
The layers F0 , A0 , F1 , . . . , Fn−1 , An−1 , Fn in the plan graph for P + (s) are
computed starting with F0 = s, by placing in Ai , i = 1, . . . , n − 1, all the actions
a in P whose preconditions P re(a) are in Fi , and placing in Fi+1 , the add effects
of those actions along with the fluents in Fi . This construction is terminated when
the goals G are all in Fn , or when Fn = Fn+1 . Then if G 6⊆ Fn , h(s) = ∞, as
it can be shown then that the relaxed problem P + (s) and the original problem
P (s) have no solution. Otherwise, a (relaxed) parallel plan π(s) for P + (s) can be
obtained backwards from the layer Fn by collecting the actions that add the goals,
and recursively, the actions that add the preconditions of those actions that are not
true in the state s.
More precisely, for Gn = G and i from n − 1 to 0, Bi is set to a minimal
collection of actions in Ai that add all the atoms in Gi+1 \ Fi , and Gi is set to
P re(Bi )∪(Gi+1 ∩Fi ) where P re(Bi ) is the collection of fluents that are preconditions
of actions in Bi . It can be shown then that π(s) = B0 , . . . , Bn−1 is a parallel plan
for the relaxation P + (s); the plan being parallel because the actions in each set Bi
are assumed to be done in parallel. The heuristic h(s) is then just Cost(π(s)). This
31
Hector Geffner
B
INIT A C GOAL
h=3 B C A
C A h=3
h=3 A h=2
B C
B A B C
......... ........
B h=1 C B h=2 C h=2
h=2 A C A B A C A B
B
B h=0 C h=2
A C A A B C
GOAL
Figure 2. A simple planning problem involving three blocks with initial and goal situations
I and G as shown. The actions allow to move a clear block on top of another clear block
or to the table. A plan for the problem is a path that connects I with G in the directed
graph partially shown. In this example, the plan can be found greedily by taking in each
state s, starting with s = I, the action that results in a state s′ that is closer to the goal
according to the heuristic. The heuristic values (shown) are derived automatically from
the problem as described in the text.
is indeed the heuristic introduced in the FF planner [Hoffmann and Nebel 2001],
which is suitable when action costs are uniform. For non-uniform action costs, other
heuristics are more convenient [Keyder and Geffner 2008].
32
Heuristics, Planning and Cognition
33
Hector Geffner
on the evaluation of the states that result from those actions only. This type of
action pruning has been shown to be quite effective [Hoffmann and Nebel 2001],
and in slightly different form is part of state-of-the-art planners [Richter, Helmert,
and Westphal 2008].
34
Heuristics, Planning and Cognition
goal. Figure 3 shows one such problem (left) and the resulting controller (right).
The problem, motivated by the work on deictic representations in the selection of
actions [Chapman 1989; Ballard, Hayhoe, Pook, and Rao 1997], is about placing a
visual marker on top of a green block in a blocks-world scene where the location of
the green blocks is not known. The visual marker, initially at the lower left corner
of the scene (shown as an eye), can be moved in the four directions, one cell at a
time. The observations are whether the cell beneath the marker is empty (‘C’), a
non-green block (‘B’), or a green block (‘G’), and whether it is on the table (‘T’)
or not (‘-’). The controller shown on the right has been derived by running a clas-
sical planner over a classical problem obtained by an automatic translation from
the original problem that involves both uncertainty and sensing. In the figure, the
controller states qi are shown in circles while the label o/a on an edge connecting
two states q to q ′ means to do action a when observing o in q and then switching
to q ′ . In the classical planning problem obtained from the translation, the actions
are tuples (fq , fo , a, fq′ ) whose effects are those of the action a but conditional on
the fluents fq and fo representing the controller state q and observation o being
true. In such a case, the fluent fq′ representing the controller state q ′ is made true
and fq is made false. The two appealing features of this formulation is that the
resulting classical plans encode very succint closed-loop controllers, and that these
controllers are quite general. Indeed, the controller shown in the figure not only
solves the problem for the configuration of blocks shown, but for any configuration
involving any number of blocks. The controller prescribes to move the ‘eye’ up until
there are no blocks, then to move it down until reaching the table and right, and
to repeat this process until a green block is found (‘G’). Likewise, the ‘eye’ must
move right when there are no blocks in a given spot (both ‘T’ and ‘C’ observed).
See [Bonet, Palacios, and Geffner 2009] for details.
35
Hector Geffner
TC/Right
-B/Up
TB/Up -B/Down
-C/Down
q0 q1
TB/Right
Figure 3. Left: The visual marker shown as an ‘eye’ must be placed on a green block in
the blocks-world scene shown, where the locations of the green blocks are not known. The
visual marker can be moved in the four directions, one cell at a time. The observations
are whether the cell beneath the marker is empty (‘C’), a non-green block (‘B’), or a green
block (‘G’), and whether the marker is on the table (‘T’) or not (‘-’). Right: The controller
derived for this problem using a classical planner over a suitable automatic transformation.
The controller states qi are shown in circles while the label o/a on an edge connecting q to
q ′ means to do a when observing o in q switching then to q ′ . The controller works not only
for the problem instance shown on the left, but for any instance resulting from changes in
the configuration or in the number of blocks.
puted by model-based methods where suitable relaxations are solved from scratch.
The technique has been shown to work over large problems involving hundred of
actions and fluents. Here I want to argue these methods also have features that
make them interesting from a cognitive point of view as a plausible basis for an ac-
count of ‘feelings’, ‘emotions’, or ‘appraisals’ in high-level human problem solving.
I focus on three of these features.
First, domain-independent heuristics are fast (low-polynomial time) and effective,
as the ‘fast and frugal’ heuristics advocated by Gigerenzer and others [Gigerenzer
and Todd 1999; Gigerenzer 2007], and yet, they are general too: they apply indeed
to all the problems that fit the (classical planning) model and to problems that can
be cast in that form (like the visual-marker problem above).
Second, the derivation of these heuristics sheds light on why appraisals may be
opaque from a cognitive point of view, and thus not conscious. This is because
the heuristic values are obtained from a relaxed model where the meaning of the
symbols is different than the meaning of the symbols in the ‘true’ model. For
example, the action of moving an object from one place to another, deletes the old
place in the true model but not in the delete-relaxation where an object can thus
appear in multiple places at the same time. Thus, if the agent selecting the actions
with the resulting heuristic does not have access to the relaxation, it won’t be able
to explain how the heuristic evaluations are produced nor what they stand for.
The importance of the unconscious in everyday cognition is a topic that has been
receiving increased attention in recent years, with conscious, deliberate reasoning,
appearing to rely heavily on unconscious processing and representing just the tip
of the ‘cognitive iceberg’ [Wilson 2002; Hassin, Uleman, and Bargh 2005; Evans
36
Heuristics, Planning and Cognition
2008]. While this is evident in vision and natural language processing, where it is
clear that one does not have access to how one ‘sees’ or ‘understands’, this is likely
to be true in most cognitive tasks, including apparently simple problems such as
the Blocks World where our ability to find reasons for the actions selected, does
not explain how such actions are selected in the first place. In this sense, the focus
of cognitive psychology on puzzles such as the Tower of Hanoi may be misplaced:
simple problems, such as the Blocks World, are not simple for domain-independent
solvers, and there is no question that people are capable of solving domains that
they have never seen where the combinatorics would defy a naive, blind solver.
Third, the heuristics provide the agent with a sense of direction or ‘gut feelings’
that guide the action selection in the presence of many alternatives, while avoiding
an infinite regress in the decision process. Indeed, emotions long held to interfere
with the decision process and rationality, are now widely perceived as a requisite
in contexts where it is not possible to consider all alternatives. Emotions and gut
feelings are thus perceived as the ‘invisible hand’ that successfully guides us out of
these mental labyrinths [Ketelaar and Todd 2001; Evans 2002].1 The ‘rationality of
the emotions’ have been defended on theoretical grounds by philosophers [De Sousa
1990; Elster 1999], and on empirical grounds by neuroscientists that have studied
the impairments in the decision process that result from lesions in the frontal lobes
[Damasio 1995]. The link between emotions and evaluation functions, point to their
computational role as well.
While emotions are currently thought as providing the appraisals that are nec-
essary for navigating in a complex world, there are actually very few accounts of
how such appraisals may be computed. Reinforcement learning methods provide one
such account that works well in low level tasks without requiring a model. Heuris-
tic planning methods provide another account that works well in high-level tasks
where the model is known. Moreover, as discussed above, heuristic planning meth-
ods do not only provide an account of the appraisals, but also of the actions that are
worth evaluating. These are the actions a in the state s that are deemed relevant
to the goal in the computation of the heuristic h(s); the so-called helpful actions
[Hoffmann and Nebel 2001]. This form of action pruning may account for a key
difference between programs and humans in games such as Chess: while the former
consider all possible moves and responses (up to a certain depth), the latter focus on
the analysis and evaluation of a few moves and countermoves. Domain-independent
heuristics can account in principle for both the focus and the evaluation, the latter
in the value of the heuristic function h(s), the former in its structure.
1 Some philosophers and cognitive scientists refer to this combinatorial problem as the ‘frame
problem’ in AI. This terminology, however, is not accurate. The frame problem in AI [McCarthy
and Hayes 1969] refers to the problem that arises in logical accounts of actions and change where
the description of the action effects does not suffice to capture what does not change. E.g., the
number of chairs in the room does not change if the bell rings. The frame problem is the problem
of capturing what does not change from a concise logical description of what changes [Ford and
Pylyshyn 1996].
37
Hector Geffner
38
Heuristics, Planning and Cognition
References
Ballard, D., M. Hayhoe, P. Pook, and R. Rao (1997). Deictic codes for the em-
bodiment of cognition. Behavioral and Brain Sciences 20 (4), 723–742.
39
Hector Geffner
Athena Scientific.
Blum, A. and M. Furst (1995). Fast planning through planning graph analysis.
In Proceedings of IJCAI-95, pp. 1636–1642. Morgan Kaufmann.
Bonet, B. and H. Geffner (2000). Planning with incomplete information as heuris-
tic search in belief space. In Proc. of AIPS-2000, pp. 52–61. AAAI Press.
Bonet, B. and H. Geffner (2001). Planning as heuristic search. Artificial Intelli-
gence 129 (1–2), 5–33.
Bonet, B., G. Loerincs, and H. Geffner (1997). A robust and fast action selection
mechanism for planning. In Proceedings of AAAI-97, pp. 714–719. MIT Press.
Bonet, B., H. Palacios, and H. Geffner (2009). Automatic derivation of memory-
less policies and finite-state controllers using classical planners. In Proc. Int.
Conf. on Automated Planning and Scheduling (ICAPS-09).
Brooks, R. (1987). A robust layered control system for a mobile robot. IEEE J.
of Robotics and Automation 2, 14–27.
Brooks, R. (1991). Intelligence without representation. Artificial Intelli-
gence 47 (1–2), 139–159.
Chapman, D. (1989). Penguins can make cake. AI magazine 10 (4), 45–50.
Damasio, A. (1995). Descartes’ Error: Emotion, Reason, and the Human Brain.
Quill.
De Sousa, R. (1990). The rationality of emotion. MIT Press.
Dechter, R. (2003). Constraint Processing. Morgan Kaufmann.
Dechter, R. and J. Pearl (1985). The anatomy of easy problems: a constraint-
satisfaction formulation. In Proc. International Joint Conference on Artificial
Intelligence (IJCAI-85), pp. 1066–1072.
Elster, J. (1999). Alchemies of the Mind: Rationality and the Emotions. Cam-
bridge University Press.
Evans, D. (2002). The search hypothesis of emotion. British J. Phil. Science 53,
497–509.
Evans, J. (2008). Dual-processing accounts of reasoning, judgment, and social
cognition. Annual Review of Pschycology 59, 255–258.
Fikes, R. and N. Nilsson (1971). STRIPS: A new approach to the application of
theorem proving to problem solving. Artificial Intelligence 1, 27–120.
Ford, K. and Z. Pylyshyn (1996). The robot’s dilemma revisited: the frame prob-
lem in artificial intelligence. Ablex Publishing.
Ghallab, M., D. Nau, and P. Traverso (2004). Automated Planning: theory and
practice. Morgan Kaufmann.
40
Heuristics, Planning and Cognition
41
Hector Geffner
42
Return to TOC
1 Introduction
This chapter takes its title from Section 4.2 of Judea Pearl’s landmark book Heuris-
tics [Pearl 1984], and explores how the vision outlined there has unfolded in the
quarter-century since its appearance. As the book’s title suggests, it is an in-depth
summary of classical artificial intelligence (AI) heuristic search, a subject to which
Pearl and his colleagues contributed substantially in the early 1980s.
The purpose of heuristic search is to find a least-cost path in a state space from
a given start state to a goal state. In principle, such problems can be solved by
classical shortest path algorithms, such as Dijkstra’s algorithm [Dijkstra 1959], but
in practice the state spaces of interest in AI are far too large to be solved in this way.
One of the seminal insights in AI was recognizing that even extremely large search
problems can be solved quickly if the search algorithm is provided with additional
information in the form of a heuristic function h(s) that estimates the distance
from any given state s to the nearest goal state [Doran and Michie 1966; Hart,
Nilsson, and Raphael 1968]. A heuristic function h(s) is said to be admissible if,
for every state s, h(s) is a lower bound on the true cost of reaching the nearest goal
from state s. Admissibility is desirable because it guarantees the optimality of the
solution found by the most widely-used heuristic search algorithms.
Most of the chapters in Heuristics contain mathematically rigorous definitions
and analysis. In contrast, Chapter 4 offers a conceptual account of where heuristic
functions come from, and a vision of how one might create algorithms for automat-
ically generating effective heuristics from a problem description. An early version
of the chapter had been published previously in the widely circulated AI Maga-
zine [Pearl 1983].
Chapter 4’s key idea is that distances in the given state space can be estimated
by computing exact distances in a “simplified” version of the state space. There
are many different ways a state space can be simplified. Pearl focused almost
exclusively on relaxation, which is done by weakening or eliminating one or more of
the conditions that restrict how one is allowed to move from one state to another.
For example, to estimate the driving distance between two cities, one can ignore
the constraint that driving must be done on roads. In this relaxed version of the
problem, the distance between two cities is simply the straight-line distance. It is
43
Robert Holte, Jonathan Schaeffer, and Ariel Felner
9 7
1 2 3 5 1 4
4 5 6 7 3 1 1 0 1 5
8 9 8
1 0 1 1 4 1 1
1 2 1 3 1 4 1 5 2 1 3 1 2 6
Figure 1. 15-puzzle
easy to see, in general, that distances in a relaxed space cannot exceed distances
in the given state space, and therefore the heuristic functions defined in this way
are guaranteed to be admissible. An alternate way of looking at this is to view the
elimination of conditions as equivalent to adding new edges to the search graph.
Therefore, optimal solutions to the relaxed graph (with the additional edges) must
be a lower bound on the solution to the original problem.
As a second example of relaxation, consider the 15-puzzle shown in Figure 1,
which consists of a set of tiles numbered 1-15 placed in a 4 × 4 grid, leaving one
square in the grid unoccupied (called the “blank” and shown as a black square). The
only moves that are permitted are to slide a tile that is adjacent to the blank into
the blank position, effectively exchanging the tile with the blank. For example, four
moves are possible in the right-hand side of Figure 1: tile 10 can be moved down, tile
11 can be moved right, tile 8 can be moved left, and tile 12 can be moved up. To solve
the puzzle is to find a sequence of moves that transforms a given scrambled state
(right side of Figure 1) into a goal state (such as the one on the left). One possible
relaxation of the 15-puzzle state space can be defined by removing the restriction
that a tile must be adjacent to the blank to be moveable. In this relaxation any tile
can move from its current position to any adjacent position at any time, regardless
of whether the adjacent position is occupied or not. The number of moves required
to solve this relaxed version (called the Manhattan Distance) is clearly less than or
equal to the number of moves required to solve the 15-puzzle itself. Note that in
this case the relaxed state space has many more states than the original 15-puzzle
(many tiles can now occupy a single location) but it is easier to solve, at least for
humans (tiles move entirely independently of one another).
Pearl observes that in AI a state space is almost always defined implicitly by a set
of operators that describe a successor relation between states. Each operator has
a precondition defining the states to which it can be applied and a postcondition
describing how the operator changes the values of the variables used to describe a
state. This implies that relaxing a state space description by eliminating one or more
preconditions is a simple syntactic operation, and the set of all possible relaxations
of a state space description (by eliminating combinations of preconditions) is well-
defined and, in fact, easy to enumerate. Hence it is entirely feasible for a mechanical
44
Mechanical Generation of Admissible Heuristics
system to generate heuristic functions and, indeed, to search through the space of
heuristic functions defined by eliminating preconditions in all possible ways.
The mechanical search through a space of heuristic functions has as its goal, in
Pearl’s view, a heuristic function with two properties. First, the heuristic function
should return values that are as close to the true distances as possible (Chapter 6
in Heuristics justifies this). Second, the heuristic function must be efficiently com-
putable, otherwise the reduction in search effort that the heuristic function produces
might be outweighed by the increase in computation time caused by the calculation
of the heuristic function. Pearl saw the second requirement as the more difficult to
detect automatically and proposed that mechanically-recognizable forms of decom-
posability of the relaxed state space would be the key to mechanically generating
efficiently-computable heuristic functions. Pearl recognized that the search for a
good heuristic function might itself be quite time-consuming, but argued that this
cost was justified because it could be amortized over an arbitrarily large number
of problem instances that could all be solved much more efficiently using the same
heuristic function.
The preceding paragraphs summarize Pearl’s vision for how effective heuristics
might be generated automatically from a state space description. The remainder
of our chapter contains a brief look at the research efforts directed towards real-
izing Pearl’s vision. We conclude that Pearl correctly anticipated a fundamental
breakthrough in heuristic search in the general terms he set out in Chapter 4 of
Heuristics although not in all of its specifics. Our discussion is informal and the
ideas presented and their references are illustrative, not exhaustive.
45
Robert Holte, Jonathan Schaeffer, and Ariel Felner
tions would be both admissible and consistent.1 To make the computation of such
heuristic functions efficient the Milan group envisaged a hierarchy of relaxed spaces,
with search at one level being guided by a heuristic function defined by distances
in the level above. The Milan group also foresaw the possibility of algorithms for
searching through the space of possible simplified state spaces, although the first
detailed articulation of this idea, albeit in a somewhat different context, was by
Richard Korf [1980].
John Gaschnig [1979] picked up on the Milan work. He made the key observation
that if a heuristic function is calculated by searching in a relaxed space, the total
time required to solve the problem using the heuristic function could exceed the
time required to solve the problem directly with breadth-first search (i.e. without
using the heuristic function). This was formally proven shortly afterwards by Marco
Valtorta [1981, 1984]. This observation led to a focus on the efficiency with which
distances in the simplified space could be computed. The favorite approach to doing
this (as exemplified in Heuristics) was to search for simplified spaces that could be
decomposed.
46
Mechanical Generation of Admissible Heuristics
9 1 4
1 0 1 5
9 1 0 1 1 1 1
1 2 1 3 1 4 1 5 1 3 1 2
of permuting tiles 1-8 among the remaining positions produce 15-puzzle states that
map to the same abstract state, even though they would all be distinct states in the
original state space. For example, the abstract state in the left part of Figure 2 is
the abstraction of the goal state in the original 15-puzzle (left part of Figure 1), but
it is also the abstraction of all the non-goal states in the original puzzle in which
tiles 9-15 and the blank are in their goal positions but some or all of tiles 1-8 are
not. Using this abstraction, the distance from the 15-puzzle state in the right part
of Figure 1 to the 15-puzzle goal state would be estimated by calculating the true
distance, in the abstract space, from the abstract state in the right part of Figure 2
to the state in the left part of Figure 2.
In addition to abstracting transformations, absolver’s library contained “opti-
mizing” transformations, which would create an equivalent description of a given
strips representation in which search could be completed more quickly. This in-
cluded the “factor” transformation that would, if possible, decompose the state
space into independent subproblems, one of the methods Pearl had suggested.
absolver was applied to thirteen state spaces and found effective heuristic func-
tions in six of them. Five of the functions it discovered were novel, including a
simple, effective heuristic for Rubik’s Cube that had been overlooked by experts:
after extensive study, Korf was unable to find a single good heuristic
evaluation function for Rubik’s Cube [Korf 1985]. He concluded that “if
there does exist a heuristic, its form is probably quite complex.”
([Mostow and Prieditis 1989], page 701)
47
Robert Holte, Jonathan Schaeffer, and Ariel Felner
To do this, it is necessary to precompute all the distances to the goal state in the
abstract space. This is typically done by a backwards breadth-first search starting
at the abstract goal state. Each abstract state reached in this way is associated with
a specific storage location in the PDB, and the state’s distance from the abstract
goal is stored in this location as the value in the PDB.
Precomputing abstract distances to create a lookup-table heuristic function was
actually one of the optimizing transformations in absolver, but Culberson and
Schaeffer had independently come up with the idea. Unlike the absolver work,
they validated it by producing a two orders of magnitude reduction in the search
effort (measured in nodes expanded) needed to solve instances of the 15-puzzle, as
compared to the then state-of-the-art search algorithms using an enhanced Man-
hattan Distance heuristic. To achieve this they used two PDBs totaling almost one
gigabyte of memory, a very large amount in 1994 when the experiments were per-
formed [Culberson and Schaeffer 1994]. The paper’s referees were sharply critical of
the exorbitant memory usage, rejecting the paper three times before it finally was
accepted [Culberson and Schaeffer 1996].
Such impressive results on the 15-puzzle could not go unnoticed. The fundamen-
tal importance of PDBs was established beyond doubt in 1997 when Richard Korf
used PDBs to enable standard heuristic search techniques to find optimal solutions
to instances of Rubik’s Cube for the first time [Korf 1997].
Since then, PDBs have been used to build effective heuristic functions in numer-
ous applications, including various combinatorial puzzles [Felner, Korf, and Hanan
2004; Felner, Korf, Meshulam, and Holte 2007; Korf and Felner 2002], multiple se-
quence alignment [McNaughton, Lu, Schaeffer, and Szafron 2002; Zhou and Hansen
2004], pathfinding [Anderson, Holte, and Schaeffer 2007], model checking [Edelkamp
2007], planning [Edelkamp 2001; Edelkamp 2002; Haslum, Botea, Helmert, Bonet,
and Koenig 2007], and vertex cover [Felner, Korf, and Hanan 2004].
5 Current Status
The use of abstraction to create heuristic functions has profoundly advanced the
fields of planning and heuristic search. But the current state of the art is not
entirely as Pearl envisaged. Although he recognized that there were other types
of state space abstraction, Pearl emphasized relaxation. In this detail, he was
too narrowly focused. Researchers have largely abandoned relaxation in favor of
homomorphic abstractions, of which many types have been developed and shown
useful for defining heuristic functions, such as domain abstraction [Hernádvölgyi
and Holte 2000], h-abstraction [Haslum and Geffner 2000], projection [Edelkamp
2001], constrained abstraction [Haslum, Bonet, and Geffner 2005], and synchronized
products [Helmert, Haslum, and Hoffmann 2007].
Pearl argued for the automatic creation of effective heuristic functions by search-
ing through a space of abstractions. There has been some research in this direc-
tion [Prieditis 1993; Hernádvölgyi 2003; Edelkamp 2007; Haslum, Botea, Helmert,
48
Mechanical Generation of Admissible Heuristics
Bonet, and Koenig 2007; Helmert, Haslum, and Hoffmann 2007], but more is needed.
However, important progress has been made on the subproblem of evaluating the
effectiveness of a heuristic function, with the development of a generic, practi-
cal method for accurately predicting how many nodes IDA* (a standard heuristic
search algorithm) will expand for any given heuristic function [Korf and Reid 1998;
Korf, Reid, and Edelkamp 2001; Zahavi, Felner, Burch, and Holte 2008].
Finally, Pearl anticipated that efficiency in calculating the heuristic function
would be achieved by finding abstract state spaces that were decomposable in some
way. This has not come to pass, although there is now a general theory of when it is
admissible to add the values returned by two or more different abstractions [Yang,
Culberson, Holte, Zahavi, and Felner 2008]. Instead, the efficiency of the heuristic
calculation has been achieved either by precomputing the heuristic function’s values
and storing them in a lookup table, as PDBs do, or by creating a hierarchy of
abstractions and using distances at one level as a heuristic function to guide the
calculation of distances at the level below [Holte, Perez, Zimmer, and MacDonald
1996; Holte, Grajkowski, and Tanner 2005], as anticipated by the Milan group.
6 Conclusion
Judea Pearl has received numerous accolades for his prodigious research and its
impact. Amidst this impressive body of work are his often-overlooked contributions
to the idea of the automatic discovery of heuristic functions. Even though Heuristics
is over 25 years old (ancient by Computing Science standards), Pearl’s ideas still
resonate today.
Acknowledgments: The authors gratefully acknowledge the support they have
received over the years for research in this area from Canada’s Natural Sciences and
Engineering Research Council (NSERC), Alberta’s Informatics Circle of Research
Excellence (iCORE), and the Israeli Science Foundation (ISF).
References
Anderson, K., R. Holte, and J. Schaeffer (2007). Partial pattern databases. In
Symposium on Abstraction, Reformulation and Approximation, pp. 20–34.
Springer-Verlag LNAI #4612.
Culberson, J. and J. Schaeffer (1994). Efficiently searching the 15-puzzle. Tech-
nical Report 94-08, Department of Computing Science, University of Alberta.
Culberson, J. and J. Schaeffer (1996). Searching with pattern databases. In
G. McCalla (Ed.), AI’96: Advances in Artificial Intelligence, pp. 402–416.
Springer-Verlag LNAI #1081.
Dijkstra, E. (1959). A note on two problems in connexion with graphs. Nu-
merische Mathematik 1, 269–271.
Doran, J. and D. Michie (1966). Experiments with the graph traverser program.
In Proceedings of the Royal Society A, Volume 294, pp. 235–259.
49
Robert Holte, Jonathan Schaeffer, and Ariel Felner
50
Mechanical Generation of Admissible Heuristics
51
Robert Holte, Jonathan Schaeffer, and Ariel Felner
52
Return to TOC
4
Space Complexity of Combinatorial Search
Richard E. Korf
2 Depth-First Search
One solution to this problem in some settings is depth-first search (DFS), which
requires memory that is only linear in the maximum search depth. The reason
is that at any point in time, it saves only the path from the start node to the
current node being expanded, either on an explicit node stack, or on the call stack
of a recursive implementation. As a result, the memory requirement of DFS is
almost never a limitation. For finite tree-structured problem space graphs, where
all solutions are equally desirable, DFS solves the memory problem. For example,
chronological backtracking is a DFS for constraint satisfaction problems, and does
not suffer any memory limitations in practice.
With an infinite search tree, or when we want a shortest solution path, however,
DFS has significant drawbacks. In an infinite search tree, which can result from a
53
Richard E. Korf
depth-first search of a finite graph with multiple paths to the same state, DFS is not
complete, but can traverse a single path until it exhausts memory. For example,
the problem space graphs of the well-known sliding-tile puzzles are finite, but a
depth-first search of these spaces explores a tree-expansion of the graph, which is
infinite. Even with a finite search tree, the first solution found by DFS will not be
a shortest solution in general.
3 Iterative Deepening
3.1 Depth-First Iterative Deepening
One solution to these limitations of DFS is depth-first iterative-deepening (DFID)
[Korf 1985a]. DFID performs a series of depth-first searches, each to a successively
greater depth. DFID simulates BFS, but using memory that is only linear in the
maximum search depth. It is guaranteed to find a solution if one exists, even on an
infinite tree, and the first solution it finds will be a shortest one.
DFID is essentially the same as iterative-deepening searches used in two-player
game programs [Slate and Atkin 1977], but is used to solve a completely different
problem. In a two-player game, iterative deepening is used to determine the search
depth, because moves must be made within a given time period, and it is difficult
to predict how long it will take to search to a given depth. In contrast, DFID is
applied to single-agent problems where a shortest solution is required, in order to
avoid the memory limitation of BFS.
DFID first appeared in a Columbia University technical report [Korf 1984]. It was
independently published in two different papers in IJCAI-85 [Korf 1985b; Stickel
and Tyson 1985], and called “consecutively bounded depth-first search” in the latter.
3.2 Iterative-Deepening-A*
While discussing DFID with Judea Pearl on a trip to UCLA in 1984, he suggested
its extension to heuristic search that became Iterative-Deepening-A* (IDA*) [Korf
1985a]. IDA* overcomes the memory limitation of the A* algorithm [Hart, Nilsson,
and Raphael 1968] for heuristic searches the same way that DFID overcomes the
memory limitation of BFS for brute-force searches. In particular, it uses the A*
cost function of f (n) = g(n) + h(n), where g(n) is the cost of the current path from
the root to node n, and h(n) is a heuristic estimate of the lowest cost of any path
from node n to a goal node. IDA* performs a series of depth-first search iterations,
where each branch of the search is terminated when the cost of the last node on that
branch exceeds a cost threshold for that iteration. The cost threshold of the first
iteration is set to the heuristic estimate of the start state, and the cost threshold of
each successive iteration is set to the minimum cost of all nodes generated but not
expanded on the previous iteration. Like A*, IDA* guarantees an optimal solution
if the heuristic function is admissible, or never overestimates actual cost. IDA*
was the first algorithm to find optimal solutions to the Fifteen Puzzle, the famous
four-by-four sliding-tile puzzle [Korf 1985a]. It was also the first algorithm to find
54
Space Complexity
55
Richard E. Korf
have non-negative cost. The A* cost function f (n) = g(n) + h(n) is monotonic
if the heuristic function h(n) is consistent, meaning that h(n) ≤ k(n, n′ ) + h(n′ ),
where n′ is a child of node n, and k(n, n′ ) is the cost of the edge from n to n′ .
Many heuristic functions are both admissible and consistent. If the cost function is
monotonic, then the order in which nodes are first expanded by IDA* is the same
as for a best-first search with the same cost function.
Not all useful cost functions are monotonic, however. For example, Weighted
A* (WA*) is a best-first search with the cost function f (n) = g(n) + w ∗ h(n).
If w is greater than one, then WA* usually finds solutions much faster than A*,
but at a small cost in solution quality. With w greater than one, however, f (n) =
g(n)+w∗h(n) is not monotonic, even with a consistent h(n). With a non-monotonic
cost-function, IDA* does not expand nodes in best-first order. In particular, in parts
of the search tree where the cost of nodes is lower than the cost threshold for the
current iteration, IDA* behaves as a brute-force search, expanding nodes in the
order in which they are generated.
Can any linear-space search algorithm simulate a best-first search with a non-
monotonic cost function? Surprisingly, the answer is yes. Recursive best-first search
[Korf 1993] (RBFS) maintains a path from the start node to the last node generated,
along with the siblings of nodes on that path. Stored with each node is a cost value.
If the node has never been expanded before, its stored cost is its original cost. If it
has been expanded before, its stored cost is the minimum cost of all its descendents
that have been generated but not yet expanded, which are not stored in memory.
The sibling node off the current path of lowest cost is the ancestor of the next leaf
node that would be expanded by a best-first search. By propagating these values
up the tree, and inheriting these values down the tree as a previously explored
path is regenerated, RBFS always finds the next leaf node expanded by a best-
first search. Thus, it simulates a best-first search even with a non-monotonic cost
function. Furthermore, with a monotonic cost function it can outperform IDA* if
there are many different unique cost values in the search tree. For details on RBFS,
see [Korf 1993].
56
Space Complexity
such a graph, as each node is generated it is checked to see if the same state already
appears on the Open or Closed lists. If so, only the node reached by a shortest path
is stored, and the duplicate node is eliminated. Thus, by detecting and rejecting
such duplicate nodes, a breadth-first search to a radius of r on a grid graph would
expand O(r2 ) nodes.
A linear-space algorithm doesn’t store most of the nodes it generates however,
and hence cannot detect most duplicate nodes. In a grid graph, each node has
four neighbors. A linear-space search will not normally regenerate its immediate
parent as one of its children, reducing the number of children to three for all but
the start node. Thus, a depth-first search of a grid graph to a radius r will generate
O(3r ) nodes, compared to O(r2 ) nodes for a best-first search. This is an enormous
overhead on graphs with many paths to the same state, rendering linear-space
algorithms completely impractical in such problem spaces.
7 Frontier Search
Fortunately, there is another technique that can significantly reduce the memory
required by a search algorithm on problem spaces with many duplicate nodes. The
basic idea is to save only the Open list and not the Closed list. This algorithm
schema is called frontier search, since the Open list represents the frontier of nodes
that have been generated but not yet expanded [Korf, Zhang, Thayer, and Hohwald
2005]. When a node is expanded in frontier search, it is simply deleted rather than
being moved to a Closed list.
The advantage of this technique is that the Open list can be much smaller than
the Closed list. In the grid graph, for example, the Closed list grows as O(r2 ),
whereas the Open list grows only as O(r), where r is the radius of the search.
For ease of explanation, we’ll assume a problem space with reversible operators,
but the method also applies to some directed problem-space graphs as well. There
are two reasons to save the Closed list. One is to detect duplicate nodes, and the
other is to return the solution path. We first consider duplicate detection.
57
Richard E. Korf
is unbroken, when this happens the Open node being expanded would first have to
generate another Open node on the frontier of the search before generating a Closed
node on the interior. When this happens, the duplicate Open node is detected, and
the the union of the used operator bits set in each of the two copies is stored with
the single copy retained. In other words, one part of the frontier cannot invade
another part of the interior without passing through another part of the frontier
first, where the intrusion is detected. By storing and managing such used-operator
bits, frontier search detects all duplicate node generations and prevents a node from
being expanded more than once.
An alternative to used-operator bits is to save several levels of the search at a
time [Zhou and Hansen 2003]. In particular, Closed nodes are stored until all their
children are expanded, and then deleted.
58
Space Complexity
of unique nodes, while frontier search performs very well, reducing the memory
required from quadratic to linear space.
8 Disk-Based Search
Even on problems where frontier search is effective, memory is still the resource
that limits its applicability. An additional approach to this memory limitation is
to use magnetic disk to store nodes rather than semiconductor memory. While
semiconductor memory has gotten much larger and cheaper over time, it still costs
about $30 per gigabyte. In contrast, magnetic disk storage costs about $100 per
terabyte, which is 300 times cheaper. The problem with simply replacing semicon-
ductor memory with magnetic disks, however, is that random access of a byte on
disk can take up to ten milliseconds, which is five orders of magnitude slower than
for memory. Thus, disk access must be sequential for efficiency.
Consider a simple breadth-first search (BFS), which is usually implemented with
a first-in first-out queue. Nodes are read from the head of the queue, expanded,
and their children are written to the tail of the queue. Such a queue can efficiently
be stored on disk, since all accesses are sequential.
In order to detect duplicate nodes efficiently, however, the nodes are also stored
in a hash table. Nodes are looked up in the hash table as they are generated, and
duplicate nodes are discarded. Such a hash table cannot be directly implemented
on magnetic disk, however, due to the long latency of random access.
A solution to this problem is called delayed duplicate detection [Korf 2008] or DDD
for short. The BFS queue is stored on disk, but nodes are not checked for duplicates
as they are generated. Rather, duplicate nodes are appended to the queue, and are
only eliminated periodically, such as at the end of each depth iteration. There are
several ways to eliminate duplicate nodes from a large file stored on disk.
The simplest way is to sort the nodes based on their state representation. This
will bring duplicate nodes to adjacent positions. Then, a simple linear scan of the
file can be used to detect and merge duplicate nodes. The drawback of this approach
is that the sorting takes O(n log n) time, where n is the number of nodes. With a
terabyte of storage, and four bytes per state, n can be as large as 250 billion, and
hence log n as large as 38.
An alternative is to use hash-based DDD. This scheme relies on two orthogonal
hash functions defined on the state representation. In the first phase, the input file
is read, and nodes are output to separate files based on the value of the first hash
function. Thus, any sets of duplicate node will be confined to the same file. In
the second phase, the nodes in each individual file are hashed into memory using
the second hash function, and duplicates are detected and merged in memory. The
advantage of this approach is that the time complexity is only linear in the number
of nodes, rather than O(n log n) time for sorting-based DDD.
The overall DDD algorithm proceeds in alternating phases of node expansion
followed by merging duplicate nodes. Combined with frontier search, DDD has
59
Richard E. Korf
References
Chakrabarti, P., S. Ghose, A. Acharya, and S. de Sarkar (1989, December).
Heuristic search in restricted memory. Artificial Intelligence 41 (2), 197–221.
Ghosh, R., A. Mahanti, and D. Nau (1994, August). An efficient limited-memory
heuristic tree search algorithm. In Proceedings of the Twelfth National Con-
ference on Artificial Intelligence (AAAI-94), Seattle, WA, pp. 1353–1358.
Hart, P., N. Nilsson, and B. Raphael (1968, July). A formal basis for the heuristic
determination of minimum cost paths. IEEE Transactions on Systems Science
and Cybernetics SSC-4 (2), 100–107.
Korf, R. (1984). The complexity of brute-force search. technical report, Computer
Science Department, Columbia University, New York, NY.
60
Space Complexity
61
Richard E. Korf
62
Return to TOC
5
Paranoia versus Overconfidence in
Imperfect-Information Games
Austin Parker, Dana Nau, and V.S. Subrahmanian
1 Introduction
In minimax game-tree search, the min part of the minimax backup rule derives
from what we will call the paranoid assumption: the assumption that the opponent
will always choose a move that minimizes our payoff and maximizes his/her payoff
(or our estimate of the payoff, if we cut off the search before reaching the end of
the game). A potential criticism of this assumption is that the opponent may not
have the ability to decide accurately what move this is. But in several decades
of experience with game-tree search in chess, checkers, and other zero-sum perfect-
information games, the paranoid assumption has worked so well that such criticisms
are generally ignored.
In game-tree search algorithms for imperfect-information games, the backup rules
are more complicated. Many of them (see Section 6) involve computing a weighted
average over the opponent’s possible moves (or a Monte Carlo sample of them),
where each move’s weight is an estimate of the probability that this is the opponent’s
best possible move. Although such backup rules do not take a min at the opponent’s
move, they still tacitly encode the paranoid assumption, by assuming that the
opponent will choose optimally from the set of moves he/she is actually capable of
making.
Intuitively, one might expect the paranoid assumption to be less reliable in
imperfect-information games than in perfect-information games; for without per-
fect information, it may be more difficult for the opponent to judge which move is
best. The purpose of this paper is to examine whether it is better to err on the side
of paranoia or on the side of overconfidence. Our contributions are as follows:
63
Austin Parker, Dana Nau, and V.S. Subrahmanian
This work was influenced by Judea Pearl’s invention of P-games [Pearl 1981; Pearl
1984], and his suggestion of investigating backup rules other than minimax [Pearl
1984]. We also are grateful for his encouragement of the second author’s early work
on game-tree search (e.g., [Nau 1982a; Nau 1983]).
2 Basics
Our definitions and notation are based on [Osborne and Rubinstein 1994]. We con-
sider games having the following characteristics: two players, finitely many moves
and states, determinism, turn taking, zero-sum utilities, imperfect information ex-
pressed via information sets (explained in Section 2.1), and perfect recall (explained
in Section 2.3). We will let G be any such game, and a1 and a2 be the two players.
64
Paranoia versus Overconfidence in Imperfect-Information Games
We assume that when two histories h, h′ produce the same sequence of observa-
tions, they also produce the same set of available moves, i.e., if Oi (h) = Oi (h′ ),
then M (s(h)) = M (s(h′ )). The rationale for this is that if the current history
is h, ai ’s observations won’t tell ai whether the history is h or h′ , so ai may
attempt to make a move m that is applicable in s(h′ ) but not in s(h). If ai
does so, then m will produce some kind of outcome, even if the outcome is just
an announcement that ai must try a different move. Consequently, we can
easily make m applicable in s(h), by defining a new state m(s(h)) in which
this outcome occurs.
1 Nondeterministic initial states, outcomes, and observations can be modeled by introducing an
additional player a0 who makes a nondeterministic move at the start of the game and after each
of the other players’ moves. To avoid affecting the other players’ payoffs, a0 ’s payoff in terminal
states is always 0.
2 Some game-theory textbooks define information sets without even using the notion of an
“observation.” They simply let a player’s information sets be the equivalence classes of a partition
over the set of possible histories.
65
Austin Parker, Dana Nau, and V.S. Subrahmanian
We assume that terminal histories with distinct utilities always provide dis-
tinct observations, i.e., for terminal histories h, h′ ∈ T , if Ui (h) 6= Ui (h′ ) then
Oi (h) 6= Oi (h′ ).
We define ai ’s information set for h to be the set of all histories that give ai the
same observations that h gives, i.e., [h]i = {h′ ∈ H : Oi (h′ ) = Oi (h)}. The set of
all possible information sets for ai is Ii = {[h]i : h ∈ H}. It is easy to show that Ii
is a partition of H.
Figure 1 shows an example game tree illustrating the correspondence between
information sets and histories. In that game, player a1 makes the first move, which
is hidden to player a2 . Thus player a2 knows that the history is either hLi or hRi,
which is denoted by putting a dotted box around the nodes for those histories.
2.2 Strategies
In a perfect-information game, a player ai ’s strategy is a function σi (m|s) that
returns the probability p that ai will make move m in state s. For imperfect-
information games, where ai will not always know the exact state he/she is in, σi is
a function of an information set rather than a state; hence σi (m|I) is the probability
that ai will make move m when their information set is I. We let M (I) be the set
of moves available in information set I.
If σi is a mixed strategy, then for every information set I ∈ Ii where it is ai ’s
move, there may be more than one move m ∈ M (I) for which σi (m|I) > 0. But
if σi is a pure strategy, then there will be a unique move mI ∈ M (I) such that
66
Paranoia versus Overconfidence in Imperfect-Information Games
σi (m|I) = 0 ∀m 6= mI and σi (mI |I) = 1; and in this case we will use the notation
σi (I) to refer to mI .
If h = hm1 , m2 , . . . , mn i is a history, then its probability P (h) can be calculated
from the players’ strategies. Suppose a1 ’s and a2 ’s strategies are σ1 and σ2 . In the
special case where a1 has the first move and the players move in strict alternation,
P (h|σ1 , σ2 ) = σ1 (m1 |h0 )σ2 (m2 |h1 ) . . . σ1 (mj |hj−1 ), σ2 (mj+1 |hj ), . . . , (1)
P (h|σ1 , σ2 )
P (h|I, σ1 , σ2 ) = P . (3)
h′ ∈I P (h |σ1 , σ2 )
′
where T is the set of all terminal histories, and P (h|σ1 , σ2 ) is as in Eq. (2). Since
the game is zero-sum, it follows that a2 ’s expected utility is −EU (σ1 , σ2 ).
For the expected utility of an individual history h, there are two cases:
67
Austin Parker, Dana Nau, and V.S. Subrahmanian
Case 1: History h is terminal. Then h’s expected utility is just its actual utility,
i.e.,
EU (h|σ1 , σ2 ) = EU (h) = U (h). (6)
Case 2: History h ends at a state where it is ai ’s move. Then h’s expected utility
is a weighted sum of the expected utilities for each of ai ’s possible moves,
weighted by the probabilities of ai making those moves:
X
EU (h|σ1 , σ2 ) = σi (m|h) · EU (h ◦ m|σ1 , σ2 )
m∈M (h)
X
= σi (m|[h]i ) · EU (h ◦ m|σ1 , σ2 ), (7)
m∈M (h)
The following lemma shows that the recursive formulation in Eqs. (6–7) matches
the notion of expected utility given in Eq. 5.
LEMMA 1. For any strategies σ1 and σ2 , EU (hi|σ1 , σ2 ) (the expected utility of
the empty initial history as computed via the recursive Equations 6 and 7) equals
EU (σ1 , σ2 ).
where k is one greater than the size of h and n is the size of each h′ as appropriate.
The base case occurs when h is terminal, and the inductive case assumes Eq. 8 holds
for histories of length m + 1 to show algebraically that Eq. 8 holds for histories of
length m. ⊔
⊓
COROLLARY 2. For any strategies σ1 and σ2 , and player ai , EU ([hi]i |σ1 , σ2 ) (the
expected utility of the initial information set for player ai ) equals EU (σ1 , σ2 ).
3 Finding a Strategy
We now develop the theory for a game-tree search technique that exploits an oppo-
nent model.
68
Paranoia versus Overconfidence in Imperfect-Information Games
Suppose a1 ’s and a2 ’s strategies are σ1 and σ2 , and let I be any information set for
a1 . Let M ∗ (I|σ1 , σ2 ) be the set of all moves in M (I) that maximize a1 ’s expected
utility at I, i.e.,
Since we are considering only finite games, every history has finite length. Thus
by starting at the terminal states and going backwards up the game tree, applying
Eqs. (7) and (9) at each move, one can compute a strategy σ1∗ such that:
(
1/|M ∗ (I, σ1∗ , σ2 )|, if m ∈ M ∗ (I|σ1∗ , σ2 ),
σ1∗ (m|I) = (11)
0, otherwise.
THEOREM 3. Let σ2 be a strategy for a2 , and σ1∗ be as in Eq. (11). Then σ1∗ is
σ2 -optimal.
Sketch of proof. Let σ̄1 be any σ2 -optimal strategy. The basic idea is to show, by
induction on the lengths of histories in an information set I, that EU (I|σ1∗ , σ2 ) ≥
EU (I|σ̄1 , σ2 ).
The induction goes backwards from the end of the game: the base case is where
I contains histories of maximal length, while the inductive case assumes the in-
equality holds when I contains histories of length k + 1, and shows it holds when I
contains histories of length k. The induction suffices to show that EU ([hi]1 |σ1∗ , σ2 ) ≥
EU ([hi]1 |σ̄1 , σ2 ), whence from Lemma 1, EU (σ1∗ , σ2 ) ≥ EU (σ̄1 , σ2 ). ⊔
⊓
69
Austin Parker, Dana Nau, and V.S. Subrahmanian
EU d (h|σ1∗ , σ2 ) =
E(h), if d = 0,
U (h),
if h is terminal,
P (12)
∗
m∈M (h) σ2 (m|[h]2 ) · EUd−1 (h ◦ m|σ1 , σ2 ), if it’s a2 ’s move,
EUd−1 (h ◦ argmaxm∈M (h) (EUd ([h ◦ m]1 |σ1∗ , σ2 ))),
if it’s a1 ’s move,
X
EU d (I|σ1∗ , σ2 ) = P (h|I, σ1∗ , σ2 ) · EUd (h|I, σ1∗ , σ2 ). (13)
h∈I
If the algorithm searches to a limited depth (Eq. 12 with d < maxh∈H |h|), we
will refer to the resulting strategy as limited-depth overconfident. If the algorithm
searches to the end of the game (i.e., d ≥ maxh∈H |h|), we will refer to the resulting
strategy as full-depth overconfident; and in this case we will usually write OU (h)
rather than OUd (h).
Paranoia. The paranoid model assumes that a2 will always make the worst possi-
ble move for a1 , i.e., the move that will produce the minimum expected utility over
all of the histories in a1 ’s information set. This model replaces the summation in
70
Paranoia versus Overconfidence in Imperfect-Information Games
P Ud (h) =
E(I), if d = 0,
U (h),
if h is terminal,
(16)
P Ud−1 (h ◦ argminm∈M (h) (minh′ ∈[h]1 P Ud ([h ◦ m]))), if it’s a2 ’s move,
P Ud−1 (h ◦ argmaxm∈M (h) (minh′ ∈[h]1 P Ud ([h ◦ m]))), if it’s a1 ’s move,
Like we did for overconfident search, we will use the terms limited-depth and
full-depth to refer to the cases where d < maxh∈H |h| and d ≥ maxh∈H |h|, respec-
tively; and for a full-depth paranoid search, we will usually write P U (h) rather than
P Ud (h).
In perfect-information games, P U (h) equals h’s minimax value. But in imperfect-
information games, h’s minimax value is the minimum Eq. (11) over all possible
values of σ2 ; and consequently P U (h) may be less than h’s minimax value.
71
Austin Parker, Dana Nau, and V.S. Subrahmanian
72
Paranoia versus Overconfidence in Imperfect-Information Games
2. Iterative: Until the available time runs out, repeatedly pick a random h ∈ I,
compute Γ({h}) and aggregate that result with all previous picks.
4 Analysis
Since paranoid and overconfident play both depend on opponent models that may
be unrealistic, which of them is better in practice? The answer is not completely
obvious. Even in games where each player’s moves are completely hidden from the
other player, it is not hard to create games in which the paranoid strategy outplays
the overconfident strategy and vice-versa. We now give examples of games with
these properties.
Figures 2 and 3, respectively, are examples of situations in which paranoid play
outperforms overconfident play and vice versa. As in Figure 1, the games are shown
in tree form in which each dotted box represents an information set. At each leaf
node, U is the payoff for player 1. Based on these values of U , the table gives,
the probabilities of moving left (L) and right (R) at each information set in the
tree, for both the overconfident and paranoid strategies. At each leaf node, pr1
is the probability of reaching that node when player 1 is overconfident and player
2 is paranoid, and pr2 is the probability of reaching that node when player 2 is
overconfident and player 1 is paranoid.
In Figure 2, the paranoid strategy outperforms the overconfident strategy, be-
cause of the differing choices the strategies will make at the information set I2:
73
Austin Parker, Dana Nau, and V.S. Subrahmanian
C D E F
L3 R3 L4 R4
Figure 3 shows a game in which the overconfident strategy outperforms the para-
noid strategy. Again, the pertinent information set is I2:
74
Paranoia versus Overconfidence in Imperfect-Information Games
C D E F
L3 R3 L4 R4
U = −1 U = +1 U = +1 U = −1 U = −1 U = −1 U = +1 U = +1
pr1 = 1/8 pr1 = 1/8 pr1 = 1/8 pr1 = 1/8 pr1 = 1/8 pr1 = 1/8 pr1 = 1/8 pr1 = 1/8
pr2 = 1/4 pr2 = 1/4 pr2 = 0 pr2 = 0 pr2 = 1/4 pr2 = 1/4 pr2 = 0 pr2 = 0
paranoid play, assuming the worst, believes both move L2 and R2 are losses.
R2 is a loss because the opponent may have made move R1 resulting in a
forced loss for player 2 at node F, and L2 is a loss because the opponent may
have made move L1 and then may make move R4 resulting in a loss for player
2. Since there is a potential loss in all cases, paranoid play chooses both cases
with equal probability.
These two examples show that neither strategy is guaranteed to be better in all
cases: sometimes paranoid play outperforms overconfident play, and sometimes vice
versa. So to determine their relative worth, deeper analysis is necessary.
75
Austin Parker, Dana Nau, and V.S. Subrahmanian
Sketch of proof. This is proven by induction on the height of the state s under
consideration. The base case occurs for with terminal nodes of height 0 for which
the lemma follows trivially. The inductive case supposes the lemma holds for all
states of height k and shows algebraically for states s of height k + 1 in each of four
possible cases: (1) if it is a1 ’s move and µ(s) = −1 then OC(s) ∈ [−1, 1), (2) if it is
a1 ’s move and µ(s) = 1 then OC(s) = 1, (3) if it is a2 ’s move and µ(s) = −1 then
OC(s) ∈ [−1, 1), and (4) if it is a2 ’s move and µ(s) = 1 then OC(s) = 1. Since
the game allows only wins and losses (so that µ(s) is 1 or −1), these are all the
possibilities. ⊔
⊓
4.2 Discussion
Paranoid play. When using paranoid play a1 assumes that a2 has always and will
always make the worst move possible for a1 , but a1 does this given only a1 ’s infor-
mation set. This means that for any given information set, the paranoid player will
find the history in the information set that is least advantageous to itself and make
moves as though that were the game’s actual history even when the game’s actual
76
Paranoia versus Overconfidence in Imperfect-Information Games
history is any other member of the information set. There is a certain intuitively
appealing protectionism occurring here: an opponent that happens to have made
the perfect moves cannot trap the paranoid player. However, it really is not clear
exactly how well a paranoid player will do in an imperfect-information game, for
the following reasons:
There is no reason to necessarily believe that the opponent has made those
“perfect” moves. In imperfect-information games, the opponent has differ-
ent information than the paranoid player, which may not give the opponent
enough information to make the perfect moves paranoid play expects.
Against non-perfect players, the paranoid player may lose a lot of potentially
winnable games. The information set could contain thousands of histories in
which a particular move m is a win; if that move is a loss on just one history,
and there is another move m′ which admits no losses (and no wins), then m
will not be chosen.3
In games such as kriegspiel, in which there are large and diverse information
sets, usually every information set will contain histories that are losses, hence
paranoid play will evaluate all of the information sets as losses. In this case,
all moves will look equally terrible to the paranoid player, and paranoid play
becomes equivalent to random play.4
We should also note the relationship paranoid play has to the “imperfection” of
the information in the game. A game with large amounts of information and small
information sets should see better play from a paranoid player than a game with
large information sets. The reason for this is that as we get more information about
the actual game state, we can be more confident that the move the paranoid player
designates as “worst” is a move the opponent can discover and make in the actual
game. The extreme of this is a perfect information game, where paranoid play has
proven quite effective: it is minimax search. But without some experimentation, it
is not clear to what extent smaller amounts of information degrade paranoid play.
Overconfident play. Overconfident play assumes that a2 will, with equal prob-
ability, make all available moves regardless of what the available information tells
a2 about each move’s expected utility. The effect this has on game play depends
on the extent to which a2 ’s moves diverge from random play. Unfortunately for
overconfidence, many interesting imperfect-information games implicitly encourage
3 This argument assumes that the paranoid player examines the entire information set rather
than a statistical sample as discussed in Section 3.4. If the paranoid player examines a statistical
sample of the information set, there is a good chance that the statistical sample will not contain
the history for which m is a loss. Hence in this case, statistical sampling would actually improve
the paranoid player’s play.
4 We have verified this experimentally in several of the games in the following section, but omit
77
Austin Parker, Dana Nau, and V.S. Subrahmanian
non-random play. In these games the overconfident player will not adequately con-
sider the risks of its moves. The overconfident player, acting under the theory that
the opponent is unlikely to make a particular move, will many times not protect
itself from a potential loss.
However, depending on the amount of information in the imperfect-information
game, the above problem may not be as bad as it seems. For example, consider a
situation where a1 , playing overconfidently, assumes the opponent is equally likely
to make each of the ten moves available in a1 ’s current information set. Suppose
that each move is clearly the best move in exactly one tenth of the available histories.
Then, despite the fact that the opponent is playing a deterministic strategy, random
play is a good opponent model given the information set. This sort of situation,
where the model of random play is reasonable despite it being not at all related
to the opponent’s actual mixed strategy, is more likely to occur in games where
there is less information. The larger the information set, the more likely it is that
every move is best in enough histories to make that move as likely to occur as any
other. Thus in games where players have little information, there may be a slight
advantage to overconfidence.
Comparative performance. The above discussion suggests that (1) paranoid
play should do better in games with “large” amounts of information, and (2) over-
confident play might do better in games with “small” amounts of information. But
will overconfident play do better than paranoid play? Suppose we choose a game
with small amounts of information and play a paranoid player against an overconfi-
dent player: what should the outcome be? Overconfident play has the advantage of
probably not diverging as drastically from the theoretically correct expected utility
of a move, while paranoid play has the advantage of actually detecting and avoiding
bad situations – situations to which the overconfident player will not give adequate
weight.
Overall, it is not at all clear from our analysis how well a paranoid player and an
overconfident player will do relative to each other in a real imperfect-information
game. Instead, experimentation is needed.
5 Experiments
In this section we report on our experimental comparisons of overconfident versus
paranoid play in several imperfect-information games.
One of the games we used was kriegspiel, an imperfect-information version of
chess [Li 1994; Li 1995; Ciancarini, DallaLibera, and Maran 1997; Sakuta and Iida
2000; Parker, Nau, and Subrahmanian 2005; Russell and Wolfe 2005]. In kriegspiel,
neither player can observe anything about the other player’s moves, except in cases
where the players directly interact with each other. For example, if a1 captures one
of a2 ’s pieces, a2 now knows that a1 has a piece where a2 ’s piece used to be. For
more detail, see Section 5.2.
In addition, we created imperfect-information versions of three perfect-
78
Paranoia versus Overconfidence in Imperfect-Information Games
information games: P-games [Pearl 1984], N-games [Nau 1982a], and a simplified
version of kalah [Murray 1952]. We did this by hiding some fraction 0 ≤ h ≤ 1 of
each player’s moves from the other player. We will call h the hidden factor, because
it is the fraction of information that we hide from each player: when h = 0, each
player can see all of the other player’s moves; when h = 1, neither player can see
any of the other player’s moves; when h = 0.2, each player can see 20% of the other
player’s moves; and so forth.
In each experiment, we played two players head-to-head for some number of
trials, and averaged the results. Each player went first on half of the trials.
79
Austin Parker, Dana Nau, and V.S. Subrahmanian
Paranoid play
1
0.8
EVVP vs score
(a) Hidden-move P- 0.6
OC's average
games. Each data
0.4
point is an average of
0.2
Overconfident
at least 72 trials.
OC vs PAR b/f 2
0 OC vs PAR b/f 3
OC vs PAR b/f 4
-0.2
0 0.2 0.4 0.6 0.8 1
Hidden Factor
Paranoid play
1
0.8
EVVP vsscore
(b) Hidden-move N- OC's average
0.6
games. Each data
0.4
point is an average of
0.2
Overconfident
at least 39 trials.
OC vs PAR b/f 2
0 OC vs PAR b/f 3
OC vs PAR b/f 4
-0.2
0 0.2 0.4 0.6 0.8 1
Hidden Factor
Paranoid play
0.8
(c) Hidden-move kalah.
EVVP vsscore
0.6
Each data point is an
OC's average
initial states. 0
OC vs PAR b/f 2
OC vs PAR b/f 3
OC vs PAR b/f 4
-0.2
0 0.2 0.4 0.6 0.8 1
Hidden Factor
Figure 4. Average scores for overconfident (OC) play against paranoid (PAR) play.
1982a], and we wanted to ascertain whether this property might have influenced
our experimental results on hidden-move P-games. N-games are similar to P-games
but do not exhibit game-tree pathology, so we did a similar set of experiments on
hidden-move N-games.
An N-game is specified by a triple (d, b, P0 ), where d is the game length, b is the
branching factor, and P0 is a probability. An N-game specified by this triple has
a game tree of height d and branching factor b, and each arc in the game tree is
randomly assigned a value of +1 with probability P0 , or −1 otherwise. A leaf node
is a win for player 1 (and a loss for player 2) if the sum of the values on the arcs
between the root and the leaf node is greater than zero; otherwise the leaf node is
80
Paranoia versus Overconfidence in Imperfect-Information Games
We vary the number of pits on the board. This varies the branching factor.
We end the game after 10 ply, to ensure that the algorithms can search the
entire tree.
We start with a random number of stones in each pit to ensure that at each
branching factor there will be games with non-trivial decisions.
Since randomized kalah is directly motivated by a very old game that people still
play, its game trees are arguably much less “artificial” than those of P-games or
N-games.
The results of playing overconfidence versus paranoia in hidden-move versions of
randomized kalah are shown in Figure 4(c). The results are roughly similar to the
81
Austin Parker, Dana Nau, and V.S. Subrahmanian
P-game and N-game results, in the sense that overconfidence generally outperforms
paranoia; but the results also differ from the P-game and N-game results in several
ways. First, overconfidence generally does better at high hidden factors than at low
ones. Second, paranoia does slightly better than overconfidence at hidden factor 0
(which does not conflict with Theorem 5, since kalah allows ties). Third, paranoia
does better than overconfidence when the branching factor is 2 and the hidden
factor is 0.2 or 0.4. These are the only results we saw where paranoia outperformed
overconfidence.
The fact that with the same branching factor, overconfidence outperforms para-
noia with hidden factor 0.6, supports the hypothesis that as the amount of in-
formation in the game decreases, paranoid play performs worse with respect to
overconfident play. The rest of the results support that hypothesis as well: over-
confidence generally increases in performance against paranoia as the hidden factor
increases.
82
Paranoia versus Overconfidence in Imperfect-Information Games
Table 1. Average scores for overconfi- Table 2. Average scores for overcon-
dent play against paranoid play, in 500 fident and paranoid play against HS,
kriegspiel games using the ICC ruleset. with 95% confidence intervals. d is the
d is the search depth. search depth.
Over- Paranoid d Paranoid Overconfident
confident d=1 d=2 d=3 1 –0.066 ± 0.02 +0.194 ± 0.038
d=1 +0.084 +0.186 +0.19 2 +0.032 ± 0.035 +0.122 ± 0.04
d=2 +0.140 +0.120 +0.156 3 +0.024 ± 0.038 +0.012 ± 0.042
d=3 +0.170 +0.278 +0.154
time controls and always forced players in the same game to ensure the results
were not biased by different hardware. The algorithms were written in C++. The
code used for overconfident and paranoid play is the same, with the exception of
the opponent model. We used a static evaluation function that was developed to
reward conservative kriegspiel play, as our experience suggests such play is generally
better. It uses position, material, protection and threats as features.
The algorithms used for kriegspiel are depth-limited versions of the paranoid and
overconfident players. To handle the immense information-set sizes in kriegspiel,
we used iterative statistical sampling (see Section 3.4). To get a good sample with
time control requires limiting the search depth to at most three ply. Because time
controls remain constant, the lower search depths are able to sample many more
histories than the higher search depths.
Head-to-head overconfident vs. paranoid play. We did experiments compar-
ing overconfident play to paranoid play by playing the two against each other. We
gave the algorithms 30 seconds per move and played each of depths one, two, and
three searches against each other. The results are in Table 1. In these results, we
notice that overconfident play consistently beats paranoid play, regardless of the
depth of either search. This is consistent with our earlier results for hidden-move
games (Section 5.1); and, in addition, it shows overconfident play doing better than
paranoid play in a game that people actually play.
HS versus overconfidence and paranoia. We also compared overconfident
and paranoid play to the hybrid sampling (HS) algorithm from our previous work
[Parker, Nau, and Subrahmanian 2005]. Table 2 presents the results of the exper-
iments, which show overconfidence playing better than paranoia except in depth
three search, where the results are inconclusive. The inconclusive results at depth
three (which are an average over 500 games) may be due to the sample sizes achieved
via iterative sampling. We measured an average of 67 histories in each sample at
depth three, which might be compared to an average of 321 histories in each sample
at depth two and an average of 1683 histories at depth one. Since both algorithms
use iterative sampling, it could be that at depth three, both algorithms examine
83
Austin Parker, Dana Nau, and V.S. Subrahmanian
6 Related Work
There are several imperfect-information game-playing algorithms that work by
treating an imperfect-information game as if it were a collection of perfect-
information games [Smith, Nau, and Throop 1998; Ginsberg 1999; Parker, Nau,
and Subrahmanian 2005]. This approach is useful in imperfect-information games
such as bridge, where it is not the players’ moves that are hidden, but instead
some information about the initial state of the game. The basic idea is to choose
at random a collection of states from the current information set, do conventional
minimax searches on those states as if they were the real state, then aggregate the
minimax values returned by those searches to get an approximation of the utility of
the current information set. This approach has some basic theoretical flaws [Frank
and Basin 1998; Frank and Basin 2001], but has worked well in games such as
bridge.
Poker-playing computer programs can be divided into two major classes. The
first are programs which attempt to approximate a Nash equilibrium. The best
examples of these are PsOpti [Billings, Burch, Davidson, Holte, Schaeffer, Schauen-
berg, and Szafron 2003] and GS1 [Gilpin and Sandholm 2006b]. The algorithms
use an intuitive approximation technique to create a simplified version of the poker
game that is small enough to make it feasible to find a Nash equilibrium. The equi-
librium can then be translated back into the original game, to get an approximate
Nash equilibrium for that game. These algorithms have had much success but differ
from the approach in this paper: unlike any attempt to find a Nash equilibrium,
information-set search simply tries to find the optimal strategy against a given op-
ponent model. The second class of poker-playing programs includes Poki [Billings,
Davidson, Schaeffer, and Szafron 2002] which uses expected value approximations
and opponent modeling to estimate the value of a given move and Vexbot [Billings,
Davidson, Schauenberg, Burch, Bowling, Holte, Schaeffer, and Szafron 2004] which
uses search and adaptive opponent modeling.
The above works have focused specifically on creating successful programs for
card games (bridge and poker) in which the opponents’ moves (card plays, bets) are
observable. In these games, the hidden information is which cards went to which
players when the cards were dealt. Consequently, the search techniques are less
general than information-set search, and are not directly applicable to hidden-move
games such as kriegspiel and the other games we have considered in this paper.
84
Paranoia versus Overconfidence in Imperfect-Information Games
7 Conclusion
We have introduced a recursive formulation of the expected value of an information
set in an imperfect information game. We have provided analytical results showing
that this expected utility formulation plays optimally against any opponent if we
have an accurate model of the opponent’s strategy.
Since it is generally not the case that the opponent’s strategy is known, the
question then arises as to what the recursive search should assume about an op-
ponent. We have studied two opponent models, a “paranoid” model that assumes
the opponent will choose the moves that are best for them, hence worst for us; and
an “overconfident” model that assumes the opponent is making moves purely at
random.
We have compared the overconfident and paranoid models in kriegspiel, in an
imperfect-information version of kalah, and in imperfect-information versions of P-
games [Pearl 1984] and N-games [Nau 1982a]. In each of these games, the overcon-
fident strategy consistently outperformed the paranoid strategy. The overconfident
strategy even outperformed the best of the kriegspiel algorithms in [Parker, Nau,
and Subrahmanian 2005].
These results suggest that the usual assumption in perfect-information game tree
search—that the opponent will choose the best move possible—is not as effective in
imperfect-information games.
Acknowledgments: This work was supported in part by AFOSR grant
FA95500610405, NAVAIR contract N6133906C0149, DARPA IPTO grant FA8650-
06-C-7606, and NSF grant IIS0412812. The opinions in this paper are those of the
authors and do not necessarily reflect the opinions of the funders.
References
Applegate, D., G. Jacobson, and D. Sleator (1991). Computer analysis of sprouts.
Technical report, Carnegie Mellon University.
85
Austin Parker, Dana Nau, and V.S. Subrahmanian
86
Paranoia versus Overconfidence in Imperfect-Information Games
Smith, S. J. J., D. S. Nau, and T. Throop (1998). Computer bridge: A big win
for AI planning. AI Magazine 19 (2), 93–105.
von Neumann, J. and O. Morgenstern (1944). Theory of Games and Economic
Behavior. Princeton University Press.
87
Return to TOC
6
Heuristic Search: Pearl’s Significance from
a Personal Perspective
Ira Pohl
1 Introduction
This paper is about heuristics, and the significance of Judea Pearl’s work to the
field. The impact of Pearl’s monograph was transformative. It heralded a third
wave in the practice and theory of heuristic search. First there were the pioneering
search algorithms without a theoretical basis, such as GPS or GT. Second came the
Nilsson [1980] work that formulated a basic theory of A*. Third, in 1984, was Judea
Pearl adding depth and breadth to this theory and adding a more sophisticated
probabilistic context. Judea Pearl’s book, Heuristics: Intelligent Search Strategies
for Computer Problem Solving, was a tour de force summarizing the work of three
decades.
Heuristic Search is a holy grail of Artificial Intelligence. It attempts to be a
universal methodology to achieve AI, demonstrating early success in an era when
AI was largely experimental. Pre-1970 programs that were notable include the
Doran-Michie Graph Traverser [Doran and Michie 1966], the Art Samuel checker
program [Samuel 1959], and the Newell, Simon, and Shaw General Problem Solver
[Newell and Simon 1972]. These programs showed what could be called intelligence
across a gamut of puzzles, games, and logic problems. Having no theoretical basis
for predicting success, they were tested and compared to human performance.
The lack of theory for heuristic search changed in 1968 with the publication
by Hart, Nilsson, and Raphael [1968] of their A* algorithm and its analysis. A*
was provably optimum under some theoretical assumptions. The outcome of this
work at the SRI robotics group fueled a series of primary results including my own,
and later Pearl’s. Pearl’s book [Pearl 1984] captured and synthesized much of the
A* work, including my work from the late 1960’s [Pohl 1967; Pohl 1969] through
1977 [Pohl 1970a; Pohl 1970b; Pohl 1971; Pohl 1973; Pohl 1977]. It built a solid
theoretical structure for heuristic search, and inspired much of my own and others
subsequent work [Ratner and Pohl 1986; Ratner and Warmuth 1986; Kaindl and
Kaintz 1997; Politowski 1984].
2 Early Experimentation
In the 1960’s there were three premiere AI labs at US universities: CMU, MIT and
Stanford; there was one such lab in England: the Machine Intelligence group at
89
Ira Pohl
Edinburgh; and there was the AI group at SRI. Each had particular strengths and
visions. The CMU group led by Allen Newell and Herb Simon [Newell and Simon
1972] took a cognitive simulation approach. Their primary algorithmic framework
was GPS, the General Problem Solver. This algorithm could be viewed as an ele-
mentary divide and conquer strategy. Guidance was based on detecting differences
between partial solution and goal states. To make progress this algorithm attempted
to apply operators to partial solutions, and reduce the difference with a goal state.
It demonstrated that a heuristic search strategy could be applied to a wide array
of problems that were associated with human intelligence. These included combi-
natorial puzzles such as crypt-arithmetic and towers of Hanoi.
The Graph Traverser [Doran and Michie 1966] reduced AI questions to the task
of heuristic search. It was deployed on the 8 puzzle, a classic combinatorial puzzle
typical of mildly challenging human amusements. “The objective of the 8-Puzzle
is to rearrange a given initial configuration of eight numbered tiles arranged on a
3 x 3 board into a given final configuration called the goal state [Pearl 1984, p.
6].”. Michie and Doran tried to obtain efficient search by discovering useful and
computationally simple heuristics that measured perceived effort toward a solution.
An example was “how many tiles are out of their goal space.” The Graph Traverser
demonstrated that a graph representation was a useful general perspective in prob-
lem solving, and that computationally knowledgeable heuristics could efficiently
guide search.
The MIT AI lab pioneered many projects in both robotics and advanced problem
solving. Marvin Minsky and his collaborators solved relatively difficult mathemat-
ical problems such as word algebra problems and calculus problems [Slagle 1963].
They used context, mathematical models and search to solve these problems. Ul-
timately this work led to programs such as Mathematica. They gave a plausible
argument for what could be described as knowledge + deduction = intelligence.
The Stanford AI lab was led by John McCarthy, who was important in two re-
gards: computational environment and logical representations. McCarthy pioneered
with LISP and time sharing: tools and schemes that would profoundly impact the
entire computational community. He championed predicate logic as a uniform and
complete representation of what was needed to express and reason about the world.
The SRI lab, originally affiliated with Stanford, but later independent, was en-
gaged in robotics. A concern was how an autonomous mobile robot could efficiently
navigate harsh terrain, such as the moon. Here the emphasis was more engineering
and algorithmic. Here the question was how something could be made to work effi-
ciently, not whether it simulated human intelligence or generated a comprehensive
theory of inference.
3 Early Theory
The Graph Traverser was a heuristic path finding algorithm attempting to minimize
search steps to a goal node. This approach was experimental and explored how to
90
Heuristic Search
formulate and test for useful heuristics. The Dijkstra shortest path algorithm [Dijk-
stra 1959] was a combinatorial algorithm. It was an improvement on earlier graph
theoretic and operations research algorithms for the pure shortest path optimization
problem. A* developed in Hart, Nilsson, and Raphael [1968], is the adaptation of
Dijkstra’s shortest path algorithm to incorporate admissible heuristics. It preserved
the need to find an optimal path, while using heuristics that attempted to minimize
search.
In the period 1966-1969, I worked on graph theory algorithms, such as the Di-
jkstra algorithm. The Stanford Computer Science department was dominated by
numerical analysts with a strong algorithmic approach. The AI group was meta-
mathematical and inference oriented. Don Knuth had yet to come to Stanford, so
there was not yet any systematic work or courses offered on combinatorial algo-
rithms. I was asked to evaluate PL1 for use at Stanford and SLAC and decided to
test it out by building a graph algorithm library.
I enjoyed puzzle-like problems and had an idea for improving Warnsdorf’s rule
for finding a Knight’s tour. A Knight’s tour is a Hamiltonian path on the 64 square
chessboard whose edge connectivity is determined by Knight moves. In graph theory
terms the rule was equivalent to going to an unvisited node of minimum out-degree.
Warnsdorf’s rule is an example of a classic greedy heuristic. I modified it to find a
Hamiltonian in the 46 node Tutte’s graph [Pohl 1967; Tutte 1946], as well as showed
that it worked well on Knight’s Tours.
At that time, the elegance of Dijkstra’s work had impressed me, and I sought to
improve his shortest path algorithm by implementing it bidirectionally. An early
attempt by Nicholson [1966] had proved incorrect because of an error in the stopping
criteria. These ideas conspired to lead me to test my methods further as variations
on A* [Pohl 1971].
91
Ira Pohl
The first half of Pearl [1984], Chapter 3 is a sophisticated and succinct summation
of search results through 1978. Three theorems principally due to Hart, Nilsson,
and Raphael [1968] are central to early theory:
• If A*2 is more informed than A*1 , then A*2 dominates A*1 [Pearl 1984,
Theorem 7, p. 81].
Much of this early work relied on the heuristic being consistent. But as Pearl [1984,
p. 111] notes: ”The property of monotonicity was introduced by Pohl [1977] to
replace that of consistency. Surprisingly, the equivalence of the two have not been
previously noted in the literature.” These theorems and monotonicity provide a first
attempt at a mathematical foundation for heuristic search.
These theorems suggest that A* is robust and that heuristics that are more
informed lead to more efficient searches. What is missing is how to relate the effort
and accuracy in computing the heuristic to its computational benefit.
92
Heuristic Search
This generalization is also adopted in Pearl [1984], Section 3.2.1. Once you have this
generalization, one can ask questions that are similar to those studied by numerical
analysts. How reliably the heuristic function estimates effort, leads to a notion of
error. This error then effects the convergence rate to a solution. This question
was first taken up by Pohl [1970a] and later by Gaschnig [1979]. Pearl’s results
as summarized in [Pearl 1984, Chapters 6–7], provide a detailed treatment of the
effects of error on search.
93
Ira Pohl
it is best to pick from the urn with fewest balls. Think of the urn as the collection
of open nodes and selection as finding the next node along an optimum path. This
leads to the cardinality comparison rule.
Bidirectional search works well for the standard graph shortest path problem.
Here, bidirectional search, exclusive of memory constraints, dominates unidirec-
tional search when the metric is nodes expanded. But when search is guided by
highly selected heuristics there can be a “wandering in the desert” problem when
the two frontiers do not meet in the middle.
To address this problem, I first proposed a parallel computation of all front-to-
front node values in 1975. Two of my student’s implemented and tested this method
[De Champeaux and Sint 1977]. It had some important theoretical advantages, such
as retaining admissibility, but it used an expensive front-to-front computation that
involved the square of the nodes in the open set. By only looking at nodes expanded
as a measure of efficiency it was misleading as to its true computational effort [Davis
et al. 1984].
94
Heuristic Search
95
Ira Pohl
[Korf 1985a]. The empirical results, which come from a test on a standard set of 50
problems [Politowski and Pohl, 1984], show that the algorithms outperform other
then published methods within stated time limits.
In order to bound the effort of local search by a constant, each local search will
have the start and goal nodes reside on the path, with the distance between them
bounded by dmax , a constant independent of n and the nodes. Then we will apply
A * with admissible heuristics to find a shortest path between the two nodes. The
above two conditions generally guarantee that each A * use requires less than some
constant time. More precisely, if the branching degrees of all the nodes in G are
bounded by a constant c which is independent of n then A * will generate at most
c(c − 1)dmax −1 nodes.
Theoretically c(c − 1)dmax −1 is a constant, but it can be very large. Nevertheless,
most heuristics prune most of the nodes [Pearl 1984]. The fact that not many nodes
are generated, is supported by experiments.
The results in [Ratner and Pohl 1986] demonstrate the effectiveness of using
LPA * . When applicable, this algorithm achieves a good solution with small
execution time. This method require an approximation algorithm as a starting
point. Typically, when one has a heuristic function, one has adequate knowledge
about the problem to be able to construct an approximation algorithm. Therefore,
this method should be preferred in most cases to earlier heuristic search algorithms.
96
Heuristic Search
paths meet in the middle. The problem with Pohl’s bidirectional algorithm is that
each search tree is ’aimed’ at the root of the opposite tree. What is needed is some
way of aiming at the front of the opposite tree rather than at its root. There are
two advantages to this. First, there is a better chance of meeting the opposite front
if you are aiming at it. Second, for most heuristics the aim is better when the
target is closer. However, aiming at a front rather than a single node is somewhat
troublesome since the heuristic function is only designed to estimate the distance
between two nodes. One way to overcome this difficulty is to choose from each front
a representative node which will be used as a target for nodes in the opposite tree.
We call such nodes d-nodes.
Consider a partially developed search tree. The growth of the tree is guided
by the heuristic function used in the search, and thus the whole tree is inclined,
at least to some degree, towards the goal. This means that one can expect that
on the average those nodes furthest from the root will also be closest to the goal.
These nodes are the best candidates for the target to be aimed at from the opposite
tree. In particular, the very farthest node out from the root should be the one
chosen. D-node selection based on this criterion costs only one comparison per
node generated.
We incorporated this idea into a bidirectional version of HPA in the following
fashion:
3. After n moves, if the g value of the furthest node out is greater than the g
value of the last d-node in this tree, then the furthest node out becomes the
new d-node. Each time this occurs, all of the nodes in the opposite front
should be re-aimed at the new d-node.
The above algorithm does not specify a value for n. Sufficient analysis may enable
one to choose a good value based on other search parameters such as branching rate,
quality of heuristic, etc. Otherwise, an empirical choice can be made on the basis
of some sample problems. In our work good results were obtained with values of n
ranging from 25 to 125.
It is instructive to consider what happens when n is too large or too small, because
it provides insight into the behavior of the d-node algorithm. A value of n which
is too large will lead to performance similar to unidirectional search. This is not
surprising since for a sufficiently large n, a path will be found unidirectionally, before
any reversal occurs. A value of n which is too small will lead to poor performance
97
Ira Pohl
in two respects. First, the runtime will be high because the overhead to re-aim the
opposite tree is incurred too often. Second, the path quality will be lower (i.e. the
The evaluation function used by the d-node search algorithm is the same as that
used by HPA, namely f = (1 − w)g + w ∗ h, except that h is now the heuristic
estimate of the distance from a particular node to the d-node of the opposite tree.
This is in contrast to the original bidirectional search algorithm, where h estimates
the distance to the root of the opposite tree, and to unidirectional heuristic search,
where h estimates the distance to the goal. The d-node algorithm’s aim is to perform
well for a variety of heuristics and over a range of w values.
The exponential nature of the problem space makes it highly probable that ran-
domly generated puzzles will be relatively hard, i.e. their shortest solution paths
will be relatively long with respect to the diameter of the state space. The four
functions used to compute h are listed below. These functions were originally de-
veloped by Doran and Michie [Doran, Michie 1966], and they are the same functions
as those used by Pohl and de Champeaux.
1. h = P
2. h = P +20*R
3. h = S
4. h = S +20*R
The three basic terms P, S, and R have the following definitions.
1. where P is the sum for all tile i of Manhattan distance between the position
of tile i in the current state and in the goal.
2. S, a relationship between tile and the blank square defined in [Doran, Michie
1966].
3. R is the number of reversals in the current state with respect to goal. For
example if tile 2 is first and tile 1 is next, this is a reversal.
Finally, the w values which we used were 0.5, 0.75, and 1.0. This covers the entire
’interesting’ range from w = 0.5, which will result in admissible search with a
suitable heuristic, to w = 1.0, which is pure heuristic search.
The detailed results of our test of the d-node algorithm are found in [Politowski
1986]. The most significant result is that the d-node method dominates both pre-
viously published bidirectional techniques, regardless of heuristic or weighting. In
comparison to de Champeaux’s BHFFA, the d-node method is typically 10 to 20
times faster. This is chiefly because the front-to-front calculations required by
BHFFA are computationally expensive, even though the number of nodes expanded
is roughly comparable for both methods. In comparison to Pohl’s bidirectional al-
gorithm, the d-node method typically solves far more problems, and when solving
the same problems it expands approximately half as many nodes.
98
Heuristic Search
8 Next Steps
Improving problem decomposition remains an important underutilized strategy
within heuristic search. Divide and conquer remains a central instrument of intelli-
gent deduction in many formats and arenas. Here the bidirectional search theme and
the LPA* search are instances of naive but effective first steps. Planning is a further
manifestation of divide and conquer. Intuitively planning amounts to a good choice
of lemma’s when attempting to construct a difficult proof. Korf’s [Korf 1985a, Korf
1985b] macro operators can be seen as a first step, or the Pohl-Politowski d-node
selection criteria as a an automated attempt at problem decomposition.
Expanding problem selection to hard domains is vital to demonstrating the rel-
evance of heuristic search. Historically these techniques were developed on puzzle
domains. Here Korf’s [Korf, Zhang 2000] recent work in applying search to genomics
problems is welcome. Computational genomics is an unambiguously complex and
non-trivial domain that almost certainly requires heuristics search.
Finally, what seems under exploited is the use of a theory of error linked to
efficiency. Here my early work and the experimental work of Gaschnig contributed
to the deeper studies of Pearl and Davis [1990]. Questions of worst case analysis
and average case complexity and its relation to complexity theory while followed
up by Pearl, and by Chenoweth and Davis [1992], is only a beginning. A deeper
theory that applies both adversary techniques and metric space analysis is needed.
References
Chenoweth, S., and Davis, H. (1992). New Approaches for Understanding the
Asymptotic Complexity of A* Tree Searching. Annals of Mathematics and AI
5, 133–162.
Davis, H. (1990). Cost-error Relationships in A* Tree Searching. Journal of the
ACM 37 , 195–199.
Davis, H., Pollack, R., and Sudkamp, T. (1984). Toward a Better Understanding
of Bidirectional Search. Proceedings AAAI , pp. 68–72.
De Champeaux, D., and Sint, L. (1977). An improved bi-directional heuristic
search algorithm. Journal of the ACM 24 , 177–191.
Dechter, R., and Pearl, J. (1985). Generalized Best-First Search Strategies and
the Optimality of A*. Journal of the ACM 32 , 505–536.
Dijkstra, E. (1959). A Note on Two Problems in Connection with Graphs Nu-
merische Mathematik 1 , 269–271, 1959.
Doran, J., and Michie, D. (1966). Experiments with the Graph Traverser Pro-
gram. Proc.of the Royal Society of London, 294A, 235–259.
Field, R., Mohyeldin-Said, K., and Pohl, I. (1984). An Investigation of Dynamic
Weighting in Heuristic Search, Proc. 6th ECAI , pp. 277–278.
99
Ira Pohl
100
Heuristic Search
101
Part II: Probability
Return to TOC
7
Inference in Bayesian Networks:
A Historical Perspective
Adnan Darwiche
1 Introduction
Judea Pearl introduced Bayesian networks as a representational device in the early
1980s, allowing one to systematically and locally assemble probabilistic beliefs into
a coherent whole. While some of these beliefs could be read off directly from the
Bayesian network, many were implied by this representation and required compu-
tational work to be made explicit. Computing and explicating such beliefs has
been the subject of much research and became known as the problem of inference
in Bayesian networks. This problem is critical to the practical utility of Bayesian
networks as the computed beliefs form the basis of decision making, which typically
dictates the need for Bayesian networks in the first place.
Over the last few decades, the interest in inference algorithms for Bayesian net-
works remained great and has witnessed a number of shifts in emphasis with regards
to the adopted computational paradigms and the type of queries addressed. My
goal in this paper is to provide a historical perspective on this line of work and the
associated shifts, where we shall see the key role that Judea Pearl has played in
initiating and inspiring many of the technical developments that have formed and
continue to form the basis of work in this area.
105
Adnan Darwiche
dren were said to quantify the causal support from parents to these children. On
the other hand, messages that communicated information from children to their
parents were said to quantify the diagnostic support from children to parents. The
notions of causal and diagnostic supports were rooted in the causal interpretation
of Bayesian network structures that Pearl insisted on, where parents are viewed as
direct causes of their children. According to this interpretation, the distribution
associated with a node in the Bayesian network is called the belief in that node,
and is a function of the causal support it receives from its direct causes, the diag-
nostic support it receives from its direct effects, and the local information available
about that node. This is why the algorithm is also known as the belief propagation
algorithm, a name which is more common today.
The polytree algorithm has had considerable impact and is of major historical
significance for a number of reasons. First, it was the very first exact inference
algorithm for this class of Bayesian networks. Second, its time and space complexity
were quite modest being linear in the size of the network. Third, the algorithm
formed the basis for a number of other algorithms, both exact and approximate,
that will be discussed later. In addition, the algorithm provided a first example of
reading off independence information from a network structure, and then using it
to decompose a complex computation into smaller and independent computations.
It formally showed the importance of independence, as portrayed by a network
structure, in driving computation and in reducing the complexity of inference.
One should also note that, according to Pearl, this algorithm was motivated by
the work of [Rumelhart 1976] on reading comprehension, which provided compelling
evidence that text comprehension must be a distributed process that combines both
top-down and bottom-up inferences. This dual mode of inference, so characteristic
of Bayesian analysis, did not match the capabilities of the ruling paradigms for
uncertainty management in the 1970s. This led Pearl to develop the polytree algo-
106
Inference in Bayesian Networks: A Historical Perspective
rithm [Pearl 1986b], which, as mentioned earlier, appeared first in [Pearl 1982] with
a restriction to trees, and then in [Kim and Pearl 1983] for polytrees.
107
Adnan Darwiche
A DFG ACE
B C
ADF AEF
D E
F
ABD EFH
G H
DAG as used here corresponds to the treewidth of its moralized graph: one which is obtained by
108
Inference in Bayesian Networks: A Historical Perspective
A B C D E
A B C D E
C D E
how similar a DAG structure is to a tree structure as it puts a lower bound on the
width of any tree clustering (jointree) of the DAG.
The connection between the complexity of inference algorithms and treewidth
is actually the central complexity result that we have today for exact inference
[Dechter 1996]. In particular, given a jointree whose width is w, node marginals
can be computed in time and space that is exponential only in w. Note that a
network treewidth of w guarantees the existence of such a jointree, but finding
it is generally known to be hard. Hence, much work on this topic concerns the
construction of jointrees with minimal width using both heuristics and complete
search methods (see [Darwiche 2009] for a survey).
connecting every pair of nodes that share a child in the DAG and then dropping the directionality
of all edges.
109
Adnan Darwiche
particular, one can guarantee that the space and time requirements of the algorithm
are at most exponential in the treewidth of underlying network structure. This
result assumes that one has access to a decomposition structure, known as a dtree,
which is used to control the decomposition process at each level of the recursive
process [Darwiche 2001]. Similar to a jointree, finding an optimal dtree (i.e., one
that realizes the treewidth guarantee on complexity) is hard. Yet, one can easily
construct such a dtree given an optimal jointree, and vice versa [Darwiche 2009].
Even though recursive conditioning and the jointree algorithm are equivalent from
this complexity viewpoint, recursive conditioning provided some new contributions
to inference. On the theoretical side, it showed that conditioning as an inference
paradigm can indeed reach the same complexity as the jointree algorithm — a
question that was open for some time. Second, the algorithm provided a flexible
paradigm for time-space tradeoffs: by simply controlling the degree of caching,
the space requirements of the algorithm can be made to range from being only
linear in the network size to being exponential in the network treewidth (given an
appropriate dtree). Moreover, the algorithm provided a convenient framework for
exploiting local structure as we shall discuss later.
On another front, and in the continued search of an alternative for the jointree
algorithm, a sequence of efforts culminated into what is known today as the variable
elimination algorithm [Zhang and Poole 1994; Dechter 1996]. According to this al-
gorithm, one maintains the probability distribution of the Bayesian network as a set
of factors (initially the set of CPTs) and then successively eliminates variables from
this set one variable at a time.2 The elimination of a variable can be implemented
by simply combining all factors that mention that variable and then removing the
variable from the combined factor. After eliminating a variable, the resulting fac-
tors represent a distribution over all remaining (un-eliminated) variables. Hence,
by repeating this elimination process, one can obtain the marginal distribution over
any subset of variables, including, for example, marginals over single variables.
The main attraction of this computational paradigm is its simplicity — at least as
compared to the initial formulations of the jointree algorithm. Variable elimination,
however, turned out to be no more efficient than the jointree algorithm in the worst
case. In particular, the ideal time and space complexities of the algorithm also
depend on the treewidth — in particular, they are exponential in treewidth when
computing the marginal over a single variable. To achieve this complexity, however,
one needs to use an optimal order for eliminating variables [Bertele and Brioschi
1972]. Again, constructing an optimal elimination order that realizes the treewidth
complexity is hard in general. Yet, one can easily construct such an optimal order
from an optimal jointree or dtree, and vice versa.
Even though variable elimination proved to have the same treewidth complexity
2A factor is a function that maps the instantiations of some set of variables into numbers; see
Figure 5. In this sense, each probability distribution is a factor and so is the marginal of such a
distribution on any set of variables.
110
Inference in Bayesian Networks: A Historical Perspective
Z Y
X Y Z f (.)
F F F 0.9 Z
F F T 0.1
F T F 0.9
F T T 0.1 .1 .9 .5
T F F 0.1
T F T 0.9
T T F 0.5
T T T 0.5
as the jointree algorithm, it better explained the semantics of the jointree algorithm,
which can now be understood as a sophisticated form of variable elimination. In
particular, one can interpret the jointree algorithm as a refinement on variable elim-
ination in which: (1) multiple variables can be eliminated simultaneously instead
of one variable at a time; (2) a tree structure is used to control the elimination
process and to save the results of intermediate elimination steps. In particular,
each message passed by the jointree algorithm can be interpreted as the result of
an elimination process, which is saved for re-use when computing marginals over
different sets of variables [Darwiche 2009]. As a result of this refinement, the join-
tree algorithm is able to perform successive invocations of the variable elimination
algorithm, for computing multiple marginals, while incurring the cost of only one
invocation, due mainly to the re-use of results across multiple invocations.
Given our current understanding of the variable elimination and jointree algo-
rithms, one now speaks of only two main computational paradigms for exact prob-
abilistic inference: conditioning algorithms (including loop-cutset conditioning and
recursive conditioning) and elimination algorithms (including variable elimination
and the jointree algorithm).
111
Adnan Darwiche
network can speed up inference to the point of beating the treewidth barrier, where
local structure refers to the specific properties attained by the probabilities quan-
tifying the network. One of the main intuitions here is that local structure can
imply independence that is not visible at the structural level and this independence
may be utilized computationally [Boutilier et al. 1996]. Another insight is that
determinism in the form of 0/1 probabilities can also be computationally useful as
it allows one to prune possibilities from consideration [Jensen and Andersen 1990].
There are many realizations of these principles today. For elimination algorithms
— which rely heavily on factors and their operations — local structure permits one
to have more compact representations of these factors than representations based
on tables [Zhang and Poole 1996], leading to a more efficient implementation of the
elimination process. One example of this would be the use of Algebraic Decision
Diagrams [R.I. Bahar et al. 1993] and associated operations to represent and ma-
nipulate factors; see Figure 5. For conditioning algorithms, local structure reduces
the number of cases one needs to consider during inference and the number of sub-
computations one needs to cache. As an example of the first, suppose that we have
an and-gate whose output and one of its inputs belong to a loop cutset. When
conditioning the output on 1, both inputs must be 1 as well. Hence, there is no
need to consider multiple values for the input in this case during the conditioning
process [Allen and Darwiche 2003]. This would no longer be true, however, if we
had an or-gate. Moreover, the difference between the two cases is only visible if we
exploit the local structure of corresponding Bayesian networks.
Another effective technique for exploiting local structure, which proved to be a
turning point in speeding up inference, is based on encoding Bayesian networks using
logical constraints and then applying logical inference techniques to the resulting
knowledge base [Darwiche 2002]. One can indeed efficiently encode the network
structure and some of its local structure, including determinism, using knowledge
bases in conjunctive normal form (CNF). One can then either compile the CNF
to produce a circuit representation of the Bayesian network (see below), or apply
model counting techniques and use the results to recover answers to probabilistic
queries [Sang, Beame, and Kautz 2005].
Realizations of the above techniques became practically viable long after the ini-
tial observations about local structure, but have allowed one to reason efficiently
with some networks whose treewidth can be quite large (e.g., [Chavira, Darwiche,
and Jaeger 2006]). Although there is some understanding of the kind of networks
that tend to lend themselves to these techniques, we still do not have strong theoret-
ical results that characterize these classes of networks and the savings that one may
expect from exploiting their local structure. Moreover, not enough work exists on
complexity measures that are sensitive to both network structure and parameters
(the treewidth is only sensitive to structure).
One step in this direction has been the use of arithmetic circuits to compactly
represent the probability distributions of Bayesian networks [Darwiche 2003]. This
112
Inference in Bayesian Networks: A Historical Perspective
B C
A B Pr (B|A) A C Pr (C|A)
A Pr (A) true true 1 true true .8
true .5 true false 0 true false .2
false .5 false true 0 false true .2
false false 1 false false .8
+
.5
* *
!a !b + + !b !a
* * * *
!c .2 .8 !c
Figure 6. A Bayesian network and a corresponding arithmetic circuit.
113
Adnan Darwiche
to some given value. Pearl actually proposed the first algorithm for this purpose,
which was a variation on the polytree algorithm [Pearl 1987a].
A more general problem is MAP which stands for Maximum a Posteriori hypoth-
esis. This problem searches for an instantiation of a subset of the network variables
that is most probable. Interestingly, MAP and MPE are complete for two different
complexity classes, which are also distinct from the class to which node marginals
is complete for. In particular, given the standard assumptions of complexity the-
ory, MPE is the easiest and MAP is the most difficult, with node marginals in the
middle.4
The standard techniques based on variable elimination and conditioning can solve
MPE and MAP as well [Dechter 1999]. MPE can be solved with the standard
treewidth guarantee. MAP, however, has a worse complexity in terms of what
is known as constrained treewidth, which depends on both the network topology
and MAP variables (that is, variables for which we are trying to find a most likely
instantiation of) [Park and Darwiche 2004]. The constrained treewidth can be much
larger than treewidth, depending on the set of MAP variables.
MPE and MAP problems have search components which lend themselves to
branch-and-bound techniques [Kask and Dechter 2001]. Over the years, many so-
phisticated MPE and MAP bounds have been introduced, allowing branch-and-
bound solvers to prune the search space more effectively. Consequently, this allows
one to solve some MPE and MAP problems efficiently, even when the network
treewidth or constrained treewidth are relatively high. In fact, only relatively re-
cently did practical MAP algorithms surface, due to some innovative bounds that
were employed in branch-and-bound algorithms [Park and Darwiche 2003].
MPE algorithms have traditionally received more attention than MAP algo-
rithms. Recently, techniques based on LP relaxations, in addition to reductions
to the MAXSAT problem, have been employed successfully for solving MPE. LP
relaxations are based on the observation that MPE has a straightforward formu-
lation in terms of integer programming, which is known to be hard [Wainwright,
Jaakkola, and Willsky 2005; Yanover, Meltzer, and Weiss 2006]. By relaxing the
integral constraints, the problem becomes a linear program, which is tractable but
provides only a bound for MPE. Work in this area has been focused on techniques
that compensate partially for the lost integral constraints using larger linear pro-
grams, and on developing refined algorithms for handling the resulting “specialized”
linear programs.5 The MAXSAT problem has also been receiving a lot of attention
in the logic community [Bonet, Levy, and Manyà 2007; Larrosa, Heras, and de Givry
2008], which developed effective techniques for this purpose. In fact, reductions of
certain MPE problems (those with excessive logical constraints) to MAXSAT seem
4 The decision problems for MPE, node marginals, and MAP are N P –complete, P P –complete,
114
Inference in Bayesian Networks: A Historical Perspective
A
1 2
7 10
B C
6 9 5
3 4 8
D E
115
Adnan Darwiche
to light around the mid 1990s, almost a decade after Pearl first suggested the al-
gorithm. Work on LBP and related methods has been dominating the field of
approximate inference for more than a decade now. One of the central questions
was: if LBP converges, what is it converging to? This question was answered in a
number of ways [Minka 2001; Wainwright, Jaakkola, and Willsky 2003; Choi and
Darwiche 2006], but the first characterization was put forth in [Yedidia, Freeman,
and Weiss 2000]. According to this characterization, one can understand LBP as
approximating the distribution of a Bayesian network by a distribution that has
a polytree structure [Yedidia, Freeman, and Weiss 2003]. The iterations of the
algorithm can then be interpreted as searching for the node marginals of that ap-
proximate distribution, while minimizing the KL–divergence between the original
and approximate distributions.
LBP has actually two built-in components. The first corresponds to a particular
approximation that it seeks, which is formally characterized as discussed before. The
second component is a particular method for seeking the approximation, through
a process of message passing. One can try to seek the same approximation using
other optimization methods, which has also been the subject of much research. Even
the message passing scheme leaves a lot of room for variation, which is captured
formally using the notion of a message passing schedule — for example, messages
can be passed sequentially, in parallel, or combinations therefore. One therefore
talks about the “convergence” properties of such algorithms, where the goal is to
seek methods that have better convergence properties.
LBP turns out to be an example of a more general class of approximation algo-
rithms that poses the approximate inference problem as a constrained optimization
problem. These methods, which are sometimes known as variational algorithms,
assume a tractable class of distributions, and seeks to find an instance in this
class that best fits the original distribution [Jordan et al. 1999; Jaakkola 2001].
For example, we may want to assume an approximating Bayesian network that is
fully-disconnected, and that the distribution it induces should have as small a KL–
divergence as possible, when compared to the distribution being approximated. The
goal of the constrained optimization problem is then to find the CPT parameters of
the approximate network that minimizes the KL–divergence between it and the orig-
inal network (subject to the appropriate normalization constraints). Work in this
area typically varies across two dimensions: proposing forms for the approximating
distribution, and devising methods for solving the corresponding optimization prob-
lem. Moreover, by varying these two dimensions, we are given access to a spectrum
of approximations, where we are able to trade the quality of an approximation with
the complexity of computing it.
8 Closing Remarks
During the first decade or two after Pearl’s introduction of Bayesian networks, infer-
ence research was very focused on exact algorithms. The efforts on these algorithms
116
Inference in Bayesian Networks: A Historical Perspective
slowed down towards the mid to late 1990s, to pick up again early in the century.
The slowdown was mostly due to the treewidth barrier, at a time where large enough
networks were being constructed to make standard algorithms impractical at that
time. The main developments leading to the revival of exact inference algorithms
has been the extended reach of conditioning methods, the deeper understanding of
elimination methods, and the more effective exploitation of local structure. Even
though these developments have increased the reach of exact algorithms consid-
erably, we still do not understand the extent to which this reach can be pushed
further. In particular, the main hope appears to be in further utilization of local
structure to speed up inference, but we clearly need better theories for providing
guarantees on such speedups and a better characterization of the networks that lend
themselves to such techniques.
On the approximate inference side, stochastic simulation methods witnessed a
surge after the initial work on this subject, with continued interest throughout, yet
not to the level enjoyed recently by methods based on belief propagation and related
methods. This class of algorithms remains dominant, with many questions begging
for answers. On the theoretical side, we do not seem to know enough on when
approximations tend to give good answers, especially that this seems to be tied
not only to the given network but also to the posed query. On the practical side,
we have yet to translate some of the theoretical results on generalizations of belief
propagation — which provides a spectrum that tradeoffs approximation quality
with computational resources — into tools that are used routinely by practitioners.
There has been a lot of progress on inference in Bayesian networks since Pearl
first made this computational problem relevant. There is clearly a lot more to be
done as we seem to always exceed the ability of existing algorithms by building
more complex networks. In my opinion, however, what is greatly missed since
Pearl’s initial work on this subject is his insistence on semantics, where he spared no
effort in establishing connections to cognition, and in grounding the most intricate
mathematical manipulations in human intuition. The derivation of the polytree
algorithm stands as a great example of this research methodology, as it provided
high level and cognitive interpretations of almost all intermediate computations
performed by the algorithm. It is no wonder then that the polytree algorithm not
only started the area of inference in Bayesian networks a few decades ago, but it
also remains a basis for some of the latest developments and inspirations in this
area of research.
Acknowledgments: I wish to thank Arthur Choi for many valuable discussions
while writing this article.
References
Allen, D. and A. Darwiche (2003). New advances in inference by recursive con-
ditioning. In Proceedings of the Conference on Uncertainty in Artificial Intel-
ligence, pp. 2–10.
117
Adnan Darwiche
118
Inference in Bayesian Networks: A Historical Perspective
119
Adnan Darwiche
R.I. Bahar, E.A. Frohm, C.M. Gaona, G.D. Hachtel, E. Macii, A. Pardo, and
F. Somenzi (1993). Algebraic Decision Diagrams and Their Applications. In
IEEE /ACM International Conference on CAD, Santa Clara, California, pp.
188–191. IEEE Computer Society Press.
Robertson, N. and P. D. Seymour (1986). Graph minors. II. Algorithmic aspects
of tree-width. J. Algorithms 7, 309–322.
Rumelhart, D. (1976). Toward an interactive model of reading. Technical Report
CHIP-56, University of California, La Jolla, La Jolla, CA.
Sang, T., P. Beame, and H. Kautz (2005). Solving Bayesian networks by weighted
model counting. In Proceedings of the Twentieth National Conference on Ar-
tificial Intelligence (AAAI-05), Volume 1, pp. 475–482. AAAI Press.
Wainwright, M. J., T. Jaakkola, and A. S. Willsky (2003). Tree-based reparam-
eterization framework for analysis of sum-product and related algorithms.
IEEE Transactions on Information Theory 49 (5), 1120–1146.
Wainwright, M. J., T. Jaakkola, and A. S. Willsky (2005). Map estimation via
agreement on trees: message-passing and linear programming. IEEE Trans-
actions on Information Theory 51 (11), 3697–3717.
Yanover, C., T. Meltzer, and Y. Weiss (2006). Linear programming relaxations
and belief propagation — an empirical study. Journal of Machine Learning
Research 7, 1887–1907.
Yedidia, J. S., W. T. Freeman, and Y. Weiss (2000). Generalized belief propaga-
tion. In NIPS, pp. 689–695.
Yedidia, J. S., W. T. Freeman, and Y. Weiss (2003). Understanding belief propa-
gation and its generalizations. In G. Lakemeyer and B. Nebel (Eds.), Exploring
Artificial Intelligence in the New Millennium, Chapter 8, pp. 239–269. Morgan
Kaufmann.
Zhang, N. L. and D. Poole (1994). A simple approach to Bayesian network com-
putations. In Proceedings of the Tenth Conference on Uncertainty in Artificial
Intelligence (UAI), pp. 171–178.
Zhang, N. L. and D. Poole (1996). Exploiting causal independence in Bayesian
network inference. Journal of Artificial Intelligence Research 5, 301–328.
120
Return to TOC
121
Thomas Dean
Moreover, the theory makes no mention of how a robot might learn such a model,
and, from years of working with robots, I was convinced that building a model by
hand would turn out to be a lot of work and very likely prove to be unsuccessful.
Here it was Judea’s graphical-models perspective that, initially, made it easy for me
to think about David’s work, and, later, extend it. I also came to appreciate the
relevance of Judea’s work on causality and, in particular, the role of intervention
in thinking about how biological systems engage the world to resolve perceptual
ambiguity.
This chapter concerns how probabilistic graphical models might be used to model
the visual cortex, and how the challenges faced in developing such models suggest
areas where current theory falls short and might be extended. A graphical model is a
useful formalism for compactly describing a joint probability distribution character-
ized by very large number of random variables. We are taking what is known about
the anatomy and physiology of the primate visual cortex and attempting to apply
that knowledge to construct probabilistic graphical models that we can ultimately
use to simulate some functions of primate vision. It may be that the resulting prob-
abilistic model also captures some important characteristics of individual neurons
or their ensembles. For practical purposes, this need not be the case, though clearly
we believe there are potential advantages to incorporating some lessons from biol-
ogy into our models. Graphical models also suggest, but do not dictate, how one
might use such a model along with various algorithms and computing hardware to
perform inference and thereby carry out practical simulations. It is this latter use
of graphical models that we refer to when we talk about implementing a model of
the visual cortex.
122
123
124
Graphical Models of the Visual Cortex
P (xO , xV1 , xV2 , xV4 , xIT ) = P (xO , xV1 )P (xV1 , xV2 )P (xV2 , xV4 )P (xV4 , xIT )P (xIT )
4 Temporal Relationships
Each neuron in the visual cortex indirectly receives input from some, typically con-
tiguous, region of retinal ganglion cells. This region is called the neuron’s receptive
field . By introducing lags and thereby retaining traces of earlier stimuli, a neuron
can be said to have a receptive field that spans both space and time — it has a
spatiotemporal receptive field. A large fraction of the cells in visual cortex and V1
in particular have spatiotemporal receptive fields. Humans, like most animals, are
very attentive to motion and routinely exploit motion to resolve visual ambiguity,
125
126
Graphical Models of the Visual Cortex
visual experience that biology has evolved to exploit to its advantage. However, in
this chapter, I want to explore a different facet of how we make sense of and, in some
cases, take advantage of spatial and temporal structure to survive and thrive, and
how these aspects of our environment offer new challenges for applying graphical
models.
127
128
129
Thomas Dean
segments as part of inference is that this model is potentially more elegant, and
even biologically plausible, in that the recursive process might be represented as a
single hierarchical graphical model allowing inference over the entire graph, rather
than over sequences of ever more refined graphs.
The above discussion of segmentation is but one example in which nodes in a
graphical model might serve as generic variables that are bound as required by
circumstances. But perhaps this view is short sighted; why not just assume that
there are enough nodes that every possible (visual) concept corresponds to a unique
combination of existing nodes. In this view, visual interpretation is just mapping
visual stimuli to the closest visual “memory”. Given the combinatorics, the only way
this could be accomplished is to use a hierarchy of features whose base layer consists
of small image fragments at many different spatial scales, and all subsequent layers
consist of compositions of features at layers lower in the hierarchy [Bienenstock and
Geman 1995; Ullman and Soloviev 1999; Ullman, Vidal-Naquet, and Sali 2002].
This view accords well with the idea that most visual stimuli are not determined
to be novel and, hence, we construct our reality from bits and pieces of existing
memories [Hoffman 1998]. Our visual memories are so extensive that we can almost
always create a plausible interpretation by recycling old memories. It may be that
in some aspects of cognition we have to employ generic neural structures to perform
the analog of binding variables, but for much of visual intelligence this may not be
necessary given a large enough memory of reusable fragments. Which raises the
question of how we might implement a graphical model that has anywhere near the
capacity of the visual cortex.
130
Graphical Models of the Visual Cortex
ular, evidence suggests the inter-columnar connection graph has low diameter (the
length of the longest shortest path separating a pair of vertices in the graph) thereby
enabling relatively low-latency communication between any two cortical columns.
It is estimated that there are about a quarter of a billion neurons in the primary
visual cortex — think V1 through V4 — counting both hemispheres, but probably
only around a million or so cortical columns. If we could roughly model each cortical
column with a handful of random variables, then it is at least conceivable that we
could implement a graphical model of early vision.
To actually implement a graphical model of visual cortex using current technol-
ogy, the computations would have to be distributed over many machines. Training
such a model might not take as long as raising a child, but it could take many
days — if not years — using the current computer technology, and, once trained,
we presumably would like to apply the learned model for much longer. Given such
extended intervals of training and application, since the mean-time-til-failure for
the commodity-hardware-plus-software that comprise most distributed processing
clusters is relatively short, we would have to allow for some means of periodically
saving local state in the form of the parameters quantifying the model.
The data centers that power the search engines of Google, Yahoo! and Microsoft
are the best bet that we currently have for such massive and long-lived computa-
tions. Software developed to run applications on such large server farms already
have tools that could opportunistically allocate resources to modify the structure of
graphical model in an analog of neurogenesis. These systems are also resistant to
both software and equipment failures and capable of reallocating resources in the
aftermath of catastrophic failure to mimic neural plasticity in the face of cell death.
In their current configuration, industrial data centers may not be well suited to
the full range of human visual processing. Portions of the network that handle very
early visual processing will undoubtedly require shorter latencies than is typical in
such server farms, even among machines on the same rack connected with high-
speed Ethernet. Riesenhuber and Poggio [1999] use the term immediate recognition
to refer to object recognition and scene categorization that occur in the first 100-
200ms or so from the onset of the stimuli. In that short span of time — less
than the time it takes for a typical saccade, we do an incredibly accurate job of
recognizing objects and inferring the gist of a scene. The timing suggests that only
a few steps of neural processing are involved in this form of recognition, assuming
10–20ms per synaptic transmission, though given the small diameter of the inter-
columnar connection graph, many millions of neurons are likely involved in the
processing. It would seem that at least the earliest stages of visual processing will
have to be carried out in architectures capable of performing an enormous number
of computations involving a large amount of state — corresponding to existing
pattern memory — with very low latencies among the processing units. Hybrid
architectures that combine conventional processors with co-processors that provide
fast matrix-matrix and matrix-vector operations will likely be necessary to handle
131
Thomas Dean
132
Graphical Models of the Visual Cortex
A typical saccade of, say, 18◦ of visual angle takes 60–80ms to complete [Harwood,
Mezey, and Harris 1999], a period during which we are essentially blind. During
the subsequent 200–500ms interval until the next saccade, the image on the fovea is
relatively stable, accounting for small adjustments due to micro saccades. So even
a rough model for the simplest sort of human visual processing has to be set against
the background of two or three fixations per second, each spanning less than half a
second, and separated by short — less than 1/10 of a second — periods of blindness.
During each fixation we have 200–500ms in which to make sense of the events
projected on the fovea; simplifying enormously, that’s time enough to view around
10–15 frames of a video shown at 30 frames per second. In most of our experience,
during such a period there is a lot going on in our visual field; our eyes, head
and body are often moving and the many objects in our field of view are also in
movement, more often than not, moving independent of one another. Either by
focusing on a small patch of an object that is motionless relative to our frame
of reference or by performing smooth pursuit, we have a brief period in which
to analyze what amounts to a very short movie as seen through a tiny aperture.
Most individual neurons have receptive fields that span an even smaller spatial and
temporal extent.
If we try to interpret movement with too restrictive a spatial extent, we can
mistake the direction of travel of a small patch of texture. If we try to work on
too restrictive a temporal extent, then we are inundated with small movements
many of which are due to noise or uninteresting as they arise from the analog of
smooth camera motion. During that half second or so we need to identify stable
artifacts, consisting of the orientation, direction, velocity, etc., of small patches
of texture and color, and then combine these artifacts to capture features of the
somewhat larger region of the fovea we are fixating on. Such a combination need
not entail recognizing shape; it could, for example, consist of identifying a set of
candidate patches, that may or may not belong to the same object, and summarizing
the processing performed during the fixation interval as a collection of statistics
pertaining to such patches, including their relative — but not absolute — positions,
velocities, etc.
In parallel with processing foveal stimuli, attentional machinery in several neural
circuits and, in particular, the lateral intraparietal cortex — which is retinotopically
mapped when the eyes are fixated — estimates the saliency of spatial locations
throughout the retina, including its periphery where acuity and color sensitivity
are poor. These estimates of “interestingness” are used to decide what location to
saccade to next. The oculomotor system keeps track of the dislocations associated
with each saccade, and this locational information can be fused together using
statistics collected over a series of saccades. How such information is combined and
the exact nature of the resulting internal representations is largely a mystery.
The main point of the above discussion is that, while human visual processing
may begin early in the dorsal and ventral pathways with something vaguely related
133
Thomas Dean
134
Graphical Models of the Visual Cortex
world in which we operate. The brain maintains detailed maps of the body and
its surrounding physical space in the hippocampus and somatosensory, motor, and
parietal cortex [Rizzolatti, Sinigaglia, and Anderson 2007; Blakeslee and Blakeslee
2007]. Recall that the dorsal — “where” and “how” — visual pathway leads to the
parietal cortex, which plays an important role in visual attention and our perception
of shape. These maps are dynamic, constantly adapting to changes in the body as
well as reflecting both short- and long-term knowledge of our surroundings and
related spatial relationships.
When attempting to gain insight from biology in building engineered vision sys-
tems, it is worth keeping in mind the basic tasks of evolved biological vision sys-
tems. Much of primate vision serves three broad and overlapping categories of tasks:
recognition, navigation and manipulation. Recognition for foraging, mating, and a
host of related social and survival tasks; navigation for exploration, localization and
controlling territory; manipulation for grasping, climbing, throwing, tool making,
etc.
The view [Lengyel 1998] that computer vision is really just inverse graphics ig-
nores the fact that most of these tasks don’t require you to be able to construct an
accurate 3-D representation of your visual experience. For many recognition tasks
it suffices to identify objects, faces, and landmarks you’ve seen before and associate
with these items task-related knowledge gained from prior experience. Navigation
to avoid obstacles requires the ability to determine some depth information but
not necessarily to recover full 3-D structure. Manipulation is probably the most
demanding task in terms of the richness of shape information apparently required,
but even so it may be that we are over-emphasizing the role of static shape memory
and under-emphasizing the role of dynamic visual servoing — see the discussion
in [Rizzolatti, Sinigaglia, and Anderson 2007] for an excellent introduction to what
is known about how we understand shape in terms of affordances for manipulation.
But when it comes right down to it, we don’t know a great deal about how
the visual system handles shape [Tarr and Bülthoff 1998] despite some tantalizing
glimpses into what might be going on the inferotemporal cortex [Tsunoda, Yamane,
Nishizaki, and Tanifuji 2001; Yamane, Tsunoda, Matsumoto, Phillips, and Tanifuji
2006]. Let’s suppose for the sake of discussion that we can build a graphical model
of the cortex that handles much of the low-level feature extraction managed by the
early visual pathways (V1 through V4) using existing algorithms for performing
inference on Markov and conditional random fields and related graphical models.
How might we construct a graphical model that captures the part of visual memory
that pools together all these low-level features to provide us with such a rich visual
experience? Lacking any clear direction from computational neuroscience, we’ll take
a somewhat unorthodox path from here on out.
As mentioned earlier, several popular web sites offer rich visual experiences that
are constructed by combining large image corpora. Photo-sharing web sites like
Flickr, Google Picasa and Microsoft Live Labs PhotoSynth are able to combine
135
136
Graphical Models of the Visual Cortex
some fixed-width receptive field and relates them by using low-level features ex-
tracted in V1 through V4 as keypoints to estimate geometric and other meaningful
relationships among patches? The use of the word “novel” in this context is meant
to convey that some method for statistical pooling of similar patches is required
to avoid literally storing every possible patch. This is essentially what Jing and
Baluja [2008] do by taking a large corpus of images, extracting low-level features
from each image, and then quantifying the similarity between pairs of images by
analyzing the features that they have in common. The result is a large graph whose
vertices are images and whose edges quantify pair-wise similarity (see Figure 6). By
using the low-level features as indices, Jing and Baluja only have to search a small
subset of the possible pairs of images, and of those only the ones that pass a specified
threshold for similarity are connected by edges. Jing and Baluja further enhance
the graph by using a form of spectral graph analysis to rank images in much the
same way as Google ranks web pages. Torralba et al [2007] have demonstrated that
even small image patches contain a great deal of useful information, and further-
more that very large collections of images can be quickly and efficiently searched
to retrieve semantically similar images given a target image as a query [Torralba,
Fergus, and Weiss 2008].
In principle, such a graph could be represented as a probabilistic graphical model
and the spectral analysis reformulated in terms of inference on graphical models.
The process whereby the graph is grown over time, incorporating new images and
new relationships, currently cannot be formulated as inference on a graphical model,
but it is interesting to speculate about very large, yet finite graphs that could evolve
over time in response to new evidence. Learning the densities used to quantify the
edges in graphical models can can be formulated in terms of hyper-parameters
directly incorporated into the model and carried out by traditional inference algo-
rithms [Buntine 1994; Heckerman 1995]. Learning graphs whose size and topol-
ogy change over time is somewhat more challenging to cast in terms of traditional
methods for learning graphical models. Graph size is probably not the determining
technical barrier however. Very large graphical models consisting of documents,
queries, genes, and other entities are now quite common, and, while exact inference
in such graphs is typically infeasible, approximate inference is often good enough
to provide the foundation for industrial-strength tools.
Unfortunately, there is no way to tie up the many loose ends which have been
left dangling in this short survey. Progress depends in part on our better under-
standing the brain and in particular the parts of the brain that are further from
the periphery of the body where our senses are directly exposed to external stimuli.
Neuroscience has made significant progress in understanding the brain at the cel-
lular and molecular level, even to the point that we are now able to run large-scale
simulations with some confidence that our models reflect important properties of
the biology. Computational neuroscientists have also made considerable progress
developing models — and graphical models in particular — that account for fea-
137
Thomas Dean
tures that appear to play an important role in early visual processing. The barrier
to further progress seems to be the same impediment that we run into in so many
other areas of computer vision, machine learning and artificial intelligence more
generally, namely the problem of representation. How and what does the brain rep-
resent about the blooming, buzzing world in which we are embedded? The answer
to that question will take some time to figure out, but no doubt probabilistic graph-
ical models will continue to provide a powerful tool in this inquiry, thanks in no
small measure to the work of Judea Pearl, his students and his many collaborators.
References
Anderson, J. and J. Sutton (1997). If we compute faster, do we understand better?
Behavior Ressearch Methods, Instruments and Computers 29, 67–77.
Bengio, Y., P. Lamblin, D. Popovici, and H. Larochelle (2007). Greedy layer-
wise training of deep networks. In Advances in Neural Information Processing
Systems 19, pp. 153–160. Cambridge, MA: MIT Press.
Bienenstock, E. and S. Geman (1995). Compositionality in neural systems. In
M. Arbib (Ed.), The Handbook of Brain Theory and Neural Networks, pp.
223–226. Bradford Books/MIT Press.
Blakeslee, S. and M. Blakeslee (2007). The Body Has a Mind of Its Own. Random
House.
Brady, N. and D. J. Field (2000). Local contrast in natural images: normalisation
and coding efficiency. Perception 29 (9), 1041–1055.
Buntine, W. L. (1994). Operations for learning with graphical models. Journal
of Artificial Intelligence Research 2, 159–225.
Cadieu, C. and B. Olshausen (2008). Learning transformational invariants from
time-varying natural images. In D. Schuurmans and Y. Bengio (Eds.), Ad-
vances in Neural Information Processing Systems 21. Cambridge, MA: MIT
Press.
Dean, T. (2005). A computational model of the cerebral cortex. In Proceedings
of AAAI-05, Cambridge, Massachusetts, pp. 938–943. MIT Press.
Dean, T. (2006, August). Learning invariant features using inertial priors. Annals
of Mathematics and Artificial Intelligence 47 (3-4), 223–250.
Dean, T., G. Corrado, and R. Washington (2009, December). Recursive sparse,
spatiotemporal coding. In Proceedings of the Fifth IEEE International Work-
shop on Multimedia Information Processing and Retrieval.
Dean, T. and M. Wellman (1991). Planning and Control. San Francisco, Califor-
nia: Morgan Kaufmann Publishers.
DeGroot, M. (1970). Optimal Statistical Decisions. New York: McGraw-Hill.
138
Graphical Models of the Visual Cortex
139
Thomas Dean
Jing, Y. and S. Baluja (2008). Pagerank for product image search. In Proceedings
of the 17th World Wide Web Conference.
Kivinen, J. J., E. B. Sudderth, and M. I. Jordan (2007). Learning multiscale
representations of natural scenes using dirichlet processes. In Proceedings of
the 11th IEEE International Conference on Computer Vision.
Konen, C. S. and S. Kastner (2008). Two hierarchically organized neural systems
for object information in human visual cortex. Nature Neuroscience 11 (2),
224–231.
LeCun, Y. and Y. Bengio (1995). Convolutional networks for images, speech, and
time-series. In M. Arbib (Ed.), The Handbook of Brain Theory and Neural
Networks. Bradford Books/MIT Press.
Lee, T. S. and D. Mumford (2003). Hierarchical Bayesian inference in the visual
cortex. Journal of the Optical Society of America 2 (7), 1434–1448.
Lengyel, J. (1998). The convergence of graphics and vision. Computer 31 (7),
46–53.
Mountcastle, V. B. (2003, January). Introduction to the special issue on compu-
tation in cortical columns. Cerebral Cortex 13 (1), 2–4.
Mumford, D. (1991). On the computational architecture of the neocortex I: The
role of the thalamo-cortical loop. Biological Cybernetics 65, 135–145.
Mumford, D. (1992). On the computational architecture of the neocortex II: The
role of cortico-cortical loops. Biological Cybernetics 66, 241–251.
Newman, M., D. Watts, and S. Strogatz (2002). Random graph models of social
networks. Proceedings of the National Academy of Science 99, 2566–2572.
Olshausen, B. and C. Cadieu (2007). Learning invariant and variant components
of time-varying natural images. Journal of Vision 7 (9), 964–964.
Olshausen, B. A., A. Anderson, and D. C. Van Essen (1993). A neurobiological
model of visual attention and pattern recognition based on dynamic routing
of information. Journal of Neuroscience 13 (11), 4700–4719.
Olshausen, B. A. and D. J. Field (2005). How close are we to understanding V1?
Neural Computation 17, 1665–1699.
Osindero, S. and G. Hinton (2008). Modeling image patches with a directed
hierarchy of markov random fields. In J. Platt, D. Koller, Y. Singer, and
S. Roweis (Eds.), Advances in Neural Information Processing Systems 20, pp.
1121–1128. Cambridge, MA: MIT Press.
Pearl, J. (1988). Probabilistic Reasoning in Intelligent Systems: Networks of Plau-
sible Inference. San Francisco, California: Morgan Kaufmann.
140
Graphical Models of the Visual Cortex
Ranzato, M., Y. Boureau, and Y. LeCun (2007). Sparse feature learning for deep
belief networks. In Advances in Neural Information Processing Systems 20.
Cambridge, MA: MIT Press.
Riesenhuber, M. and T. Poggio (1999, November). Hierarchical models of object
recognition in cortex. Nature Neuroscience 2 (11), 1019–1025.
Rizzolatti, G., C. Sinigaglia, and F. Anderson (2007). Mirrors in the Brain How
Our Minds Share Actions, Emotions, and Experience. Oxford, UK: Oxford
University Press.
Rumelhart, D. E. and J. L. McClelland (Eds.) (1986). Parallel Distributed Pro-
cessing: Explorations in the Microstructure of Cognition, Volume I: Founda-
tions. Cambridge, Massachusetts: MIT Press.
Russell, S. and P. Norvig (2003). Artificial Intelligence: A Modern Approach.
Upper Saddle River, NJ: Second edition, Prentice Hall.
Saxena, A., S. Chung, and A. Ng (2007). 3-D depth reconstruction from a single
still image. International Journal of Computer Vision 76 (1), 53–69.
Sporns, O. and J. D. Zwi (2004). The small world of the cerebral cortex. Neu-
roinformatics 2 (2), 145–162.
Tarr, M. and H. Bülthoff (1998). Image-based object recognition in man, monkey
and machine. Cognition 67, 1–20.
Tenenbaum, J. and H. Barrow (1977). Experiments in interpretation-guided seg-
mentation. Artificial Intelligence 8, 241–277.
Torralba, A., R. Fergus, and W. Freeman (2007). Object and scene recognition
in tiny images. Journal of Vision 7 (9), 193–193.
Torralba, A., R. Fergus, and Y. Weiss (2008). Small codes and large image
databases for recognition. In Proceedings of IEEE Computer Vision and Pat-
tern Recognition, pp. 1–8. IEEE Computer Society.
Tsunoda, K., Y. Yamane, M. Nishizaki, and M. Tanifuji (2001). Complex ob-
jects are represented in macaque inferotemporal cortex by the combination of
feature columns. Nature Neuroscience 4, 832–838.
Ullman, S. and S. Soloviev (1999). Computation of pattern invariance in brain-
like structures. Neural Networks 12, 1021–1036.
Ullman, S., M. Vidal-Naquet, and E. Sali (2002). Visual features of intermediate
complexity and their use in classification. Nature Neuroscience 5 (7), 682–687.
Wiskott, L. and T. Sejnowski (2002). Slow feature analysis: Unsupervised learn-
ing of invariances. Neural Computation 14 (4), 715–770.
Yamane, Y., K. Tsunoda, M. Matsumoto, N. A. Phillips, and M. Tanifuji (2006).
Representation of the spatial relationship among object parts by neurons in
macaque inferotemporal cortex. Journal Neurophysiology 96, 3147–3156.
141
Return to TOC
1 Introduction
In his seminal paper, Pearl [1986] introduced the notion of Bayesian networks and
the first processing algorithm, Belief Propagation (BP), that computes posterior
marginals, called beliefs, for each variable when the network is singly connected.
The paper provided the foundation for the whole area of Bayesian networks. It was
the first in a series of influential papers by Pearl, his students and his collaborators
that culminated a few years later in his book on probabilistic reasoning [Pearl 1988].
In his early paper Pearl showed that for singly connected networks (e.g., polytrees)
the distributed message-passing algorithm converges to the correct marginals in a
number of iterations equal to the diameter of the network. In his book Pearl goes
further to suggest the use of BP for loopy networks as an approximation algorithm
(see page 195 and exercise 4.7 in [Pearl 1988]). During the decade that followed
researchers focused on extending BP to general loopy networks using two principles.
The first is tree-clustering, namely, the transformation of a general network into a
tree of large-domain variables called clusters on which BP can be applied. This led
to the join-tree or junction-tree clustering and to the bucket-elimination schemes
[Pearl 1988; Dechter 2003] whose time and space complexity is exponential in the
tree-width of the network. The second principle is that of cutset-conditioning that
decomposes the original network into a collection of independent singly-connected
networks all of which must be processed by BP. The cutset-conditioning approach
is time exponential in the network’s loop-cutset size and require linear space [Pearl
1988; Dechter 2003].
The idea of applying belief propagation directly to multiply connected networks
caught up only a decade after the book was published, when it was observed by
researchers in coding theory that high performing probabilistic decoding algorithms
such as turbo codes and low density parity-check codes, which significantly outper-
formed the best decoders at the time, are equivalent to an iterative application of
Pearl’s belief propagation algorithm [McEliece, MacKay, and Cheng 1998]. This
success intrigued researchers and started massive explorations of the potential of
these local computation algorithms for general applications. There is now a signifi-
cant body of research seeking the understanding and improvement of the inference
power of iterative belief propagation (IBP).
143
R. Dechter, B. Bidyuk, R. Mateescu and E. Rollon
The early work on IBP showed its convergence for a single loop, provided empir-
ical evidence of its successes and failures on various classes of networks [Rish, Kask,
and Dechter 1998; Murphy, Weiss, and Jordan 2000] and explored the relationship
between energy minimization and belief-propagation shedding light on convergence
and stable points [Yedidia, Freeman, and Weiss 2000]. Current state of the art in
convergence analysis are the works by [Ihler, Fisher, and Willsky 2005; Mooij and
Kappen 2007] that characterize convergence in networks having no determinism.
The work by [Roosta, Wainwright, and Sastry 2008] also includes an analysis of
the possible effects of strong evidence on convergence which can act to suppress
the effects of cycles. As far as accuracy, the work of [Ihler 2007] considers how
weak potentials can make the graph sufficiently tree-like to provide error bounds, a
work which is extended and improved in [Mooij and Kappen 2009]. For additional
information see [Koller 2010].
While a significant progress has been made in understanding the relationship be-
tween belief propagation and energy minimization, and while many extensions and
variations were proposed, some with remarkable performance (e.g., survey propa-
gation for solving satisfiability for random SAT problems), the following questions
remain even now:
• Can we assess the quality of the algorithm’s performance once and if it con-
verges.
In this paper we try to shed light on the power (and limits) of belief propagation
algorithms and on the above questions by explicating its relationship with constraint
propagation algorithms such as arc-consistency. Our results are relevant primarily
to networks that have determinism and extreme probabilities. Specifically, we show
that: (1) Belief propagation converges for zero beliefs; (2) All IBP-inferred zero be-
liefs are correct; (3) IBP’s power to infer zero beliefs is as weak and as strong as that
of arc-consistency; (4) Evidence and inferred singleton beliefs act like cutsets during
IBP’s performance. From points (2) and (4) it follows that if the inferred evidence
breaks all the cycles, then IBP converges to the exact beliefs for all variables.
Subsequently, we investigate empirically the behavior of IBP for inferred near-
zero beliefs. Specifically, we explore the hypothesis that: (5) If IBP infers that the
belief of a variable is close to zero then this inference is relatively accurate. We will
see that while our empirical results support the hypothesis on benchmarks having
no determinism, the results are quite mixed for networks with determinism.
Finally, (6) We investigate if variables that have extreme probabilities in all its
domain values (i.e., extreme support) also nearly cut off information flow. If that
hypothesis is true, whenever the set of variables with extreme support constitute a
144
On the Power of Belief Propagation
loop-cutset, IBP is likely to converge and, if the inferred beliefs for those variables
are sound, it will converge to accurate beliefs throughout the network.
On coding networks that posses significant determinism, we do see this desired
behavior. So, we could view this hypothesis as the first to provide a plausible expla-
nation to the success of belief propagation on coding networks. In coding networks
the channel noise is modeled through a normal distribution centered at the trans-
mitted character and controlled by a small standard deviation. The problem is
modeled as a layered belief network whose sink nodes are all evidence that trans-
mit extreme support to their parents, which constitute all the rest of the variables.
The remaining dependencies are functional and arc-consistency on this type of net-
works is strong and often complete. Alas, as we show, on some other deterministic
networks IBP’s performance inferring near zero values is utterly inaccurate, and
therefore the strength of this explanation is questionable.
The paper is based for the most part on [Dechter and Mateescu 2003] and also on
[Bidyuk and Dechter 2001]. The empirical portion of the paper includes significant
new analysis of recent empirical evaluations carried on in UAI 2006 and UAI 20081 .
2 Arc-consistency
DEFINITION 1 (constraint network). A constraint network C is a triple C =
hX, D, Ci, where X = {X1 , ..., Xn } is a set of variables associated with a set of
discrete-valued domains D = {D1 , ..., Dn } and a set of constraints C = {C1 , ..., Cr }.
Each constraint Ci is a pair hSi , Ri i where Ri is a relation Ri ⊆ DSi defined on
a subset of variables Si ⊆ X and DSi is the Cartesian product of the domains of
variables Si . The relation Ri denotes all tuples of DSi allowed by the constraint.
The projection operator π creates a new relation, πSj (Ri ) = {x | x ∈ DSj and
∃y, y ∈ DSi \Sj and x ∪ y ∈ Ri }, where Sj ⊆ Si . Constraints can be combined with
the join operator 1, resulting in a new relation, Ri 1 Rj = {x | x ∈ DSi ∪Sj and
πSi (x) ∈ Ri and πSj (x) ∈ Rj }.
DEFINITION 2 (constraint satisfaction problem). The constraint satisfaction prob-
lem (CSP) defined over a constraint network C = hX, D, Ci, is the task of finding a
solution, that is, an assignment of values to all the variables x = (x1 , ..., xn ), xi ∈ Di ,
such that ∀Ci ∈ C, πSi (x) ∈ Ri . The set of all solutions of the constraint network
C is sol(C) =1 Ri .
145
R. Dechter, B. Bidyuk, R. Mateescu and E. Rollon
R1 A F
R2 1 h56 = 3
1
A A B 2 1
1 2 A 3 R3
1
2
3 2 h2
1
1
A 3 A C (
h65 = π F R6 h56 = ) F
1
B C 2 3 AB AC
1 2
3 2
3
3
1
2 AB B
(
h64 = π D R6 h56 = ) D
2
2
h h52 R5
R4 4 4 5
A B D ABD
B
h54
BCF
B C F
1 2 3
(
h54 = h52 = π B R5 h65 = ) B
3
D F 1 2 3 6 F 3 2 1
D
1 3 2 h56 (
h42 = π AB R4 h64 h54 = ) A B
2 1 3 h64 DFG h65 1 3
G 2 3 1 D F G
3
3
1
2
2
1 R6
1 2 3
2 1 3
(
h21 = π A R2 h42 h52 = ) A
1
work C = hX, D, Ci, C is arc-consistent iff for every binary constraint Ri ∈ C s.t.
Si = {Xj , Xk }, every value xj ∈ Dj has a value xk ∈ Dk s.t. (xj , xk ) ∈ Ri .
When a binary constraint network is not arc-consistent, arc-consistency algo-
rithms remove values from the domains of the variables until an arc-consistent net-
work is generated. A variety of such algorithms were developed over the past three
decades [Dechter 2003]. We will consider here a simple and not the most efficient
version, which we call relational distributed arc-consistency algorithm. Rather than
defining it on binary constraint networks we will define it directly over the dual
graph, extending the arc-consistency condition to non-binary networks.
DEFINITION 4 (dual graph). Given a set of functions/constraints F = {f1 , ..., fr }
over scopes S1 , ..., Sr , the dual graph of F is a graph DF = (V, E, L) that associates
a node with each function, namely V = F , and an arc connects any two nodes
whose scope share a variable, E = {(fi , fj )|Si ∩ Sj 6= ∅} . L is a set of labels for
the arcs, where each arc is labeled by the shared variables of its nodes, L = {lij =
Si ∩ Sj |(i, j) ∈ E}.
Algorithm Relational distributed arc-consistency (RDAC) is a message passing
algorithm defined over the dual graph DC of a constraint network C = hX, D, Ci. It
enforces what is known as relational arc-consistency [Dechter 2003]. Each node (a
constraint) in DCi , for a constraint Ci ∈ C maintains a current set of viable tuples
Ri . Let ne(i) be the set of neighbors of Ci in DC . Every node Ci sends a message to
any node Cj ∈ ne(i), which consists of the tuples over their label variables lij that
are allowed by the current relation Ri . Formally, let Ri and Rj be two constraints
sharing scopes, whose arc in DC is labeled by lij . The message that Ri sends to Rj
denoted hji is defined by:
EXAMPLE 5. Figure 1 describes part of the execution of RDAC for a graph col-
146
On the Power of Belief Propagation
oring problem, having the constraint graph shown on the left. All variables have
the same domain, {1,2,3}, except for variable C whose domain is 2, and variable G
whose domain is 3. The arcs correspond to not equal constraints, and the relations
are RA , RAB , RAC , RABD , RBCF , RDF G , where the subscript corresponds to their
scopes. The dual graph of this problem is given on the right side of the figure,
and each table shows the initial constraints (there are unary, binary and ternary
constraints). To initialize the algorithm, the first messages sent out by each node
are universal relations over the labels. For this example, RDAC actually solves the
problem and finds the unique solution A=1, B=3, C=2, D=2, F=1, G=3.
Relational distributed arc-consistency algorithm converges after O(r·t) iterations
to the largest relational arc-consistent network that is equivalent to the original
network, where r is the number of constraints and t bounds the number of tuples
in each constraint. Its complexity can be shown to be O(r2 t2 log t) [Dechter 2003].
147
R. Dechter, B. Bidyuk, R. Mateescu and E. Rollon
Algorithm IBP
Input: An arc-labeled dual join-graph DJ = (V, E, L) for a belief network B =
hX, D, G, P i. Evidence e.
Output: An augmented graph whose nodes include the original CPTs and the messages
received from neighbors. Approximations of P (Xi |e) and P (f a(Xi )|e), ∀Xi ∈ X.
Denote by: hvu the message from u to v; ne(u) the neighbors of u in V ; nev (u) =
ne(u) − {v}; luv the label of (u, v) ∈ E; elim(u, v) = f a(Xi ) − f a(Xj ), where u and v
are the vertexs of family f a(Xi ) and f a(Xj ) in DJ, respectively.
• One iteration of IBP
For every node u in DJ in a topological order and back, do:
1. Process observed variables
Assign evidence variables to each pi and remove them from the labeled arcs.
2. Compute and send to v the function:
X Y
hvu = (pu · hui)
elim(u,v) {hu
i ,i∈nev (u)}
EndFor
• Compute approximations of P (Xi |e) and P (f a(Xi )|e):
For every Xi ∈ X (let u be the vertex of family f a(Xi ) in DJ), do:
P (f a(Xi )|e) = α( hu ,u∈ne(i) hu
Q
i i ) · pu
P
P (Xi |e) = α f a(Xi )−{Xi } P (f a(Xi )|e)
EndFor
intersection property.
In IBP each node in the dual join-graph sends a message over an adjacent arc
whose scope is identical to its label. Pearl’s original algorithm sends messages whose
scopes are singleton variables only. It is easy to show that any dual graph (which
itself is a dual join-graph) has an arc-minimal singleton dual join-graph which can
be constructed directly by labeling the arc between the CPT of a variable and the
CPT of its parent, by its parent variable. Algorithm IBP defined for any dual join-
graph is given in Figure 2. One iteration of IBP is time and space linear in the size
of the belief network, and when IBP is applied to the singleton labeled dual graph
it coincides with Pearl’s belief propagation. The inferred approximation of belief
P (X|e) output by IBP, will be denoted by PIBP (X|e).
148
On the Power of Belief Propagation
dence between IBP-RDAC and IBP yields the main claims and provides insight into
the behavior of IBP for inferred zero beliefs. In particular, this relationship justi-
fies the iterative application of belief propagation algorithms, while also illuminates
their “distance” from being complete.
More precisely, in this section we will show that: (a) If a variable-value pair
is assessed in some iteration by IBP as having a zero-belief, it remains zero in
subsequent iterations; (b) Any IBP-inferred zero-belief is correct with respect to
the corresponding belief network’s marginal; and (c) IBP converges in finite time
for all its inferred zeros.
Proof. PB (x) > 0 ⇔ Πni=1 P (xi |xpa(Xi ) ) > 0 ⇔ ∀i ∈ {1, . . . , n}, P (xi |xpa(Xi ) ) > 0
⇔ ∀i ∈ {1, . . . , n}, (xi , xpa(Xi ) ) ∈ RFi ⇔ x ∈ sol(f lat(B)). ⊓
⊔
149
R. Dechter, B. Bidyuk, R. Mateescu and E. Rollon
A P(A) A
1 .2 1
A B P(B|A) 2 .5 2
1 2 .3 3 .3 3
1 3 .7 … 0 A B
2 1 .4 1 2
2 3 .6 A A C P(C|A) 1 3 A
3 1 .1 A A 1 2 1 2 1 A A A C
3 2 .9 3 2 1 2 3
1 2
… … 0 AB AC … … 0 3 1 AB AC 3 2
3 2
B A B
B A C B C
A B D P(D|A,B)
A B D
1 2 3 1
1 2 3 B C F
1 3 2 1 ABD BCF B C F P(F|B,C) ABD BCF
1 3 2 1 2 3
2 1 3 1 1 2 3 1
2 1 3 F 3 2 1
2 3 1 1 D F 3 2 1 1 D
DFG 2 3 1 DFG
3 1 2 1 … … … 0
3 1 2
3 2 1 1
3 2 1
… … … 0 D F G P(G|D,F)
1 2 3 1 D F G
2 1 3 1 1 2 3
… … … 0 2 1 3
P
3. Instead of , we use the projection operator π.
Proof. The proof is by induction. The base case is trivially true since messages h
in IBP are initialized to a uniform distribution and messages h in IBP-RDAC are
initialized to complete relations.
150
On the Power of Belief Propagation
is the message sent by IBP-RDAC from u to v. Assume that the claim holds
for all messages received by u from its neighbors. Let f ∈ u in IBP and Rf
be the corresponding relation in IBP-RDAC, and t be an assignment of values
to variables in elim(u, v). We have hIBP
P Q
Q (u,v) (x) 6= 0 ⇔ elim(u,v) f f (x) 6= 0
⇔ ∃t, f f (x, t) 6= 0 ⇔ ∃t, ∀f, f (x, t) 6= 0 ⇔ ∃t, ∀f, πscope(Rf ) (x, t) ∈ Rf ⇔
∃t, πelim(u,v) (1Rf πscope(Rf ) (x, t)) ∈ hIBP
(u,v)
−RDAC
⇔ x ∈ hIBP
(u,v)
−RDAC
. ⊓
⊔
Moving from tuples to domain values, we will show that whenever IBP computes
a marginal probability PIBP (xi |e) = 0, IBP-RDAC removes xi from the domain of
variable Xi , and vice-versa.
PROPOSITION 17. Given a belief network B and evidence e, IBP applied to B
derives PIBP (xi |e) = 0 iff IBP-RDAC over f lat(B) decides that xi 6∈ Di .
Proof. According to Proposition 16, the messages computed by IBP and IBP-
RDAC are identical in terms of zero probabilities. Let f ∈ cluster(u) in IBP and
Rf be the corresponding relation in IBP-RDAC, and t be an assignment of values
to variables in χ(u)\Xi . We will show that when IBP computes P (Xi = xi ) = 0
(upon convergence), then IBP-RDAC computes xi 6∈ Di . We have P (Xi = xi ) =
P Q Q
X\Xi f f (xi ) = 0 ⇔ ∀t, f f (xi , t) = 0 ⇔ ∀t, ∃f, f (xi , t) = 0 ⇔
∀t, ∃Rf , πscope(Rf ) (xi , t) 6∈ Rf ⇔ ∀t, (xi , t) 6∈ (1Rf Rf (xi , t)) ⇔ xi 6∈ Di ∩ πXi (1Rf
Rf (xi , t)) ⇔ xi 6∈ Di . Since arc-consistency is sound, so is the decision of zero
probabilities. ⊓
⊔
Proof. By Proposition 17, if IBP over B computes PIBP (xi |e) = 0, then IBP-
RDAC over f lat(B) removes the value xi from the domain Di . Therefore, xi ∈ Di
is a no-good of the constraint network f lat(B) and from Theorem 11 it follows that
Bel(xi ) = 0. ⊓
⊔
Next, we show that the time it takes IBP to find its inferred zeros is bounded.
PROPOSITION 19. Given a belief network B and evidence e, IBP finds all its xi
for which PIBP (xi |e) = 0 in finite time, that is, there exists a number k such that
no PIBP (xi |e) = 0 will be generated after k iterations.
Proof. This follows from the fact that the number of iterations it takes for IBP to
compute PIBP (Xi = xi |e) = 0 over B is exactly the same number of iterations IBP-
RDAC needs to remove xi from the domain Di over f lat(B) (Propositions 16 and
17) and the fact that IBP-RDAC’s number of iterations is bounded (Proposition
15). ⊓
⊔
151
R. Dechter, B. Bidyuk, R. Mateescu and E. Rollon
X1 H3 X1 X3
Bel(Xi Bel(Xi Bel(Xi 1 1 1 2
X1 H3 #iter 2
X1 X1
H3X1 X3 1 2 1
= 1) = 2) = 3)
3 1 3 3
1 .45 .45 .1 X1 X3
300 .5 .5 0
X2 X2 H2 X2 X3
X2 H2 True
0 0 1 1
X2 H2 X2 X3 1 1 2
belief
2 1 2 1
3 1 3 3
a) b) c)
There are three CPTs over the scopes: {H1 , X1 , X2 }, {H2 , X2 , X3 }, and {H3 , X1 , X3 }.
The values of the CPTs for every triplet of variables {Hk , Xi , Xj } are:
1, if (3 6= xi 6= xj 6= 3);
P (hk = 1|xi , xj ) = 1, if (xi = xj = 3);
0, otherwise ;
P (hk = 0|xi , xj ) = 1 − P (hk = 1|xi , xj ).
The belief for any of the X variables as a function of the number of iteration is
given in Figure 4b. After about 300 iterations, the finite precision of our computer
is not able to represent the value for Bel(Xi = 3), and this appears to be zero,
152
On the Power of Belief Propagation
yielding the final updated belief (.5, .5, 0), when in fact the true updated belief
should be (0, 0, 1). Notice that (.5, .5, 0) cannot be regarded as a legitimate fixed
point for IBP. Namely, if we would initialize IBP with the values (.5, .5, 0), then
the algorithm would maintain them, appearing to have a fixed point. However,
initializing IBP with zero values cannot be expected to be correct. Indeed, when
we initialize with zeros we forcibly introduce determinism in the model, and IBP
will always maintain it afterwards.
However, this example does not contradict our theory because, mathematically,
Bel(Xi = 3) never becomes a true zero, and IBP never reaches a quiescent state.
The example shows however that a close to zero inferred belief by IBP can be
arbitrarily inaccurate. In this case the inaccuracy seems to be due to the initial
prior belief which are so different from the posterior ones.
1. IJGP generates all its PIJGP (xi |e) = 0 in finite time, that is, there exists a
number k, such that no PIJGP (xi ) = 0 will be generated after k iterations.
2. Whenever IJGP determines PIJGP (xi |e) = 0, it stays 0 during all subsequent
iterations.
153
R. Dechter, B. Bidyuk, R. Mateescu and E. Rollon
Cases of weak inference power. Consider the belief network described in Ex-
ample 20. The flat constraint network of that belief network is defined over the
scopes S1 ={H1 , X1 , X2 }, S2 ={H2 , X2 , X3 }, S3 ={H3 , X1 , X3 }. The constraints are
defined by: RSi = {(1, 1, 2), (1, 2, 1), (1, 3, 3), (0, 1, 1), (0, 1, 3), (0, 2, 2), (0, 2, 3),
(0, 3, 1), (0, 3, 2)}. The prior probabilities for Xi ’s imply unary constraints equal
to the full domain {1,2,3}. An arc-minimal dual join-graph that is identical to the
constraint network is given in Figure 4b. In this case, IBP-RDAC sends as messages
the full domains of the variables and thus no tuple is removed from any constraint.
Since IBP infers the same zeros as arc-consistency, IBP will also not infer any zeros.
Since the true probability of most tuples is zero, we can conclude that the inference
power of IBP on this example is weak or non-existent.
The weakness of arc-consistency in this example is not surprising. Arc-consistency
is known to be far from complete. Since every constraint network can be expressed
as a belief network (by adding a variable for each constraint as we did in the above
example) and since arc-consistency can be arbitrarily weak on some constraint net-
works, so could be IBP.
Cases of strong inference power. The relationship between IBP and arc-
consistency ensures that IBP is zero-complete, whenever arc-consistency is. In
general, if for a flat constraint network of a belief network B, arc-consistency re-
moves all the inconsistent domain values, then IBP will also discover all the true
zeros of B. Examples of constraint networks that are complete for arc-consistency
are max-closed constraints. These constraints have the property that if 2 tuples are
in the relation so is their intersection. Linear constraints are often max-closed and
so are Horn clauses (see [Dechter 2003]). Clearly, IBP is zero complete for acyclic
networks which include binary trees, polytrees and networks whose dual graph is a
hypertree [Dechter 2003]. This is not too illuminating though as we know that IBP
is fully complete (not only for zeros) for such networks.
An interesting case is when the belief network has no evidence. In this case,
the flat network always corresponds to the causal constraint network defined in
[Dechter and Pearl 1991]. The inconsistent tuples or domain values are already
explicitly described in each relation and no new zeros can be inferred. What is
more interesting is that in the absence of evidence IBP is also complete for non-zero
beliefs for many variables as we show later.
154
On the Power of Belief Propagation
155
R. Dechter, B. Bidyuk, R. Mateescu and E. Rollon
erties of observed and unobserved nodes, automatically, without any outside inter-
vention for network transformation. As a result, the correctness and convergence of
IBP on a node Xi in a multiply-connected belief network will be determined by the
structure restricted to Xi ’s relevant subgraph. If the relevant subnetwork of Xi is
singly-connected relative to the evidence (observed or inferred), IBP will converge
to the correct beliefs for node Xi .
6 Experimental Evaluation
The goal of the experiments is two-fold. First, since zero values inferred by IBP/IJGP
are proved correct, we want to explore the behavior of IBP/IJGP for near zero
inferred beliefs. Second, we want to explore the hypothesis that the loop-cutset
impact on IBP’s performance, as discussed in Section 5.2, also extends to variables
with extreme support. The next two subsections are devoted to these two issues,
respectively.
156
On the Power of Belief Propagation
Exact Histogram IBP Histogram Recall Abs. Error Precision Abs. Error
50% % % 0.05
45% % %
40% % % 0.04
Absolute Error
Percentage
35% % %
30% % % 0.03
25% % %
20% % % 0.02
15% % %
10% % % 0.01
5% % %
0% % % 0
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
noise = 0.20 noise = 0.40 noise = 0.60
by IBP are extreme. The Recall and Precision are very small, of the order of 10−11 .
So, in this case, all the beliefs are very small (i.e., ǫ small) and IBP is able to infer
them correctly, resulting in almost perfect accuracy (IBP is indeed perfect in this
case for the bit error rate). As noise increases, the Recall and Precision get closer
to a bell shape, indicating higher error for values close to 0.5 and smaller error for
extreme values. The histograms show that fewer belief values are extreme as noise
increases.
Grid networks. Grid networks are characterized by two parameters (N, D), where
N × N is the size of the network and D is the percentage of determinism (i.e., the
percentage of values in all CPTs assigned to either 0 or 1). We experiment with
grids2 instances from the UAI08 competition. They are characterized by parameters
({16, . . . , 42}, {50, 75, 90}). For each parameter configuration, there are samples of
size 10 generated by randomly assigning value 1 to one leaf node.
Figure 7 and Figure 8 report the results. IJGP correctly infers all 0 beliefs.
157
R. Dechter, B. Bidyuk, R. Mateescu and E. Rollon
Exact Histogram IJGP Histogram Recall Abs. Error Precision Abs. Error
pedigree1
50 0.008
45 0.007
Absolute Error
40
0.006
Percentage
35
30 0.005
25 0.004
20 0.003
15
0.002
10
5 0.001
0 0
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
0.55
0.6
0.65
0.7
0.75
0.8
0.85
0.9
0.95
1
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
0.55
0.6
0.65
0.7
0.75
0.8
0.85
0.9
0.95
1
pedigree23
50 0.16
45 0.14
Absolute Error
40
0.12
Percentage
35
30 0.1
25 0.08
20 0.06
15
0.04
10
5 0.02
0 0
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
0.55
0.6
0.65
0.7
0.75
0.8
0.85
0.9
0.95
1
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
0.55
0.6
0.65
0.7
0.75
0.8
0.85
0.9
0.95
1
pedigree37
50 0.3
45
0.25
Absolute Error
40
Percentage
35
0.2
30
25 0.15
20
15 0.1
10 0.05
5
0 0
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
0.55
0.6
0.65
0.7
0.75
0.8
0.85
0.9
0.95
1
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
0.55
0.6
0.65
0.7
0.75
0.8
0.85
0.9
0.95
1
pedigree38
50 0.2
45
40
Absolute Error
0.15
Percentage
35
30
25 0.1
20
15
0.05
10
5
0 0
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
0.55
0.6
0.65
0.7
0.75
0.8
0.85
0.9
0.95
1
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
0.55
0.6
0.65
0.7
0.75
0.8
0.85
0.9
0.95
1
i-bound = 3 i-bound = 7
Figure 6. Results on pedigree instances. Each row is the result for one instance.
Each column is the result of running IJGP with i-bound equal to 3 and 7, respec-
tively. The number of variables N , number of evidence variables N E, and induced
width w* of each instance is as follows. Pedigree1: N = 334, N E = 36 and w*=21;
pedigree23: N = 402, N E = 93 and w*=30; pedigree37: N = 1032, N E = 306 and
w*=30; pedigree38: N = 724, N E = 143 and w*=18.
158
On the Power of Belief Propagation
Exact Histogram IJGP Histogram Recall Abs. Error Precision Abs. Error
(16, 50)
50
45 0.035
Absolute Error
Percentage 40 0.03
35
0.025
30
25 0.02
20 0.015
15 0.01
10
5 0.005
0 0
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
(16, 75)
50
45
0.1
Absolute Error
40
Percentage
35 0.08
30
25 0.06
20
0.04
15
10 0.02
5
0 0
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
i-bound = 3 i-bound = 5 i-bound = 7
Figure 7. Results on grids2 instances. First row shows the results for parameter
configuration (16, 50). Second row shows the results for (16, 75). Each column is
the result of running IJGP with i-bound equal to 3, 5, and 7, respectively. Each
plot indicates the mean value for up to 10 instances. Both parameter configurations
have 256 variables, one evidence variable, and induced width w*=22.
However, its performance for ǫ small beliefs is quite poor. Only for networks with
parameters (16, 50) the Precision error is relatively small (less than 0.05). If we fix
the size of the network and the i-bound, both Precision and Recall errors increase
as the determinism level D increases. The histograms clearly show the gap between
the number of true ǫ small beliefs and the ones inferred by IJGP. As before, the
accuracy of IJGP improves as the value of the control parameter i-bound increases.
Two-layer noisy-OR networks. Variables are organized in two layers where the
ones in the second layer have 10 parents. Each probability table represents a noisy
OR-function. Each parent variable yj has a value Pj ∈ [0..Pnoise ]. The CPT for each
Q
variable in the second layer is then defined as, P (x = 0|y1 , . . . , yP ) = yj =1 Pj and
P (x = 1|y1 , . . . , yP ) = 1 − P (x = 0|y1 , . . . , yP ). We experiment on bn2o instances
from the UAI08 competition.
Figure 9 reports the results for 3 instances. In this case, IJGP is very accurate
for all instances. In particular, the accuracy in ǫ small beliefs is very high.
CPCS networks. These are medical diagnosis networks derived from the Computer-
Based Patient Care Simulation system (CPCS) expert system. We tested on two
159
R. Dechter, B. Bidyuk, R. Mateescu and E. Rollon
Exact Histogram IJGP Histogram Recall Abs. Error Precision Abs. Error
(26, 75)
50 0.14
45
Absolute Error
40 0.12
Percentage
35 0.1
30
0.08
25
20 0.06
15 0.04
10
0.02
5
0 0
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
(26, 90)
50
45
0.2
Absolute Error
40
Percentage
35
0.15
30
25
20 0.1
15
10 0.05
5
0 0
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
i-bound = 3 i-bound = 5 i-bound = 7
Figure 8. Results on grids2 instances. First row shows the results for parameter
configuration (26, 75). Second row shows the results for (26, 90). Each column is
the result of running IJGP with i-bound equal to 3, 5 and 7, respectively. Each
plot indicates the mean value for up to 10 instances. Both parameter configurations
have 676 variables, one evidence variable, and induced width w*=40.
networks, cpcs54 and cpcs360, with 54 and 360 variables, respectively. For the first
network, we generate samples of size 100 by randomly assigning 10 variables as
evidence. For the second network, we also generate samples of the same size by
randomly assigning 20 and 30 variables as evidence.
Figure 10 shows the results. The histograms show opposing trends in the distri-
bution of beliefs. Although irregular, the absolute error tends to increase towards
0.5 for cpcs54. In general, the error is quite small throughout all intervals and, in
particular, for inferred extreme marginals.
160
On the Power of Belief Propagation
Exact Histogram IJGP Histogram Recall Abs. Error Precision Abs. Error
bn2o-30-15-150-1a
50
45
0.001
Absolute Error
40
Percentage 35 0.0008
30
25 0.0006
20
0.0004
15
10 0.0002
5
0 0
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
bn2o-30-20-200-1a
50
0.008
45
40 0.007
Absolute Error
Percentage
35 0.006
30 0.005
25 0.004
20
0.003
15
10 0.002
5 0.001
0 0
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
bn2o-30-25-250-1a
50
45 0.006
40
Absolute Error
0.005
Percentage
35
30 0.004
25
20 0.003
15 0.002
10
0.001
5
0 0
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
i-bound = 3 i-bound = 11 i-bound = 15
Figure 9. Results on bn2o instances. Each row is the result for one instance. Each
column in each row is the result of running IJGP with i-bound equal to 3, 5 and
7, respectively. The number of variables N , number of evidence variables N E,
and induced width w* of each instance is as follows. bn2o-30-15-150-1a: N = 45,
N E = 15, and w*=24; bn2o-30-20-200-1a: N = 50, N E = 20, and w*=27; bn2o-
30-25-250-1a: N = 55, N E = 25, and w*=26.
cutset of the graph, IBP converges and computes beliefs that approach exact ones.
We will briefly recap the empirical evidence supporting the hypothesis in 2-layer
noisy-OR networks. The number of root nodes m and total number of nodes n
was fixed in each test set (indexed m − n). Generating the networks, each leaf
node Yj was added to the list of children of a root node Ui with probability 0.5.
All nodes were bi-valued. All leaf nodes were observed. We used average absolute
error in the posterior marginals (averaged over all unobserved variables) to measure
IBP’s accuracy and the percent of variables for which IBP converged as a measure
of convergence. In each group of experiments, the results were averaged over 100
instances.
In one set of experiments, we measured the performance of IBP while changing
161
R. Dechter, B. Bidyuk, R. Mateescu and E. Rollon
Exact Histogram IBP Histogram Recall Abs. Error Precision Abs. Error
50% % 0.035
45% %
0.030
40% %
Absolute Error
Percentage
35% % 0.025
30% % 0.020
25% %
20% % 0.015
15% % 0.010
10% %
0.005
5% %
0% % 0.000
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
cpcs360, evidence = 20 cpcs360, evidence = 30 cpcs54, evidence = 10
the number of observed loop-cutset variables (we fixed all priors to (.5, .5) and
picked observed value for loop-cutset variables at random). The results are shown
in Figure 11, top. As expected, the number of converged nodes increased and the
absolute average error decreased monotonically as number of observed loop-cutset
nodes increased.
162
On the Power of Belief Propagation
0.014 110
5-20
0.012 7-23 100
0.01 10-40 90
0.008 80
0.006 70
0.004 60 5-20
0.002 50 7-23
0 40 10-40
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
# obse rved loop-cutset node s # obs erved loop-cutset nodes
Figure 11. Results for 2-layer Noisy-OR networks. The average error and the
number of converged nodes vs the number of truly observed loop-cutset nodes (top)
and the size of of ǫ-cutset (bottom).
7 Conclusion
The paper provides insight into the power of the Iterative Belief Propagation (IBP)
algorithm by making its relationship with constraint propagation explicit. We show
that the power of belief propagation for zero beliefs is identical to the power of arc-
consistency in removing inconsistent domain values. Therefore, the strength and
weakness of this scheme can be gleaned from understanding the inference power of
arc-consistency. In particular we show that the inference of zero beliefs (marginals)
by IBP and IJGP is always sound. These algorithms are guaranteed to converge
for inferred zeros and are as efficient as the corresponding constraint propagation
algorithms.
Then the paper empirically investigates whether the sound inference of zeros by
IBP is extended to near zeros. We show that while the inference of near zeros is
often quite accurate, it can sometimes be extremely inaccurate for networks hav-
ing significant determinism. Specifically, for networks without determinism IBP’s
near zero inference was sound in the sense that the average absolute error was con-
tained within the length of the 0.05 interval (see two layer noisy-OR and CPCS
benchmarks). However, the behavior was different on benchmark networks having
determinism. For example, experiments on coding networks show that IBP is al-
most perfect, while for pedigree and grid networks the results are quite inaccurate
163
R. Dechter, B. Bidyuk, R. Mateescu and E. Rollon
2-layer, 100 instances, Ave Err P(x|e) 2-layer, 100 instance s, % nodes converged
5-20
0.018 120
7-23
0.016
10-40 100
0.014
0.012 80
0.01
60
0.008
0.006 40
5-20
0.004 7-23
20
0.002 10-40
0 0
1E-10 1E-07 0.0001 0.05 0.3 0.6 0.9 0.999 1 1E-10 1E-07 0.0001 0.05 0.3 0.6 0.9 0.999 1
root priors root priors
Figure 12. Results for 2-layer Noisy-OR networks. The average error and the
percent of converged nodes vs ǫ-support.
near zeros.
Finally, we show that evidence, observed or inferred, automatically acts as a
cycle-cutting mechanism and improves the performance of IBP. We also provide
preliminary empirical evaluation showing that the effect of loop-cutset on the accu-
racy of IBP extends to variables that have extreme probabilities.
References
Bidyuk, B. and R. Dechter (2001). The epsilon-cutset effect in Bayesian networks,
r97, r97a in http://www.ics.uci.edu/ dechter/publications. Technical report,
University of California, Irvine.
Dechter, R. (2003). Constraint Processing. Morgan Kaufmann Publishers.
Dechter, R. and R. Mateescu (2003). A simple insight into iterative belief propa-
gation’s success. In Proceedings of the Nineteenth Conference on Uncertainty
in Artificial Intelligence (UAI’03), pp. 175–183.
Dechter, R., R. Mateescu, and K. Kask (2002). Iterative join-graph propaga-
tion. In Proceedings of the Eighteenth Conference on Uncertainty in Artificial
Intelligence (UAI’02), pp. 128–136.
Dechter, R. and J. Pearl (1991). Directed constraint networks: A relational frame-
work for causal reasoning. In Proceedings of the Twelfth International Joint
Conferences on Artificial Intelligence (IJCAI’91), pp. 1164–1170.
Ihler, A. T. (2007). Accuracy bounds for belief propagation. In Proceedings of the
Twenty Third Conference on Uncertainty in Artificial Intelligence (UAI’07).
Ihler, A. T., J. W. Fisher, III, and A. S. Willsky (2005). Loopy belief propagation:
Convergence and effects of message errors. J. Machine Learning Research 6,
905–936.
164
On the Power of Belief Propagation
165
Return to TOC
10
1 Introduction
One of the milestones in the development of artificial intelligence (AI) is the em-
brace of uncertainty and inductive reasoning as primary concerns of the field. This
embrace has been a surprisingly slow process, perhaps because the naive interpre-
tation of “uncertain” seems to convey an image that is the opposite of “intelligent.”
That the field has matured beyond this naive opposition is one of the singular
achievements of Judea Pearl. While the pre-Pearl AI researcher tended to focus on
mimicking the deductive capabilities of human intelligence, a post-Pearl researcher
has been sensitized to the inevitable uncertainty that intelligent systems face in
any realistic environment, and the need to explicitly represent that uncertainty so
as to be able to mitigate its effects. Not only does this embrace of uncertainty
accord more fully with the human condition, but it also recognizes that the first ar-
tificially intelligent systems—necessarily limited in their cognitive capabilities—will
be if anything more uncertain regarding their environments than us humans. It is
only by embracing uncertainty that a bridge can be built from systems of limited
intelligence to those having robust human-level intelligence.
A computational perspective on uncertainty has two aspects: the explicit rep-
resentation of uncertainty and the algorithmic manipulation of this representation
so as to transform and (often) to reduce uncertainty. In his seminal 1988 book,
Probabilistic Reasoning in Intelligent Systems, Pearl showed that these aspects are
intimately related. In particular, obtaining a compact representation of uncer-
tainty has important computational consequences, leading to efficient algorithms
for marginalization and conditioning. Moreover, marginalization and conditioning
are the core inductive operations that tend to reduce uncertainty. Thus, by devel-
oping an effective theory of the representation of uncertainty, Pearl was able to also
develop an effective computational approach to probabilistic reasoning.
Uncertainty about an environment can also be reduced by simply observing that
environment; i.e., by learning from data. Indeed, another response to the early focus
on deduction in AI has been to emphasize learning as a pathway to the development
of intelligent systems. In the 1980’s, concurrently with Pearl’s work on probabilis-
tic expert systems, this perspective was taken up in earnest, building on an earlier
tradition in pattern recognition (which itself built on even earlier traditions in statis-
167
Michael I. Jordan
tics). The underlying inductive principle was essentially the law of large numbers,
a principle of probability theory which states that the statistical aggregation of
independent, identically distributed samples yields a decrease of uncertainty that
goes (roughly speaking) at a rate inversely proportional to the square root of the
number of samples. The question has been how to perform this “aggregation,” and
the learning field has been avidly empirical, exploring a variety of computational
architectures, including extremely simple representations (e.g., nearest neighbor),
ideas borrowed from deductive traditions (e.g., decision trees), ideas closely related
to classical statistical models (e.g., boosting and the support vector machine), and
architectures motivated at least in part by complex biological and physical systems
(e.g., neural networks). Several of these architectures have factorized or graphical
representations, and numerous connections to graphical models have been made.
A narrow reader of Pearl’s book might wish to argue that learning is not distinct
from the perspective on reasoning presented in that book; in particular, observing
the environment is simply a form of conditioning. This perspective on learning is
indeed reasonable if we assume that a learner maintains an explicit probabilistic
model of the environment; in that case, making an observation merely involves
instantiating some variable in the model. However, many learning researchers do
not wish to make the assumption that the learner maintains an explicit probabilistic
model of the environment, and many algorithms developed in the learning field
involve some sort of algorithmic procedure that is not necessarily interpretable as
computing a conditional probability. These procedures are instead justified in terms
of their unconditional performance when used again and again on various data sets.
Here we are of course touching on the distinction between the Bayesian and the
frequentist approaches to statistical inference. While this is not the place to develop
that distinction in detail, it is worth noting that statistics—the field concerned
with the theory and practice of inference—involves the interplay of the conditional
(Bayesian) and the unconditional (frequentist) perspectives and this interplay also
underlies many developments in AI research. Indeed, the trend since Pearl’s work
in the 1980’s has been to blend reasoning and learning: put simply, one does not
need to learn (from data) what one can infer (from the current model). Moreover,
one does not need to infer what one can learn (intractable inferential procedures
can be circumvented by collecting data). Thus learning (whether conditional or
not) and reasoning interact. The most difficult problems in AI are currently being
approached with methods that blend reasoning with learning. While the extremes
of classical expert systems and classical tabula rasa learning are still present and
still have their value in specialized situations, they are not the centerpieces of the
field. Moreover, the caricatures of probabilistic reasoning and statistical inference
that fed earlier ill-informed debates in AI have largely vanished. For this we owe
much to Judea Pearl.
There remain, however, a number of limitations—both perceived and real—of
probabilistic and statistical approaches to AI. In this essay, I wish to focus on some
168
Bayesian Nonparametric Learning
169
Michael I. Jordan
take this literature as a point of departure for the development of expressive data
structures for computationally efficient reasoning and learning.
One general way to use stochastic processes in inference is to take a Bayesian per-
spective and replace the parametric distributions used as priors in classical Bayesian
analysis with stochastic processes. Thus, for example, we could consider a model
in which the prior distribution is a stochastic process that ranges over trees of ar-
bitrary depth and branching factor. Combining this prior with a likelihood, we
obtain a posterior distribution that is also a stochastic process that ranges over
trees of arbitrary depth and branching factor. Bayesian learning amounts to up-
dating one flexible representation (the prior stochastic process) into another flexible
representation (the posterior stochastic process).
This idea is not new, indeed it is the core idea in an area of research known as
Bayesian nonparametrics, and there is a small but growing community of researchers
who work in the area. The word “nonparametrics” needs a bit of explanation. The
word does not mean “no parameters”; indeed, many stochastic processes can be
usefully viewed in terms of parameters (often, infinite collections of parameters).
Rather, it means “not parametric,” in the sense that Bayesian nonparametric in-
ference is not restricted to objects whose dimensionality stays fixed as more data is
observed. The spirit of Bayesian nonparametrics is that of flexible data structures—
representations can grow as needed. Moreover, stochastic processes yield a much
broader class of probability distributions than the class of exponential family distri-
butions that is the focus of the graphical model literature. In this sense, Bayesian
nonparametric learning is less assumption-laden than classical Bayesian parametric
learning.
In this paper we offer an invitation to Bayesian nonparametrics. Our presenta-
tion is meant to evoke Pearl’s presentation of Bayesian networks in that our focus
is on foundational representational issues. As in the case of graphical models, if
the representational issues are handled well, then there are favorable algorithmic
consequences. Indeed, the parallel is quite strong—in the case of graphical mod-
els, these algorithmic consequences are combinatorial in nature (they involve the
combinatorics of sums and products), and in the case of Bayesian nonparametrics
favorable algorithmic consequences also arise from the combinatorial properties of
certain stochastic process priors.
170
Bayesian Nonparametric Learning
171
Michael I. Jordan
into groups that have the same color. This distribution on partitions is known as the
Chinese restaurant process [Aldous 1985]. As we discuss in more detail in Section 4,
the Chinese restaurant process and the Pólya urn model can be used as the basis of
a Bayesian nonparametric model of clustering where the random partition provides
a prior on clusterings and the color associated with a given cell can be viewed as a
parameter vector for a distribution associated with a given cluster.
The exchangeability of the Pólya urn model implies—by De Finetti’s theorem—
the existence of an underlying random element G that renders the ball colors con-
ditionally independent. This random element is not a classical fixed-dimension
random variable; rather, it is a stochastic process known as the Dirichlet process.
In the following section we provide a brief introduction to the Dirichlet process.
βk ∼ Beta(1, α0 ) k = 1, 2, . . . , (3)
where δφk is a unit mass at the point φk . Clearly G is a measure. Indeed, for any
measurable subset B of Ω, G(A) just adds up the values πk for those k such that
φk ∈ B, and this process satisfies the countable additivity needed in the definition
of a measure. Moreover, G is a probability measure, because G(Ω) = 1.
Note that G is random in two ways—the weights πk are obtained by a random
process, and the locations φk are also obtained by a random process. While it seems
clear that such an object is not a classical finite-dimensional random variable, in
172
Bayesian Nonparametric Learning
from which the needed consistency properties follow immediately from classical
properties of the Dirichlet distribution. For this reason, the stochastic process
defined by Eq. (5) is known as a Dirichlet process. Eq. (6) can be summarized as
saying that a Dirichlet process has Dirichlet marginals.
Having defined a stochastic process G, we can now turn De Finetti’s theorem
around and ask what distribution is induced on a sequence (X1 , X2 , . . . , XN ) if
we draw these variables independently from G and then integrate out G. The
answer: the Pólya urn. We say that the Dirichlet process is the De Finetti mixing
distribution underlying the Pólya urn.
In the remainder of this chapter, we denote the stochastic process defined by
Eq. (5) as follows:
G ∼ DP(α0 , G0 ). (7)
173
Michael I. Jordan
G ∼ DP(α0 , G0 )
θi | G ∼ G, i = 1, . . . , N
xi | θi ∼ p(xi | θi ), i = 1, . . . , N,
174
Bayesian Nonparametric Learning
G0
α0 G
θi
xi
where the first factor is the Pólya urn model. This can be viewed as a product of
a prior (the first factor) and a likelihood (the remaining factors).
The variable x is held fixed in inference (it is the observed data) and the goal is
to sample θ. We develop a Gibbs sampler for this purpose. The main problem is to
sample a particular component θi while holding all of the other components fixed. It
is here that the property of exchangeability is essential. Because the joint probability
of (θ1 , . . . , θN ) is invariant to permutation, we can permute the vector to move θi to
the end of the list. But the prior probability of the last component given all of the
preceding variables is given by the urn model specification in Eq. (2). We multiply
each of the distributions in this expression by the likelihood p(xi | θ) and integrate
with respect to θ. (We are assuming that G0 and the likelihood are conjugate that
this integral can be done in closed form.) The result is the conditional distribution
of θi given the other components and given xi . This conditional is sampled to yield
the updated value of θi . This is done for all of the indices i ∈ {1, . . . , N } and the
175
Michael I. Jordan
process iterates.
This link between exchangeability and an efficient inference algorithm is an im-
portant one. In other more complex Bayesian nonparametric models, while we may
no longer assume exchangeability, we generally aim to maintain some weaker notion
(e.g., partial exchangeability) so as to have some hope of tractable inference.
G0 | γ, H ∼ DP(γ, H)
Gi | α, G0 ∼ DP(α0 , G0 ) i = 1, . . . , m,
where γ and H are concentration and base measure parameters at the top of the
hierarchy. This construction—which is known as a hierarchical Dirichlet process
(HDP)—yields an interesting kind of “shrinkage.” Recall that G0 is a discrete
random measure, with its support on a countably infinite set of atoms. Drawing
Gi ∼ DP(α0 , G0 ) means that Gi will also have its support on the same set of atoms,
and this will be true for each of {G1 , G2 , . . . , Gm }. Thus these measures will share
atoms. They will differ in the weights assigned to these atoms. The weights are
obtained via conditionally independent stick-breaking processes.
One application of this sharing of atoms is to share mixture components across
multiple clustering problems. Consider in particular a problem in which we have
176
Bayesian Nonparametric Learning
m groups of data, {(x11 , x12 , . . . , x1N1 ), . . . , (xm1 , xm2 , . . . xmNm )}, where we wish
to cluster the points {xij } in the ith group. Suppose, moreover, that we view the
groups as related, and we think that clusters discovered in one group might also be
useful in other groups. To achieve this, we define the following hierarchical Dirichlet
process mixture model (HDP-MM):
G0 | γ, H ∼ DP(γ, H)
Gi | α, G0 ∼ DP(α0 , G0 ) i = 1, . . . , m,
θij | Gi ∼ Gi j = 1, . . . , Ni ,
xij | θij ∼ F (xij , θij ) j = 1, . . . , Ni .
This model is shown in graphical form in Figure 2. To see how the model achieves
our goal of sharing clusters across groups, recall that the Dirichlet process clusters
points within a single group by assigning the same parameter vector to those points.
That is, if θij = θij ′ , the points xij and xij ′ are viewed as belonging to the same
cluster. This equality of parameter vectors is possible because both θij and θij ′ are
drawn from Gi , and Gi is a discrete measure. Now if Gi and Gi′ share atoms, as
they do in the HDP-MM, then points in different groups can be assigned to the
same cluster. Thus we can share clusters across groups.
The HDP was introduced by Teh, Jordan, Beal and Blei [2006] and it has since
appeared as a building block in a variety of applications. One application is to the
class of models known as grade of membership models [Erosheva 2003], an instance
of which is the latent Dirichlet allocation (LDA) model [Blei, Ng, and Jordan 2003].
In these models, each entity is associated not with a single cluster but with a
set of clusters (in LDA terminology, each “document” is associated with a set of
“topics”). To obtain a Bayesian nonparametric version of these models, the DP
does not suffice; rather, the HDP is required. In particular, the topics for the ith
document are drawn from a random measure Gi , and the random measures Gi are
drawn from a DP with a random base measure G0 ; this allows the same topics to
appear in multiple documents.
Another application is to the hidden Markov model (HMM) where the number of
states is unknown a priori. At the core of the HMM is the transition matrix, each
row of which contains the conditional probabilities of transitioning to the “next
state” given the “current state.” Viewing states as clusters, we obtain a set of
clustering problems, one for each row of the transition matrix. Using a DP for each
row, we obtain a model in which the number of next states is open-ended. Using
an HDP to couple these DPs, the same pool of next states is available from each of
the current states. The resulting model is known as the HDP-HMM [Teh, Jordan,
Beal, and Blei 2006]. Marginalizing out the HDP component of this model yields an
urn model that is known as the infinite HMM [Beal, Ghahramani, and Rasmussen
2002].
Similarly, it is also possible to use the HDP to define an architecture known as
177
Michael I. Jordan
γ G0
α0 Gi
θij
x ij
178
Bayesian Nonparametric Learning
the HDP hidden Markov tree (HDP-HMT), a Markovian tree in which the number
of states at each node in the tree is unknown a priori and the state space is shared
across the nodes. The HDP-HMT has been shown to be useful in image denoising
and scene recognition problems [Kivinen, Sudderth, and Jordan 2007].
Let us also mention that the HDP can be also used to develop a Bayesian non-
parametric approach to probabilistic context free grammars. In particular, the
HDP-PCFG of Liang, Jordan and Klein [2010] involves an HDP-based lexicalized
grammar in which the number of nonterminal symbols is open-ended and inferred
from data (see also Finkel, Grenager and Manning [2007] and Johnson, Griffiths and
Goldwater [2007]). When a new nonterminal symbol is created at some location in
a parse tree, the tying achieved by the HDP makes this symbol available at other
locations in the parse tree.
There are other ways to connect multiple Dirichlet processes. One broadly useful
idea is to use a Dirichlet process to define a distribution on Dirichlet processes.
In particular, let {G∗1 , G∗2 , . . .} be independent draws from a Dirichlet process,
DP(γ, H), and then let G be equal to G∗k with probability πk , where the weights
{πk } are drawn from the stick-breaking process in Eq. (4). This construction (which
can be extended to multiple levels) is known as a nested Dirichlet process [Rodrı́guez,
Dunson, and Gelfand 2008]. Marginalizing over the Dirichlet processes the resulting
urn model is known as the nested Chinese restaurant process [Blei, Griffiths, and
Jordan 2010], which is a model that can be viewed as a tree of Chinese restaurants.
A customer enters the tree at a root Chinese restaurant and sits at a table. This
points to another Chinese restaurant, where the customer goes to dine on the fol-
lowing evening. The construction then recurses. Thus a given customer follows a
path through the tree of restaurants, and successive customers tend to follow the
same paths, eventually branching off.
These nested constructions differ from the HDP in that they do not share atoms
among the multiple instances of lower-level DPs. That is, the draws {G∗1 , G∗2 , . . .}
involve disjoint sets of atoms. The higher-level DP involves a choice among these
disjoint sets.
A general discussion of some of these constructions involving multiple DPs and
their relationships to directed graphical model representations can be found in
Welling, Porteous and Bart [2008]. Finally, let us mention the work of MacEach-
ern [1999], whose dependent Dirichlet processes provide a general formalism for
expressing probabilistic dependencies among both the stick-breaking weights and
the atom locations in the stick-breaking representation of the Dirichlet process.
179
Michael I. Jordan
The fact that we obtain an infinite collection of atoms is due to the fact that we
have used a beta density that integrates to infinity. This construction is depicted
graphically in Figure 3.
If we replace the beta density in this construction with other densities (generally
defined on the positive real line rather than the unit interval (0,1)), we obtain
other completely random measures. In particular, we obtain the gamma process
by using an improper gamma density in place of the beta density. The gamma
process provides a natural latent representation for models in which entities are
180
181
Michael I. Jordan
a binary-valued matrix in which the rows are customers and the columns are the
dishes, and where Zn,k = 1 if customer n samples dish k. Customer n samples
dish k with probability mk /n, where mk is the number of customers who have
previously sampled dish k; that is, Zn,k ∼ Ber(mk /n). (Note that this rule can
be interpreted in terms of classical Bayesian analysis as sampling the predictive
distribution obtained from a sequence of Bernoulli draws based on an improper
beta prior.) Having sampled from the dishes previously sampled by other customers,
customer n then goes on to sample an additional number of new dishes determined
by a draw from a Poiss(α/n) distribution.
The connection to the beta process delineated by Thibaux and Jordan [2007] is
as follows (see Teh and Jordan [2010] for an expanded discussion). Dishes in the
IBP correspond to atoms in the beta process, and the independent beta/Bernoulli
updating of the dish probabilities in the IBP reflects the independent nature of
the atoms in the beta process. Moreover, the fact that a Poisson distribution is
adopted for the number of dishes in the IBP reflects the fact that the beta process
is defined in terms of an underlying Poisson process. The exchangeability of the
IBP (which requires considering equivalence classes of matrices if argued directly
on the IBP representation) follows immediately from the beta process construction
(by the conditional independence of the rows of Z given the underlying draw from
the beta process).
It is also possible to define hierarchical beta processes for models involving mul-
tiple beta processes that are tied in some manner [Thibaux and Jordan 2007]. This
is done by simply letting the base measure for the beta process itself be drawn from
the beta process:
B0 ∼ BP(c0 , B00 )
B ∼ BP(c, B0 ),
where BP(c, B0 ) denotes the beta process with concentration parameter c and base
measure B0 . This construction can be used in a manner akin to the hierarchical
Dirichlet process; for example, we can use it to model groups of entities that are
described by sparse binary vectors, where we wish to share the sparsity pattern
among groups.
8 Conclusions
Judea Pearl’s work on probabilistic graphical models yielded a formalism that was
significantly more expressive than existing probabilistic representations in AI, but
yet retained enough mathematical structure that it was possible to design efficient
computational procedures for a wide class of useful models. In this short article,
we have argued that Bayesian nonparametrics provides a framework in which this
agenda can be taken further. By replacing the traditional parametric prior distri-
butions of Bayesian analysis with stochastic processes, we obtain a rich vocabulary,
182
Bayesian Nonparametric Learning
References
Aldous, D. (1985). Exchangeability and related topics. In Ecole d’Eté de Proba-
bilités de Saint-Flour XIII–1983, pp. 1–198. Springer, Berlin.
Antoniak, C. E. (1974). Mixtures of Dirichlet processes with applications to
Bayesian nonparametric problems. Annals of Statistics 2, 1152–1174.
Beal, M. J., Z. Ghahramani, and C. E. Rasmussen (2002). The infinite hidden
Markov model. In Advances in Neural Information Processing Systems, Vol-
ume 14, Cambridge, MA. MIT Press.
183
Michael I. Jordan
184
Bayesian Nonparametric Learning
Liang, P., M. I. Jordan, and D. Klein (2010). Probabilistic grammars and hier-
archical Dirichlet processes. In The Handbook of Applied Bayesian Analysis,
Oxford, UK. Oxford University Press.
Lo, A. (1984). On a class of Bayesian nonparametric estimates: I. Density esti-
mates. Annals of Statistics 12, 351–357.
MacEachern, S. (1999). Dependent nonparametric processes. In Proceedings of
the Section on Bayesian Statistical Science. American Statistical Association.
Neal, R. M. (2000). Markov chain sampling methods for Dirichlet process mixture
models. Journal of Computational and Graphical Statistics 9, 249–265.
Pitman, J. (2002). Combinatorial stochastic processes. Technical Report 621,
Department of Statistics, University of California at Berkeley.
Rodrı́guez, A., D. B. Dunson, and A. E. Gelfand (2008). The nested Dirichlet
process. Journal of the American Statistical Association 103, 1131–1154.
Sethuraman, J. (1994). A constructive definition of Dirichlet priors. Statistica
Sinica 4, 639–650.
Teh, Y. W. and M. I. Jordan (2010). Hierarchical Bayesian nonparametric mod-
els with applications. In Bayesian Nonparametrics: Principles and Practice.
Cambridge, UK: Cambridge University Press.
Teh, Y. W., M. I. Jordan, M. J. Beal, and D. M. Blei (2006). Hierarchical Dirichlet
processes. Journal of the American Statistical Association 101, 1566–1581.
Thibaux, R. and M. I. Jordan (2007). Hierarchical beta processes and the Indian
buffet process. In Proceedings of the International Workshop on Artificial In-
telligence and Statistics, Volume 11, San Juan, Puerto Rico.
Welling, M., I. Porteous, and E. Bart (2008). Infinite state Bayesian networks for
structured domains. In Advances in Neural Information Processing Systems,
Volume 20, Cambridge, MA. MIT Press.
185
Return to TOC
11
Judea Pearl’s tremendous influence on the fields of artificial intelligence and ma-
chine learning began with the fundamental insight that much of traditional statis-
tical modeling lacked expressive means for articulating known or learned structure
and relationships between probabilistic entities. Judea and his early colleagues
focused their efforts on a type of structure that proved to be particularly impor-
tant — namely network structure, or the graph-theoretic structure that arises from
pairwise influences between random variables. Judea’s legacy includes not only the
introduction of Bayesian networks — perhaps the most important class of proba-
bilistic graphical models — but a rich series of results establishing firm semantics
for inference, independence and causality, and efficient algorithms and heuristics for
fundamental probabilistic computations. His body of work is one of those rare in-
stances in which the contributions range from the most conceptual and philosophical
to the eminently practical.
Inspired by the program established by Judea for statistical models, about a
decade ago a number of us became intrigued by the possibility of replicating it in
the domains of strategic, economic and game-theoretic modeling. At its highest
level, the proposed metaphor was both simple and natural. Rather than a large
number of random variables related by a joint distribution, imagine we have a
large number of players in a (normal-form) game. Instead of the edges of a network
representing direct probabilistic influences between random variables, they represent
direct influences on payoffs by the actions of neighboring players. As opposed to
being concerned primarily with conditional inferences on the joint distribution, we
are interested in the computation of Nash and other types of equilibrium for the
game. As with probabilistic graphical models, although the network succinctly
articulates only local influences, in the game-theoretic setting, at equilibrium there
are certainly global influences and coordination via the propagation of local effects.
And finally, if we were lucky, we might hope to capture for game theory some of
the algorithmic benefits that models like Bayesian networks brought to statistical
modeling.
The early work following this metaphor was broadly successful in its goals. The
first models proposed, which included graphical games [Kearns, Littman, and Singh
2001; Kearns 2007] and Multi-Agent Influence Diagrams [Koller and Milch 2003;
Vickrey and Koller 2002], provided succinct languages for expressing strategic struc-
187
Michael Kearns
ture in the form of networks over the players. The NashProp algorithm for com-
puting (approximate) Nash equilibria in graphical games was the strategic analogue
of the belief propagation algorithm developed by Judea and others, and like that
algorithm it came in both provably efficient form for restricted network topologies,
or in more heuristic but more general form for “loopy” or highly cyclical struc-
tures [Ortiz and Kearns 2003]. There are also works carefully relating probabilistic
and game-theoretic graphical models in interesting ways, as in a result showing
that the distributions forming the correlated equlibria of a graphical game can be
succinctly represented by a (probabilistic) Markov network using (almost) the same
underlying graph structure [Kakade, Kearns, Langford, and Ortiz 2003]. Graphical
games have also played an important role in some recent complexity-theoretic work,
most notably the breakthrough proof establishing that the problem of computing
Nash equilibria in general games for even 2 players is PPAD-complete and thus
potentially intractable [Daskalakis, Goldberg, and Papadimitriou 2006].
In short, we now have a rather rich set of network-based models for game the-
ory, and a firm understanding of their semantic and algorithmic properties. The
execution of this agenda relied on Judea’s work in many places for inspiration and
guidance, from the very conception of the models studied to the usage of cutset
conditioning and distributed dynamic programming techniques in the development
of NashProp and its variants.
Encouraged by this success, more recent works have sought to expand its scope
to include more specifically economic models, developing networked variants of
the classical exchange economies studied by Arrow and Debreu, Fisher, and oth-
ers [Kakade, Kearns, and Ortiz 2004]. Now network structure represents permissible
trading partners or relationships, and again the primary solution concept of interest
is an equilibrium — but now an equilibrium in prices or exchange rates that permits
self-interested traders to clear the market in all goods. While, as to be expected,
there are different technical details, we can again establish the algorithmic benefits
of such models in the form of a price propagation algorithm for computing an ap-
proximate equilibrium. Perhaps more interesting are examinations of how network
topology and equilibrium properties interact. It is worth noting that for probabilis-
tic graphical models such as Bayesian networks, the question of what the “typical”
structure looks like is somewhat nonsensical — the reply might be that there is no
“typical” structure, and topology will depend highly on the domain (whether it be
machine vision, medical diagnosis, and so on). In contrast, the emerging literature
on social and economic networks is indeed beginning to establish at least broad
topological features that arise frequently in empirical networks. This invites, for
example, results establishing that if the network structure exhibits a heavy-tailed
distribution of connectivity (degrees), agent wealths at equilibrium will also be
distributed in highly unequal fashion [Kakade, Kearns, Ortiz, Pemantle, and Suri
2005] . Thus social network structure may be (just one) explanation for observed
disparities in wealth.
188
Graphical Models for Economics
The lines of research sketched above continue to grow and deepen, and have
become one of the many topics of mutual interest between computer scientists,
economists and sociologists. Those of us who were exposed to and inspired by
Judea’s work in probabilistic graphical models were indeed most fortunate to have
had the opportunity to help initiate a fundamental and interdisciplinary subject
only shortly before social, economic and technological network structure became a
topic of such general interest.
Thank you Judea!
References
Daskalakis, C., P. Goldberg, and C. Papadimitriou (2006). The complexity of
computing a Nash equilibrium. In Proceedings of the Thirty-Eighth ACM Sym-
posium on the Theory of Computing, pp. 71–78. ACM Press.
Kakade, S., M. Kearns, J. Langford, and L. Ortiz (2003). Correlated equilibria
in graphical games. In Proceedings of the 4th ACM Conference on Electronic
Commerce, pp. 42–47. ACM Press.
Kakade, S., M. Kearns, and L. Ortiz (2004). Graphical economics. In Proceed-
ings of the 17th Annual Conference on Learning Theory, pp. 17–32. Springer
Berlin.
Kakade, S., M. Kearns, L. Ortiz, R. Pemantle, and S. Suri (2005). Economic
properties of social networks. In L. Saul, Y. Weiss, and L. Bottou (Eds.),
Advances in Neural Information Processing Systems 17, pp. 633–640. MIT
Press.
Kearns, M. (2007). Graphical games.
Kearns, M., M. Littman, and S. Singh (2001). Graphical models for game theory.
In Proceedings of the 17th Annual Conference on Uncertainty in Artificial
Intelligence, pp. 253–260. Morgan Kaufmann.
Koller, D. and B. Milch (2003). Multi-agent influence diagrams for representing
and solving games. Games and Economic Behavior 45 (1), 181–221.
Ortiz, L. and M. Kearns (2003). Nash propagation for loopy graphical games. In
S. Becker, S. Thrun, and K. Obermayer (Eds.), Advances in Neural Informa-
tion Processing Systems 15, pp. 793–800. MIT Press.
Vickrey, D. and D. Koller (2002). Multi-agent algorithms for solving graphical
games. In Proceedings of the 18th National Conference on Artificial Intelli-
gence, pp. 345–351. AAAI Press.
189
Return to TOC
12
191
Daphne Koller
tional probability queries, where we wish to infer the probability distribution over
some (small) subset of variables given evidence concerning some of the others; for
example, in the medical diagnosis setting, we might want to infer the distribution
over each possible disease given observations about the patient’s predisposing fac-
tors, symptoms, and some test results. A second common type of query is the
maximum a posteriori (or MAP ) query, where we wish to find the most likely joint
assignment to all of our random variables; for example, we often wish to find the
most likely joint segmentation to all of the pixels in an image.
In general, it is not difficult to show that both of these inference problems are
NP-hard [Cooper 1990; Shimony 1994], yet (as always) this is not end of the story.
In their seminal paper, Kim and Pearl [1983] presented an algorithm that passes
messages between the nodes in the Bayesian network graph to propagate beliefs
between them. The algorithm was developed in the context of singly connected
directed graphs, also known as polytrees, where there is at most one path (ignoring
edge directionality) between each pair of nodes. In this case, the message passing
process produces correct posterior beliefs for each node in the graph.
Pearl also considered what happens when the algorithm is executed (without
change) over a loopy (multiply connected) graph. In his seminal book, Pearl [1988]
says:
When loops are present, the network is no longer singly connected and
local propagation schemes will invariably run into trouble . . . If we ignore
the existence of loops and permit the nodes to continue communicating
with each other as if the network were singly connected, messages may
circulate indefinitely around the loops and the process may not con-
verge to a stable equilibrium . . . Such oscillations do not normally occur
in probabilistic networks . . . which tend to bring all messages to some
stable equilibrium as time goes on. However, this asymptotic equilib-
rium is not coherent, in the sense that it does not represent the posterior
probabilities of all nodes of the networks.
As a consequence of these problems, the idea of loopy belief propagation was largely
abandoned for many years.
The revival of this approach is surprisingly due to a seemingly unrelated advance
in coding theory. The area of coding addresses the problem of sending messages over
a noisy channel, and recovering it from the garbled result. We send a k-bit message,
redundantly coded using n bits. These n bits are sent over the noisy channel, so the
received bits are possibly corrupted. The decoding task is to recover the original
message from the bits received. The bit error rate is the probability that a bit is
ultimately decoded incorrectly. This error rate depends on the code and decoding
algorithm used and on the amount of noise in the channel. The rate of a code is
k/n — the ratio between the number of bits in the message and the number of bits
used to transmit it. In 1948, Claude Shannon provided a theoretical analysis of
192
Belief Propagation in Loopy Graphs
the coding problem [Shannon 1948]. For a given rate, Shannon provided an upper
bound on the maximum noise level that can be tolerated while still achieving a
certain bit error rate, no matter which code is used. Shannon also showed that
there exist channel codes that achieve this limit, but his proof was nonconstructive
— he did not present practical encoders and decoders that achieve this limit.
Since Shannon’s landmark result, multiple codes were suggested. However, de-
spite a gradual improvement in the quality of the code (bit-error rate for a given
noise level), none of the codes even came close to the Shannon limit. The big
breakthrough came in the early 1990s, when Berrou, Glavieux, and Thitimajshima
[1993] came up with a new scheme that they called a turbocode, which, empirically,
came much closer to achieving the Shannon limit than any other code proposed up
to that point. However, their decoding algorithm had no theoretical justification,
and, while it seemed to work well in real examples, could be made to diverge or
converge to the wrong answer. The second big breakthrough was the subsequent
realization, due to McEliece, MacKay, and Cheng [1998] and Frey and MacKay
[1997] that the turbocoding procedure was simply performing loopy belief propaga-
tion message passing on a Bayesian network representing the probability model for
the code and the channel noise!
This revelation had a tremendous impact on both the coding theory community
and the graphical models community. For the former, loopy belief propagation
provides a general-purpose algorithm for decoding a large family of codes. By sep-
arating the algorithmic question of decoding from the question of the code design,
it allowed the development of many new coding schemes with improved properties.
These codes have come much, much closer to the Shannon limit than any previous
codes, and they have revolutionized both the theory and the practice of coding.
For the graphical models community, it was the astounding success of loopy belief
propagation for this application that led to the resurgence of interest in these ap-
proaches. Subsequent work showed that this algorithm works very well in practice
on a broad range of other problems (see, for example, Weiss [1996] and Murphy,
Weiss, and Jordan [1999]), leading to a large amount of work on this topic. In this
short paper, we review only some of the key ideas underlying this important class
of methods; see section 6 for some discussion and further references.
2 Background
2.1 Probabilistic Graphical Models
Probabilistic graphical models are a general family of representations for probability
distributions over high-dimensional spaces. Specifically, our goal is to encode a joint
probability distribution over the possible assignments to a set of random variables
X = {X1 , . . . , Xn }. We focus on the discrete setting, where each random variable
Xi takes values in some set Val(Xi ). In this case, the number of possible assignments
grows exponentially with the number of variables n, making an explicit enumeration
of the joint distribution infeasible.
193
Daphne Koller
set of factors, the partition function is guaranteed to be 1, and so we can now say
that a distribution P factorizes over G if it can be written as:
Y
P (X1 , . . . , Xn ) = P (Xi | PaGXi ),
i
194
Belief Propagation in Loopy Graphs
195
Daphne Koller
We now use the clique tree structure to pass messages. The message from C i
to another clique C j is computed using the following sum-product message passing
operation: X Y
δi→j = πi0 × δk→i . (2)
C i −S i,j k∈(Ni −{j})
In words, the clique C i multiplies all incoming messages from its other neighbors
with its initial clique potential, resulting in a factor ψ whose scope is the clique.
It then sums out all variables except those in the sepset between C i and C j , and
sends the resulting factor as a message to C j .
This computation can be scheduled in a variety of ways. Most generally, we say
that C i is ready to transmit a message to C j when C i has messages from all of its
neighbors except from C j . In such a setting, C i can compute the message δi→j (S i,j )
by multiplying its initial potential with all of its incoming messages except the one
from C j , and then eliminate the variables in C i − S i,j . Although the algorithm is
196
Belief Propagation in Loopy Graphs
This algorithm, when applied to a clique tree that satisfies the family preservation
and running intersection property, computes messages and beliefs repesenting well-
defined expressions. In particular, we can show that the message passed from C i
to C j is the product of all the factors in F≺(i→j) , marginalized over the variables
in the sepset (that is, summing out all the others):
X Y
δi→j (S i,j ) = φ.
V≺(i→j) φ∈F≺(i→j)
It then follows that, when the algorithm terminates, we have, for each clique i
X
πi [C i ] = P̃Φ (X ), (3)
X −C i
that is, the value of the unnormalized measure P̃Φ , marginalized over the variables
in C i .
We note that this expression holds for all cliques; thus, in one upward-downward
pass of the algorithm, we obtain all of the marginals of all of the cliques in the
network, from which we can also obtain the marginals over all variables: to compute
the marginal probability over a particular variable X, we can select a clique whose
scope contains X, and marginalize all variables other than X. This capability is
very valuable in many applications; for example, in a medical-diagnosis setting, we
generally want the probability of several possible diseases.
An important consequence of (3) is that we obtain the same marginal distribution
over X regardless of the from which we extracted it. More generally, for any two
adjacent cliques C i , we must have that
X X
πi [C i ] = πj [C j ].
C i −S i,j C j −S i,j
197
Daphne Koller
B B
1: A, B, C 4: B, E B 1: A, B, C 4: B, E B
B
C E 3: B, D, F B,C E 3: B, D, F
D D D D
2: B, C, D 5: D, E 2: B, C, D 5: D, E
the clique tree message passing algorithm we described is precisely Pearl’s belief
propagation algorithm. The more general case of this particular algorithm was
developed by Shafer and Shenoy [1990], who described it in a much broader form
that applies to many factored models other than probabilistic graphical models.
An alternative but ultimately equivalent message passing scheme (which uses a
sum-product-divide sequence for each message passing step) was was developed in
parallel, in a series of papers by Lauritzen and Spiegelhalter [1988] and Jensen,
Olesen, and Andersen [1990].
198
Belief Propagation in Loopy Graphs
about X to flow between all clusters that contain it, so that, in a calibrated cluster
graph, all clusters must agree about the marginal distribution of X. The fact
that there is at most one path prevents loops in the cluster graph where all of the
clusters contain X. In graphs that contain such loops, a message passing algorithm
can propagate information about X endlessly around the loop, making the beliefs
more extreme due to “cyclic arguments.”
Importantly, however, since the graph is not necessarily a tree, the same pair of
clusters might also be connected by other paths. For example, in the cluster graph
of figure 1a, we see that the edges labeled with B form a subtree that spans all
the clusters that contain B. However, there are loops in the graph. For example,
there are two paths from C 3 = {B, D, F } to C 2 = {B, C, D}. The first, through
C 4 , propagates information about B, and the second, through C 5 , propagates
information about D. Thus, we can still get circular reasoning, albeit less directly
than we would in a graph that did not satisfy the running intersection property.
Note that while in the case of trees the definition of running intersection implied
that S i,j = C i ∩ C j , in a graph this equality is no longer enforced by the running
intersection property. For example, cliques C 1 and C 2 in figure 1a have B in
common, but S 1,2 = {C}.
We note that there are many possible choices for the cluster graph, and the
decision on which to use can make a significant difference to the algorithm. In
particular, different graphs can lead to very different computational cost, different
convergence behavior and even different answers.
EXAMPLE 1. Consider, for example, the cluster graphs U1 and U2 of figure 1a and
figure 1b. Both are fairly similar, yet in U2 the edge between C 1 and C 2 involves
the marginal distribution over B and C. On the other hand, in U1 , we propagate
the marginal only over C. Intuitively, we expect inference in U2 to better capture
the dependencies between B and C. For example, assume that the potential of
C 1 introduces strong correlations between B and C (say B = C). In U2 , this
correlation is conveyed to C 2 directly. In U1 , the marginal on C is conveyed on
the edge (1–2), while the marginal on B is conveyed through C 4 . In this case, the
strong dependency between the two variables is lost. In particular, if the marginal
on C is diffuse (close to uniform), then the message C 1 sends to C 4 will also have
a uniform distribution on B, and from C 2 ’s perspective the messages on B and C
will appear as two independent variables.
One class of networks for which a simple cluster graph construction exists is the
class of pairwise Markov networks. In these networks, we have a univariate potential
φi [Xi ] over each variable Xi , and in addition a pairwise potential φ(i,j) [Xi , Xj ] over
some pairs of variables. These pairwise potentials correspond to edges in the Markov
network. Many problems are naturally formulated as pairwise Markov networks,
such as the grid networks common in computer vision applications. Indeed, if we
are willing to transform our variables, any distribution can be reformulated as a
199
Daphne Koller
200
Belief Propagation in Loopy Graphs
Procedure CGraph-SP-Calibrate (
Φ, // Set of factors
U // Generalized cluster graph Φ
)
1 for each cluster C i
Q
2 πi ← φ : α(φ)=i φ
3 for each edge (i–j) ∈ EU
4 δi→j ← 1; δj→i ← 1
5
6 while graph is not calibrated
7 Select (i–j) ∈ EU
0
P Q
8 δi→j (S i,j ) ← C i −S i,j πi × k∈(Ni −{j}) δk→i
9
10 for each clique i
πi ← πi0 × k∈Ni δk→i
Q
11
12 return {πi }
then pass messages to their neighbors, summarizing the current beliefs derived from
their own initial potentials and from the messages received by their neighbors. The
algorithm is shown in figure 2. Convergence is achieved when the cluster graph is
calibrated ; that is, if for each edge (i–j), connecting the clusters C i and C j , we
have that X X
πi = πj .
C i −S i,j C j −S i,j
Note that this definition is weaker than cluster tree calibration, since the clusters do
not necessarily agree on the joint marginal of all the variables they have in common,
but only on those variables in the sepset. However, if a calibrated cluster graph
satisfies the running intersection property, then the marginal of a variable X is
identical in all the clusters that contain it. This algorithm clearly generalizes the
clique-tree message-passing algorithm described earlier.
EXAMPLE 2. With this framework in hand, we can now revisit the message de-
coding task. Assume that we wish to send a k-bit message u1 , . . . , uk . We code
the message using a number of bits x1 , . . . , xn , which are then sent over the noisy
channel, resulting in a set of (possibly corrupted) outputs y1 , . . . , yn . The message
decoding task is to recover an estimate uˆ1 , . . . , uˆk from y1 , . . . , yn . We first observe
that message decoding can easily be reformulated as a probabilistic inference task:
We have a prior over the message bits U = hU1 , . . . , Uk i, a (usually deterministic)
function that defines how a message is converted into a sequence of transmitted
bits X1 , . . . , Xn , and another (stochastic) model that defines how the channel ran-
201
Daphne Koller
Y4 Y8
X4 X8
Y1 Y2 Y3 Y4
W1 W2 W3 W4
X1 X2 X3 X4
Permuter
U1 U2 U3 U4 U1 U2 U3 U4
Y1 Y3 Y5 Y7
X5 X6 X7
Z1 Z2 Z3 Z4
Y5 Y6 Y7
X2 X6
Y2 Y6
(a) (b)
domly corrupts the Xi ’s to produce Yi ’s. The decoding task can then be viewed
as finding the most likely joint assignment to U given the observed message bits
y = hy1 , . . . , yn i, or (alternatively) as finding the posterior P (Ui | y) for each bit
Ui . The first task is a MAP inference task, and the second task one of computing
posterior probabilities. Unfortunately, the probability distribution is of high dimen-
sion, and the network structure of the associated graphical model is quite densely
connected and with many loops.
The turbocode approach, as first proposed, comprised both a particular coding
scheme, and the use of a message passing algorithm to decode it. The coding
scheme transmits two sets of bits: one set comprises the original message bits X a =
hX1a , . . . , Xka i = u, and the second some set X b = hX1b , . . . , Xkb i of transformed bits
202
Belief Propagation in Loopy Graphs
(like the parity check bits, but more complicated). The received bits then can also
be partitioned into the noisy y a , y b . Importantly, the code is designed so that the
message can be decoded (albeit with errors) using either y a or y b . The turbocoding
algorithm then works as follows: It uses the model of X a (trivial in this case)
and of the channel noise to compute a posterior probability over U given y a . It
then uses that posterior πa (U1 ), . . . , πa (Uk ) as a prior over U and computes a new
posterior over U , using the model for X b and the channel, and y b as the evidence,
to compute a new posterior πb (U1 ), . . . , πb (Uk ). The “new information,” which is
πb (Ui )/πa (Ui ), is then transmitted back to the first decoder, and the process repeats
until a stopping criterion is reached. In effect, the turbocoding idea was to use two
weak coding schemes, but to “turbocharge” them using a feedback loop. Each
decoder is used to decode one subset of received bits, generating a more informed
distribution over the message bits to be subsequently updated by the other. The
specific method proposed used particular coding scheme for the X b bits, illustrated
in figure 3b.
This process looked a lot like black magic, and in the beginning, many people
did not even believe that the algorithm worked. However, when the empirical
success of these properties was demonstrated conclusively, an attempt was made
to understand its theoretical properties. McEliece, MacKay, and Cheng [1998] and
Frey and MacKay [1997] subsequently showed that the specific message passing
procedure proposed by Berrou et al. is precisely an application of belief propagation
(with a particular message passing schedule) to the Bayesian network representing
the turbocode (as in figure 3b).
203
Daphne Koller
we are not interested in individual beliefs, but rather in some aggregate over the
entire network, for example, in a learning setting.
A second observation is that nonconvergence is often due to oscillations in the
beliefs. As proposed by Murphy, Weiss, and Jordan [1999] and Heskes [2002], we
can dampen the oscillations by reducing the difference between two subsequent
updates. In particular, we can replace the update rule in (2) by a smoothed version
that averages the update δi→j with the previous message between the two cliques:
X Y
old
δi→j ← λ δk→i + (1 − λ)δi→j , (4)
C i −S i,j k6=j
old
where λ is the damping weight and δi→j is the previous value of the message. When
λ = 1, this update is equivalent to standard belief propagation. For 0 < λ < 1,
the update is partial and although it shifts πj toward agreement with πi , it leaves
some momentum for the old value of the belief, a dampening effect that in turn
reduces the fluctuations in the beliefs. It turns out that this smoothed update rule
is “equivalent” to the original update rule, in that a set of beliefs is a convergence
point of the smoothed update if and only if it is a convergence point of standard
updates. Moreover, one can show that, if run from a point close enough to a
stable convergence point of the algorithm, with a sufficiently small λ, this smoothed
update rule is guaranteed to converge. Of course, this guarantee is not very useful
in practice, but there are indeed many cases where the smoothed update rule is
convergent, whereas the original update rule oscillates indefinitely (see figure 4).
A broader-spectrum heuristic, which plays an important role not only in en-
suring convergence but also in speeding it up considerably, is intelligent message
scheduling. The simplest and perhaps most natural approach is to implement BP
message passing as a synchronous algorithm, where all messages are updated at
once. Asynchronous message passing updates messages one at a time, using the
most recent version of the incoming messages to generate the outgoing message. It
turns out that, in most cases, the synchronous schedule is far from optimal, both in
terms of reaching convergence, and in the number of messages required for conver-
gence. As one simple example, consider a cluster graph with m edges, and diameter
d, synchronous message passing requires m(d − 1) messages to pass information
from one side of the graph to the other. By contrast, asynchronous message pass-
ing, appropriately scheduled, can pass information between two clusters at opposite
ends of the graph using d − 1 messages. Moreover, the fact that, in synchronous
message passing, each cluster uses messages from its neighbors that are based on
their previous beliefs appears to increase the chances of oscillatory behavior and
nonconvergence in general.
In practice, an asynchronous message passing schedule works significantly better
than the synchronous approach (see figure 4). Moreover, even greater improvements
can be obtained by scheduling messages in a guided way. One approach, called tree
204
Belief Propagation in Loopy Graphs
205
Daphne Koller
% of messages converged
0.9 0.9 0.9
0.8 0.8 0.8
0.7 0.7 0.7
P (X 115 = 0)
P (X 10 = 0)
0.6 0.6 0.6
0.5 0.5 0.5
0.4 0.4 0.4
0.3 0.3 0.3
0.2 0.2 0.2
0.1 0.1 0.1
0 0 0
0 10 20 30 40 50 60 70 80 90 100 0 0.1 0.2 0.3 0.4 0.5 0 0.1 0.2 0.3 0.4 0.5
Time (seconds) Time (seconds) Time (seconds)
P (X 17 = 0)
P (X 7 = 0)
0.6 0.6 0.6
0.5 0.5 0.5
0.4 0.4 0.4
0.3 0.3 0.3
0.2 0.2 0.2
0.1 0.1 0.1
0 0 0
0 0.1 0.2 0.3 0.4 0.5 0 0.1 0.2 0.3 0.4 0.5 0 0.1 0.2 0.3 0.4 0.5
Time (seconds) Time (seconds) Time (seconds)
The remaining panels illustrate the progression of the marginal beliefs over the
course of the algorithm. (b) shows a marginal where both the synchronous and
asynchronous updates converge quite rapidly and are close to the true marginal (thin
solid black). Such behavior is atypical, and it comprises only around 10 percent
of the marginals in this example. In the vast majority of the cases (almost 80
percent in this example), the synchronous beliefs oscillate around the asynchronous
ones ((c)–(e)). In many cases, such as the ones shown in (e), the entropy of the
synchronous beliefs is quite significant. For about 10 percent of the marginals (for
example (f)), both the asynchronous and synchronous marginals are inaccurate. In
these cases, using more informed message schedules can significantly improve the
algorithms performance.
These qualitative differences between the BP variants are quite consistent across
many random and real-life models. Typically, the more complex the inference prob-
lem, the larger the gaps in performance. For very complex real-life networks in-
206
Belief Propagation in Loopy Graphs
As for sum-product message passing, the algorithm will converge after a single
upward and downward pass. After those steps, the resulting clique tree T will
contain the appropriate max-marginal in every clique. In particular, for each clique
C i and each assignment ci to C i , we will have that
That is, the clique belief contains, for each assignment ci to the clique variables,
the (unnormalized) measure P̃Φ (x) of the most likely assignment x consistent with
ci . Note that, because the max-product message passing process does not compute
207
Daphne Koller
the partition function, we cannot derive from these max-marginals the actual prob-
ability of any assignment; however, because the partition function is a constant,
we can still compare the values associated with different assignments, and therefore
compute the assignment x that maximizes P̃Φ (x).
Because max-product message passing over a clique tree produces max-marginals
in every clique, and because max-marginals must agree, it follows that any two
adjacent cliques must agree on their sepset:
In this case, the clusters are said to be max-calibrated. We say that a clique tree is
max-calibrated if all pairs of adjacent cliques are max-calibrated.
The same transformation from sum-product to max-product can be applied to
the case of loopy belief propagation. Here, the algorithm is the same as in figure 2,
except that we replace the sum-product message computation with the max-product
computation of (6). As for sum-product, there are no guarantees that this algorithm
will converge. Indeed, in practice, it tends to converge somewhat less often than
the sum-product algorithm, perhaps because the averaging effect of the summation
operation tends to smooth out messages, and reduce oscillations. The same ideas
that we discussed in section 4.3 can be used to improve convergence in this algorithm
as well.
At convergence, the result will be a set of calibrated clusters: As for sum-product,
if the clusters are not calibrated, convergence has not been achieved, and the algo-
rithm will continue iterating. However, the resulting beliefs will not generally be
the exact max-marginals; these beliefs are often called pseudo-max-marginals.
208
Belief Propagation in Loopy Graphs
This condition prevents symmetric cases like the one in the preceding example.
Indeed, it is not difficult to show that the following two conditions are equivalent:
For generic probability measures, the assumption of unambiguity is not overly strin-
gent, since we can always break ties by introducing a slight random perturbation
into all of the factors, making all of the elements in the joint distribution have
slightly different probabilities. However, if the distribution has special structure —
deterministic relationships or shared parameters — that we want to preserve, this
type of ambiguity may be unavoidable.
The situation where there are ties in the node beliefs is more complex. In this
case, we say that an assignment x∗ has the local optimality property if, for each
cluster C i in the tree, we have that
that is, the assignment to C i in x∗ optimizes the C i belief. The task of finding
a locally optimal assignment x∗ given a max-calibrated set of beliefs is called the
decoding task.
Importantly, for approximate max-marginals derived from loopy belief propaga-
tion, a locally optimal joint assignment may not exist:
EXAMPLE 4. Consider a cluster graph with the three clusters {A, B}, {B, C}, {A, C}
and the beliefs
a1 a0 b1 b0 a1 a0
b1 1 2 c1 1 2 c1 1 2
b0 2 1 c0 2 1 c0 2 1
These beliefs are max-calibrated, in that all messages are (2, 2). However, there is
no joint assignment that maximizes all of the cluster beliefs simultaneously. For
example, if we select a0 , b1 , we maximize the value in the A, B belief. We can
now select c0 to maximize the value in the B, C belief. However, we now have a
nonmaximizing assignment a0 , c0 in the A, C belief. No matter which assignment
of values we select in this example, we do not obtain a single joint assignment that
maximizes all three beliefs. Loops such as this are often called frustrated.
209
Daphne Koller
How do we find a locally optimal joint assignment, if one exists? Recall from the
definition that an assignment is locally optimal if and only if it selects one of the
optimizing assignments in every single cluster. Thus, we can essentially label the
assignments in each cluster as either “legal” if they optimize the belief or “illegal” if
they do not. We now must search for an assignment to X that results in a legal value
for each cluster. This problem is precisely an instance of a constraint satisfaction
problem (CSP). A constraint satisfaction problem can be defined in terms of a
Markov network (or factor graph) where all of the entries in the beliefs are either 0
or 1. The CSP problem is now one of finding an assignment whose (unnormalized)
measure is 1, if one exists, and otherwise reporting failure. In other words, the
CSP problem is simply that of finding the MAP assignment in this model with
{0, 1}-valued beliefs. The field of CSP algorithms is a large one, and a detailed
survey is outside the scope of the paper; see Dechter [2003] for a recent survey.
Interestingly, it is an area to which Pearl also made important early contributions
[Dechter and Pearl 1987]. Recent work has reinvigorated this trajectory, studying
the surprisingly deep connections between CSP methods and belief propagation, and
exploiting it (for example, within the context of the survey propagation algorithm
[Maneva, Mossel, and Wainwright 2007]).
Thus, given a max-product calibrated cluster graph, we can convert it to a
discrete-valued CSP by simply taking the belief in each cluster, changing each as-
signment that locally optimizes the belief to 1 and all other assignments to 0. We
then run some CSP solution method. If the outcome is an assignment that achieves
1 in every belief, this assignment is guaranteed to be a locally optimal assignment.
Otherwise, there is no locally optimal assignment. Importantly, as we discuss below,
for the case of calibrated clique trees, we are guaranteed that this approach finds a
globally optimal assignment.
In the case where there is no locally optimal assignment, we must resort to the
use of alternative solution methods. One heuristic in this latter situation is to use
information obtained from the max-product propagation to construct a partial as-
signment. For example, assume that a variable Xi is unambiguous in the calibrated
cluster graph, so that the only value that locally optimizes its node marginal is xi .
In this case, we may decide to restrict attention only to assignments where Xi = xi .
In many real-world problems, a large fraction of the variables in the network are
unambiguous in the calibrated max-product cluster graph. Thus, this heuristic can
greatly simplify the model, potentially even allowing exact methods (such as clique
tree inference) to be used for the resulting restricted model. We note, however, that
the resulting assignment would not necessarily satisfy the local optimality condition,
and all of the guarantees we will present hold only under that assumption.
210
Belief Propagation in Loopy Graphs
B
1: A, B, C 4: B, E 1: A, B, C 4: B, E
C B C E
B 3: B, D, F
2: B, C, D 2: B, C, D 5: D, E
(a) (b)
Figure 5. Two induced subgraphs derived from figure 1a. (a) Graph over
{B, C}; (b) Graph over {C, E}.
211
Daphne Koller
6 Conclusions
This paper has reviewed a small fraction of the recent results regarding the belief
propagation algorithm. This line of work has been hugely influential in the area
of probabilistic modeling, both in practice and in theory. On the practical side,
belief propagation algorithms are among the most commonly used for inference in
graphical models for which exact inference is intractable. They have been used
successfully for a broad range of applications, including message decoding, natural
language processing, computer vision, computational biology, web analysis, and
many more. There have also been tremendous developments on the algorithmic
side, with many important extensions to the basic approach.
On the theoretical side, the work of many people has served to provide a much
deeper understanding of the theoretical foundations of this algorithm, which has
tremendously influenced our entire perspective on probabilistic inference. One sem-
212
Belief Propagation in Loopy Graphs
inal line of work along these lines was initiated by the landmark paper of Yedidia,
Freeman, and Weiss [2000, 2005], showing that beliefs obtained as fixed points of
the belief propagation algorithm are also solutions to an optimization problem; this
problem is an approximation to another optimization problem whose solutions are
the exact marginals that would be obtained from clique tree inference. Thus, both
exact (clique tree) and approximate (cluster graph) inference can be viewed in terms
of optimization of an objective. This observation was the basis for the development
of a whole range of novel methods that explored different variations on the for-
mulation of the optimization problem, or different algorithms for performing the
optimization. One such line of work uses convex versions of the optimization prob-
lem underlying belief propagation, a trajectory initiated by Wainwright, Jaakkola,
and Willsky [2002]. Algorithms based on this approach (e.g., [Heskes 2006; Hazan
and Shashua 2008]) can also guarantee convergence as well as provide bounds on
the partition function.
For the MAP problem, a similar optimization-based view has also recently come
to dominate the field. Here, the original MAP problem is reformulated as an inte-
ger programming problem, where the (discrete-valued) variables in the optimization
represent the space of possible assignments x. This discrete optimization is then re-
laxed to produce a continuous-valued optimization problem that is a linear program
(LP). This LP-relaxation approach was first proposed by Schlesinger [1976], and
then subsequently rediscovered independently by several researchers. Most notably,
Wainwright, Jaakkola, and Willsky [2005] established the first connection between
the dual problem to this LP and message passing algorithms, and proposed a new
message-passing algorithm (TRW) based on this connection. Many recent works
build on these ideas and develop a suite of increasingly better algorithms for solv-
ing the MAP inference problem. Some of these algorithms utilize message-passing
techniques; others merely adopt the idea of using the LP dual but utilize other
optimization methods for solving it. Importantly, for several of these algorithms,
one can guarantee that a solution, if one is found, is guaranteed to be the optimal
MAP assignment.
In summary, the simple message-passing algorithm first proposed by Pearl has
recently returned to revolutionize the world of inference in graphical models. It
has dramatically affected both the practice in the field and has led to a new,
optimization-based perspective on the foundations of the inference task. This new
understanding has, in turn, given rise to the development of much better algorithms,
which continue to improve our ability to apply probabilistic graphical models to
challenging, real-world applications.
Acknowledgments This material in this review paper is extracted from the book
of Koller and Friedman [2009], published by MIT Press. Some of this material
is based on contributions by Nir Friedman and Gal Elidan. I also thank Amir
Globerson, David Sontag, and Yair Weiss for useful discussions regarding MAP
inference.
213
Daphne Koller
References
Berrou, C., A. Glavieux, and P. Thitimajshima (1993). Near Shannon limit error-
correcting coding: Turbo codes. In Proc. International Conference on Com-
munications, pp. 1064–1070.
Cooper, G. (1990). Probabilistic inference using belief networks is NP-hard. Ar-
tificial Intelligence 42, 393–405.
Dechter, R. (2003). Constraint Processing. Morgan Kaufmann.
Dechter, R., K. Kask, and R. Mateescu (2002). Iterative join-graph propagation.
In Proc. 18th Conference on Uncertainty in Artificial Intelligence (UAI), pp.
128–136.
Dechter, R. and J. Pearl (1987). Network-based heuristics for constraint-
satisfaction problems. Artificial Intelligence 34 (1), 1–38.
Elidan, G., I. McGraw, and D. Koller (2006). Residual belief propagation: In-
formed scheduling for asynchronous message passing. In Proc. 22nd Confer-
ence on Uncertainty in Artificial Intelligence (UAI).
Frey, B. and D. MacKay (1997). A revolution: Belief propagation in graphs with
cycles. In Proc. 11th Conference on Neural Information Processing Systems
(NIPS).
Hazan, T. and A. Shashua (2008). Convergent message-passing algorithms for
inference over general graphs with convex free energies. In Proc. 24th Confer-
ence on Uncertainty in Artificial Intelligence (UAI).
Heckerman, D., E. Horvitz, and B. Nathwani (1992). Toward normative ex-
pert systems: Part I. The Pathfinder project. Methods of Information in
Medicine 31, 90–105.
Heskes, T. (2002). Stable fixed points of loopy belief propagation are minima
of the Bethe free energy. In Proc. 16th Conference on Neural Information
Processing Systems (NIPS), pp. 359–366.
Heskes, T. (2006). Convexity arguments for efficient minimization of the Bethe
and Kikuchi free energies. Journal of Machine Learning Research 26, 153–190.
Ihler, A. T., J. W. Fisher, and A. S. Willsky (2005). Loopy belief propagation:
Convergence and effects of message errors. Journal of Machine Learning Re-
search 6, 905–936.
Jensen, F. V., K. G. Olesen, and S. K. Andersen (1990, August). An algebra
of Bayesian belief universes for knowledge-based systems. Networks 20 (5),
637–659.
Kim, J. and J. Pearl (1983). A computational model for combined causal and
diagnostic reasoning in inference systems. In Proc. 7th International Joint
Conference on Artificial Intelligence (IJCAI), pp. 190–193.
214
Belief Propagation in Loopy Graphs
215
Daphne Koller
Wainwright, M., T. Jaakkola, and A. Willsky (2005). MAP estimation via agree-
ment on trees: Message-passing and linear programming. IEEE Transactions
on Information Theory.
Wainwright, M., T. Jaakkola, and A. S. Willsky (2002). A new class of upper
bounds on the log partition function. In Proc. 18th Conference on Uncertainty
in Artificial Intelligence (UAI).
Weiss, Y. (1996). Interpreting images by propagating bayesian beliefs. In Proc.
10th Conference on Neural Information Processing Systems (NIPS), pp. 908–
914.
Weiss, Y. and W. Freeman (2001). On the optimality of solutions of the max-
product belief propagation algorithm in arbitrary graphs. IEEE Transactions
on Information Theory 47 (2), 723–735.
Yedidia, J., W. Freeman, and Y. Weiss (2005). Constructing free-energy approx-
imations and generalized belief propagation algorithms. IEEE Trans. Infor-
mation Theory 51, 2282–2312.
Yedidia, J. S., W. T. Freeman, and Y. Weiss (2000). Generalized belief propa-
gation. In Proc. 14th Conference on Neural Information Processing Systems
(NIPS), pp. 689–695.
216
Return to TOC
13
1 Introduction
One of Judea Pearl’s now-classic examples of a Bayesian network involves a home
alarm system that may be set off by a burglary or an earthquake, and two neighbors
who may call the homeowner if they hear the alarm. Like most scenarios modeled
with BNs, this example involves a known set of objects (one house, one alarm,
and two neighbors) with known relations between them (the alarm is triggered by
events that affect this house; the neighbors can hear this alarm). These objects and
relations determine the relevant random variables and their dependencies, which
are then represented by nodes and edges in the BN.
In many real-world scenarios, however, the relevant objects and relations are ini-
tially unknown. For instance, suppose we have a set of ASCII strings containing
irregularly formatted and possibly erroneous academic citations extracted from on-
line documents, and we wish to make a list of the distinct publications that are
referred to, with correct author names and titles. In this case, the publications,
authors, venues, and so on are not known in advance, nor is the mapping between
publications and citations. The same challenge of making inferences about unknown
objects is called coreference resolution in natural language processing, data associ-
ation in multitarget tracking, and record linkage in database systems. The issue is
actually much more widespread than this short list suggests; it arises in any data
interpretation problem in which objects or events come without unique identifiers.
In this chapter, we show how the Bayesian network (BN) formalism that Judea
Pearl pioneered has been extended to handle such scenarios. The key contribution
on which we build is the use of acyclic directed graphs of local conditional distri-
butions to generate well-defined, global probability distributions. We begin with a
review of relational probability models (RPMs), which specify how to construct a BN
for a given set of objects and relations. We then describe open-universe probability
models, or OUPMs, which represent uncertainty about what objects exist. OUPMs
may not boil down to finite, acyclic BNs; we present results from Milch [2006]
showing how to extend the factorization and conditional independence semantics
of BNs to models that are only context-specifically finite and acyclic. Finally, we
discuss how Markov chain Monte Carlo (MCMC) methods can be used to perform
approximate inference on OUPMs and briefly describe some applications.
217
Brian Milch and Stuart Russell
Title(Pub1) Title(Pub2)
Figure 1. A BN for a bibliography scenario where we know that citations Cit1 and
Cit3 refer to Pub1, while citation Cit2 refers to Pub2.
Title(p) ∼ TitlePrior()
CitTitle(c) ∼ TitleEditCPD(Title(PubCited(c)))
218
Open-Universe Probability Models
bols of the logical language (including constant and predicate symbols) are divided
into a set of nonrandom function symbols whose interpretations are specified by
the relational skeleton, and a set of random function symbols whose interpretations
vary between possible worlds. An RPM includes one dependency statement for each
random function symbol.
Each RPM M defines a set of basic random variables VM , one for each application
of a random function to a tuple of arguments. We will write X(ω) for the value
of a random variable X in world ω. If X represents the value of the random
function f on some arguments, then the dependency statement for f defines a
parent set and CPD for X. The parent set for X, denoted Pa(X), is the set of basic
variables that are needed to evaluate the expressions in the dependency statement
in any possible world. For instance, if we know that PubCited(Cit1) = Pub1, then
the dependency statement in Figure 2 yields the single parent Title(Pub1) for the
variable CitTitle(Cit1). The CPD for a basic variable X is a function ϕX (x, pa)
that defines a conditional probability distribution over values x of X given each
instantiation pa of Pa(X). We obtain this CPD by evaluating the expressions in
the dependency statement (such as Title(PubCited(Cit1))) and passing them to an
elementary distribution function such as TitleEditCPD.
Thus, an RPM defines a BN over its basic random variables. If this BN is
acyclic, it defines a joint distribution for the basic RVs. Since there is a one-to-
one correspondence between full instantiations of VM and worlds in ΩM , this BN
also gives us a probability measure over ΩM . We define this to be the probability
measure represented by the RPM.
This statement says that each citation refers to a publication chosen uniformly at
random from the set of all publications p. The dependency statement for CitTitle
in Figure 2 now represents a context-specific dependency: for a given citation Ci ,
the Title(p) variable that CitTitle(Ci ) depends on varies from world to world.
In the BN defined by this model, shown in Figure 3, the parents of each CitTitle(c)
variable include all variables that might be needed to evaluate the dependency
statement for CitTitle(c) in any possible world. This includes PubCited(c) and all
the Title(p) variables. The CPD in the BN is a multiplexer that conditions on the
appropriate Title(p) variable for each value of PubCited(c). If the BN constructed
this way is still finite and acyclic, the usual BN semantics hold.
219
Brian Milch and Stuart Russell
CitTitle(C1) CitTitle(C2)
PubCited(C1) PubCited(C2)
220
Open-Universe Probability Models
PubCited(C1) PubCited(C2)
Number variables can also depend on other variables; we will consider an example
of this below.
In the RPM where we had a fixed set of publications, the relational skeleton
specified a constant symbol such as Pub1 for each publication. In an OUPM where
the set of publications is unknown, it does not make sense for the language to include
such constant symbols. The possible worlds contain publication objects—which will
assume are pairs hPub, 1 i, hPub, 2 i, etc.—but now they are not necessarily in one-
to-one correspondence with any constant symbols.
The set of basic variables now includes the number variable #Pub itself, and
variables for the application of each random function to all arguments that exist in
any possible world. Figure 4 shows the BN over these variables. Note that we have
an infinite sequence of Title variables: if we had a finite number, our BN would
not define probabilities for worlds with more than that number of publications. We
stipulate that if a basic variable has an object o as an argument, then in worlds
where o does not exist, the variable takes on the special value null. Thus, #Pub
is a parent of each Title(p) variable, determining whether that variable takes the
value null or not. The set of publications available for selection in the dependency
221
Brian Milch and Stuart Russell
#Researcher ∼ NumResearchersPrior()
#Pub(FirstAuthor = r) ∼ NumPubsCPD(Position(r))
222
Open-Universe Probability Models
instantiation where #Pub = 100, but Title(p) takes on a non-null value for 200
publications.
To facilitate using the basic variables to define a probability measure over the
possible worlds, we would like to have a one-to-one mapping between ΩM and a
set of achievable instantiations of VM . This is straightforward in cases like our
first OUPM, where there is only one number variable for each type of object. Then
our semantics specifies that the non-guaranteed objects of each type—that is, the
objects that exist in some possible worlds and not others, like the publications in our
example—are pairs hPub, 1 i, hPub, 2 i, . . .. In each world, the set of non-guaranteed
objects of each type that exist is required to be a prefix of this numbered sequence.
Thus, if we know that #Pub = 4 in a world ω, we know that the publications in ω
are hPub, 1 i through hPub, 4 i, not some other set of non-guaranteed objects.
Things are more complicated when we have multiple number variables for a type,
as in our example with researchers generating publications. Given values for all the
number variables of the form #Pub(FirstAuthor = r), we do not want there to be
any uncertainty about which non-guaranteed objects have each FirstAuthor value.
We can achieve this by letting the non-guaranteed objects be nested tuples that
encode their generation history. For the publications with hResearcher , 5 i as their
first author, we use tuples
hPub, hFirstAuthor, hResearcher, 5ii, 1 i
hPub, hFirstAuthor, hResearcher, 5ii, 2 i
and so on. As before, in each possible world, the set of tuples in each sequence must
form a prefix of the sequence. This construction yields the following lemma.
LEMMA 1. In any OUPM M , each complete instantiation of VM is consistent
with at most one possible world in ΩM .
Section 4.3 of Milch [2006] gives a more rigorous formulation and proof of this
result. Given this lemma, the probability measure defined by an OUPM M on ΩM
is well-defined if the OUPM specifies a joint probability distribution for VM that
is concentrated on the set of achievable instantiations. Since the OUPM’s CPDs
implicitly force a variable to take the value null when any of its arguments do not
exist, any distribution consistent with the CPDs will indeed put probability one on
achievable instantiations.
Informally, the probability distribution for the basic random variables can be de-
fined by a generative process that builds up an instantiation step-by-step, sampling
a value for each variable according to its dependency statement. In the next section,
we show how this intuitive semantics can be formalized using an extended version
of Bayesian networks.
4 Extending BN semantics
There are two equivalent ways of defining the probability distribution represented
by a BN B. The first is based on conditional independence statements; specifically,
223
Brian Milch and Stuart Russell
#Pub ∼ NumPubsPrior()
Title(p) ∼ TitlePrior()
Date(c) ∼ DatePrior()
Figure 6. Dependency statements for a model where each citation was written on
some date, and a citation may copy the title from an earlier citation of the same
publication rather than copying the publication title directly.
The remarkable property of BNs is that if the graph is finite and acyclic, then there
is guaranteed to be exactly one joint distribution that satisfies these conditions.
Note that in the BN in Figure 4, the CitTitle(c) variables have infinitely many par-
ents. The fact that the BN has infinitely many nodes means that we can no longer
use the standard product-expression semantics for the BN, because the product of
the CPDs for all variables is an infinite product, and will typically be zero for all
values of the variables. We would like to specify probabilities for certain partial, fi-
nite instantiations of the variables that are sufficient to define the joint distribution.
As noted by Kersting and DeRaedt [2001], if it is possible to number the nodes of
the BN in topological order, then it suffices to specify the product expression for
each finite prefix of this numbering. However, if a variable has infinitely many par-
ents, then the BN has no topological numbering—if we try numbering the nodes in
topological order, we will spend forever on X’s parents and never reach X.
224
Open-Universe Probability Models
Date(C1)
Source(C1) Source(C2)
Date(C2)
CitTitle(C1) CitTitle(C2)
Figure 7. Part of the BN defined by the OUPM in Figure 6, for two citations.
In OUPMs and even RPMs with relational uncertainty, it is fairly easy to write
dependency statements that define a cyclic BN. For instance, suppose that some
citations are composed by copying another citation, and we do not know who copied
whom. We can specify a model where each citation was written at some unknown
date, and with probability 0.1, a citation copies an earlier citation to the same
publication if one exists. Figure 6 shows the dependency statements for this model.
(Note that Date here is the date the citation was written, i.e., the date of the citing
paper, not the date of the paper being cited.)
The difficult aspect of semantics for this class of cyclic BNs is the directed local
Markov property. It is no longer sufficient to assert that X is independent of its non-
descendants in the full BN given its parents, because its set of non-descendants in
the full BN may be too small. In this model, all the CitTitle nodes are descendants of
each other, so the standard directed local Markov property would yield no assertions
of conditional independence between them.
225
Brian Milch and Stuart Russell
PM (X = x | λ) = ϕX (x, λ) (1)
226
Open-Universe Probability Models
PROPERTY 3 (Directed local Markov property for an OUPM M ). For each basic
random variable X ∈ VM , each block λ ∈ ΛX , and each self-supporting instanti-
ation σ on VM such that X ∈/ vars(σ), X is conditionally independent of σ given
λ.
Under what conditions is there a unique probability measure PM on ΩM that
satisfies Properties 2 and 3? In the BN case, it suffices for the graph to admit a
topological numbering. We can define a similar notion that is specific to individual
worlds: a supportive numbering for a world ω ∈ ΩM is a numbering X0 , X1 , . . . of
VM such that for each natural number n, the instantiation (X0 (ω), . . . , Xn−1 (ω))
supports Xn .
THEOREM 4. Let M be an OUPM such that for every world ω ∈ ΩM , either:
227
Brian Milch and Stuart Russell
,3)
P ub
ub,2
P ub
bC )
ite b,3
ub
d(C Pu
Cite
Cite
)=(
1)=(P
2)=(P
2)= 1
(P d(C
d(C
ub
d(C
,1 ite
ited(C
) bC
ited(C
Pu
2)=(
1)=(
P
P
PubC
P ub C
ub,2
ub,1
)
)
#Pub CitTitle(C1) CitTitle(C2)
PubCited(C1) PubCited(C2)
Date(C1)
Source(C1) Source(C2)
Date(C2)
Source(C2)=C1
CitTitle(C1) CitTitle(C2)
Source(C1)=C2
A CBN can be viewed as a partition-based model where the partition ΛX for each
random variable X is defined by a decision tree. The internal nodes in this decision
tree are labeled with random variables; the edges are labeled with variable values;
and the leaves specify conditional probability distributions for X. The blocks in
ΛX correspond to the leaves in this tree (we assume the tree has no infinite paths,
so the leaves cover all possible worlds). The restriction to decision trees allows us
to define a notion of a parent being active in a particular world: if we walk along
X’s tree from the root, following edges consistent with a given world ω, then the
random variables on the nodes we visit are the active parents of X in ω. The label
on an edge W → X in a CBN is the event consisting of those worlds where W is an
active parent of X. (In diagrams, we omit the trivial label A = ΩM , which indicates
that the dependency is always active.)
228
Open-Universe Probability Models
5 Inference
Given an OUPM, we would like to be able to compute the probability of a query
event Q given an evidence event E. For example, Q could be the event that
PubCited(Cit1) = PubCited(Cit2) and E could be the event that CitTitle(Cit1) =
“Learning Probabilistic Relational Models” and CitTitle(Cit2) = “Learning Prob-
abilitsic Relation Models”. The ideas we present can be extended to other tasks
such as computing the posterior distribution of a random variable, or finding the
maximum a posteriori (MAP) assignment of values to a set of random variables.
229
Brian Milch and Stuart Russell
With probability α, we accept the proposal and let ωt+1 = ω ′ ; otherwise we reject
the proposal and let ωt+1 = ωt .
The difficulty in OUPMs is that each world may be very large. For instance, if we
have a world where #P ub = 1000, but only 100 publications are referred to by our
observed citations, then the world must also specify the titles of the 900 unobserved
publications. Sampling values for these 900 Title variables and computing their
probabilities will slow down our algorithm unnecessarily. In scenarios where some
possible worlds have infinitely many objects, specifying a possible world completely
may be impossible.
Thus, we would like to run MCMC over partial descriptions that specify values
only for certain random variables. The set of instantiated variables may vary from
world to world. Since a partial instantiation σ defines an event (the set of worlds
that are consistent with it), a Markov chain over partial instantiations can be viewed
as a chain over events. Thus, we use the acceptance probability:
PM (σ ′ )q(σt |σ ′ )
α = min 1,
PM (σt )q(σ ′ |σt )
where PM (σ) is the probability of the event σ. As long as the set Σ of partial
instantiations that can be returned by q forms a partition of E, and each partial
instantiation is specific enough to determine whether Q is true, we can estimate
P (Q|E) using a Markov chain on Σ with stationary distribution proportional to
PM (σ) [Milch and Russell 2006].
In general, computing the probability PM (σ) involves summing over all the vari-
ables not instantiated in σ—which is precisely what we want to avoid by using
a Monte Carlo inference algorithm. Fortunately, if each instantiation in Σ is self-
supporting, we can compute its probability using the product expression from Prop-
erty 2. Thus, our partial worlds are self-supporting instantiations that include the
query and evidence variables. We also make sure to use minimal instantiations sat-
isfying this condition—that is, instantiations that would cease to be self-supporting
if we removed any non-query, non-evidence variable. It can be shown that in a
CBN, such minimal self-supporting instantiations are mutually exclusive . So if our
set of partial worlds Σ covers all of E, we are guaranteed to have a partition of E,
as required. An example of a partial world in our bibliography scenario is:
230
Open-Universe Probability Models
∃ distinct x, y
#Pub = 50, CitTitle(Cit1) = “Calculus”, CitTitle(Cit2) = “Intro to Calculus”,
PubCited(Cit1) = x, PubCited(Cit2) = y,
Title(x) = “Calculus”, Title(y) = “Intro to Calculus”
where Pc (σ) is the probability of any one of the “concrete” instantiations obtained
by substituting distinct tuple representations for the logical variables in σ.
where Paσ (X) is the set of parents of X whose edge conditions are entailed by
σ. This expression is daunting, because even though the instantiations σt and
σ ′ are only partial descriptions of possible worlds, they may still assign values to
large sets of random variables — and the number of instantiated variables grows at
least linearly with the number of observations we have. Since we may want to run
231
Brian Milch and Stuart Russell
millions of MCMC steps, having each step take time proportional to the number of
observations would make inference prohibitively expensive.
Fortunately, with most proposal distributions used in practice, each step changes
the values of only a small set of random variables. Furthermore, if the edges that
are active in any given possible world are fairly sparse, then σ [Paσ′ (X)] will also
be the same as σt [Paσt (X)] for many variables X. Thus, many factors will cancel
out in the ratio above.
We need to compute the “new” and “old” probability factors for a variable X
only if either σ ′ [X] 6= σt [X], or there is some active parent W ∈ Paσt (X) such that
σ ′ [W ] 6= σt [W ]. (We take these inequalities to include the case where σ ′ assigns a
value to the variable and σt does not, or vice versa.) Note that it is not possible
for Paσ′ (X) to be different from Paσt (X) unless one of the “old” active parents in
Paσt (X) has changed: given that σt is a self-supporting instantiation, the values
of X’s instantiated parents in σt determine the truth values of the conditions on
all the edges into X, so the set of active edges into X cannot change unless one of
these parent variables changes.
This fact is exploited in the Blog system [Milch and Russell 2006] to efficiently
detect which probability factors need to be computed for a given proposal. The
system maintains a graph of the edges that are active in the current instantiation σt .
The proposer provides a list of the variables that are changed in σ ′ , and the system
follows the active edges in the graph to identify the children of these variables,
whose probability factors also need to be computed. Thus, the graphical locality
that is central to many other BN inference algorithms also plays a role in MCMC
over relational structures.
6 Related work
The connection between probability and first-order languages was first studied by
Carnap [1950]. Gaifman [1964] and Scott and Krauss [1966] defined a formal se-
mantics whereby probabilities could be associated with first-order sentences and for
which models were probability measures on possible worlds. Within AI, this idea
was developed for propositional logic by Nilsson [1986] and for first-order logic by
Halpern [1990]. The basic idea is that each sentence constrains the distribution over
possible worlds; one sentence entails another if it expresses a stronger constraint.
For example, the sentence ∀x P (Hungry(x)) > 0.2 rules out distributions in which
any object is hungry with probability less than 0.2; thus, it entails the sentence
∀x P (Hungry(x)) > 0.1. Bacchus [1990] investigated knowledge representation is-
sues in such languages. It turns out that writing a consistent set of sentences in
these languages is quite difficult and constructing a unique probability model nearly
impossible unless one adopts the representational approach of Bayesian networks
by writing suitable sentences about conditional probabilities.
The impetus for the next phase of work came from researchers working with
BNs directly. Rather than laboriously constructing large BNs by hand, they built
232
Open-Universe Probability Models
them by composing and instantiating “templates” with logical variables that de-
scribed local causal models associated with objects [Breese 1992; Wellman et al.
1992]. The most important such language was Bugs (Bayesian inference Using
Gibbs Sampling) [Gilks et al. 1994], which combined Bayesian networks with the
indexed-random-variable notation common in statistics. These languages inherited
the key property of Bayesian networks: every well-formed knowledge base defines a
unique, consistent probability model. Languages with well-defined semantics based
on unique names and domain closure drew on the representational capabilities of
logic programming [Poole 1993; Sato and Kameya 1997; Kersting and De Raedt
2001] and semantic networks [Koller and Pfeffer 1998; Pfeffer 2000]. Initially, in-
ference in these models was performed on the equivalent Bayesian network. Lifted
inference techniques borrow from first-order logic the idea of performing an inference
once to cover an entire equivalence class of objects [Poole 2003; de Salvo Braz et al.
2007; Kisynski and Poole 2009]. MCMC over relational structures was introduced
by Pasula and Russell [2001]. Getoor and Taskar [2007] collect many important
papers on first-order probability models and their use in machine learning.
Probabilistic reasoning about identity uncertainty has two distinct origins. In
statistics, the problem of record linkage arises when data records do not contain
standard unique identifiers—for example, in financial, medical, census, and other
data [Dunn 1946; Fellegi and Sunter 1969]. In control theory, the problem of data
association arises in multitarget tracking when each detected signal does not identify
the object that generated it [Sittler 1964]. For most of its history, work in symbolic
AI assumed erroneously that sensors could supply sentences with unique identifiers
for objects. The issue was studied in the context of language understanding by
Charniak and Goldman [1993] and in the context of surveillance by Huang and
Russell [1998] and Pasula et al. [1999]. Pasula et al. [2003] developed a complex
generative model for authors, papers, and citation strings, involving both relational
and identity uncertainty, and demonstrated high accuracy for citation information
extraction. The first formally defined language for open-universe probability models
was Blog [Milch et al. 2005], from which the material in the current chapter was
developed. Laskey [2008] describes another open-universe modeling language called
multi-entity Bayesian networks.
Another important thread goes under the name of probabilistic programming
languages, which include Ibal [Pfeffer 2007] and Church [Goodman et al. 2008].
These languages represent first-order probability models using a programming lan-
guage extended with a randomization primitive; any given “run” of a program can
be seen as constructing a possible world, and the probability of that world is the
probability of all runs that construct it.
The OUPMs we have described here bear some resemblance to probabilistic pro-
grams, since each dependency statement can be viewed as a program fragment for
sampling a value for a child variable. However, expressions in dependency state-
ments have different semantics from those in a probabilistic functional language
233
Brian Milch and Stuart Russell
7 Discussion
This chapter has stressed the importance of unifying probability theory with first-
order logic—particularly for cases with unknown objects—and has presented one
possible approach based on open-universe probability models, or OUPMs. OUPMs
draw on the key idea introduced into AI by Judea Pearl: generative probability
models based on local conditional distributions. Whereas BNs generate worlds by
assigning values to variables one at a time, relational models can assign values to a
whole class of variables through a single dependency assertion, while OUPMs add
object creation as one of the generative steps.
OUPMs appear to enable the straightforward representation of a wide range
of situations. In addition to the citation model mentioned in this chapter (see
Milch [2006] for full details), models have been written for multitarget tracking,
plan recognition, sibyl attacks (a security threat in which a reputation system is
compromised by individuals who create many fake identities), and detection of
nuclear explosions using networks of seismic sensors [Russell and Vaidya 2009]. In
each case, the model is essentially a transliteration of the obvious English description
of the generative process.
Inference, however, is another matter. The generic Metropolis–Hastings inference
engine written for Blog in 2006 is far too slow to support any of the applications
described in the preceding paragraph. For the citation problem, Milch [2006] de-
234
Open-Universe Probability Models
References
Bacchus, F. (1990). Representing and Reasoning with Probabilistic Knowledge.
MIT Press.
Breese, J. S. (1992). Construction of belief and decision networks. Computational
Intelligence 8 (4), 624–647.
Carnap, R. (1950). Logical Foundations of Probability. Univ. of Chicago Press.
Charniak, E. and R. P. Goldman (1993). A Bayesian model of plan recognition.
Artificial Intelligence 64 (1), 53–79.
de Salvo Braz, R., E. Amir, and D. Roth (2007). Lifted first-order probabilis-
tic inference. In L. Getoor and B. Taskar (Eds.), Introduction to Statistical
Relational Learning. MIT Press.
Dunn, H. L. (1946). Record linkage. Am. J. Public Health 36 (12), 1412–1416.
Fellegi, I. and A. Sunter (1969). A theory for record linkage. J. Amer. Stat. As-
soc. 64, 1183–1210.
Gaifman, H. (1964). Concerning measures in first order calculi. Israel J. Math. 2,
1–18.
Getoor, L. and B. Taskar (Eds.) (2007). Introduction to Statistical Relational
Learning. MIT Press.
Gilks, W. R., A. Thomas, and D. J. Spiegelhalter (1994). A language and program
for complex Bayesian modelling. The Statistician 43 (1), 169–177.
Goodman, N. D., V. K. Mansinghka, D. Roy, K. Bonawitz, and J. B. Tenenbaum
(2008). Church: A language for generative models. In Proc. 24th Conf. on
Uncertainty in AI.
Halpern, J. Y. (1990). An analysis of first-order logics of probability. Artificial
Intelligence 46, 311–350.
Huang, T. and S. J. Russell (1998). Object identification: A Bayesian analysis
with application to traffic surveillance. Artificial Intelligence 103, 1–17.
235
Brian Milch and Stuart Russell
236
Open-Universe Probability Models
237
Return to TOC
14
A Heuristic Procedure for Finding Hidden
Variables
Azaria Paz
1 Introduction
This paper investigates Probabilistic Distribution (PD) induced independency rela-
tions which are representable by Directed Acyclic Graphs (DAGs), and are marginal-
ized over a subset of their variables. PD-induced relations have been shown in the
literature to be representable as relations that can be defined on various graphical
models. All those graphical models have two basic properties: They are compact,
i.e., the space required for storing such a model is polynomial in the number of vari-
ables, and they are decidable, i.e., a polynomial algorithm exists for testing whether
a given independency is represented in the model. In particular, two such mod-
els will be encountered in this paper; the DAG model and the Annotated Graph
(AG) model. The reader is supposed to be familiar with the DAG-model which was
studied extensively in the literature. An ample introduction to the DAG model is
included in Pearl [7, 1988], Pearl [8, 2000], and Lauritzen [2, 1996].1
The AG-model in a general form was introduced by Paz, Geva, and Studeny in
[5, 2000] and a restricted form of this model, which is all we need for this paper,
was introduced by Paz [3, 2003a] and investigated further in Paz [4, 2003b]. For
the sake of completeness, we shall reproduce here some of the basic definitions and
properties of those models which are relevant for this paper.
Given a DAG-representable PD-induced relation it is often the case that we need
to marginalize the relation over a subset of variables. Unfortunately it is seldom
the case that such a marginalized relation can be represented by a DAG, which is
an easy to manage and a well understood model.
In the paper [4, 2003] the author proved a set of necessary corelations for a given
AG to be equivalent to a DAG (see Lemma 1 in the next section). In the same paper
a decision procedure is given for checking whether a given AG which satisfies the
necessary conditions is equivalent to a DAG. Moreover, if the answer is ”yes”, the
procedure constructs an equivalent DAG to the given AG. In a subsequent paper
[6, 2006], the author generalizes the AG model and gives a procedure which enables
the representation of any marginalized DAG representable relation by a generalized
model.
1 The main part of this work was done while the author visited the Cognitive Systems Laboratory
at UCLA and was supported in part by grants from Air Force, NSF and ONR (MURI).
239
Azaria Paz
2 Preliminaries
2.1 Definitions and notations
UGs will denote undirected graphs G = (V, E) where V is a set of vertices and E is
a set of undirected edges connecting between two vertices. Two vertices connected
by an edge are adjacent or neighbors. A path in G of length k is a sequence of
vertices v1 . . . vk+1 such that (vi , vi+1 ) is an edge in E for i < 1, . . . , k. A DAG is
an acyclic directed graph D = (V, E) where V is a set of vertices and E is a set of
directed arcs connecting between two vertices in V . A trail of length k in D is a
sequence v1 . . . vk+1 of vertices in V such that (vi , vi+1 ) is an arc in E for i = 1 . . . k.
If all the arcs on the trail are directed in the same direction then the trail is called
a directed path. If a directed path exists in D from vi to vj then vj is a descendant
of vi and vi is a predecessor or ancestor of vj . If the path is of length one then vi
is a parent of vj who is a child of vi .
The skeleton of a DAG is the UG derived from the DAG when the orientations
of the arcs are removed. A pattern of the form vi → vj ← vk is a collider pattern
where vj is the collider. If there is no arc between vi and vk then vj is an uncoupled
collider. The moralizing procedure is the procedure generating a UG from a DAG,
by first joining both parents of uncoupled colliders in the DAG by an arc, and then
removing the orientation of all arcs. The edges resulting from the coupling of the
uncoupled collider are called moral edges. As mentioned in the introduction UG’s
and DAG’s represent PD-induced relations whose elements are triplets t = (X; Y |Z)
over the set of vertices of the graphs. For a given triplet t we denote by v(t) the set
of vertices v(t) = X ∪ Y ∪ Z. Two graph models are equivalent if they represent
the same relation.
2.2 DAG-model
Let D = (V, E) be a DAG whose vertices are V and whose arcs are E. D represents
the relation R(D) = {t = (X; Y |Z)|t ∈ D} where X, Y, Z are disjoint subsets of V ,
the vertices in V represent variables in a PD, t is interpreted as “X is independent
of Y given Z” and t ∈ D means: t is represented in D. To check whether a given
triplet t is represented in D we use the Algorithm L1 below due to Lauritzen et al.
[1, 1990].
Algorithm L1:
Input: D = (V, E) and t = (X; Y |Z).
2. Moralize D′ (t) (i.e., join all uncoupled parents of uncoupled colliders in D′ (t)).
Denote the resulting graph by D′′ (t).
3. Remove all orientations in D′′ (t) and denote the resulting UG by G(D′′ (t)).
240
A Heuristic Procedure for Finding Hidden Variables
In order to check whether t ∈ A we use the algorithm L2 due to Paz [3, 2003a]
below.
Algorithm L2
Input: An AG A = (G, K).
1. For every element e = (d, r(d)) in K such that r(d) ∩ v(t) = ∅ (v(t) = X ∪ Y ∪
Z). Disconnect the edge (a, b) in G corresponding to d and remove from G
all the vertices in r(d) and incident edges. Denote the resulting UG by G(t).
REMARK 2. It is clear from the definitions and from the L2 Algorithm that the
AG model is both compact and decidable. In addition, it was shown in [3, 2003a]
that the AG model has the following uniqueness property: R(A1 ) = R(A2 ) implies
that A1 = A2 when A1 and A2 are AG’s. This property does not hold for DAG
models where it is possible for two different (and equivalent) DAGs to define the
same relation. In fact the AG (D) derived from a DAG D represents the equivalence
class of all DAGs which are equivalent to the given DAG D.
REMARK 3. The AGs derived from DAG’s are a particular case of AGs as defined
in Paz et al. [5, 2000] and there are additional ways to derive AGs that represent
PD-induced relations which are not DAG-representable. Consider e.g., the example
below. It was shown by Pearl [7, 1988 Ch. 3] that every DAG representable relation
is a PD-induced relation. Therefore the relation defined by the DAG in Fig. 1
represents a PD-induced relation.
241
Azaria Paz
a b c d
If we marginalize this relation over the vertices e and f we get another rela-
tion, PD-induced, that can be represented by the AG A in Fig. 2, as will be
shown in the sequel, under the semantics of the L2 Algorithm, with R(A) =
((b,d), {a})
a b c d K= { }
((a,b),{c,d})
{(a; b|∅), (b; d|c) + symmetric images}. But R(A) above cannot be represented by
a DAG. This follows from the following lemma that was proven in [4, 2003b].
1. For every element ((a, b), r) ∈ K(D), there is a vertex v ∈ r which is a child
of both a and b and every vertex w ∈ r is connected to some vertex v in r
whose parents are both a and b.
4. The set of elements K(D) is a poset (=partially ordered set) with regards to
the relation “” defined as follows: For any two elements (dp , rp ) and (dq , rq ).
If dp ∩ rq 6= ∅ then (dp , rp ) ≻ (dq , rq ), in words “(dp , rp ) is strictly greater than
(dq , rq )”. Moreover (dp , rp ) ≻ (dq , rq ) implies that rp ⊂ rq .
5. For any two elements (d1 , r1 ) and (d2 , r2 ) If r1 ∩ r2 6= ∅ and r1 , r2 are not
a subset of one another, then there is an element (d3 , r3 ) in K(D) such that
r3 ⊆ r1 ∩ r2 .
242
A Heuristic Procedure for Finding Hidden Variables
As is easy to see the annotation, K in Fig. 2 does not satisfy the condition 4 of
the lemma since the first element in K is bigger than the second but it’s range is not
a subset of the range of the second element. Therefore A is not DAG-representable.
REMARK 4. An algorithm is provided in [3, 2003a] that tests whether a given AG,
possibly derived from a marginalized DAG relation, which satisfies the (necessary
but not sufficient) conditions in Lemma 1 above, is DAG-representable. The main
result of this work is to provide a polynomial algorithm which generates a “gener-
alized annotated graph” representation (concept to be defined in the sequel) which
is both compact and decidable. In some cases the generalized annotated graph re-
duces to a regular annotated graph which satisfies the condition of Lemma 1. If
this is the case than, using the testing algorithm in [4, 2003b] we can check whether
the given AG is DAG-representable. It is certainly not DAG-representable if the
generalized annotated graph is not a regular AG or is a regular AG but does not
satisfy the conditions of Lemma 1.
REMARK 5. When a given AG A is derived from a DAG then the annotation set
K = {(d, r(d))} can be interpreted as follows: The edge (a, b), in G, corresponding
to d, (a moral edge) represents a conditional dependency. That is: there is some
set of vertices, disjoint of r(d), Sab such that (ai b|Sab ) is represented in A but a
and b become dependent if any proper subset of r(d) is observed i.e., ¬(a; b|S) if
∅ 6= S ⊆ r(d).
In this paper we are concerned with the following problem: Given a relation which
is represented by a generalized UG model and is not representable by a DAG (see
[4, 2003]). Is it possible to find hidden variables such that the given relation results
from the marginalization of the DAG representable relation over the expanded set
of variables, including the hidden variables in addition to the given AG variables.
We do not have a full solution to this problem which is so far an open problem.
We present only a heuristic procedure, illustrated by several examples, for partially
solving the problem.
243
Azaria Paz
assume that there is a DAG D with n variables, including x, y and z such that
when D is marginalized over {x, y, z}, the marginalized DAG represents R. This
assumption leads to a contradiction: Since (x; z|∅) and (y; z|∅) are not in R there
must be trails πxz and πyz in D with no colliders included in them. Let πxy be the
concatenation of the two trails πxz and πzy (which is the trail πyz reversed). Then
πxy connects between x and y and has no colliders on it except perhaps the vertex
z. If z is a collider then (x; y|z) is not represented in D. If z is not a collider then
πxy has no colliders on it and therefore (x; y|∅) is not represented in D. Therefore
R cannot be represented by marginalizing D over {x, y, z}, a contradiction. That R
is a PD-induced relation was shown by Milan Studeny [9, private communication,
2000] as follows:
Consider the PD over the binary variables x, y and the ternary variable z. The
probability of the three variables for the different values of x, y, z is given below
1
p(0, 0, 0) = p(0, 0, 1) = p(1, 0, 1) = p(1, 0, 2) = 8
p(0, 1, 0) = p(1, 1, 1) = 14
p(x, y, z) = 0 for all other configurations
The reader can convince himself that the relation induced by the above PD is
the relation R = {(x; y|∅), (x; y|z)}. Notice however that the relation R above is
represented by the annotated graph below
244
A Heuristic Procedure for Finding Hidden Variables
4.1 Example 1
Consider the DAG D shown in Fig. 3. An equivalent AG, representing the same
relation [4, 2003b] is shown in Fig. 4 where the broken lines (edges) represent con-
ditional dependencies and correspond to uncoupled parents of colliders in Fig. 3.
Assume now that we want to marginalize the relation represented by the AG
shown in Fig. 4 over the variables p and q. Using the procedure given in [6, 2006]
we get a relation represented by the AG shown in Fig. 5 below.
The Derivation of the AG in Fig. 5 can be explained as follows:
p q
a c
b
e f
Figure 3. DAG D
p q
{ }
((p,q), {b,e})
((a,b), {e})
G(D): a c K=
b ((a,p),{e})
((q,c), {f})
e f
245
Azaria Paz
{ }
(a,b), {e}
b (b,c), {f}
a c
K= (a,f), {e}
(e,c), {f}
(a,c), {e,f}
e f
• The solid edge b-f is induced by the path, in the AG shown in Fig. 4, b-q-f ,
through the marginalized variable q.
• Similarly, the solid edge e-f in Fig. 5 is induced by the path, in Fig. 4, e −
p − q − f through the marginalized variables p and q.
• The element ((a, b), {e}) in Fig. 5 was transferred from Fig. 4 since it involves
only non marginalized variables.
246
A Heuristic Procedure for Finding Hidden Variables
{ }
(a,b), {e}
b (b,c), {f}
a c
K= (a,u), {e}
(u,c), {f}
(u,b), {e,f}
e f
Figure 6. Extended AG
a b c
e f
Given the fact that the relation is not DAG-representable, we ask ourselves
whether it is possible to add hidden variables to the variables of the relation
such that the extended relation is DAG-representable and is such that when it
is marginalized over the added (hidden) variables, it reduces to the given no DAG-
representable relation. Consider first the fifth element (see Fig. 5) (a, c){e ∧ f }
of K for the given relation. The conditional {e ∧ f } of this element does not fit
DAG-representable relation.
We notice that this element can be eliminated if we add a hidden variableu that
is connected by solid lines (unconditional dependant) to the variables e and f and
is conditionally dependant on a with conditional e and is conditionally dependant
on c with conditional f . The resulting graph is shown in Fig. 6.
The reader will easily convince himself that the annotation K shown in Fig. 6
fits the extended graph, where the first 2 elements are inherited from Fig. 5 and
the other 3 elements are induced by the (hidden) new variable u.The solid edge
(e, f ) in Fig. 5 is removed in Fig. 6 since it is implied by the path e − u − f when
the extended relation is marginalized over u. The reader will also easily convince
himself that if the relation shown in Fig. 6 is marginalized over u we get back the
relation shown in Fig. 5. Moreover the annotation K in Fig. 6 complies with the
necessary conditions of Lemma 1. Indeed the relation represented by the AG in Fig.
6 is DAG representable: Just direct all solid edges incident with e into e, direct all
247
Azaria Paz
7 6 5
1 4
8 9
10
2 3
2
8
10
9
3
4
K=
{ } (1,4), {8 ^ 9}
(1,9), {8,10}
(8,4), {9,10}
solid edges incident with f into f and remove all broken edges. The result is shown
in Fig. 7.
Notice that the DAG in Fig. 7 is quite different from the DAG shown in Fig. 3,
but both reduce to the same relation when marginalized over their extra variables.
4.2 Example 2
Consider the DAG shown in Fig. 8. Using methods similar to the methods we used
in the previous example we can get an equivalent AG which when marginalized over
the vertices 5, 6 and 7 results in the UG shown in Fig. 9.
Here again we can check and verify that K does not satisfy the conditions of
Lemma 1, in particular the first element in K is a compound statement that does
not fit DAG representable relations. So the relation represented by the AG in Fig. 9
is not DAG representable.Trying to eliminate the first element in K, we may assume
the existence of a hidden variable u which is connected by solid edges to both 8
and 9, but is conditionally dependent on 1 with conditional 8 and is conditionally
dependent on 4 with conditional 9. The resulting extended AG is shown in Fig. 10.
The K in Fig. 10 satisfies the conditions of Lemma 1 and one can see that the AG
in DAG is equivalent. Moreover marginalizing it over u reduces to the UG shown
in Fig. 9. The equivalent DAG is shown in Fig. 11.
Notice that the DAG in Fig. 11 is quite different from the DAG in Fig. 8, but both
result in the same AG when marginalizing over their corresponding extra variables.
248
A Heuristic Procedure for Finding Hidden Variables
2
8
10
9
3
4
K=
{ } (1,u), {8,10}
(u,4), {9,10}
(8,9), {10}
2
8
10
9
3
4
K=
{ } (1,u), {8,10}
(u,4), {9,10}
(8,9), {u}
4.3 Example 3
In this last example we consider the AG shown in Fig. 2 which is reproduced here,
for convenience as Fig. 12.
While the K sets in the previous examples included elements with compound
conditionals, both conditionals in this example are simple but the conditions of
Lemma 1 are not satisfied: d in the first element is included in the second conditional
and a of the second element is included in the first conditional, but the conditionals
are no a subset of each other. Consider first the second element. We can introduce
an additional variable u so that u will be conditionally dependent on b, and the
element (u, b), {c, d} will replace the element (a, b), {c, d}. The resulting AG is
shown in Fig. 13 below.
It is easy to see that the graph in Fig. 13 reduces to graph in Fig. 12 when
marginalized over u. We still need to take care of the first element since it is not
satisfying the conditions of Lemma 1. We can now add additional new variable v
and replace the first element (b, d), {a} by the element (u, v), {a}. The resulting
larger AG is shown in Fig. 14. Notice that the graph in Fig. 14 will reduce to the
graph in Fig. 12.
To verify this we observe that marginalization over u and v induces the element
(b, d), {a} since when a and d are observed b gets connected to u (by d and the
second element) and u is connected to v by the first element so that the path
b − u − v − d is activated through the extra variables (b, d and a exist in Fig. 12).
One can also verify that the AG in Fig. 14 is DAG equivalent after some simples
249
Azaria Paz
((b,d), {a})
a b c d K= { ((a,b),{c,d})
}
Figure 12. AG from Fig. 2
(b,d), {a}
a b c d K= { (u,b), {c,d} }
u
modifications: we can remove the edges (a, c) and (a, d) since they are implied when
marginalizing over u and v and we need to add the element (v, c), {d} and a corre-
sponding broken edge between v and c, which will be discarded when marginalizing.
The equivalent DAG is shown in Fig. 15 and is identical with the DAG shown in
Fig. 1.
Acknowledgment
I would like to thank the editors for allowing me to contribute this paper to the
book honoring Judea Pearl, a friend and collaborator for more than 20 years now.
Much of my scientific work in the past 20 years was influenced by discussions I
had with him. His sharp intuition and his ability to pinpoint important problems
and ask the right questions helped me in choosing the problems to investigate and
finding their solutions, resulting in several scientific papers, including this one.
References
[1] S .L. Lauritzen, A. P. Dawid, B. N. Larsen, and H. G. Leimer. Independence
properties of directed Markov fields. Networks, 20:491–505, 1990.
[2] S.L. Lauritzen. Graphical Models. Claredon, Oxford, U.K., 1996.
[3] A. Paz. An alternative version of Lauritzen et al’s algorithm for checking rep-
resentation of independencies. Journal of Soft Computing, pages 491–505,
2003a.
[4] A. Paz. The annotated graph model for representing DAG-representable re-
lations – algorithmic approach. Technical Report Technical Report R-312,
250
A Heuristic Procedure for Finding Hidden Variables
(u,v), {a}
a b c d K= { ((a,b),{c,d})
}
u
a b c d
251
Return to TOC
15
Probabilistic Programming Languages:
Independent Choices and Deterministic
Systems
David Poole
Pearl [2000, p. 26] attributes to Laplace [1814] the idea of a probabilistic model as a
deterministic system with stochastic inputs. Pearl defines causal models in terms of
deterministic systems with stochastic inputs. In this paper, I show how determinis-
tic systems with (independent) probabilistic inputs can also be seen as the basis of
modern probabilistic programming languages. Probabilistic programs can be seen
as consisting of independent choices (over which there are probability distributions)
and deterministic programs that give the consequences of these choices. The work
on developing such languages has gone in parallel with the development of causal
models, and many of the foundations are remarkably similar. Most of the work in
probabilistic programming languages has been in the context of specific languages.
This paper abstracts the work on probabilistic programming languages from specific
languages and explains some design choices in the design of these languages.
Probabilistic programming languages have a rich history starting from the use of
simulation languages such as Simula [Dahl and Nygaard 1966]. Simula was designed
for discrete event simulations, and the built-in random number generator allowed
for stochastic simulations. Modern probabilistic programming languages bring three
extra features:
conditioning: the ability to make observations about some variables in the simu-
lation and to compute the posterior probability of arbitrary propositions given
these observations. The semantics can be seen in terms of rejection sampling:
accept only the simulations that produce the observed values, but there are
other (equivalent) semantics that have been developed.
inference: more efficient inference for determining posterior probabilities than re-
jection sampling.
In this paper, I explain how we can get from Bayesian networks [Pearl 1988] to in-
dependent choices plus a deterministic system (by augmenting the set of variables).
I explain the results from [Poole 1991; Poole 1993b], abstracted to be language
independent, and show how they can form the foundations for a diverse set of
probabilistic programming languages.
253
David Poole
A B C
There are 5 free parameters to be assigned for this model; for concreteness assume
the following values (where A = true is written as a, and similarly for the other
variables):
P (a) = 0.1
P (b|a) = 0.8
p(b|¬a) = 0.3
P (c|b) = 0.4
p(c|¬b) = 0.75
254
Probabilistic Programming Languages
language specifies what follows from them. For example, in Simula [Dahl and Ny-
gaard 1966], this could be represented as:
begin
Boolean a,b,c;
a := draw(0.1);
if a then
b := draw(0.8);
else
b := draw(0.3);
if b then
c := draw(0.4);
else
c := draw(0.75);
end
where draw(p) is a Simula system predicate that returns true with probability p;
each time it is called, there is an independent drawing.
Suppose c was observed, and we want the posterior probability of b. The con-
ditional probability P (b|c) is the proportion of those runs with c true that have b
true. This could be computed using the Simula compiler by doing rejection sam-
pling: running the program many times, and rejecting those runs that do not assign
c to true. Out of the non-rejected runs, it returns the proportion that have b true.
Of course, conditioning does not need to implemented that way; much of the de-
velopment of probabilistic programming languages over the last twenty years is in
devising more efficient ways to implement conditioning.
Another equivalent model to the Simula program can be given in terms of logic.
There can be 5 random variables, corresponding to the independent draws, let’s
call them A, Bif a, Bif na, Cif b Cif nb. These are independent with P (a) = 0.1,
P (bif a) = 0.8, P (bif na) = 0.3, P (cif b) = 0.4, and P (cif nb) = 0.75. The other
variables can be defined in terms of these:
These two formulations are essentially the same, they differ in how the determin-
istic system is specified, whether it is in Simula or in logic.
Any discrete belief network can be represented as a deterministic system with
independent inputs. This was proven by Poole [1991, 1993b] and Druzdzel and
Simon [1993]. These papers used different languages for the deterministic systems,
but gave essentially the same construction.
255
David Poole
• The rejection sampling semantics; running the program with a random num-
ber generator, removing those runs that do not predict the observations, the
posterior probability of a proposition is the limit, as the number of runs in-
creases, of the proportion of the non-rejected runs that have the proposition
true.
256
Probabilistic Programming Languages
In the logical definition of the belief network (or in the Simula definition if the
draws are named), there are 32 worlds in the independent choice semantics:
World A Bif a Bif na Cif b Cif nb Probability
w0 false false false false false 0.9 × 0.2 × 0.7 × 0.6 × 0.25
w1 false false false false true 0.9 × 0.2 × 0.7 × 0.6 × 0.75
...
w30 true true true true false 0.1 × 0.8 × 0.3 × 0.4 × 0.75
w31 true true true true true 0.1 × 0.8 × 0.3 × 0.4 × 0.75
The probability of each world is the product of the probability of each variable (as
each of these variables is assumed to be independent). Note that in worlds w30 and
w31 , the original variables A, B and C are all true; the value of Cif nb is not used
when B is true. These variables are also all true in the worlds that only differ in
the value of Bif na, as again, Bif na is not used when A is true.
In the program-trace semantics there are 8 worlds for this example, but not all
of the augmented variables are defined in all worlds.
World A Bif a Bif na Cif b Cif nb Probability
w0 false ⊥ false ⊥ false 0.9 × 0.7 × 0.25
w1 false ⊥ false ⊥ true 0.9 × 0.7 × 0.75
...
w7 true true ⊥ false ⊥ 0.1 × 0.8 × 0.6
w8 true true ⊥ true ⊥ 0.1 × 0.8 × 0.4
where ⊥ means the variable is not defined in this world. These worlds cover all 8
cases of truth values for the original worlds that give values for A, B and C. The
values of A, B and C can be obtained from the program. The idea is that a run
of the program is never going to encounter an undefined value. The augmented
worlds can be obtained from the worlds defined by the program trace by splitting
the worlds on each value of the undefined variables. Thus each augmented world
corresponds to a set of possible worlds, where the distinctions that are ignored do
not make a difference in the probability of the original variables.
While it may seem that we have not made any progress, after all this is just a
simple Bayesian network, we can do the same thing for any program with prob-
abilistic inputs. We just need to define the independent inputs (often these are
called noise inputs), and a deterministic program that gives the consequences of
the choices of values for these inputs. It is reasonably easy to see that any belief
network can be represented in this way, where the number of independent inputs is
257
David Poole
equal to the number of free parameters of the belief network. However, we are not
restricted to belief networks. The programs can be arbitrarily complex. We also
do not need special “original variables”, but can define the augmented worlds with
respect to any variables of interest. Observations and queries (about which we want
the posterior probability) can be propositions about the behavior of the program
(e.g., that some assignment of the program variables becomes true).
When the language is Turing-equivalent, the worlds can be countably infinite,
and thus there can be uncountably many worlds. A typical assumption is that
the program eventually infers the observations and the query, that is, each run of
the program will eventually (with probability 1) assign a truth value to any given
observation and query. This is not always the case, such as when the query is to
determine the fixed point of a Markov chain (see e.g., Pfeffer and Koller [2000]). We
could also have non-discrete choices using continuous variables, which complicates
but does not invalidate the discussion here.
A probability measure is over sets of possible worlds that form an algebra or a
σ-algebra, depending on whether we want finite additivity or countable additivity
[Halpern 2003]. For a programming language, we typically want countable additiv-
ity, as this allows us to not have a bound on the number of steps it takes to prove
a query. For example, consider a person who plays the lottery until they win. The
person will win eventually. This case is easy to represent as a probabilistic program,
but requires reasoning about an infinite set of worlds.
The typical σ-algebra is the set of worlds that can be finitely described, and their
(countable) union. Finitely describable means there is a finite set of draws that have
their outcomes specified. Thus the probability measure is over sets of worlds that
all have the same outcomes for a finite set of draws, and the union of such sets of
worlds. We have a measure over such sets by treating the draws to be independent.
3 Abductive Characterization
Abduction is a form of reasoning characterized by “reasoning to the best explana-
tion”. It is typically characterized by finding a minimal consistent set of assumables
that imply some observation. This set of assumables is called an explanation of the
observation.
Poole [1991, 1993b] gave an abductive characterization of a probabilistic pro-
gramming language, which gave a mapping between the independent possible world
structure, and the descriptions of the worlds produced by abduction. This notion of
abduction lets us construct a concise set of sets of possible worlds that is adequate
to infer the posterior probability of a query.
The idea is that the the independent inputs become assumables. Given a prob-
abilistic program, a particular observation obs and a query q, we characterize a
(minimal) partition of possible worlds, where
258
Probabilistic Programming Languages
• in each partition the same (finite) set of choices for the values of some of the
inputs is made.
This is similar to the program-trace semantics, but will only need to make distinc-
tions relevant to computing P (q|obs). Given a probabilistic program, an observation
and a query, the “explanations” of the observation conjoined with the query or its
negation, produces such a partition of possible worlds.
In the example above, if the observation was C = true, and the query was B, we
want the minimal set of assignments of values to the independent choices that gives
C = true ∧ B = true or C = true ∧ B = f alse. There are 4 such explanations:
The probability of each of these explanations is the product of the choices made,
as these choices are independent. The posterior probability P (B|C = true) can be
easily computed by the weighted sum of the explanations in which B is true. Note
also that the same explanations would be true even if C has unobserved descendants.
As the number of descendants could be infinite if they were generated by a program,
it is better to construct the finite relevant parts than prune the infinite irrelevant
parts.
In an analogous way to how the probability of a real-variable is defined as a
limit of discretizations, we can compute the posterior probability of a query given
a probabilistic programming language. This may seem unremarkable until it is
realized that for programs that are guaranteed to halt, there can be countably
many possible worlds, and so there are uncountably many sets of possible worlds,
over which to place a measure. For programs that are not guaranteed to halt, such
as a sequence of lotteries, there are uncountably many possible worlds, and even
more sets of possible worlds upon which to place a measure. Abduction gives us the
sets of possible worlds in which to answer a conditional probability query. When
the programs are not guaranteed to halt, the posterior probability of a query can
be defined as the limit of the sets of possible worlds created by abduction, as long
as the query can be derived in finite time for all but a set of worlds with measure
zero.
In terms of the Simula program, explanations correspond to execution paths. In
particular, an explanation corresponds to the outcomes of the draws in one trace
of the program that infers the observations and a query or its negation. The set
of traces of the program gives a set of possible worlds from which to compute
probabilities.
259
David Poole
When the program is a logic program, it isn’t obvious what the program-trace
semantics is. However, the semantics in terms of independent choices and abduction
is well-defined. Thus it seems like the semantics in terms of abduction is more
general than the program-trace semantics, as it is more generally applicable. It is
also possible to define the abductive characterization independently of the details of
the programming language, whereas defining a trace or run of a program depends
on the details of the programming language.
Note that this abductive characterization is unrelated to MAP or MPE queries;
we are defining the marginal posterior probability distribution over the query vari-
ables.
4 Inference
Earlier algorithms (e.g. Poole [1993a]) extract the minimal explanations and com-
pute conditional probabilities from these. Later algorithms, such as used in IBAL
[Pfeffer 2001], use sophisticated variable elimination to carry out inference in this
space. IBAL’s computation graph corresponds to a graphical representation of the
explanations. Problog [De Raedt, Kimmig, and Toivonen 2007] compiles the com-
putation graph into BDDs.
In algorithms that exploit the conditional independent structure, like variable
elimination or recursive conditioning, the order that variables are summed out or
split on makes a big difference to efficiency. In the independent choice semantics,
there are more options available for summing out variables, thus there are more
options available for making inference efficient. For example, consider the following
fragment of a Simula program:
begin
Boolean x;
x := draw(0.1);
if x then
begin
Boolean y := draw(0.2);
...
end
else
begin
Boolean z := draw(0.7);
...
end
...
end
Here y is only defined when x is true and z is only defined when x is false. In
the program-trace semantics, y and z are never both defined in any world. In
260
Probabilistic Programming Languages
5 Learning Probabilities
The other aspect of modern probabilistic programming languages is the ability to
learn the probabilities. As the input variables are rarely observed, the standard
way to learn the probabilities is to use EM. Learning probabilities using EM in
probabilistic programming languages is described by Sato [1995] and Koller and
Pfeffer [1997]. In terms of available programming languages, EM forms the basis
for learning in Prism [Sato and Kameya 1997; Sato and Kameya 2001], IBAL [Pfeffer
2001; Pfeffer 2007] and many subsequent languages.
One can do EM learning in either of the semantic structures. The difference is
whether some data updates the probabilities of parameters that were not involved
in computing the data. By making this choice explicit, it is easy to see that one
should use the abductive characterization to only update the probabilities of the
choices that were used to derive the data.
Structure learning for probabilistic programming languages has really only been
explored in the context of logic programs, where the techniques of inductive logic
programming can be applied. De Raedt, Frasconi, Kersting, and Muggleton [2008]
overview this active research area.
261
David Poole
6 Causal Models
It is interesting that the research on causal modelling and probabilistic programming
languages have gone on in parallel, with similar foundations, but only recently have
researchers started to combine them by adding causal constructs to probabilistic
programming languages [Finzi and Lukasiewicz 2003; Baral and Hunsaker 2007;
Vennekens, Denecker, and Bruynooghe 2009].
In some sense, the programming languages can be seen as representations for
all of the counterfactual situations. A programming language gives a model when
some condition is true, but also defines the “else” part of a condition; what happens
when the condition is false.
In the future, I expect that programming languages will be the preferred way to
specify causal models, and for interventions and counterfactual reasoning to become
part of the repertoire of probabilistic programming languages.
7 Observation Languages
These languages can be used to compute conditional probabilities by having an
“observer” (either humans or sensors) making observations of the world that are
conditioned on. One problem that has long gone unrecognized is that it is often
not obvious how to condition when the language allows for multiple individuals and
relations among them. There are two main problems:
• The observer has to know what to specify and what vocabulary to use. Unfor-
tunately, we can’t expect an observer to “tell us everything that is observed”.
First, there are an unbounded number of true things one can say about a
world. Second, the observer does not know what vocabulary to use to describe
their observations of the world. As probabilistic models get more integrated
into society, the models have to be able to use observations from multiple
people and sensors. Often these observations are historic, or are created asyn-
chronously by people who don’t even know the model exists and are unknown
when the model is being built.
• When there are are unique names, so that the observer knows which object(s)
a model is referring to, an observer can provide a value to the random variable
corresponding to the property of the individual. However, models often refer
to roles [Poole 2007]. The problem is that the observer does not know which
individual in the world fills a role referred to in the program (indeed there
is often a probability distribution over which individuals fill a role). There
needs to be some other mechanism other than asking for the observed value
of a random variable or program variable, or the value of a property of an
individual.
262
Probabilistic Programming Languages
areas that have spun off from the expert systems work of the 1970’s and 1980’s.
One is the probabilistic revolution pioneered by Pearl. The other is often under the
umbrella of knowledge representation and reasoning [Brachman and Levesque 2004].
A major aspect of this work is in the representation of ontologies that specify the
meaning of symbols. An ontology language that has come to prominence recently is
the language OWL [Hitzler, Krötzsch, Parsia, Patel-Schneider, and Rudolph 2009]
which is one of the foundations of the semantic web [Berners-Lee, Hendler, and
Lassila 2001]. There has recently been work on representing ontologies to integrate
with probabilistic inference [da Costa, Laskey, and Laskey 2005; Lukasiewicz 2008;
Poole, Smyth, and Sharma 2009]. This is important for Bayesian reasoning, where
we need to condition on all available evidence; potentially applicable evidence is
(or should be) published all over the world. Finding and using this evidence is a
major challenge. This problem in being investigated under the umbrella of semantic
science [Poole, Smyth, and Sharma 2008].
To understand the second problem, suppose we want to build a probabilistic
program to model what apartments people will like for an online apartment finding
service. This is an example where models of what people want and descriptions
of the world are built asynchronously. Rather than modelling people’s preferences,
suppose we want to model whether they would want to move in and be happy
there in 6 months time (this is what the landlord cares about, and presumably
what the tenant wants too). Suppose Mary is looking for an apartment for her and
her daughter, Sam. Whether Mary likes an apartment depends on the existence
and the properties of Mary’s bedroom and of Sam’s bedroom (and whether they
are the same room). Whether Mary likes a room depends on whether it is large.
Whether Sam likes a room depends on whether it is green. Figure 1 gives one
possible probability model, using a belief network, that follows the above story.
If we observe a particular apartment, such as the one on the right of Figure
1, it isn’t obvious how to condition on the observations to determine the posterior
probability that the apartment is suitable for Mary. The problem is that apartments
don’t come labelled with Mary’s bedroom and Sam’s bedroom. We need some role
assignment that specifies which bedroom is Mary’s and which bedroom is Sam’s.
However, which room Sam chooses depends on the colour of the room. We may
also like to know the probability that a bachelor’s apartment (that contains no
bedrooms) would be suitable.
To solve the second problem, we need a representation of observations. These
observations and the programs need to refer to interoperating ontologies. The ob-
servations need to refer to the existence of objects, and so would seem to need some
subset of the first-order predicate calculus. However, we probably don’t want to
allow arbitrary first-order predicate calculus descriptions of observations. Arguably,
people do not observe arbitrary disjunctions. One simple, yet powerful, observation
language, based on RDF [Manola and Miller 2004] was proposed by Sharma, Poole,
and Smyth [2010]. It is designed to allow for the specification of observations of
263
264
Probabilistic Programming Languages
9 Conclusion
This paper has concentrated on similarities, rather than the differences, between
the probabilistic programing languages. Much of the research in the area has con-
centrated on specific languages, and this paper is an attempt to put a unifying
structure on this work, in terms of independent choices and abduction.
265
David Poole
References
Andre, D. and S. Russell (2002). State abstraction for programmable reinforce-
ment learning agents. In Proc. AAAI-02.
Baral, C. and M. Hunsaker (2007). Using the probabilistic logic programming
language P-log for causal and counterfactual reasoning and non-naive condi-
tioning. In Proc. IJCAI 2007, pp. 243–249.
Berners-Lee, T., J. Hendler, and O. Lassila (2001). The semantic web: A new
1 We actually use it to teach logic programming to beginning students. They use it for assign-
ments before they learn that it can also handle probabilities. The language shields the students
from the non-declarative aspects of languages such as Prolog, and has many fewer built-in predi-
cates to encourage students to think about the problem they are trying to solve.
266
Probabilistic Programming Languages
267
David Poole
268
Probabilistic Programming Languages
269
Return to TOC
16
Arguing with a Bayesian Intelligence
Ingrid Zukerman
1 Introduction
Bayesian Networks (BNs) [Pearl 1988] constitute one of the most influential ad-
vances in Artificial Intelligence, with applications in a wide range of domains, e.g.,
meteorology, agriculture, medicine and environment. To further capitalize on its
clear technical advantages, a Bayesian intelligence (a computer system that em-
ploys a BN as its knowledge representation and reasoning formalism) should be
able to communicate with its users, i.e., users should be able to put forward their
views, and the system should be able to generate responses in turn. However, com-
munication between a Bayesian and a human intelligence poses some challenges, as
people generally do not engage in normative probabilistic reasoning when faced with
uncertainty [Evans, Barston, and Pollard 1983; Lichtenstein, Fischhoff, and Phillips
1982; Tversky and Kahneman 1982]. In addition, human discourse is typically en-
thymematic (i.e., it omits easily inferable information), and usually the beliefs and
inference patterns of conversational partners are not perfectly synchronized. As a
result, an addressee’s understanding may differ from the message intended by his
or her conversational partner.
In this chapter, we offer a mechanism that enables a Bayesian intelligence to
interpret human arguments for or against a proposition. This mechanism, which is
implemented in a system called bias (Bayesian Interactive Argumentation System),
constitutes a building block of a future system that will enable a Bayesian reasoner
to communicate with people.1
In order to address the above challenges, we adopt the view that discourse inter-
pretation is the process of integrating the contribution of a conversational partner
into the addressee’s mental model [Kashihara, Hirashima, and Toyoda 1995; Kintsch
1994], which in bias’s case is a BN. Notice, however, that when performing such
an integration, one cannot be sure that one is drawing the intended inferences or
reinstating the exact information omitted by the user. All an addressee can do is
construct an account of the conversational partner’s discourse that makes sense to
him or her. An interpretation of an argument that makes sense to bias is a subnet
of its BN and a set of beliefs.
To illustrate these ideas, consider the argument in Figure 1(a) regarding the guilt
1 The complementary building block, a mechanism that generates arguments from BNs, is
described in [Korb, McConachy, and Zukerman 1997; Zukerman, McConachy, and Korb 1998].
271
Ingrid Zukerman
Fingerprints being found on the gun, and forensics matching the fin-
gerprints with Mr Green implies that Mr Green probably had the means
to murder Mr Body.
The Bayesian Times reporting that Mr Body seduced Mr Green’s
girlfriend implies that Mr Green possibly had a motive to murder Mr Body.
Since Mr Green probably had the means to murder Mr Body, and Mr Green
possibly had a motive to murder Mr Body, then Mr Green possibly murdered
Mr Body.
ForensicMatch BulletsFound
BulletsWith InBody’sBody BayesTimesReportBody
FoundGun SeduceGreen’sGirlfriend
FoundGunIs
MurderWeapon BodySeduce
BodyWasMurdered Green’sGirlfriend
2 We use the following linguistic terms, which are similar to those used in [Elsaesser 1987], to
convey degree of belief: Very Probable, Probable, Possible and their negations, and Even Chance.
According to our surveys, these terms are the most consistently understood by people [George,
Zukerman, and Niemann 2007].
3 The observable evidence nodes are boxed, and the evidence nodes that were actually ob-
served by the user are boldfaced, as are the evidence nodes employed in the argument. The
272
Arguing with a Bayesian Intelligence
2 What is an Interpretation?
As mentioned in Section 1, we view the interpretation of an argument as a “self
explanation” — an account of the argument that makes sense to the addressee.
For bias, such an account is specified by a tuple {IG, SC, EE}, where IG is an
interpretation graph, SC is a supposition configuration, and EE are explanatory
extensions.4
nodes corresponding to the consequents in the user’s argument (GreenHasMeans, GreenHasMotive and
GreenMurderedBody) are italicized and oval-shaded.
4 In
our initial work, our interpretations contained only interpretation graphs [Zukerman and
George 2005]. Subsequent trials with users demonstrated the need for supposition configurations
and explanatory extensions [George, Zukerman, and Niemann 2007].
273
Ingrid Zukerman
274
Arguing with a Bayesian Intelligence
the interpretation graph. Note that explanatory extensions do not affect the beliefs
in an interpretation, as they simply state previously held beliefs.
3 Proposing Interpretations
The problem of finding the best interpretation is exponential, as there are many
candidates for each component of an interpretation, and complex interactions be-
tween supposition configurations and interpretation graphs. For example, making
a supposition could invalidate an otherwise sound line of reasoning.
In order to generate reasonable interpretations in real time, we apply Algorithm 1
— an (almost) anytime algorithm [Dean and Boddy 1988; Horvitz, Suermondt, and
Cooper 1989] that iteratively proposes interpretations until time runs out, i.e., until
the system has to act upon a preferred interpretation or show the user one or more
interpretations for validation [George, Zukerman, and Niemann 2007; Zukerman and
George 2005]. At present, our interaction with the user stops when interpretations
275
Ingrid Zukerman
276
Arguing with a Bayesian Intelligence
Argument (connected propositions) Mr Green probably being in the garden at 11 implies that
Mr Green possibly had the opportunity to kill Mr Body.
GreenInGardenAtTimeOfDeath GreenInGardenAtTimeOfDeath
[EvenChance] [Probably]
I know this is not quite what you said, but Supposing that the time of death was 11 ,
it is the best I could do given what I believe. Mr Green probably being in the garden
Since it is probable that Mr Green was in the at 11 implies that he probably was in the
garden at 11, and it is even chance that the garden at the time of death, which implies
time of death was 11 , it is even chance that that he possibly had the opportunity to
Mr Green was in the garden at the time of kill Mr Body.
death, which implies that it is even chance
that he had the opportunity to kill Mr Body.
(c) Interpretation (SC1, IG11, EE11) (d) Interpretation (SC2, IG21, EE21)
user’s belief in the consequent of the argument differs from the belief obtained by
bias by means of Bayesian propagation from the evidence nodes in the domain BN.
As indicated above, bias attempts to address this problem by making suppositions
about the user’s beliefs. The first level of the sample search tree in Figure 3(b)
contains three supposition configurations SC1, SC2 and SC3. SC1 posits no beliefs
that differ from those in the domain BN, thereby retaining the mismatch between
the user’s belief in the consequent and bias’s belief; SC2 posits that the user believes
that the time of death is 11; and SC3 posits that the user believes that Mr Green
visited Mr Body last night.
The best interpretation graph for SC1 is IG11 (the evaluation of the goodness
of an interpretation is described in Section 4). Here the belief in the consequent
differs from that stated by the user, prompting the generation of a preface that
acknowledges this fact. In addition, the interpretation graph has a large jump in
belief (from Probably to EvenChance), which causes bias to add the mutually be-
lieved proposition TimeOfDeath11[EvenChance] as an explanatory extension. The
277
Ingrid Zukerman
resultant interpretation and its gloss appear in Figure 3(c). The best interpre-
tation graph for SC2 is IG21, which matches the beliefs in the user’s argument.
The resultant interpretation and its gloss appear in Figure 3(d). Note that both
(SC1, IG11, EE11) and (SC2, IG21, EE21) mention TimeOfDeath11, but in the first
interpretation this proposition is used as an explanatory extension (with a belief of
EvenChance obtained by Bayesian propagation), while in the second interpretation
it is used as a supposition (with a belief of true). Upon completion of this pro-
cess, bias retains the K most probable interpretations. In this example, the best
interpretation is {SC2, IG21, EE21}.
278
Arguing with a Bayesian Intelligence
tecedent yields a high/low probability for the consequent. We posit similar preferences for inverse
inferences, where a low/high probability antecedent yields a high/low probability consequent. The
work described in [George, Zukerman, and Niemann 2007] contains additional, more fine-grained
categories of inferences, but here we restrict our discussion to the main ones.
279
Ingrid Zukerman
4 Probabilistic Formalism
As mentioned in Section 1, the Minimum Message Length (MML) principle [Wallace
2005] selects the simplest model that explains the observed data. In our case, the
data are the argument given by a user, and the candidate models are interpretations
of this argument. In addition to the data and the model, the MML principle requires
the specification of background knowledge — information shared by the system and
the user prior to the argument, e.g., domain knowledge (including shared beliefs)
and dialogue history.
We posit that the best interpretation is that with the highest posterior probabil-
ity.
IntBest = argmaxi=1,...,q Pr(IGi , SCi , EEi |Argument,Background)
The first factor, which is also known as model complexity, represents the prior
probability of the model, and the second factor represents data fit.
• Data fit measures how similar the data (argument) are to the model (inter-
pretation). The closer the data are to the model, the higher the probability
of the data given the model (i.e., the probability that the user uttered the
argument when he or she intended the interpretation in question).
Both the argument and its interpretation contain structural information and be-
liefs. The beliefs are simply those stated in the argument and in the interpretation,
and suppositions made as part of the interpretation. The structural part of the
280
Arguing with a Bayesian Intelligence
argument comprises the stated propositions and the relationships between them,
while the structural part of the interpretation comprises the interpretation graph
and explanatory extensions. As stated above, smaller, simpler structures usually
have a higher prior probability than larger, more complex ones. However, the
simplest structure is not necessarily the best overall. For instance, the simplest
possible interpretation for any argument consists of a single proposition, but this
interpretation usually yields a poor data fit with most arguments. An increase in
structural complexity (and corresponding reduction in probability) may reduce the
discrepancy between the argument structure and the structure of the interpretation
graph, thereby improving data fit. If this improvement overcomes the reduction
in probability due to the higher model complexity, we obtain a higher-probability
interpretation overall.
The techniques employed to calculate the prior probability and data fit for both
types of information are outlined below (a detailed description appears in [George,
Zukerman, and Niemann 2007; Zukerman and George 2005]).
The factors in Equation 2 are described below (we consider them from last to
first for clarity of exposition).
Supposition configuration: Pr(SCi |Background)
A supposition configuration addresses mismatches between the beliefs expressed
in an argument and those in an interpretation. It comprises beliefs attributed
281
Ingrid Zukerman
to the user in light of the beliefs shared with the system, which are encoded in
the background knowledge. Making suppositions has a lower probability than not
making suppositions (which has no discrepancy with the background knowledge).
However, as seen in the example in Figure 2(c), making a supposition that reduces
or eliminates the discrepancy between the beliefs stated in the argument and those
in the interpretation increases the data fit for beliefs.
Pr(SCi |Background) reflects how close the suppositions in a supposition config-
uration are to the current beliefs in the background knowledge. The closer they
are, the higher the probability of the supposition configuration. Assuming condi-
tional independence between the supposition for each node given the background
knowledge yields
N
Y
Pr(SCi |Background) = Pr(sji |BelBkgrd (j))
j=1
where N is the number of nodes in the BN, sij is the supposition made for node j
in supposition configuration SCi , and BelBkgrd (j) is the belief in node j according
to the background knowledge. Pr(sji |BelBkgrd (j)) is estimated by means of the
heuristic function H.
where Type(sji ) is the type of supposition sji (supposing nothing, supposing evi-
dence, or forgetting evidence), and Bel(sji ) is the value of the supposition (true or
false when evidence is supposed for node j; and the belief in node j obtained from
belief propagation in the BN when evidence is forgotten for node j). Specifically,
we posit that supposing nothing has the highest probability, and supposing the
truth or falsehood of an inferred value is more probable than forgetting seen evi-
dence [George, Zukerman, and Niemann 2007]. In addition, strongly believed (high
probability) propositions are more likely to be supposed true than weakly believed
(lower probability) propositions, and weakly believed propositions are more likely
to be supposed false than strongly believed propositions [Lichtenstein, Fischhoff,
and Phillips 1982].
Structure of an interpretation: Pr(struct of IGi |SCi , Background)
Pr(struct of IGi |SCi , Background) is the probability of selecting the nodes and arcs
in IGi from the domain BN (which is part of the background knowledge). The
calculation of this probability is described in detail in [Zukerman and George 2005].
In brief, the prior probability of the structure of an interpretation graph is estimated
using the combinatorial notion of selecting the nodes and arcs in the graph from
those in the domain BN. To implement this idea, we specify an interpretation graph
IGi by indicating the number of nodes in it (ni ), the number of arcs (ai ), and the
actual nodes and arcs in it (Nodesi and Arcsi respectively). Thus, the probability
of the structure of IGi in the context of the domain BN (composed of A arcs and
282
Arguing with a Bayesian Intelligence
tion generally prefers small models to larger ones.7 In [Zukerman and George
2005], we considered salience — obtained from dialogue history, which is part
of the background knowledge — to moderate the probability of selecting a
node. According to this scheme, recently mentioned nodes are more salient
(and have a higher probability of being selected) than nodes mentioned less
recently.
283
Ingrid Zukerman
and its consequent (rather than to connect between the propositions in an argu-
ment). These expectations, which are part of the background knowledge, were
obtained from our user studies. Explanatory extensions have no belief component,
as the nodes in them do not provide additional evidence, and hence do not affect
the beliefs in a BN.
Interpretations with explanatory extensions are more complex, and hence have a
lower probability, than interpretations without such extensions. At the same time,
as shown in the example in Figure 2(d), an explanatory extension that overcomes
an expectation violation regarding the consequent of an inference improves the
acceptance of the interpretation, thereby increasing the probability of the model.
According to our surveys, explanatory extensions that yield BothSides inferences
are preferred to those that yield SameSide inferences. In addition, as for interpreta-
tion graphs, shorter explanatory extensions are preferred to longer ones. Thus, our
estimate of the structural probability of explanatory extensions balances the size
of explanatory extensions (number of propositions) against their type (inference
category), as follows.8
284
Arguing with a Bayesian Intelligence
Beliefs in an interpretation:
Pr(beliefs in IGi |struct of IGi , SCi , EEi , Background)
The beliefs in an interpretation IGi are estimated by performing Bayesian propa-
gation from the beliefs in the domain BN and the suppositions. This is an algo-
rithmic process, hence the probability of obtaining the beliefs in IGi is 1. However,
the background knowledge has another aspect, viz users’ expectations regarding
inferred beliefs. In our preliminary trials, users objected to inferences that had
increases in certainty or large changes in belief from their antecedents to their con-
sequent [Zukerman and George 2005].
Thus, interpretations that contain objectionable inferences have a lower prob-
ability than interpretations where the beliefs in the consequents of the inferences
fall within an “acceptable range” of the beliefs in their antecedents. We use the
categories of acceptable inferences obtained from our surveys to estimate the prob-
ability of each inference in an interpretation — these categories define an acceptable
range of beliefs for the consequent of an inference given its antecedents. For ex-
ample, an inference with antecedents A[Probably] & B[Possibly] has the acceptable
belief range {Probably, Possibly, EvenChance} for its consequent. The probability
of an inference whose consequent falls within the acceptable range is higher than
the probability of an inference whose consequent falls outside this range. In addi-
tion, we extrapolate from the results of our surveys, and posit that the probability
of an unacceptable inference decreases as the distance of its consequent from the
acceptable range increases. We use the Zipf distribution to model the probability of
an inference, where the “rank” is the distance between the belief in the consequent
and the acceptable belief range.
As mentioned above, explanatory extensions are generated to satisfy people’s
expectations about the relationship between the beliefs in the antecedents of infer-
ences and the belief in their consequent (i.e., they bring the consequent into the
acceptable range of an inference, or at least closer to this range). Thus, they in-
crease the belief probability of an interpretation at the expense of its structural
probability.
285
Ingrid Zukerman
5 Evaluation
Our evaluation was designed to determine whether our approach to argument in-
terpretation by a Bayesian intelligence yields interpretations that are acceptable to
users. However, our target users are not those who constructed the argument, but
those who read the argument. Specifically, our evaluation determines whether peo-
ple reading someone else’s argument find bias’s highest-probability interpretations
acceptable (and better than other options).
286
Arguing with a Bayesian Intelligence
9 The arguments were generated by project team members. We also conducted experiments
were people not associated with the project entered arguments, but interface problems affected
the evaluation (Section 6.4).
287
Ingrid Zukerman
the same time, about half of the participants felt that the extended interpretations
were too verbose. This problem may be partially attributed to the presentation
of the nodes as direct renditions of their propositional content, which makes the
interpretations appear repetitive in style. The generation of stylistically diverse text
is the subject of active research in Natural Language Generation, e.g., [Gardent
and Kow 2005].
6 Discussion
This chapter offers a probabilistic approach to argument interpretation by a system
that uses a BN as its knowledge representation and reasoning formalism. An inter-
pretation of a user’s argument is represented as beliefs in the BN (suppositions) and
a Bayesian subnet (interpretation graph and explanatory extensions). Our evalu-
ations show that people found bias’s interpretations generally acceptable, and its
suppositions and explanatory extensions both necessary and reasonable.
Our approach casts the generation of an interpretation as a model selection task,
and employs an (almost) anytime algorithm to generate candidate interpretations.
Our model selection approach balances the probability of the model in light of
background knowledge against its data fit (similarity between the model and the
data). In other words, our formalism balances the cost of adding extra elements
to an interpretation (e.g., suppositions) against the benefits obtained from these
elements. The calculations that implement this idea are based on three main ele-
ments: (1) combinatoric principles for extracting an interpretation graph from the
domain BN, and an argument from an interpretation; (2) known distributions, such
as Poisson for the number of nodes in an interpretation graph or explanatory ex-
tension, and Zipf for modeling discrepancies in belief; and (3) manually-generated
distributions for suppositions and for preferences regarding different types of in-
ferences. The parameterization of these distributions requires specific information.
For instance, the mean of the Poisson distribution, which determines the “penalty”
for having too many nodes in an interpretation or explanatory extension, must be
empirically determined. Similarly, the hand-tailored distributions for supposition
configurations and explanatory extensions require experimental fine-tuning or user
studies to gather these probabilities.
The applicability of our approach is mainly affected by our assumption that the
nodes in the domain BN are binary. Other factors to be considered when applying
our formalism are: the characteristics of the domain, the expressive power of BNs
vis a vis human reasoning, and the ability of users to interact with the system.
288
Arguing with a Bayesian Intelligence
289
Ingrid Zukerman
rather than its absolute value, e.g., an increase from 6% to 10% in the probability
of a patient having cancer may require an explanatory extension, even though both
probabilities belong to the VeryProbablyNot belief category.
290
Arguing with a Bayesian Intelligence
Acknowledgments
The author thanks her collaborators on the research described in this chapter:
Sarah George and Michael Niemann. This research was supported in part by grant
DP0878195 from the Australian Research Council (ARC) and by the ARC Centre
for Perceptive and Intelligent Machines in Complex Environments.
References
Dean, T. and M. Boddy (1988). An analysis of time-dependent planning. In
AAAI88 – Proceedings of the 7th National Conference on Artificial Intelli-
gence, St. Paul, Minnesota, pp. 49–54.
Elsaesser, C. (1987). Explanation of probabilistic inference for decision support
systems. In Proceedings of the AAAI-87 Workshop on Uncertainty in Artificial
Intelligence, Seattle, Washington, pp. 394–403.
Evans, J., J. Barston, and P. Pollard (1983). On the conflict between logic and
belief in syllogistic reasoning. Memory and Cognition 11, 295–306.
Gardent, C. and E. Kow (2005). Generating and selecting grammatical para-
phrases. In ENLG-05 – Proceedings of the 10th European Workshop on Nat-
ural Language Generation, Aberdeen, Scotland, pp. 49–57.
George, S., I. Zukerman, and M. Niemann (2007). Inferences, suppositions and
explanatory extensionsin argument interpretation. User Modeling and User-
Adapted Interaction 17 (5), 439–474.
Getoor, L., N. Friedman, D. Koller, and B. Taskar (2001). Learning probabilistic
models of relational structure. In Proceedings of the 18th International Con-
ference on Machine Learning, Williamstown, Massachusetts, pp. 170–177.
Horvitz, E., H. Suermondt, and G. Cooper (1989). Bounded conditioning: flexible
inference for decision under scarce resources. In UAI89 – Proceedings of the
1989 Workshop on Uncertainty in Artificial Intelligence, Windsor, Canada,
pp. 182–193.
Kashihara, A., T. Hirashima, and J. Toyoda (1995). A cognitive load application
in tutoring. User Modeling and User-Adapted Interaction 4 (4), 279–303.
Kintsch, W. (1994). Text comprehension, memory and learning. American Psy-
chologist 49 (4), 294–303.
Korb, K. B., R. McConachy, and I. Zukerman (1997). A cognitive model of ar-
gumentation. In Proceedings of the 19th Annual Conference of the Cognitive
Science Society, Stanford, California, pp. 400–405.
Lichtenstein, S., B. Fischhoff, and L. Phillips (1982). Calibrations of probabilities:
The state of the art to 1980. In D. Kahneman, P. Slovic, and A. Tversky
(Eds.), Judgment under Uncertainty: Heuristics and Biases, pp. 306–334.
Cambridge University Press.
291
Ingrid Zukerman
292
Part III: Causality
Return to TOC
17
Instrumental Sets
Carlos Brito
1 Introduction
The research of Judea Pearl in the area of causality has been very much acclaimed.
Here we highlight his contributions for the use of graphical languages to represent
and reason about causal knowledge.1
The concept of causation seems to be fundamental to our understanding of the
world. Philosophers like J. Carroll put it in these terms: ”With regard to our total
conceptual apparatus, causation is the center of the center” [Carroll 1994]. Perhaps
more dramatically, David Hume states that causation together with resemblance
and contiguity are ”the only ties of our thoughts, ... for us the cement of the
universe” [Hume 1978]. In view of these observations, the need for an adequate
language to talk about causation becomes clear and evident.
The use of graphical languages was present in the early times of causal modelling.
Already in 1934, Sewall Wright [Wright 1934] represented the causal relation among
several variables with diagrams formed by points and arrows (i.e., a directed graph),
and noted that the correlations observed between the variables could be associated
with the various paths between them in the diagram. From this observation he
obtained a method to estimate the strength of the causal connections known as
The Method of Path Coefficients, or simply Path Analysis.
With the development of the research in the field, the graphical representation
gave way to a mathematical language, in which causal relations are represented by
equations of the form Y = α+βX +e. This movement was probably motivated by an
increasing interest in the quantitative aspects of the model, or by the rigorous and
formal appearance offered by the mathematical language. However it may be, the
consequence was a progressive departure from our basic causal intuitions. Today
people ask whether such an equation represents a functional or a causal relation
[Reiss 2005]. Sewall Wright and Judea Pearl would presumably answer: ”Causal,
of course!”.
295
Carlos Brito
in the form of an acyclic causal diagram, which contains both arrows and bidirected
arcs [Pearl 1995; Pearl 2000a]. The arrows represent the potential existence of di-
rect causal relationships between the corresponding variables, and the bidirected
arcs represent spurious correlations due to unmeasured common causes. All inter-
actions among variables are assumed to be linear. Our task is to decide whether the
assumptions represented in the diagram are sufficient for assessing the strength of
causal effects from non-experimental data, and, if sufficiency is proven, to express
the target causal effect in terms of estimable quantities.
This decision problem has been tackled in the past half century, primarily by
econometricians and social scientists, under the rubric ”The Identification Prob-
lem” [Fisher 1966] - it is still unsolved. Certain restricted classes of models are
nevertheless known to be identifiable, and these are often assumed by social scien-
tists as a matter of convenience or convention [Duncan 1975]. A hierarchy of three
such classes is given in [McDonald 1997]: (1) no bidirected arcs, (2) bidirected arcs
restricted to root variables, and (3) bidirected arcs restricted to variables that are
not connected through directed paths.
In a further development [Brito and Pearl 2002], we have shown that the identifi-
cation of the entire model is ensured if variables standing in direct causal relationship
(i.e., variables connected by arrows in the diagram) do not have correlated errors;
no restrictions need to be imposed on errors associated with indirect causes. This
class of models was called ”bow-free”, since their associated causal diagrams are
free of any ”bow-pattern” [Pearl 2000a] (see Figure 1).
Most existing conditions for identification in general models are based on the
concept of Instrumental Variables (IV) [Pearl 2000b; Bowden and Turkington 1984].
IV methods take advantage of conditional independence relations implied by the
model to prove the identification of specific causal-effects. When the model is not
rich in conditional independence relations, these methods are not informative. In
[Brito and Pearl 2002] we proposed a new graphical criterion for identification which
does not make direct use of conditional independence, and thus can be successfully
applied to models in which the IV method would fail.
The result presented in this paper is a generalization of the graphical version
296
Instrumental Sets
297
Carlos Brito
4 Graph Background
DEFINITION 1.
2. A directed path is a path composed only by directed edges, all of them oriented
298
Instrumental Sets
The idea is simple. If the path is closed, then it is naturally blocked by its colliders.
However, if a collider, or one of its descendants, belongs to Z, then it ceases to be
an obstruction. But if a non-collider of p belongs to Z, then the path is definitely
blocked.
DEFINITION 3. A set of nodes Z d-separates X and Y if Z simultaneously blocks
all the paths between X and Y . If Z is empty, then we simply say that X and Y
are d-separated.
The significance of this definition comes from a result showing that if X and Y
are d-separated by Z in the causal diagram of a linear model, then the variables X
and Y are conditionally independent given Z [Pearl 2000a]. It is this sort of result
that makes the connection between the mathematical and graphical languages, and
allows us to express our conditions for identification in graphical terms.
DEFINITION 4. Let p1 , . . . , pn be unblocked paths connecting the variables Z1 , . . . , Zn
and the variables X1 , . . . , Xn , respectively. We say that the set of paths p1 , . . . , pn is
incompatible if we cannot rearrange their edges to form a different set of unblocked
paths p′1 , . . . , p′n between the same variables.
A set of disjoint paths (i.e., paths with no common nodes) consists in a simple
example of an incompatible set of paths.
299
Carlos Brito
IV-2. Z is independent of all error terms that have an influence on Y that is not
mediated by X.
The first condition simply states that there is a correlation between Z and X.
The second condition says that the only source of correlation between Z and Y
is due to a covariation bewteen Z and X that subsequently affects Y through the
c
causal connection X → Y .
If we can find a variable Z with these properties, then the causal effect of X on
Y is identified and given by c = σZY /σZX .
Using the notion of d-separation we can express the conditions (1) and (2) of
the IV method in graphical terms, thus obtaining a criterion for identification that
can be applied directly to the causal diagram of the model. Let G be the graph
representing the causal diagram of the model, and let Gc be the graph obtained
c
after removing the edge X → Y from G (see Figure 4). Then, Z is an instrumental
c
variable relative to X → Y if:
2. Z is d-separated from Y in Gc .
300
Instrumental Sets
3. W d-separates Z from Y in Gc .
Y = c1 X1 + . . . + ck Xk + e
301
Carlos Brito
Figure 5. The causal diagram G of a linear model and the graph Ḡ.
Next, we develop our graphical intuition and obtain a graphical criterion for
identification that corresponds to the full version of the IV method.
Consider the model in Figure 5a. Here, the variables Z1 and Z2 do not qualify
c1
as instrumental variables (or even conditional IVs) with respect to either X1 → Y
c2
or X2 → Y . But, following ideas similar to the ones developed in the previous
sections, in Figure 5b we show the graph obtained by removing edges X1 → Y and
X2 → Y from the causal diagram. Observe that now both d-separation conditions
for an instrumental variable hold for Z1 and Z2 . This leads to the idea that Z1 and
Z2 could be used together as instruments to prove the identification of parameters
c1 and c2 . Indeed, next we give a graphical criterion that is sufficient to guarantee
the identification of a subset of parameters of the model.
c1 ck
Fix a variable Y , and consider the edges X1 → Y, . . . , Xk → Y in the causal
diagram G of the model. Let Ḡ be the graph obtained after removing the edges
X1 → Y, . . . , Xk → Y from G. The variables Z1 , . . . , Zk are instruments relative to
c1 ck
X1 → Y, . . . , Xk → Y if
302
Instrumental Sets
where the summation ranges over all unblocked paths p between X and Y , and each
term T (p) represents the contribution of the path p to the total correlation between
X and Y . The term T (p) is given by the product of the parameters of the edges
along the path p. We refer to Equation 3 as Wright’s equation for X and Y .
Wright’s method of path coefficients for identification consists in forming Wright’s
equations for each pair of variables in the model, and then solving for the parameters
in terms of the observed correlations. Whenever there is a unique solution for a
parameter c, this parameter is identified.
7 Proof of Theorem 1
7.1 Notation
Fix a variable Y in the model. Let X = {X1 , . . . , Xn } be the set of all non-
descendants of Y which are connected to Y by an edge. Define the following set of
edges incoming Y :
303
Carlos Brito
Note that for some Xi ∈ X there may be more than one edge between Xi and Y
(one directed and one bidirected). Thus, |Inc(Y )| ≥ |X|. Let λ1 , . . . , λm , m ≥ k,
denote the parameters of the edges in Inc(Y ).
c1 ck
It follows that edges X1 → Y, . . . , Xk → Y all belong to Inc(Y ), because
X1 , . . . , Xk are clearly non-descendants of Y . We assume that λi = ci , for i =
1, . . . , k, while λk+1 , . . . , λm are the parameters of the remaining edges of Inc(Y ).
Let Z be any non-descendant of Y . Wright’s equation for the pair (Z, Y ) is given
by:
X
(5) σZY = T (p)
p
where each term T (p) corresponds to an unblocked path p between Z and Y . The
next lemma proves a property of such paths.
LEMMA 6. Any unblocked path between Y and one of its non-descendants Z must
include exactly one edge from Inc(Y ).
Lemma 6 allows us to write equation 4 as:
m
X
(6) σZY = aj · λj
j=1
304
Instrumental Sets
The conclusion from Lemma 7 is that the expression for ρZi Y is a linear function
only of parameters λ1 , . . . , λk :
k
X
(8) ρZi Y = aij · λj
j=1
Our goal now is to show that Φ can be solved uniquely for the parameters λi , and
so prove the identification of λ1 , . . . , λk . Next lemma proves an important result in
this direction.
Let A denote the matrix of coefficients of Φ.
LEMMA 8. Det(A) is a non-trivial polynomial on the parameters of the model.
Proof. The determinant of A is defined as the weighted sum, for all permutations
π of h1, . . . , ki, of the product of the entries selected by π. Entry aij is selected by a
permutation π if the ith element of π is j. The weights are either 1 or -1, depending
on the parity of the permutation.
Now, observe that each diagonal entry aii is a sum of terms associated with
unblocked paths between Zi and Xi . Since pi is one such path, we can write
aii = T (pi ) + âii . From this, it is easy to see that the term
k
Y
(10) T ∗ = T (pj )
j=1
305
Carlos Brito
appears in the product of permutation π = h1, . . . , ni, which selects all the diagonal
entries of A.
We prove that det(A) does not vanish by showing that T ∗ is not cancelled out
by any other term in the expression for det(A).
Let τ be any other term appearing in the summation that defines the determinant
of A. This term appears in the product of some permutation π, and has as factors
exactly one term from each entry aij selected by π. Thus, associated with such factor
there is an unblocked path between Zi and Xj . Let p′1 , . . . , p′k be the unblocked paths
associated with the factors of τ .
We conclude the proof observing that, since p1 , . . . , pk is an incompatible set,
its edges cannot be rearranged to form a different set of unblocked paths between
the same variables, and so τ 6= T ∗ . Hence, the term T ∗ is not cancelled out in the
summation, and the expression for det(A) does not vanish. ⊔
⊓
7.4 Identification of λ1 , . . . , λk
Lemma 8 gives that det(Q) is a non-trivial polynomial on the parameters of the
model. Thus, det(Q) only vanishes on the roots of this polynomial. However,
[Okamoto 1973] has shown that the set of roots of a polynomial has Lebesgue
measure zero. Thus, the system Φ has unique solution almost everywhere.
It just remains to show that we can estimate the entries of the matrix of coeffi-
cients A from the data. But this is implied by the following observation.
Once again, coefficient aij is given by a sum of terms associated with unblocked
paths between Zi and Xj . But, in principle, not every unblocked path between Zi
and Xj would contribute with a term to the sum; just those which can be extended
by the edge Xj → Y to form an unblocked path between Zi and Y . However, since
the edge Xj → Y does not point to Xj , every unblocked path between Zi and Xj
can be extended by the edge Xj → Y without creating a collider. Hence, the terms
of all unblocked paths between Zi and Xj appear in the expression for aij , and by
the method of path coefficients, we have aij = ρZi Xj .
We conclude that each entry of matrix A can be estimated from data, and we
can solve the system of linear equations Φ to obtain the parameters λ1 , . . . , λk .
References
Bollen, K. (1989). Structural Equations with Latent Variables. John Wiley, New
York.
Bowden, R. and D. Turkington (1984). Instrumental Variables. Cambridge Univ.
Press.
Brito, C. and J. Pearl (2002). A graphical criterion for the identification of
causal effects in linear models. In Proc. of the AAAI Conference, Edmon-
ton, Canada..
Carroll, J. (1994). Laws of Nature. Cambridge University Press.
306
Instrumental Sets
307
Return to TOC
18
1 Introduction
It is relatively recently that much attention has focused on what, for want of a better term,
we might call “statistical causality”, and the subject has developed in a somewhat haphaz-
ard way, without a very clear logical basis. There is in fact a variety of current conceptions
and approaches [Campaner and Galavotti 2007; Hitchcock 2007; Galavotti 2008]—here we
shall distinguish in particular agency, graphical, probabilistic and modular conceptions of
causality—that tend to be mixed together in an informal and half-baked way, based on
“definitions” that often do not withstand detailed scrutiny. In this article I try to unpick this
tangle and expose the various different strands that contribute to it. Related points, with a
somewhat different emphasis, are made in a companion paper [Dawid 2009].
The approach of Judea Pearl [2009] cuts through this Gordian knot like the sword of
Alexander. Whereas other conceptions of causality may be philosophically questionable,
definitionally unclear, pragmatically unhelpful, theoretically skimpy, or simply confused,
Pearl’s theory is none of these. It provides a valuable framework, founded on a rich and
fruitful formal theory, by means of which causal assumptions about the world can be mean-
ingfully represented, and their implications developed. Here we will examine both the rela-
tionships of Pearl’s theory with the other conceptions considered, and its differences from
them. We extract the essence of Pearl’s approach as an assumption of “modularity”, the
transferability of certain probabilistic properties between observational and interventional
regimes: so, in particular, forging a synthesis between the very different activities of “see-
ing” and “doing”. And we describe a generalisation of this framework that releases it from
any necessary connexion to graphical models.
The plan of the paper is as follows. In § 2, I describe the agency, graphical and proba-
bilistic conceptions of causality, and their connexions and distinctions. Section 3 introduces
Pearl’s approach, showing its connexions with, and differences from, the other theories.
Finally, in § 4, I present the generalisation of that approach, emphasising the modularity
assumptions that underlie it, and the usefulness of the theory of “extended conditional in-
dependence” for describing and manipulating these.
Disclaimer I have argued elsewhere [Dawid 2000, 2007a, 2010] that it is important to dis-
tinguish arguments about “Effects of Causes” (EoC, otherwise termed “type”, or “generic”
causality”), from those about “Causes of Effects” (CoE, also termed “token”, or “indi-
vidual” causality); and that these demand different formal frameworks and analyses. My
concern here will be entirely focused on problems of generic causality, EoC. A number of
309
Philip Dawid
the current frameworks for statistical causality, such as Rubin’s “potential response mod-
els” [Rubin 1974, 1978], or Pearl’s “probabilistic causal models” [Pearl 2009, Chapter 7],
are more especially suited for handling CoE type problems, and will not be discussed fur-
ther here. There are also numerous other conceptions of causality, such as mechanistic
causality [Salmon 1984; Dowe 2000], that I shall not be considering here.
“X has no effect on Y ”1
as holding whenever, considering regimes that manipulate only X, the resulting value of Y
(or some suitable codification of uncertainty about Y , such as its probability distribution)
does not depend on the value x assigned to X. When this fails, X has an effect on Y ; we
might then go on to quantify this dependence in various ways.
We could likewise interpret
as the property that, considering regimes where we manipulate both W and X, when we
manipulate W to some value w and X to some value x, the ensuing value (or uncertainty)
for Y will depend only on w, and not further on x.
Now suppose that, explicitly or implicitly, we restrict consideration to some collection
V of manipulable variables. Then we might interpret the statement
1 Just as “zero” is fundamental to arithmetic and “independence” is fundamental to probability, so the concept
310
Seeing and Doing
(where V might be left unmentioned, but must be clearly understood) as the negation of
“X has no direct effect on Y , after controlling for V \ {X, Y }”.2
It is important to bear in mind that all these assertions relate to properties of the real
world under the various regimes considered: in particular, they can not be given purely
mathematical definitions. And in real world problems there are typically various ways of
manipulating variables, so we must be very clear as to exactly what is intended.
EXAMPLE 1. Ideal gas law
Consider the “ideal gas law”:
(1) P V = kN T
where P is the absolute pressure of the gas, V is its volume, N is the number of molecules
of gas present, k is Boltzmann’s constant, and T is the absolute temperature. For our
current purposes this will be supposed to be universally valid , no matter how the values of
the variables in (1) may have come to arise.
Taking a fixed quantity N of gas in an impermeable container, we might consider inter-
ventions on any of P , V and T . (Note however that, because of the constraint (1), we can
not simultaneously and arbitrarily manipulate all three variables.)
An intervention that sets V to v and T to t will lead to the unique value p = kN t/v for
P . Because this depends on both v and t, we can say that there is a direct effect of each of
V and T on P (relative to V = {V, P, T }). Similarly, P has a direct effect on each of V
and T .
What if we wish to quantify, say, “the causal effect of V on P ”? Any attempt to do
this must take account of the fact that the problem requires additional specification to be
well-defined. Suppose the volume of the container can be altered by applying a force to
a piston. Initially the gas has V = v0 , P = p0 , T = t0 . We wish to manipulate V to a
new value v1 . If we do this isothermally, i.e. by sufficiently slow movement of the piston
that, through flow of heat through the walls of the container, the temperature of the gas
always remains the same as that of the surrounding heat bath, we will end up with V = v1 ,
P = p1 = v0 p0 /v1 , T = t1 = t0 . But if we move the piston adiabatically, i.e. so fast that
no heat can pass through the walls of the container, the relevant law is P V γ = constant,
where γ = 5/3 for a monatomic gas. Then we get V = v1 , P = p∗1 = p0 (v0 /v1 )γ ,
T = t∗1 = p∗1 v1 /kN .
311
Philip Dawid
D over a collection V ⊆ A. These ingredients are “known to Nature”, though not neces-
sarily to us: D is “Nature’s DAG”. Given such a causal DAG D, for X, Y ∈ V we interpret
“X is a direct cause of Y ” as synonymous with “X is a parent of Y in D”, and similarly
equate “cause” with “ancestor in D”. One can also use the causal DAG to introduce further
graphically defined causal terms, such as “causal chain”, “intermediate variable”, . . .
The concepts of causal ambit and causal DAG might be regarded as primitive notions,
or attempts might be made to define them in terms of pre-existing understandings of causal
concepts. In either case, it would be good to have criteria to distinguish a putative causal
ambit from a non-causal ambit, and a causal DAG from a non-causal DAG.
For example, we typically read [Hernán and Robins 2006]:
“A causal DAG D is a DAG in which:
(i). the lack of an arrow from Vj to Vm can be interpreted as the absence of a direct causal effect of Vj
on Vm (relative to the other variables on the graph)
(ii). all common causes, even if unmeasured, of any pair of variables on the graph are themselves on the
graph.4
If we start with a DAG D over V that we accept as being a causal DAG, and interpret
“direct cause” etc. in terms of that, then conditions (i) and (ii) will be satisfied by definition.
However, this begs the question of how we are to tell a causal from a non-causal DAG.
More constructively, suppose we start with a prior understanding of the term “direct
cause” (relative to V)—for example, though by no means necessarily,5 based on the agency
interpretation described in § 2.1 above. It appears that we could then use the above defini-
tion to check whether a proposed DAG D is indeed “causal”. But while this is essentially
straightforward so far as condition (i) is concerned (except that there is no obvious rea-
son to require a DAG representation), interpretation and implementation of condition (ii)
is more problematic. First, what is a “common cause”? Spirtes et al. [2000, p. 44] say
that a variable X is a common cause of variables Y and Z if and only if X is both a direct
cause of Y and a direct cause of Z — but in each case relative to the set {X, Y, Z}, so
that this definition is not dependent on the causal ambit V. Neapolitan [2003, p. 57] has
a different interpretation, which apparently is relative to an essentially arbitrary set V —
but then states that that problems can arise when at least one common cause is not in V, a
possibility that seems to be precluded by his definition.
As another attempt at clarification, Spirtes and Scheines [2004] require “that the set
of variables in the causal graph be causally sufficient, i.e. if V is the set of variables in
the causal graph, that there is no variable L not in V that is a direct cause (relative to
V ∪ {L}) of two variables in V”. If “L 6∈ V is not a direct cause of V ∈ V” is interpreted
in agency terms, it would mean that V would not respond to manipulations of L, when
holding fixed all the other variables in V. But whatever the interpretation of direct cause,
such a “definition” of causal sufficiency is ineffective when the range of possible choices
4 The motivation for this requirement is not immediately obvious, but is related to the defensibility of the causal
312
Seeing and Doing
for the additional variable L is entirely unrestricted—for then how could we ever be sure
that it holds, without conducting an infinite search over all unmentioned variables L? That
is why we posit an appropriate clearly-defined “causal ambit” A: we can then restrict the
search to L ∈ A.
It seems to me that we should, realistically, allow that “causality” can operate, in parallel,
at several different levels of granularity. Thus while it may or may not be possible to
describe the medical effects of aspirin treatment in terms of quantum theory, even if we
could, it would be a category error to try and do so in the context of a clinical trial. So there
may be various different causal descriptions of the world, all operating at different levels,
each with its associated causal ambit A of variables and various causal DAGs D over sets
V ⊆ A. The meaning of any causal terms used should then be understood in relation to the
appropriate level of description.
The obvious questions to ask about graphical causality, which are however not at all easy
to answer, are: “When can a collection A of variables be regarded as a causal ambit?”, and
“When can a DAG be regarded as a causal DAG?”.
In summary, so long as we start with a DAG D over V that we are willing to accept as a
causal DAG (taken as a primitive concept), we can take V itself as our causal ambit, and use
the structure of D to define causal terms. Without having a prior primitive notion of what
constitutes a “causal DAG”, however, conditions such as (i) and (ii) are unsatisfactory as a
definition. At the very least, they require that we have specified (but how?) an appropriate
causal ambit A, relevant to our desired level of description, and have a clear pre-existing
understanding (i.e. not based on the structure of D, since that would be logically circular)
of the terms “direct causal effect”, “common cause” (perhaps relative to a set V).
Agency causality and graphical causality
It is tempting to use the agency theory as a basis for such prior causal understanding. How-
ever, graphical causality does not really sit well with agency causality. For, as seen clearly
in Example 1, in the agency intepretation it is perfectly possible for two variables each to
have a direct effect on the other—which could not hold under any DAG representation.
Similarly [Halpern and Pearl 2005; Hall 2000] there is no obvious reason to expect agency
causality to be a transitive relation, which would again be a requirement under the graphical
conception. For better or worse, the agency theory does not currently seem to be endowed
with a sufficiently rich axiomatic structure to guide manipulations of its causal properties;
and however such a general axiomatic structure might look, it would seem unduly restric-
tive to relate it closely to DAG models.
313
Philip Dawid
(ii). the joint probability distribution P of the variables in V is Markov over D, i.e. its
probabilistic conditional independence (CI) properties are represented by the same
DAG D, according to the “d-separation” semantics described by Pearl [1986], Verma
and Pearl [1990], Lauritzen et al. [1990].
In particular, from (ii), for any V ∈ V, V is independent of its non-descendants, nd(V ), in
D, given its parents, pa(V ), in D. Given the further interpretation (i) of D as a causal DAG,
this can be expressed as “V is independent of its non-effects, given its direct causes in V”—
the so-called causal Markov assumption. Also, (ii) implies that, for any sets of variables X
and Y in D, X ⊥ ⊥ Y | an(X) ∩ an(Y ) (where an(X) denotes the set of ancestors of X in
D, including X itself): again with D interpreted as causal, this can be read as saying “X and
Y are conditionally independent, given their common causes in V”. In particular, marginal
independence (where X ⊥ ⊥ Y is represented in D) holds if and only if an(X) ∩ an(Y ) = ∅,
i.e. (using (i)) “X and Y have no common cause” (including each other) in V; in the
“if” direction, this has been termed the weak causal Markov assumption [Scheines and
Spirtes 2008]. Many workers regard the causal and weak causal Markov assumptions as
compelling—but this must depend on making the “right” choice for V (essentially, through
appropriate delineation of the causal ambit.)
Note that this conception of causality involves, simultaneously, two very different ways
of interpreting the DAG D (see Dawid [2009] for more on this). The d-separation seman-
tics by means of which we relate D to conditional independence properties of the joint
distribution P , while clearly defined, are somewhat subtle: in particular, the arrows in D
are somewhat incidental “construction lines”, that only play a small rôle in the semantics.
But as soon as we also give D an interpretation as a “causal DAG” we are into a completely
different way of interpreting it, where the arrows themselves are regarded as directly car-
rying causal meaning. Probabilistic causality can thus be thought of as the progeny of a
shotgun wedding between two ill-matched parties.
314
Seeing and Doing
Causal discovery
The enterprise of Causal Discovery [Spirtes et al. 2000; Glymour and Cooper 1999;
Neapolitan 2003] is grounded in this probabilistic-cum-graphical conception of causality.
There are many variations, but all share the same basic philosophy. Essentially, one anal-
yses observational data in an attempt to identify conditional independencies (possibly in-
volving unobserved variables) in the distribution from which they arise. Some of these
might be discarded as “accidental” (perhaps because they are inconsistent with an a priori
causal order); those that remain might be represented by a DAG. The hope is that this dis-
covered conditional independence DAG can also be interpreted as a causal DAG. When,
as is often the case, there are several Markov equivalent DAG representations of the dis-
covered CI relationships, which, moreover, cannot be causally distinguished on a priori
grounds (e.g. in terms of an assumed causal order), this hope can not be fully realised; but
if we can assume that one of these, at least, is a causal DAG, then at least an arrow common
to all of them can be interpreted causally.
315
Philip Dawid
Pearl [2009], starting with Chapter 7), he has moved to an interpretation of DAG models based on deterministic
functional relationships, with stochasticity deriving solely from unobserved exogenous variables. That interpre-
tation does however imply all the properties of the stochastic theory, and can be regarded as a specialisation of it.
We shall not here be considering any features (such as the possibility of counterfactual analysis) dependent on the
additional structure of Pearl’s deterministic approach, since these only become relevant when analysing “causes
of effects”—see Dawid [2000, 2002] for more on this.
7 We have already remarked that probabilistic causality is itself the issue of an uneasy alliance between two
quite different ways of interpreting graphs. Further miscegenation with the agency conception of causality looks
like a eugenically risky endeavour!
8 For this to be effective, the variables in V should have clearly-defined meanings and be observable in the
real-world. Some Pearlian models incorporate unobservable latent variables without clearly identified external
referents, in which case only the implications of such a model for the behaviour of observables can be put to
empirical test.
316
Seeing and Doing
model thus has the great virtue, all too rare in treatments of causality, of being totally clear
and explicit about what is being said—allowing one to accord it, in a principled way, ac-
ceptance or rejection, as deemed appropriate, in any given application. And when a system
can indeed be described by a Pearlian DAG, it is straightforward to learn (not merely qual-
itatively, but quantitatively too), from purely observational data, about the (probabilistic)
effects of any interventions on variables in the system.
3.1 Justification
The falsifiability of the property of being a Pearlian DAG (unlike, for example, the some-
what ill-defined property of being a “causal DAG”) is at once a great strength of the the-
ory (especially for those with a penchant for Karl Popper’s “falsificationist” Philosophy
of Science), and something of an Achilles’ heel. For all too often it will be impossible,
for a variety of pragamatic, ethical or financial reasons, to conduct the experiments that
would be needed to falsify the Pearlian assumptions. A lazy reaction might then simply
be to assume that a DAG found, perhaps by “causal discovery”, to represent observational
conditional independencies, but without any interventions having been applied, is indeed
Pearlian—and so also describes what would happen under interventions. While this may
well be an interesting working hypothesis to guide further experimental investigations, it
would be an illogical and dangerous point at which to conclude our studies. In particular,
further experimental investigations could well result in rejection of our assumed Pearlian
model.
Nevertheless, if forced to make a tentative judgment on the Pearlian nature, or other-
wise, of a putative DAG model9 of a system, there are a number of more or less reasonable,
more or less intuitive, arguments that can be brought to bear. As a very simple example, we
would immediately reject any putative “Pearlian DAG” in which an arrow goes backwards
in time,10 or otherwise conflicts with an accepted causal order. As another, if an “obser-
vational” regime itself involves an imposed physical randomisation to generate the value
of some variable X, in a way that might possibly take account of variables Z temporally
prior to X, we might reasonably regard the conditional distribution of some later variable
Y , given X and Z, as a modular component, that would be the same in a regime that in-
tervenes to set the value of X as it is in the (observational) randomisation regime.11 Such
arguments can be further extended to “natural experiments”, where it is Nature that im-
posed the external randomisation. This is the case for “Mendelian randomisation” [Didelez
and Sheehan 2007], which capitalises on the random assortment of genes under Mendelian
genetics. Other natural experiments rely on other causal assumptions about Nature: thus
the “discontinuity design” [Trochim 1984] assumes that Nature supplies continuous dose-
response cause-effect relationships. But all such justifications are, and must be, based on
(what we think are) properties of the real world, and not solely on the internal structure of
9 Assumed, for the sake of non-triviality, already to be a Markov model of its observational probabilistic
properties.
10 Assuming, as most would accept, that an intervention in a variable at some time can not affect any variable
317
Philip Dawid
the putative Pearlian DAG. In particular, they are founded on pre-existing ideas we have
about causal and non-causal processes in the world, even though these ideas may remain
unformalised and woolly: the important point is that we have enough, perhaps tacit, shared
understanding of such processes to convince both ourselves and others that they can serve as
external justification for a suggested Pearlian model. Unless we have sufficient justification
of this kind, all the beautiful analysis (e.g. in Pearl [2009]) that develops the implications
of a Pearlian model will be simply irrelevant. To echo Cartwright [1994, Chapter 2], “No
causes in, no causes out”.
(2) Y⊥
⊥F | X
can be interpreted as asserting that the conditional distribution of Y , for specified regime
F = f and given observed value X = x, depends only on x and not further on the
regime f that is operating: in terms of densities we could write p(y | f, x) = p(y |
x). If F had been a stochastic variable this would be entirely equivalent to stochastic
conditional independence of Y and F given X; but it remains meaningful, with the above
interpretation, even when F is a non-stochastic regime indicator: Indeed, it asserts exactly
the modular nature of the conditional distribution p(y | x), as being the same across all the
regimes indicated by values of F . Such modularity properties, when expressed in terms of
ECI, can be formally manipulated—and, in those special cases where this is possible and
appropriate, represented and manipulated graphically—in essentially the same fashion as
for regular probabilistic conditional independence.
For applications of ECI to causal inference, we would typically want one or more of the
regimes indicated by F to represent the behaviour of the system when subjected to an inter-
vention of a specified kind—thus linking up nicely with the agency interpretation; and one
12 More generally, we could usefully identify features of the different regimes other than conditional
318
Seeing and Doing
regime to describe the undisturbed system on which observations are made—thus allowing
the possibility of “causal inference” and making links with probabilistic causality, but in a
non-graphical setting. Modularity/ECI assumptions can now be introduced, as considered
appropriate, and their implications extracted by algebraic or graphical manipulations, using
the established theory of conditional independence. We emphasise that, although the nota-
tion and technical machinery of conditional independence is being used here, this is applied
in a way that is very different from the approach of probabilistic causality: no assumptions
need be made connecting causal relationships with ordinary probabilistic conditional inde-
pendence.
Because it concerns the probabilistic behaviour of a system under interventions—a par-
ticular interpretation of agency causality—this general approach can be termed “decision-
theoretic” causality. With the emphasis now on modularity, intuitive or graphically mo-
tivated causal terms such as “direct effect” or “causal pathway” are best dispensed with
(and with them such assumptions as the causal Markov property). The decision-theoretic
approach should not be regarded as providing a philosophical foundation for “causality”,
or even as a way of interpreting causal terms, but rather as very useful machinery for ex-
pressing and manipulating whatever modularity assertions one might regard as appropriate
in a given problem.
319
Philip Dawid
conditional distribution of X given pa(X) is the same, no matter how the other variables
are set (or left idle).
(U, Z) ⊥⊥ FX (3)
U ⊥⊥ Z | FX (4)
Y ⊥⊥ FX | (X, U ) (5)
Y ⊥⊥ Z | (X, U ; FX ) (6)
X ⊥
6⊥ Z | FX = ∅ (7)
Property (3) is to be interpreted as saying that the joint distribution of (U, Z) is independent
of the regime FX : i.e., it is the same in all three regimes. That is to say, it is entirely
unaffected by whether, and if so how, we intervene to set the value of X. The identity of
this joint distribution across the two interventional regimes, FX = 0 and FX = 1, can be
interpreted as expressing a causal property: manipulating X has no (probabilistic) effect
13 In addition to these core conditions, precise identification of a causal effect by means of an instrumental
variable requires further modelling assumptions, such as linear regressions [Didelez and Sheehan 2007].
320
Seeing and Doing
on the pair of variables (U, Z). Moreover, since this common joint distribution is also
supposed the same in the idle regime, FX = ∅, we could in principle use observational
data to estimate it—thus opening up the possibility of causal inference.
Property (4) asserts that, in their (common) joint distribution in any regime, U and Z are
independent (this however is a purely probabilistic, not a causal, property).
Property (5) says that the conditional distribution of Y given (X, U ) is the same in both
interventional regimes, as well as in the observational regime, and can thus be considered
as a modular component, fully transferable between the three regimes—again, I regard this
as expressing a causal property.
Property (6) asserts that this common conditional distribution is unaffected by further
conditioning on Z (not in itself a causal property).
Finally, property (7) requires that Z be genuinely associated with X in the observational
regime.
Of course, these ECI properties should not simply be assumed without some attempt at
justification: for example, Mendelian randomisation attempts this in the case that Z is an
inherited gene. But because we have no need to consider interventions at any node other
than X, less by way of justification is required than if we were to do so.
Once expressed in terms of ECI, these core conditions can be manipulated algebraically
using the general theory of conditional independence [Dawid 1979]. Depending on what
further modelling assumptions are made, it may then be possible to identify, or to bound,
the desired causal effect in terms of properties of the observational joint distribution of
(X, Y, Z) [Dawid 2007b, Chapter 11].
In this particular case, although the required ECI conditions are expressed without ref-
erence to any graphical representation, it is possible (though not obligatory!) to give them
one. This is shown in Figure 1. Properties (3)–(6) can be read off this DAG directly using
the standard d-separation semantics. (Property (7) is only represented under a further as-
sumption that the graphical representation is faithful.) We term such a DAG an augmented
DAG: it differs from a Pearlian DAG in that some, but not necessarily all, variables have
associated intervention indicators.
Just as for regular CI, it is possible for a collection of ECI properties, constituting a
321
Philip Dawid
regimes we are not able to observe. This would be required for example when want to consider the effect of
setting the value of a continuous “dose” variable. At this very general level we can even dispense entirely with
the assumption of modular conditional distributions [Duvenaud et al. 2009].
322
Seeing and Doing
the χ2 -test) for conditional independence. Such discovered ECI properties (whether or not
they can be expressed graphically) can then be used to model the “causal structure” of the
problem.
5 Conclusion
Over many years, Judea Pearl’s original and insightful approach to understanding uncer-
tainty and causality have had an enormous influence on these fields. They have certainly
had a major influence on my own research directions: I have often—as evidenced by this
paper—found myself following in his footsteps, picking up a few crumbs here and there
for further digestion.
Pearl’s ideas do not however exist in a vacuum, and I believe it is valuable both to relate
them to their precursors and to assess the ways in which they may develop. In attempting
this task I fully acknowledge the leadership of a peerless researcher, whom I feel honoured
to count as a friend.
References
Campaner, R. and M. C. Galavotti (2007). Plurality in causality. In P. K. Machamer
and G. Wolters (Eds.), Thinking About Causes: From Greek Philosophy to Modern
Physics, pp. 178–199. Pittsburgh: University of Pittsburgh Press.
Cartwright, N. (1994). Nature’s Capacities and Their Measurement. Oxford: Clarendon
Press.
Dawid, A. P. (1979). Conditional independence in statistical theory (with Discussion).
Journal of the Royal Statistical Society, Series B 41, 1–31.
Dawid, A. P. (2000). Causal inference without counterfactuals (with Discussion). Jour-
nal of the American Statistical Association 95, 407–448.
Dawid, A. P. (2002). Influence diagrams for causal modelling and inference. Interna-
tional Statistical Review 70, 161–189. Corrigenda, ibid., 437.
Dawid, A. P. (2007a). Counterfactuals, hypotheticals and potential responses: A philo-
sophical examination of statistical causality. In F. Russo and J. Williamson (Eds.),
Causality and Probability in the Sciences, Volume 5 of Texts in Philosophy, pp.
503–32. London: College Publications.
Dawid, A. P. (2007b). Fundamentals of statistical causality. Research Re-
port 279, Department of Statistical Science, University College London.
<http://www.ucl.ac.uk/Stats/research/reports/psfiles/rr279.pdf>
323
Philip Dawid
324
Seeing and Doing
Salmon, W. C. (1984). Scientific Explanation and the Causal Structure of the World.
Princeton: Princeton University Press.
Scheines, R. and P. Spirtes (2008). Causal structure search: Philosophical foundations
and future problems. Paper presented at NIPS 2008 Workshop “Causality: Objec-
tives and Assessment”, Whistler, Canada.
Spirtes, P., C. Glymour, and R. Scheines (2000). Causation, Prediction and Search (Sec-
ond ed.). New York: Springer-Verlag.
Spirtes, P. and R. Scheines (2004). Causal inference of ambiguous manipulations. Phi-
losophy of Science 71, 833–845.
Spohn, W. (1976). Grundlagen der Entscheidungstheorie. Ph.D. thesis, University of
Munich. (Published: Kronberg/Ts.: Scriptor, 1978).
Spohn, W. (2001). Bayesian nets are all there is to causal dependence. In M. C. Galavotti,
P. Suppes, and D. Costantini (Eds.), Stochastic Dependence and Causality, Chap-
ter 9, pp. 157–172. Chicago: University of Chicago Press.
Suppes, P. (1970). A Probabilistic Theory of Causality. Amsterdam: North Holland.
Trochim, W. M. K. (1984). Research Design for Program Evaluation: The Regression-
Discontinuity Approach. SAGE Publications.
Verma, T. and J. Pearl (1990). Causal networks: Semantics and expressiveness. In R. D.
Shachter, T. S. Levitt, L. N. Kanal, and J. F. Lemmer (Eds.), Uncertainty in Artificial
Intelligence 4, Amsterdam, pp. 69–76. North-Holland.
Woodward, J. (2003). Making Things Happen: A Theory of Causal Explanation. Ox-
ford: Oxford University Press.
325
Return to TOC
19
Effect Heterogeneity and Bias in
Main-Effects-Only Regression Models
FELIX ELWERT AND CHRISTOPHER WINSHIP
1 Introduction
The overwhelming majority of OLS regression models estimated in the social sciences,
and in sociology in particular, enter all independent variables as main effects. Few re-
gression models contain many, if any, interaction terms. Most social scientists would
probably agree that the assumption of constant effects that is embedded in main-effects-
only regression models is theoretically implausible. Instead, they would maintain that
regression effects are historically and contextually contingent; that effects vary across
individuals, between groups, over time, and across space. In other words, social scien-
tists doubt constant effects and believe in effect heterogeneity.
But why, if social scientists believe in effect heterogeneity, are they willing to substan-
tively interpret main-effects-only regression models? The answer—not that it’s been
discussed explicitly—lies in the implicit assumption that the main-effects coefficients in
linear regression represent straightforward averages of heterogeneous individual-level
causal effects.
The belief in the averaging property of linear regression has previously been chal-
lenged. Angrist [1998] investigated OLS regression models that were correctly specified
in all conventional respects except that effect heterogeneity in the main treatment of in-
terest remained unmodeled. Angrist showed that the regression coefficient for this
treatment variable gives a rather peculiar type of average—a conditional variance
weighted average of the heterogeneous individual-level treatment effects in the sample. If
the weights differ greatly across sample members, the coefficient on the treatment vari-
able in an otherwise well-specified model may differ considerably from the arithmetic
mean of the individual-level effects among sample members.
In this paper, we raise a new concern about main-effects-only regression models.
Instead of considering models in which heterogeneity remains unmodeled in only one
effect, we consider standard linear path models in which unmodeled heterogeneity is
potentially pervasive.
Using simple examples, we show that unmodeled effect heterogeneity in more than one
structural parameter may mask confounding and selection bias, and thus lead to biased
estimates. In our simulations, this heterogeneity is indexed by latent (unobserved) group
membership. We believe that this setup represents a fairly realistic scenario—one in
which the analyst has no choice but to resort to a main-effects-only regression model
because she cannot include the desired interaction terms since group-membership is un-
327
Felix Elwert and Christopher Winship
observed. Drawing on Judea Pearl’s theory of directed acyclic graphs (DAG) [1995,
2009] and VanderWeele and Robins [2007], we then show that the specific biases we
report can be predicted from an analysis of the appropriate DAG. This paper is intended
as a serious warning to applied regression modelers to beware of unmodeled effect het-
erogeneity, as it may lead to gross misinterpretation of conventional path models.
We start with a brief discussion of conventional attitudes toward effect heterogeneity in
the social sciences and in sociology in particular, formalize the notion of effect heteroge-
neity, and briefly review results of related work. In the core sections of the paper, we use
simulations to demonstrate the failure of main-effects-only regression models to recover
average causal effects in certain very basic three-variable path models where unmodeled
effect heterogeneity is present in more than one structural parameter. Using DAGs, we
explain which constellations of unmodeled effect heterogeneity will bias conventional
regression estimates. We conclude with a summary of findings.
1
Whether a model requires an interaction depends on the functional form of the dependent and/or
independent variables. For example, a model with no interactions in which the independent vari-
ables are entered in log form, would require a whole series of interactions in order to approximate
this function if the independent variables where entered in nonlog form.
328
Effect Heterogeneity and Bias in Regression
their failure to anticipate and incorporate all dimensions of effect heterogeneity into re-
gression analysis simply shifts the interpretation of regression coefficients from
individual-level causal effects to average causal effects, without imperiling the causal
nature of the estimate.
2
This presentation follows Angrist [1998] and Angrist and Pischke [2009].
3
The denominator of the OLS estimator is just a normalizing constant that does not aid intuition.
329
Felix Elwert and Christopher Winship
If the treatment effect is constant across strata, these weights make good sense. OLS
gives the minimum variance linear unbiased estimator of the model parameters under
homoscedasticity assuming correct specification of the model. Thus in a model without
interactions between treatment and covariates X the OLS estimator gives the most weight
to strata with the smallest variance for the estimated within-stratum treatment effect,
which, not considering the size of the strata, are those strata with the largest treatment
variance, i.e. with the Px that are closest to .5. However, if effects are heterogeneous
across strata, this weighting scheme makes little substantive sense: in order to compute
the average causal effect, δ , as defined above, we would want to give the same weight to
every individual in the sample. As a variance-weighted estimator, however, regression
estimates under conditions of unmodeled effect heterogeneity do not give the same
weight to every individual in the sample and thus do not converge to the (unweighted)
average treatment effect.
α B β
A C
γ
330
Effect Heterogeneity and Bias in Regression
1. For simulations in which one or more parameters do not vary by group, we set the
constant parameter(s) to the average of the group specific parameters, e.g. α = (α0 + α1)/2.
Finally, we estimate a conventional linear regression model for the effects of A and B
on C using the conventional default specification, in which all variables enter as main
effects only, C = Aγ + Bβ + ε. (Note that G is latent and therefore cannot be included in
the model.) The parameter, γ refers to the direct effect of A on C holding B constant, and
β refers to the total effect of B on C.4 In much sociological and social science research,
this main-effects regression model is intended to recover average structural (causal)
effects, and is commonly believed to be well suited for the purpose.
Results: Table 2 shows the regression estimates for the main effect parameters across
the eight scenarios of effect heterogeneity. We see that the main effects regression model
correctly recovers the desired (average) parameters, γ=1 and β=1.5 if none of the pa-
rameters vary across groups (column 1), or if only one of the three parameters varies
(columns 2-4).
Other constellations of effect heterogeneity, however, produce biased estimates. If αG
and βG (column 5); or αG and γG (column 6); or αG, βG, and γG (column 8) vary across
groups, the main-effects-only regression model fails to recover the true (average) pa-
rameter values known to underlie the simulations. For our specific parameter values, the
estimated (average) effect of B on C in these troubled scenarios is always too high, and
the estimated average direct effect of A on C is either too high or too low. Indeed, if we
set γ=0 but let αG and βG vary across groups, the estimate for γ in the main-effects-only
regression model would suggest the presence of a direct effect of A on C even though it
is known by design that no such direct effect exists (not shown).
Failure of the regression model to recover the known path parameters is not merely a
function of the number of paths that vary. Although none of the scenarios in which fewer
than two parameters vary yield incorrect estimates, and the scenario in which all three
parameters vary is clearly biased, results differ for the three scenarios in which exactly
two parameters vary. In two of these scenarios (columns 5 and 6), regression fails to
recover the desired (average) parameters, while regression does recover the correct
average parameters in the third scenario (column 7).
4
The notion of direct and indirect effects is receiving deserved scrutiny in important recent work
by Robins and Greenland [1992]; Pearl [2001]; Robins [2003]; Frangakis and Rubin [2002]; Sobel
[2008]; and VanderWeele [2008].
331
Felix Elwert and Christopher Winship
In sum, the naïve main-effects-only linear regression model recovers the correct (aver-
age) parameter values only under certain conditions of limited effect heterogeneity, and it
fails to recover the true average effects in certain other scenarios, including the scenario
we consider most plausible in the majority of sociological applications, i.e., where all
three parameters vary across groups. If group membership is latent—because group
membership is unknown to or unmeasured by the analyst— and thus unmodeled, linear
regression generally will fail to recover the true average effects.
332
Effect Heterogeneity and Bias in Regression
of Y on Z, Z=Yξ + YXψ + εZ.5 In the language of VanderWeele and Robins [2007], who
provide the most extensive treatment of effect heterogeneity using DAGs to date, one
may call X a “direct effect modifier” of the effect of Y on Z. The point is that a variable
that modifies the effect of Y on Z is causally associated with Z, as represented by the
arrow from X to Z.
εZ
X
Z
Returning to our simulation, one realizes that the social science path model of Figure 1,
although a useful tool for informally illustrating the data generation process, does not,
generally, provide a sufficiently rigorous description of the causal structure underlying
the simulations. Figure 1, although truthfully representing the separate data generating
mechanism for each group and each individual in the simulated population, is not the
correct DAG for the pooled population containing groups G = 0 and G = 1 for all of the
heterogeneity scenarios considered above. Specifically, in order to turn the informal
social science path model of Figure 1 into a DAG, one would have to integrate the source
of heterogeneity, G, into the picture. How this is to be done depends on the structure of
heterogeneity. If only βG (the effect of B on C) and/or γG (the direct effect of A on C
holding B constant) varied with G, then one would add an arrow from G into C. If αG
(the effect of A on B) varied with G, then one would add an arrow from G into B. The
DAG in Figure 3 thus represents those scenarios in which αG as well as either βG or γG, or
both, vary with G (columns 5, 6, and 8). Interpreted in terms of a linear path model, this
DAG is consistent with the following two structural equations: B = Aα0 + AGα1 + εB and
C = Aγ0 + AGγ1 + Bβ0 + BGβ1 + εC (where the iid errors, ε, have been omitted from the
DAG and are assumed to be uncorrelated).6
In our analysis, mimicking the reality of limited observational data with weak substan-
tive theory, we have assumed that A, B, and C are observed, but that G is not observed. It
is immediately apparent that the presence of G in Figure 3 means that, first, G is a
confounder for the effect of B on C; and, second, that B is a “collider” [Pearl 2009] on
5
It is also consistent with an equation that adds a main effect of X. For the purposes of this paper
it does not matter whether the main effect is present.
6
By construction of the example, we assume that A is randomized and thus marginally
independent of G. Note, however, that even though G is mean independent of B and C (no main
effect of G on either B or C), G is not marginally independent of B or C because
var(B|G=1)≠var(B|G=0) and var(C|G=1)≠var(C|G=0), which explains the arrows from G into B and
C. Adding main effects of G on B and C would not change the arguments presented here.
333
Felix Elwert and Christopher Winship
the path from A to C via B and G. Together, these two facts explain the failure of the
main-effects-only regression model to recover the true parameters in panels 5, 6, and 8:
First, in order to recover the effect of B on C, β, one would need to condition on the con-
founders A and G. But G is latent so it cannot be conditioned on. Second, conditioning on
the collider B in the regression opens a “backdoor path” from A to C via B and G (when
G is not conditioned on), i.e. it induces a non-causal association between A and C,
creating selection bias in the estimate for the direct effect of A on C, γ [Pearl 1995, 2009;
Hernán et al 2004]. Hence, both coefficients in the main-effects-only regression model
will be biased for the true (average) parameters.
A C
By contrast, if G modifies neither β nor γ, then the DAG would not contain an arrow
from G into C; and if G does not modify α then the DAG would not contain an arrow
from G into B. Either way, if either one (or both) of the arrows emanating from G are
missing, then G is not a confounder for the effect of B on C, and conditioning on B will
not induce selection bias by opening a backdoor path from A to C. Only then would the
main effects regression model be unbiased and recover the true (average) parameters, as
seen in panels 1-4 and 7.
In sum, Pearl’s DAGs neatly display the structural information encoded in effect het-
erogeneity [VanderWeele and Robins 2007]. Consequently, Pearl’s DAGs immediately
draw attention to problems of confounding and selection bias that can occur when more
than one effect in a causal system varies across sample members. Analyzing the appro-
priate DAG, the failure of main-effects-only regression models to recover average struc-
tural parameters in certain constellations of effect heterogeneity becomes predictable.
5 Conclusion
This paper considered a conventional structural model of a kind commonly used in the
social sciences and explored its performance under various basic scenarios of effect het-
erogeneity. Simulations show that the standard social science strategy of dealing with
effect heterogeneity—by ignoring it—is prone to failure. In certain situations, the main-
effects-only regression model will recover the desired quantities, but in others it will not.
We believe that effect heterogeneity in all arrows of a path model is plausible in many, if
not most, substantive applications. Since the sources of heterogeneity are often not theo-
rized, known, or measured, social scientists continue routinely to estimate main-effects-
334
Effect Heterogeneity and Bias in Regression
only regression models in hopes of recovering average causal effects. Our examples
demonstrate that the belief in the averaging powers of main-effects-only regression mod-
els may be misplaced if heterogeneity is pervasive, as estimates can be mildly or wildly
off the mark. Judea Pearl’s DAGs provide a straightforward explanation for these diffi-
culties—DAGs remind analysts that effect heterogeneity may encode structural infor-
mation about confounding and selection bias that requires consideration when designing
statistical strategies for recovering the desired average causal effects.
References
Amato, Paul R., and Alan Booth. (1997). A Generation at Risk: Growing Up in an Era of
Family Upheaval. Cambridge, MA: Harvard University Press.
Angrist, Joshua D. (1998). “Estimating the Labor Market Impact on Voluntary Military
Service Using Social Security Date on Military Applicants.” Econometrica 66: 249-88.
Angrist, Joshua D. and Jörn-Steffen Pischke. (2009). Mostly Harmless Econometrics: An
Empiricist’s Companion. Princeton, NJ: Princeton University Press.
Elwert, Felix, and Nicholas A. Christakis. (2006). “Widowhood and Race.” American
Sociological Review 71: 16-41.
Frangakis, Constantine E., and Donald B. Rubin. (2002). “Principal Stratification in
Causal Inference.” Biometrics 58: 21–29.
Greenland, Sander, Judea Pearl, and James M. Robbins. (1999). “Causal Diagrams for
Epidemiologic Research.” Epidemiology 10: 37-48.
Hernán, Miguel A., Sonia Hernández-Diaz, and James M. Robins. (2004). “A Structural
Approach to Section Bias.” Epidemiology 155 (2): 174-184.
Morgan, Stephen L. and Christopher Winship. (2007). Counterfactuals and Causal Infer-
ence: Methods and Principles of Social Research. Cambridge: Cambridge University
Press.
Pearl, Judea. (1995). “Causal Diagrams for Empirical Research.” Biometrika 82 (4): 669-
710.
Pearl, Judea. (2001). "Direct and Indirect Effects." In Proceedings of the Seventeenth
Conference on Uncertainty in Artificial Intelligence. San Francisco, CA: Morgan
Kaufmann, 411-420.
Pearl, Judea. (2009). Causality: Models, Reasoning, and Inference. Second Edition.
Cambridge: Cambridge University Press.
335
Felix Elwert and Christopher Winship
Robins, James M. (2001). “Data, Design, and Background Knowledge in Etiologic Infer-
ence,” Epidemiology 11 (3): 313-320.
Robins, James M. (2003). “Semantics of Causal DAG Models and the Identification of
Direct and Indirect Effects.” In: Highly Structured Stochastic Systems, P. Green, N.
Hjort and S. Richardson, Eds. Oxford: Oxford University Press.
Robins, James M, and Sander Greenland. (1992). “Identifiability and Exchangeability for
Direct and Indirect Effects.” Epidemiology 3:143-155.
Sobel, Michael. (2008). “Identification of Causal Parameters in Randomized Studies with
Mediating Variables,” Journal of Educational and Behavioral Statistics 33 (2): 230-
251.
VanderWeele, Tyler J. (2008). “Simple Relations Between Principal Stratification and
Direct and Indirect Effects.” Statistics and Probability Letters 78: 2957-2962.
VanderWeele, Tyler J. and James M. Robins. (2007). “Four Types of Effect Modifica-
tion: A Classification Based on Directed Acyclic Graphs.” Epidemiology 18 (5): 561-
568.
336
Return to TOC
20
1 Introduction
In this paper we give an overview of the knowledge representation (KR) language P-
log [Baral, Gelfond, and Rushton 2009] whose design was greatly influenced by work
of Judea Pearl. We introduce the syntax and semantics of P-log, give a number of
examples of its use for knowledge representation, and discuss the role Pearl’s ideas
played in the design of the language. Most of the technical material presented in
the paper is not new. There are however two novel technical contributions which
could be of interest. First we expand P-log semantics to allow domains with infinite
Herbrand bases. This allows us to represent infinite sequences of random variables
and (indirectly) continuous random variables. Second we generalize the logical base
of P-log which improves the degree of elaboration tolerance of the language.
The goal of the P-log designers was to create a KR-language allowing natural
and elaboration tolerant representation of commonsense knowledge involving logic
and probabilities. The logical framework of P-log is Answer Set Prolog (ASP) —
a language for knowledge representation and reasoning based on the answer set se-
mantics (aka stable model semantics) of logic programs [Gelfond and Lifschitz 1988;
Gelfond and Lifschitz 1991]. ASP has roots in declarative programing, the syntax
and semantics of standard Prolog, disjunctive databases, and non-monotonic logic.
The semantics of ASP captures the notion of possible beliefs of a reasoner who
adheres to the rationality principle which says that “One shall not believe anything
one is not forced to believe”. The entailment relation of ASP is non-monotonic1 ,
which facilitates a high degree of elaboration tolerance in ASP theories. ASP allows
natural representation of defaults and their exceptions, causal relations (including
effects of actions), agents’ intentions and obligations, and other constructs of natural
language. ASP has a number of efficient reasoning systems, a well developed math-
ematical theory, and a well tested methodology of representing and using knowledge
for computational tasks (see, for instance, [Baral 2003]). This, together with the
fact that some of the designers of P-log came from the ASP community made the
choice of a logical foundation for P-log comparatively easy.
1 Roughly
speaking, a language L is monotonic if whenever Π1 and Π2 are collections of state-
ments of L with Π1 ⊂ Π2 , and W is a model of Π2 , then W is a model of Π1 . A language which
is not monotonic is said to be nonmonotonic.
337
Michael Gelfond and Nelson Rushton
The choice of a probabilistic framework was more problematic and that is where
Judea’s ideas played a major role. Our first problem was to choose from among
various conceptualizations of probability: classical, frequentist, subjective, etc. Un-
derstanding the intuitive readings of basic language constructs is crucial for a soft-
ware/knowledge engineer — probably more so than for a mathematician who may
be primarily interested in their mathematical properties. Judea Pearl in [Pearl 1988]
introduced the authors to the subjective view of probability — i.e. understanding
of probabilities as degrees of belief of a rational agent — and to the use of subjective
probability in AI. This matched well with the ASP-based logic side of the language.
The ASP part of a P-log program can be used for describing possible beliefs, while
the probabilistic part would allow knowledge engineers to quantify the degrees of
these beliefs.
After deciding on an intuitive reading of probabilities, the next question was
which sorts of probabilistic statements to allow. Fortunately, the question of concise
and transparent representation of probability distributions was already addressed by
Judea in [Pearl 1988], where he showed how Bayesian nets can be successfully used
for this purpose. The concept was extended in [Pearl 2000] where Pearl introduced
the notion of Causal Bayesian Nets (CBN’s). Pearl’s definition of CBN’s is pioneer-
ing in three respects. First, he gives a framework where nondeterministic causal
relations are the primitive relations among random variables. Second, he shows how
relationships of correlation and (classical) independence emerge from these causal
relationships in a natural way; and third he shows how this emergence is faithful to
our intuitions about the difference between causality and (mere) correlation.
As we mentioned above, one of the primary desired features in the design of P-log
was elaboration tolerance — defined as the ability of a representation to incorpo-
rate new knowledge with minimal revision [McCarthy 1999]. P-log inherited from
ASP the ability to naturally incorporate many forms of new logical knowledge. An
extension of ASP, called CR-Prolog, further improved this ability [Balduccini and
Gelfond 2003]. The term “elaboration tolerance” is less well known in the field of
probabilistic reasoning, but one of the primary strengths of Bayes nets as a repre-
sentation is the ability to systematically and smoothly incorporate new knowledge
through conditioning, using Bayes Theorem as well as algorithms given by Pearl
[Pearl 1988] and others. Causal Bayesian Nets carry this a step further, by allowing
us to formalize interventions in addition to (and as distinct from) observations, and
smoothly incorporate either kind of new knowledge in the form of updates. Thus
from the standpoint of elaboration tolerance, CBN’s were a natural choice as a
probabilistic foundation for P-log.
Another reason for choosing CBN’s is that we simply believe Pearl’s distinction
between observations and interventions to be central to commonsense probabilistic
reasoning. It gives a precise mathematical basis for distinguishing between the
following questions: (1) what can I expect to happen given that I observe X = x,
and (2) what can I expect to happen if I intervene in the normal operation of
338
Reasoning in P-log
P-log carries things another step. There are many actions one could take to
manipulate a system besides fixing the values of (otherwise random) variables —
and the effects of such actions are well studied under headings associated with
ASP. Moreover, besides actions, there are many sorts of information one might
gain besides those which simply eliminate possible worlds: one may gain knowledge
which introduces new possible worlds, alters the probabilities of possible worlds,
introduces new logical rules, etc. ASP has been shown to be a good candidate for
handling such updates in non-probabilistic settings, and our hypothesis was that it
would serve as well when combined with a probabilistic representation. Thus some of
the key advantages of Bayesian nets, which are amplified by CBN’s, show plausible
promise of being even further amplified by their combination with ASP. This is the
methodology of P-log: to combine a well studied method for elaboration tolerant
probabilistic representations (CBN’s) with a well studied method for elaboration
tolerant logical representations (ASP).
Finally let us say a few words about the current status of the language. It is com-
paratively new. The first publication on the subject appeared in [Baral, Gelfond,
and Rushton 2004], and the full journal paper describing the language appeared
only recently in [Baral, Gelfond, and Rushton 2009]. The use of P-log for knowl-
edge representation was also explored in [Baral and Hunsaker 2007] and [Gelfond,
Rushton, and Zhu 2006]. A prototype reasoning system based on ASP computa-
tion allowed the use of the language for a number of applications (see, for instance,
[Baral, Gelfond, and Rushton 2009; Pereira and Ramli 2009]). We are currently
working on the development and implementation of a more efficient system, and on
expanding it to allow rules of CR-Prolog. Finding ways for effectively combining
ASP-based computational methods of P-log with recent advanced algorithms for
Bayesian nets is probably one of the most interesting open questions in this area.
339
Michael Gelfond and Nelson Rushton
2 Preliminaries
This section contains a description of syntax and semantics of both ASP and CR-
Prolog. In what follows we use a standard notion of a sorted signature from classical
logic. Terms and atoms are defined as usual. An atom p(t) and its negation ¬p(t)
are referred to as literals. Literals of the form p(t) and ¬p(t) are called contrary.
ASP and CR-Prolog also contain connectives not and or which are called default
negation and epistemic disjunction respectively. Literals possibly preceded by de-
fault negation are called extended literals.
An ASP program is a pair consisting of a signature σ and a collection of rules of
the form
l0 or . . . or lm ← lm+1 , . . . , lk , not lk+1 , . . . , not ln (1)
where l’s are literals. The right-hand side of of the rule is often referred to as the
rule’s body, the left-hand side as the rule’s head.
The answer set semantics of a logic program Π assigns to Π a collection of answer
sets – partial interpretations2 corresponding to possible sets of beliefs which can be
built by a rational reasoner on the basis of rules of Π. In the construction of such
a set S, the reasoner is assumed to be guided by the following informal principles:
• the reasoner should adhere to the rationality principle, which says that one
shall not believe anything one is not forced to believe.
l0 or . . . or li ← li+1 , . . . , lm .
340
Reasoning in P-log
The rationality principle here is captured by the minimality condition. For example,
it is easy to see that { } is the only answer set of program consisting of the single
rule p ← p, and hence the reasoner associated with it knows nothing about the
truth or falsity of p. The program consisting of rules
p(a).
q(a) or q(b) ← p(a).
has two answer sets: {p(a), q(a)} and {p(a), q(b)}. Note that no rule requires the
reasoner to believe in both q(a) and q(b). Hence he believes that the two formulas
p(a) and (q(a) or q(b)) are true, and that ¬p(a) is false. He remains undecided,
however, about, say, the two formulas p(b) and (¬q(a) or ¬q(b)). Now let us consider
an arbitrary program:
DEFINITION 2 (Answer Sets, Part II). Let Π be an arbitrary collection of rules
(1) and S a set of literals. By ΠS we denote the program obtained from Π by
341
Michael Gelfond and Nelson Rushton
Here d1 (P, C) is the name of the default rule and ab(d1 (P, C)) says that default
d1 (P, C) is not applicable to the pair hP, Ci. The second rule above stops the
application of the default in cases where the class is logic and P may be a math
professor. Used in conjunction with rules:
member(john, cs).
member(mary, math).
member(bob, ee).
¬member(P, D) ← not member(P, D).
course(logic, cs).
course(data structures, cs).
the program will entail that Mary does not teach data structures while she may
teach logic; Bob teaches neither logic nor data structures, and John may teach both
classes.
The previous examples illustrate the representation of defaults and their strong and
weak exceptions. There is another type of possible exception to defaults, sometimes
referred to as an indirect exception. Intuitively, these are rare exceptions that
come into play only as a last resort, to restore the consistency of the agent’s world
view when all else fails. The representation of indirect exceptions seems to be
beyond the power of ASP. This observation led to the development of a simple but
powerful extension of ASP called CR-Prolog (or ASP with consistency-restoring
rules). To illustrate the problem let us consider the following example.
Consider an ASP representation of the default “elements of class c normally have
property p”:
p(X) ← c(X),
not ab(d(X)),
not ¬p(X).
together with the rule
q(X) ← p(X).
and the facts c(a) and ¬q(a). Let us denote this program by E, where E stands for
“exception”.
It is not difficult to check that E is inconsistent. No rules allow the reasoner to
prove that the default is not applicable to a (i.e. to prove ab(d(a))) or that a
does not have property p. Hence the default must conclude p(a). The second rule
implies q(a) which contradicts one of the facts. However, there seems to exists a
342
Reasoning in P-log
where l’s are literals. Rules of this type are called consistency restoring rules
(CR-rules).
Intuitively, rule (2) says that if the reasoner associated with the program believes the
body of the rule, then he “may possibly” believe its head. However, this possibility
may be used only if there is no way to obtain a consistent set of beliefs by using
only regular rules of the program. The partial order over sets of CR-rules will be
used to select preferred possible resolutions of the conflict. Currently the inference
engine of CR-Prolog [Balduccini 2007] supports two such relations, denoted ≤1 and
≤2 . One is based on the set-theoretic inclusion (R1 ≤1 R2 holds iff R1 ⊆ R2 ).
The other is defined by the cardinality of the corresponding sets (R1 ≤2 R2 holds
iff |R1 | ≤ |R2 |). To give the precise semantics we will need some terminology and
notation.
The set of regular rules of a CR-Prolog program Π will be denoted by Πr , and the
set of CR-rules of Π will be denoted by Πcr . By α(r) we denote a regular rule
+
obtained from a consistency restoring rule r by replacing ← by ←. If R is a set of
CR-rules then α(R) = {α(r) : r ∈ R}. As in the case of ASP, the semantics of
CR-Prolog will be given for ground programs. A rule with variables will be viewed
as a shorthand for a set of ground rules.
DEFINITION 3. (Abductive Support)
A minimal (with respect to the preference relation of the program) collection R of
343
Michael Gelfond and Nelson Rushton
CR-rules of Π such that Πr ∪ α(R) is consistent (i.e. has an answer set) is called
an abductive support of Π.
DEFINITION 4. (Answer Sets of CR-Prolog)
A set A is called an answer set of Π if it is an answer set of a regular program
Πr ∪ α(R) for some abductive support R of Π.
Now let us show how CR-Prolog can be used to represent defaults and their indirect
exceptions. The CR-Prolog representation of the default d(X), which we attempted
to represent in ASP program E, may look as follows
p(X) ← c(X),
not ab(d(X)),
not ¬p(X).
+
¬p(X) ← c(X).
The first rule is the standard ASP representation of the default, while the second rule
expresses the Contingency Axiom for the default d(X)4 . Consider now a program
obtained by combining these two rules with an atom c(a).
Assuming that a is the only constant in the signature of this program, the program’s
unique answer set will be {c(a), p(a)}. Of course this is also the answer set of the
regular part of our program. (Since the regular part is consistent, the Contingency
Axiom is ignored.) Let us now expand this program by the rules
q(X) ← p(X).
¬q(a).
The regular part of the new program is inconsistent. To save the day we need to
use the Contingency Axiom for d(a) to form the abductive support of the program.
As a result the new program has the answer set {¬q(a), c(a), ¬p(a))}. The new
information does not produce inconsistency, as it did in ASP program E. Instead the
program withdraws its previous conclusion and recognizes a as a (strong) exception
to default d(a).
3 The Language
A P-log program consists of its declarations, logical rules, random selection rules,
probability atoms, observations, and actions. We will begin this section with a
brief description of the syntax and informal readings of these components of the
programs, and then proceed to an illustrative example.
The declarations of a P-log program give the types of objects and functions in
the program. Logical rules are “ordinary” rules of the underlying logical language
4 In this form of Contingency Axiom, we treat X as a strong exception to the default. Sometimes
it may be useful to also allow weak indirect exceptions; this can be achieved by adding the rule
+
ab(d(X)) ← c(X).
344
Reasoning in P-log
written using light syntactic sugar. For purposes of this paper, the underlying
logical language is CR-Prolog.
P-log uses random selection rules to declare random attributes (essentially ran-
dom variables) of the form a(t), where a is the name of the attribute and t is a
vector of zero or more parameters. In this paper we consider random selection rules
of the form
[ r ] random(a(t)) ← B. (3)
where r is a term used to name the random causal process associated with the rule
and B is a conjunction of zero or more extended literals. The name [ r ] is optional
and can be omitted if the program contains exactly one random selection rule for
a(t). Statement (3) says that if B were to hold, the value of a(t) would be selected at
random from its range by process r, unless this value is fixed by a deliberate action.
More general forms of random selection rules, where the values may be selected from
a range which depends on context, are discussed in [Baral, Gelfond, and Rushton
2009].
Knowledge of the numeric probabilities of possible values of random attributes is
expressed through causal probability atoms, or pr-atoms. A pr-atom takes the form
where l is a literal, a(t) a random attribute, and y a possible value of a(t). obs(l)
is read l is observed to be true. The action do(a(t) = y) is read the value of a(t),
instead of being random, is set to y by a deliberate action.
345
Michael Gelfond and Nelson Rushton
EXAMPLE 5. [Circuit]
A circuit has a motor, a breaker, and a switch. The switch may be open or closed.
The breaker may be tripped or not; and the motor may be turning or not. The
operator may toggle the switch or reset the breaker. If the switch is closed and the
system is functioning normally, the motor turns. The motor never turns when the
switch is open, the breaker is tripped, or the motor is burned out. The system may
break and if so the break could consist of a tripped breaker, a burned out motor,
or both, with respective probabilities .9, .09, and .01. Breaking, however, is rare,
and should be considered only in the absence of other explanations.
Let us show how to represent this knowledge in P-log. First we give declarations of
sorts and functions relevant to the domain. As typical for representation of dynamic
domains we will have sorts for actions, fluents (properties of the domain which can
be changed by actions), and time steps. Fluents will be partitioned into inertial
fluents and defined fluents. The former are subject to the law of inertia [Hayes and
McCarthy 1969] (which says that things stay the same by default), while the latter
are specified by explicit definitions in terms of already defined fluents. We will also
have a sort for possible types of breaks which may occur in the system. In addition
to declared sorts P-log contains a number of predefined sorts, e.g. a sort boolean.
Here are the sorts of the domain for the circuit example:
Here holds(f, T ) says that fluent f is true at time step T and occurs(a, T ) indicates
that action a was executed at T .
The last function we need to declare is a random attribute type of break(T ) which
denotes the type of an occurrence of action break at step T .
The first two logical rules of the program define the direct effects of action toggle.
346
Reasoning in P-log
holds(closed, T + 1) ← occurs(toggle, T ),
¬holds(closed, T ).
¬holds(closed, T + 1) ← occurs(toggle, T ),
holds(closed, T ).
They simply say that toggling opens and closes the switch. The next rule says that
resetting the breaker untrips it.
¬holds(tripped, T + 1) ← occurs(reset, T ).
The effects of action break are described by the rules
holds(tripped, T + 1) ← occurs(break, T ),
type of break(T ) = trip.
holds(burned, T + 1) ← occurs(break, T ),
type of break(T ) = burn.
holds(tripped, T + 1) ← occurs(break, T ),
type of break(T ) = both.
holds(burned, T + 1) ← occurs(break, T ),
type of break(T ) = both.
The next two rules express the inertia axiom which says that by default, things stay
as they are. They use default negation not — the main nonmonotonic connective
of ASP —, and can be viewed as typical representations of defaults in ASP and its
extensions.
holds(F, T + 1) ← inertial f luent(F ),
holds(F, T ),
not ¬holds(F, T + 1).
¬holds(F, T + 1) ← inertial f luent(F ),
¬holds(F, T ),
not holds(F, T + 1).
Next we explicitly define fluents faulty and turning.
holds(f aulty, T ) ← holds(tripped, T ).
holds(f aulty, T ) ← holds(burned, T ).
¬holds(f aulty, T ) ← not holds(f aulty, T ).
The rules above say that the system is functioning abnormally if and only if the
breaker is tripped or the motor is burned out. Similarly the next definition says
that the motor turns if and only if the switch is closed and the system is functioning
normally.
holds(turning, T ) ← holds(closed, T ),
¬holds(f aulty, T ).
¬holds(turning, T ) ← not holds(turning, T ).
The above rules are sufficient to define causal effects of actions. For instance if
we assume that at Step 0 the motor is turning and the breaker is tripped, i.e.
347
Michael Gelfond and Nelson Rushton
action break of the type trip occurred at 0, then in the resulting state we will have
holds(tripped, 1) as the direct effect of this action; while ¬holds(turning, 1) will be
its indirect effect5 .
We will next have a default saying that for each action A and time step T , in the
absence of a reason to believe otherwise we assume A does not occur at T .
¬occurs(A, T ) ← action(A), not occurs(A, T ).
We next state a CR-rule representing possible exceptions to this default. The rule
says that a break to the system may be considered if necessary (that is, necessary
in order to reach a consistent set of beliefs).
+
occurs(break, 0) ← .
The next collection of facts describes the initial situation of our story.
¬holds(closed, 0). ¬holds(burned, 0). ¬holds(tripped, 0). occurs(toggle, 0).
Next, we state a random selection rule which captures the non-determinism in the
description of our circuit.
random(type of break(T )) ← occurs(break, T ).
The rule says that if action break occurs at step T then the type of break will be
selected at random from the range of possible types of breaks, unless this type is
fixed by a deliberate action. Intuitively, break can be viewed as a non-deterministic
action, with non-determinism coming from the lack of knowledge about the precise
type of break.
Let π0 be the circuit program given so far. Next we will give a sketch of the formal
semantics of P-log, using π0 as an illustrative example.
The logical part of a P-log program Π consists of its declarations, logical rules,
random selection rules, observations, and actions; while its probabilistic part consists
of its pr-atoms (though the above program does not have any). The semantics of
P-log describes a translation of the logical part of Π into an “ordinary” CR-Prolog
program τ (Π). The semantics of Π is then given by
5 Itis worth noticing that, though short, our formalization of the circuit is non-trivial. It is
obtained using the general methodology of representing dynamic systems modeled by transition
diagrams whose nodes correspond to physically possible states of the system and whose arcs are
labeled by actions. A transition hσ0 , a, σ1 i indicates that state σ1 may be a result of execution of
a in σ0 . The problem of finding concise and mathematically accurate description of such diagrams
has been a subject of research for over 30 years. Its solution requires a good understanding of the
nature of causal effects of actions in the presence of complex interrelations between fluents. An
additional level of complexity is added by the need to specify what is not changed by actions. As
noticed by John McCarthy, the latter, known as the Frame Problem, can be reduced to finding
a representation of the Inertia Axiom which requires the ability to represent defaults and to do
non-monotonic reasoning. The representation of this axiom as well as that of the interrelations
between fluents we used in this example is a simple special case of general theory of action and
change based on logic programming under the answer set semantics.
348
Reasoning in P-log
To obtain τ (π0 ) we represent sorts as collections of facts. For instance, sort step
would be represented in CR-Prolog as
step(0). step(1).
For a non-boolean function type of break the occurrences of atoms of the form
type of break(T ) = trip in π0 are replaced by type of break(T, trip). Similarly for
burn and both. The translation also contains the axiom
¬type of break(T, V1 ) ← breaks(V1 ), breaks(V2 ), V1 6= V2 ,
type of break(T, V2 ).
to guarantee that type of break is a function. In general, the same transformation
is performed for all non-boolean functions.
Logical rules of π0 are simply inserted into τ (π0 ). Finally, the random selection rule
is transformed into
It is worth pointing out here that while CBN’s represent the notion of intervention in
terms of transformations on graphs, P-log axiomatizes the semantics of intervention
by including not intervene(. . . ) in the body of the translation of each random
selection rule. This amounts to a default presumption of randomness, overridable
by intervention. We will see next how actions using do can defeat this presumption.
Observations and actions are translated as follows. For each literal l in π0 , τ (π0 )
contains the rule
← obs(l), not l.
and
The first rule eliminates possible worlds of the program failing to satisfy l. The
second rule makes sure that interventions affect their intervened-upon variables in
the expected way. The third rule defines the relation intervene which, for each
action, cancels the randomness of the corresponding attribute.
349
Michael Gelfond and Nelson Rushton
It is not difficult to check that under the semantics of CR-Prolog, τ (π0 ) has a unique
possible world W containing holds(closed, 1) and holds(turning, 1), the direct and
indirect effects, respectively, of the action close. Note that the collection of regular
ASP rules of τ (π0 ) is consistent, i.e., has an answer set. This means that CR-rule
+
occurs(break, 0) ← is not activated, break does not occur, and the program contains
no randomness.
Now we will discuss how probabilities are computed in P-log. Let Π be a P-log
program containing the random selection rule [r] random(a(t)) ← B1 and the pr-
atom prr (a(t) = y |c B2 ) = v. Then if W is a possible world of Π satisfying B1 and
B2 , the assigned probability of a(t) = y in W is defined 6 to be v. In case W satisfies
B1 and a(t) = y, but there is no pr-atom prr (a(t = y |c B2 ) = v of Π such that
W satisfies B2 , then the default probability of a(t) = y in W is computed using the
“indifference principle”, which says that two possible values of a random selection
are equally likely if we have no reason to prefer one to the other (see [Baral, Gelfond,
and Rushton 2009] for details). The probability of each random atom a(t) = y
occurring in each possible world W of program Π, written PΠ (W, a(t) = y), is now
defined to be the assigned probability or the default probability, as appropriate.
Let W be a possible world of Π. The unnormalized probability, µ̂Π (W ), of a
possible world W induced by Π is
Y
µ̂Π (W ) =def PΠ (W, a(t) = y)
a(t,y)∈ W
where the product is taken only over atoms for which P (W, a(t) = y) is defined.
Suppose Π is a P-log program having at least one possible world with nonzero
unnormalized probability, and let Ω be the set of possible worlds of Π. The measure,
µΠ (W ), of a possible world W induced by Π is the unnormalized probability of W
divided by the sum of the unnormalized probabilities of all possible worlds of Π,
i.e.,
µ̂Π (W )
µΠ (W ) =def P
Wi ∈Ω µ̂Π (Wi )
When the program Π is clear from context we may simply write µ̂ and µ instead of
µ̂Π and µΠ respectively.
This completes the discussion of how probabilities of possible worlds are defined in
P-log. Now let us return to the circuit example. Let program π1 be the union of π0
with the single observation
obs(¬holds(turning, 1))
The observation contradicts our previous conclusion holds(turning, 1) reached by
using the effect axiom for toggle, the definitions of f aulty and turning, and the
6 For the sake of well definiteness, we consider only programs in which at most one v satisfies
this definition.
350
Reasoning in P-log
inertia axiom for tripped and burned. The program τ (π1 ) will resolve this contra-
+
diction by using the CR-rule occurs(break, 0) ← to conclude that the action break
occurred at Step 0. Now type of break randomly takes one of its possible values.
Accordingly, τ (π1 ) has three answer sets: W1 , W2 , and W3 . All of them contain
occurs(break, 0), holds(f aulty, 1), ¬holds(turning, 1). One, say W1 will contain
and
In accordance with our general definition, π1 will have three possible worlds, W1 ,
W2 , and W3 . The probabilities of each of these three possible worlds can be com-
puted as 1/3, using the indifference principle.
Now let us add some quantitative probabilities to our program. If π2 is the union
of π1 with the following three pr-atoms
then program π2 has the same possible worlds as Π1 . Not surprisingly, Pπ2 (W1 ) =
0.9. Similarly Pπ2 (W2 ) = 0.09 and Pπ2 (W3 ) = 0.01. This demonstrates how a P-log
program may be written in stages, with quantitative probabilities added as they are
needed or become available.
Typically we are interested not just in the probabilities of individual possible worlds,
but in the probabilities of certain interesting sets of possible worlds described, e.g.,
those described by formulae. For current purposes a rather simple definition suffices.
Viz., recalling that possible worlds are sets of literals, for an arbitrary set C of literals
we define
Our example is in some respects rather simple. For instance, every possible world
of our program contains at most one atom of the form a(t) = y where a(t) is a
random attribute. We hope, however, that this example gives a reader some insight
in the syntax and semantics of P-log. It is worth noticing that the example shows
the ability of P-log to mix logical and probabilistic reasoning, including reasoning
about causal effects of actions and explanations of observations. In addition it
351
Michael Gelfond and Nelson Rushton
demonstrates the non-monotonic character of P-log, i.e. its ability to react to new
knowledge by changing probabilistic models of the domain and creating new possible
worlds.
The ability to introduce new possible worlds as a result of conditioning is of
interest from two standpoints. First, it reflects the common sense semantics of
utterances such as “the motor might be burned out.” Such a sentence does not
eliminate existing possible beliefs, and so there is no classical (i.e., monotonic)
semantics in which the statement would be informative. If it is informative, as
common sense suggests, then its content seems to introduce new possibilities into
the listener’s thought process.
Second, nonmonotonicity can improve performance. Possible worlds tend to pro-
liferate exponentially with the size of a program, quickly making computations in-
tractable. The ability to consider only those random selections which may explain
our abnormal observations may make computations tractable for larger programs.
Even though our current solver is in its early stages of development, it is based on
well researched answer set solvers which efficiently eliminate impossible worlds from
consideration based on logical reasoning. Thus even our early prototype has shown
promising performance on problems where logic may be used to exclude possible
worlds from consideration in the computation of probabilities [Gelfond, Rushton,
and Zhu 2006].
4 Spider Example
In this section, we consider a variant of Simpson’s paradox, to illustrate the for-
malization of interventions in P-log. The story we would like to formalize is as
follows:
In Stan’s home town there are two kinds of poisonous spider, the creeper and the
spinner. Bites from the two are equally common in Stan’s area — though spinner
bites are more common on a worldwide basis. An experimental anti-venom has
been developed to treat bites from either kind of spider, but its effectiveness is
questionable.
One morning Stan wakes to find he has a bite on his ankle, and drives to the
emergency room. A doctor examines the bite, and concludes it is a bite from either
a creeper or a spinner. In deciding whether to administer the anti-venom, the
doctor examines the data he has on bites from the two kinds of spiders: out of 416
people bitten by the creeper worldwide, 312 received the anti-venom and 104 did
not. Among those who received the anti-venom, 187 survived; while 73 survived
who did not receive anti-venom. The spinner is more deadly and tends to inhabit
areas where the treatment is less available. Of 924 people bitten by the spinner,
168 received the anti-venom, 34 of whom survived. Of the 756 spinner bite victims
who did not receive the experimental treatment, only 227 survived.
For a random individual bitten by a creeper or spinner, let s, a, and c denote the
352
Reasoning in P-log
events of survival, administering anti-venom, and creeper bite. Based on the fact
that the two sorts of bites are equally common in Stan’s region, the doctor assigns a
0.5 probability to either kind of bite. He also computes a probability of survival, with
and without treatment, from each kind of bite, based on the sampling distribution
of the available data. He similarly computes the probabilities that victims of each
kind of bite received the anti-venom. We may now imagine the doctor uses Bayes’
Theorem to compute P (s | a) = 0.522 and P (s | ¬a) = 0.394.
Thus we see that if we choose a historical victim, in such a way that he has a
50/50 chance of either kind of bite, those who received anti-venom would have a
substantially higher chance of survival. Stan is in the situation of having a 50/50
chance of either sort of bite; however, he is not a historical victim. Since we are
intervening in the decision of whether he receives anti-venom, the computation
above is not germane (as readers of [Pearl 2000] already know) — though we can
easily imagine the doctor making such a mistake. A correct solution is as follows.
Formalizing the relevant parts of the story in a P-log program Π gives
survive, antivenom : boolean.
spider : {creeper, spinner}.
random(spider).
random(survive).
random(antivenom).
pr(spider = creeper) = 0.5.
pr(survive |c spider = creeper, antivenom) = 0.6.
pr(survive |c spider = creeper, ¬antivenom) = 0.7.
pr(survive |c spider = spinner, antivenom) = 0.2.
pr(survive |c spider = spinner, ¬antivenom) = 0.3.
and so, according to our semantics,
PΠ∪{do(antivenom} (survive) = 0.4
PΠ∪{do(¬antivenom} (survive) = 0.5
Thus, the correct decision, assuming we want to intervene to maximize Stan’s chance
of survival, is to not administer antivenom.
In order to reach this conclusion by classical probability, we would need to consider
separate probability measures P1 and P2 , on the sets of patients who received or did
not receive antivenom, respectively. If this is done correctly, we obtain P1 (s) = 0.4
and P2 (s) = 0.5, as in the P-log program.
Thus we can get a correct classical solution using separate probability measures.
Note however, that we could also get an incorrect classical solution using separate
measures, since there exist probability measures P̂1 and P̂2 on the sets of histor-
ical bite victims which capture classical conditional probabilities given a and ¬a
respectively. We may define
353
Michael Gelfond and Nelson Rushton
P (E∩a)
P̂1 (E) =def 0.3582
P (E∩¬a)
P̂2 (E) =def 0.6418
It is well known that each of these is a probability measure. They are seldom seen
only because classical conditional probability gives us simple notations for them in
terms of a single measure capturing common background knowledge. This allows us
to refer to probabilities conditioned on observations without defining a new measure
for each such observation. What we do not have, classically, is a similar mechanism
for probabilities conditioned on intervention — which is sometimes of interest as
the example shows. The ability to condition on interventions in this way has been a
fundamental contribution of Pearl; and the inclusion in P-log of such conditioning-
on-intervention is a direct result of the authors’ reading of his book.
5 Infinite Programs
The definitions given so far for P-log apply only to programs with finite numbers
of random selection rules. In this section we state a theorem which allows us to
extend these semantics to programs which may contain infinitely many random
selection rules. No changes are required from the syntax given in [Baral, Gelfond,
and Rushton 2009], and the probability measure described here agrees with the one
in [Baral, Gelfond, and Rushton 2009] whenever the former is defined.
We begin by defining the class of programs for which the new semantics are
applicable. The reader is referred to [Baral, Gelfond, and Rushton 2009] for the
definitions of causally ordered, unitary, and strict probabilistic levelling.
354
Reasoning in P-log
random(a) : boolean.
pr(a) = 1/2.
pr(¬a) = 2/3.
random(a) : boolean.
random(b) : boolean.
prr (a|c b) = 1/3.
prr (a|c ¬b) = 2/3.
prr (b|c a) = 1/5.
p ← not q.
q ← not p.
since it has two answer sets which arise from circularity of defaults, rather than
random selections. The following program is both unitary and causally ordered, but
not admissible, because atLeastOneT ail depends on infinitely many coin tosses.
Recall that the semantic value of a P-log program Π consists of (1) a set of possible
worlds of Π and (2) a probability measure on those possible worlds. The proposition
now puts us in position to give semantics for programs with infinitely many random
355
Michael Gelfond and Nelson Rushton
selection rules. The possible worlds of the program are the answer sets of the
associated (infinite) CR-Prolog program, as determined by the usual definition —
while the probability measure is PΠ , as defined in Proposition 8.
Note that the declaration for coin is not written in the current syntax of P-log; but
to save space we use set-builder notation here as a shorthand for the more lengthy
formal declaration. Similarly, the notation hM, N i is also a shorthand. From this
point on we will write coin(M, N ) instead of coin(hM, N i).
Π also contains a declaration to say that the throws are random and the coin is
known to be fair:
random(coin(M, N )).
pr(coin(M, N ) = head) = 1/2.
winnings(0) = 0.
winnings(N + 1) = winnings(N ) + 1 + 2N +1 ← win(N ).
winnings(N + 1) = winnings(N ) − 1 ← lose(N ).
winnings(N + 1) = winnings(N ) ← ¬play(N ).
Finally the program contains rules which describe the agent’s strategy in choosing
which games to play. Note that the agent’s expected winnings in the N th game are
given by (1/2N )(1 + 2N +1 ) − (1 − 1/2N ) = 1, so each game has positive expectation
for the player. Thus a reasonable strategy might be to play every game, represented
as
356
Reasoning in P-log
play(N ).
This completes program Π. It can be shown to be admissible, and hence there is
a unique probability measure PΠ satisfying the conclusion of Proposition 1. Thus,
for example, PΠ (coin(3, 2) = head) = 1/2, and PΠ (win(10)) = 1/210 . Each of
these probabilities can be computed from finite sub-programs. As more interesting
example, let S be the set of possible worlds in which the agent wins infinitely many
games. The probability of this event cannot be computed from any finite sub-
program of Π. However, S is a countable intersection of countable unions of sets
whose probabilities are defined by finite subprograms. In particular,
∞ [
\ ∞
S= {W | win(J) ∈ W }
N =1 J=N
∞
[
PΠ (S) < PΠ ( {W | win(J) ∈ W }
J=N
∞
X
≤ PΠ ({W | win(J) ∈ W })
J=N
X∞
= 1/2J
J=N
= 1/2N
Now since right hand side can be made arbitrarily small by choosing a sufficiently
large N , it follows that PΠ (S) = 0. Consequently, with probability 1, our agent
will lose all but finitely many of the games he plays. Since he loses one dollar per
play indefinitely after his final win, his winnings converge to −∞ with probability
1, even though each of his wagers has positive expectation!
Acknowledgement
The first author was partially supported in this research by iARPA.
References
Balduccini, M. (2007). CR-MODELS: An inference engine for CR-Prolog. In
C. Baral, G. Brewka, and J. Schlipf (Eds.), Proceedings of the 9th Inter-
357
Michael Gelfond and Nelson Rushton
358
Return to TOC
21
1 Introduction
I came to UCLA in the fall of 1987 and immediately enrolled in the course titled
“Probabilistic Reasoning in Intelligent Systems” where we, as a class, went over the
draft of Judea’s book of the same title [Pearl 1988]. The class meetings were fun
and intense. Everybody came prepared, having read the draft of the appropriate
chapter and having struggled through the list of homework exercises that were due
that day. There was a high degree of discussion and participation, and I was very
impressed by Judea’s attentiveness and interest in our suggestions. He was fully
engaged in these discussions and was ready to incorporate our comments and change
the text accordingly. The following year, I was a teaching assistant (TA) for that
class. The tasks involved with being a TA gave me a chance to rethink and really
digest the contents of the course. It dawned on me then what a terrific insight
Judea had to focus on formalizing the notion of conditional independence: All the
“juice” he got in terms of making “reasoning under uncertainty” computationally
effective came from that formalization. Shortly thereafter, I had a chance to chat
with Judea about these and related thoughts. I was in need of formalizing a notion
of “relevance” for my own research and thought that I could adapt some ideas from
the graphoid models [Pearl 1988]. In that opportunity Judea shared another of his
great insights with me. After hearing me out, Judea said one word: “causality”.
I don’t remember the exact words he used to elaborate, but the gist of what he
said to me was: “we as humans perform extraordinarily complex reasoning tasks,
being able to select the relevant variables, circumscribe the appropriate context,
and reduce the number of factors that we should manipulate. I believe that our
intuitive notions of causality enable us to do so. Causality is the holly grail [for
Artificial Intelligence]”.
In this short note, I would like to pay tribute to Judea’s scientific work by specu-
lating on the very realistic possibility of computers using his formalization of causal-
ity for automatically performing a nontrivial reasoning task commonly reserved for
humans. Namely designing, generating, and executing experiments in order to con-
duct a proper diagnosis and identify the causes of performance problems on code
being executed in large clusters of computers. What follows in the next two sections
is not a philosophical exposition on the meaning of “causality” or on the reasoning
powers of automatons. It is rather a brief description of the current state of the art
359
Moises Goldszmidt
There has been a recent research surge in systems directed at providing program-
mers with the ability to write efficient parallel and distributed applications [Hadoop
2008; Dean and Ghemawat 2004; Isard et al. 2007]. Programs written in these envi-
ronments are automatically parallelized and executed on large clusters of commodity
machines. The tasks of enabling programmers to effectively write and deploy par-
allel and distributed application has of course been a long-standing problem. Yet,
the relatively recent emergence of large-scale internet services, which depend on
clusters of hundreds of thousands of general purpose servers, have given the area
a forceful push. Indeed, this is not merely an academic exercise; code written in
these environments has been deployed and is very much in everyday use at com-
panies such as Google, Microsoft, and Yahoo (and many others). These programs
process web pages in order to feed the appropriate data to the search and news
summarization engines; render maps for route planning services; and update usage
and other statistics from these services. Year old figures estimate that Dryad, the
specific such environment created at Microsoft [Isard et al. 2007], is used to crunch
on the order of a petabyte a day at Microsoft. In addition, in our lab at Microsoft
Research, a cluster of 256 machines controlled by Dryad runs daily at a 100% uti-
lization. This cluster mostly runs tests and experiments on research algorithms in
machine learning, privacy, and security that process very large amounts of data.
The intended model in Dryad is for the programmer to build code as if she were
programming one computer. The system then takes care of a) distributing the code
to the actual cluster and b) managing the execution of the code in the cluster. All
aspects of execution, including data partition, communications, and fault tolerance,
are the responsibility of Dryad.
With these new capabilities comes the need for new tools for debugging code,
profiling execution performance, and diagnosing system faults. By the mere fact
that clusters of large numbers of computers are being employed, rare bugs will
manifest themselves more often, and devices will fail in more runs (due to both
software and hardware problems). In addition, as the code will be executed in a
networked environment and the data will be partitioned (usually according to some
hash function), communication bandwidth, data location, contention for shared
disks, and data skewness will impact the performance of the programs. Most of
the times the impact of these factors will be hard to reproduce in a single machine,
making it an imperative that the diagnosis, profiling, and debugging be performed
in the same environment and conditions as those in which the code is running.
360
On Computers Diagnosing Computers
361
362
On Computers Diagnosing Computers
References
Blake, R. and J. Breese (1995). Automatic bottleneck detection. In UAI’95: Pro-
ceedings of the Conference on Uncertainty in Artificial Intelligence.
Breese, J. and D. Heckerman (1996). Decision theoretic troubleshooting. In
UAI’96: Proceedings of the Conference on Uncertainty in Artificial Intelli-
gence.
Cohen, I., M. Goldszmidt, T. Kelly, J. Symons, and J. Chase (2004). Correlating
instrumentation data to systems states: A building block for automated diag-
nosis and control. In OSDI’04: Proceedings of the 6th conference on Sympo-
sium on Opearting Systems Design & Implementation. USENIX Association.
Creţu-Ciocârlie, G. F., M. Budiu, and M. Goldszmidt (2008). Hunting for prob-
lems with Artemis. In USENIX Workshop on the Analysis of System Logs
(WASL).
Dean, J. and S. Ghemawat (2004). Mapreduce: simplified data processing on
large clusters. In OSDI’04: Proceedings of the 6th conference on Symposium
on Opearting Systems Design & Implementation. USENIX Association.
Hadoop (2008). The hadoop project. http://hadoop.apache.org.
Isard, M., M. Budiu, Y. Yu, A. Birrell, and D. Fetterly (2007). Dryad: dis-
tributed data-parallel programs from sequential building blocks. In EuroSys
’07: Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on
Computer Systems 2007. ACM.
Pearl, J. (1988). Probabilistic Reasoning in Intelligent Systems: Networks of Plau-
sible Inference. Morgan Kaufmann.
Pearl, J. (2000). Causality: Models, Reasoning, and Inference. Cambridge
Univ. Press.
363
Return to TOC
22
Overthrowing the Tyranny of Null Hypotheses
Hidden in Causal Diagrams
SANDER GREENLAND
1 Introduction
Graphical models have a long history before and outside of causal modeling.
Mathematical graph theory extends back to the 1700s and was used for circuit analysis in
the 19th century. Its application in probability and computer science dates back at least to
the 1960s (Biggs et al., 1986), and by the 1980s graphical models had become fully
developed tools for these fields (e.g., Pearl, 1988; Hajek et al., 1992; Lauritzen, 1996).
As Bayesian networks, graphical models are carriers of direct conditional independence
judgments, and thus represent a collection of assumptions that confine prior support to a
lower dimensional manifold of the space of prior distributions over the nodes. Such
dimensionality reduction was recognized as essential in formulating explicit and
computable algorithms for digital-machine inference, an essential task of artificial-
intelligence (AI) research. By the 1990s, these models had been merged with causal path
diagrams long used in observational health and social science (OHSS) (Wright, 1934;
Duncan, 1975), resulting in a formal theory of causal diagrams (Spirtes et al., 1993;
Pearl, 1995, 2000).
It should be no surprise that some of the most valuable and profound contributions to
these developments were from Judea Pearl, a renowned AI theorist. He motivated causal
diagrams as causal Bayesian networks (Pearl, 2000), in which the basis for the
dimensionality reduction is grounded in judgments of causal independence (and
especially, autonomy) rather than mere probabilistic independence. Beyond his extensive
technical and philosophical contributions, Pearl fought steadfastly to roll back prejudice
against causal modeling and causal graphs in statistics. Today, only a few statisticians
still regard causality as a metaphysical notion to be banned from formal modeling (Lad,
1999). While a larger minority still reject some aspects of causal-diagram or potential-
outcome theory (e.g., Dawid, 2000, 2008; Shafer, 2002), the spreading wake of
applications display the practical value of these theories, and formal causal diagrams
have advanced into applied journals and books (e.g., Greenland et al., 1999; Cole and
Hernán, 2002; Hernán et al., 2002; Jewell, 2004; Morgan and Winship, 2007; Glymour
and Greenland, 2008) – although their rapid acceptance in OHSS may well have been
facilitated by the longstanding informal use of path diagrams to represent qualities of
causal systems (e.g., Susser, 1973; Duncan, 1975).
Graphs are unsurpassed tools for illustrating certain mathematical results that hold in
functional systems (whether stochastic or not, or causal or not). Nonetheless, it is
essential to recognize that many if not most causal judgments in OHSS are based on
365
Overthrowing the Tyranny of Null Hypotheses!
observational (purely associational) data, with little or nothing in the way of manipulative
(or “surgical”) experiment to test these judgments. Time order is usually known, which
insures that the chosen arrow directions are correct; but rarely is there a sound basis for
deleting an arrow, leaving autonomy in question. When all empirical constraints encoded
by the causal network come from passive frequency observations rather than
experiments, the primacy of causal independence judgments has to be questioned. In
these situations (which characterize observational research), we should not neglect
associational models (including graphs) that encode frequency-based judgments, for these
models may be all that are identified by available data. Indeed, a deep philosophical
commitment to statistically identified quantities seems to drive the arguments of certain
critics of potential outcomes and causal diagrams (Dawid, 2000, 2008). Even if we reject
this philosophy, however, we should retain the distinction between levels of identification
provided by our data, for even experimental data will not identify everything we would
like to know.
I will argue that, in some ways, the distinction of nonidentification from identification
is as fundamental to modeling and statistical inference about causal effects as is the
distinction of causation from association (Gustafson, 2005; Greenland, 2005a, 2009a,
2009b). Indeed, I believe that some of the controversy and confusion over causation
versus association stems from the inability of statistical observations to point identify
(consistently estimate) many of the causal parameters that astute scientists legitimately
ask about. Furthermore, if we consider strategies that force identification from available
data (such as node or arrow deletions from graphical models) we will find that
identification may arise only by declaring some types of joint frequencies as justifying
the corresponding conditional independence assumptions. This leads directly into the
complex topic of pruning algorithms, including the choice of target or loss function.
I will outline these problems in their most basic forms, for I think that in the rush to
adopt causal diagrams some realism has been lost by neglecting problems of
nonidentification and pruning. My exposition will take the form of a series of vignettes
that illustrate some basic points of concern. I will not address equally important concerns
that many of the nodes offered as “treatments” may be ill-defined or nonmanipulable, or
may correspond poorly to the treatments they ostensibly represent (Greenland, 2005b;
Hernán, 2005; Cole and Frangakis, 2009; VanderWeele, 2009).
366
Overthrowing the Tyranny of Null Hypotheses!
algorithm was employed by Spirtes et al. (1993), unfortunately with a very naïve
Neyman-Pearsonian criterion (basically, allowing removal of arrows when a P-value
exceeds an α level). These and related graphical algorithms (Pearl and Verma, 1991)
produce what appear to be results in conflict with practical intuitions, namely causal
“discovery” algorithms for single observational data sets, with no need for experimental
evidence. These algorithms have been criticized philosophically on grounds related to the
identification problem (Freedman and Humphreys, 1999; Robins and Wasserman,
1999ab), and there are also objections based on statistical theory (Robins et al., 2003).
One controversial assumption in these algorithms is faithfulness (or stability) that all
connected nodes are associated. Although arguments have been put forward in its favor
(e.g., Spirtes et al., 1993; Pearl, 2000, p. 63), this assumption coheres poorly with prior
beliefs of some experienced researchers. Without faithfulness, two nodes may be
independent even if there is an arrow linking them directly, if that arrow represents the
presence of causal effects among units in a target population. A classic example of such
unfaithfulness appeared in the debates between Fisher and Neyman in the 1930s, in
which they disagreed on how to formulate the causal null hypothesis (Senn, 2004). The
framework of their debate would be recognized today as the potential-outcome or
counterfactual model, although in that era the model (when named) was called the
randomization model. This model illustrates the benefit of randomization as a means of
detecting a signal by injecting white noise into a system to drown out uncontrolled
influences.
To describe the model, suppose we are to study the effect of a treatment X on an
outcome Yobs observable on units in a specific target population. Suppose further we can
fully randomize X, so X will equal the randomized node R. In the potential-outcome
formulation, the outcome becomes a vector Y indexed by X. Specifically, X determines
which component Yx of Y is observable conditional on X=x: Yobs = Yx given X=x. To say
X can causally affect a unit makes no reference to observation, however; it merely means
that some components of Y are unequal. With a binary treatment and outcome, there are
four types of units in the target population about a binary treatment X which indexes a
binary potential-outcome vector Y (Copas, 1973):
1) Noncausal units with outcomes Y=(1,1) under X=1,0 (“doomed’ to Yobs=1);
2) Causal units with outcomes Y=(1,0) under X=1,0 (X=1 causes Yobs=1);
3) Causal units with outcomes Y=(0,1) under X=1,0 (X=1 prevents Yobs=1); and
4) Noncausal units with outcomes Y=(0,0) under X=1,0 (“immune” to Yobs=1).
Suppose the proportion of type i in the trial population is pi. There are now two null
hypotheses:
Hs: There are no causal units: p2=p3=0 (sharp or strong null),
Hw: There is no net effect of treatment on the distribution of Yobs: p2=p3 (weak null).
Under the randomization distribution we have
E(Yobs|X=1) = Pr(Yobs=1|do[X=1]) = Pr(Y1=1) = p1+p2 and
E(Yobs|X=0) = Pr(Yobs=1|do[X=0]) = Pr(Y0=1) = p1+p3;
hence Hw: p2=p3 is equivalent to the hypothesis that the expected outcome is the same for
both treatment groups, and that the proportions with Yobs=1 under the extreme population
367
Overthrowing the Tyranny of Null Hypotheses!
intervention do[X=1] to every unit and do[X=0] to every unit are equal. Note however
that only Hs entails that the proportion with Yobs=1 would be the same under every
possible allocation of treatment X among the units; this property implies that the Y
margin is fixed under Hs, and thus provides a direct causal rationale for Fisher’s exact test
of Hs (Greenland, 1991).
Hs also entails Hw (or, in terms of parameter subspaces, Hs Hw). The converse is
false; but, under any of the “optimal” statistical tests that can be formulated from data on
X and Yobs only, power is identical to the test size on all alternatives to the sharp null with
p2=p3, i.e., Hs is not identifiable within Hw, so within Hw the power of any valid test of Hs
will not exceed its nominal alpha level. Thus, following Neyman, it is only relevant to
think in terms of Hw, because Hw could be rejected whenever Hs could be rejected.
Furthermore, some later authors would disallow Hw – Hs: p2 = p3 ≠ 0 because it violates
faithfulness (Spirtes et al., 2001) or because it represents an extreme treatment-by-unit
interaction with no main effect (Senn, 2004).
There is also a Bayesian argument for focusing exclusively on Hw. Hw is of Lebesgue
measure zero, so under the randomization model, distinctions within Hw can be ignored
by inferences based on an absolutely continuous prior on p = (p1,p2,p3) (Spirtes et al.,
1993). More generally, any distinction that remains a posteriori can be traced to the prior.
A more radical stance would dismiss both Hs and the model defined by 1-4 above as
“metaphysical,” because it invokes constraints on the joint distribution of the components
Y1 and Y0, and that joint distribution is not identified by randomization of X if only X
and Yobs are observed (Dawid, 2000).
On the other hand, following Fisher one can argue that the null of key scientific and
practical interest is Hs, and that Hw − Hs: p2 = p3 ≠ 0 is a scientifically important and
distinct hypothesis. For instance, p2>0, p3>0 entails the existence of units who should be
treated quite differently, and provides an imperative to seek covariates that discriminate
between the two causal types, even if p2=p3. Furthermore, rejection of the stronger Hs is a
weaker inference than rejection of the weaker Hw, and thus rejecting only Hs would be a
conservative interpretation of a “significant” test statistic. Thus, focusing on Hs is
compatible with a strictly falsificationist view of testing in which acceptance of the null is
disallowed. Finally, there are real examples in which X=1 causes Y=1 in some units and
causes Y=0 in others; in some of these cases there may be near-perfect balance of
causation and prevention, as predicted by certain physical explanations for the
observations (e.g., as in Neutra et al., 1980).
To summarize, identification problems arose in the earliest days of formal causal
modeling, even when considering only the simplest of trials. Those problems pivoted not
on whether one should attempt formal modeling of causation as distinct from association,
but rather on what could be identified by standard experimental designs. In the face of
limited (and limiting) design strategies, these problems initiated a long history of
attempts to banish identification problems based on idealized inference systems and
absolute philosophical assertions. But a counter-tradition of arguments, both practical and
philosophical, has regarded identification problems as carriers of valuable scientific
information: They are signs of study limitations which need to be recognized and can
368
369
370
Overthrowing the Tyranny of Null Hypotheses!
p(u,c,x,y,c*,x*,y*,s) =
p(u)p(c|u)p(x|u,c)p(y|u,c,x)p(c*|u,c,x,y)p(x*|u,c,x,y)p(y*|u,c,x,y)p(s|u,c,x,y,c*,x*,y*),
which involves both S=0 events (not selected) and S=1 events (selected), i.e., the
lowercase “s” is used when S can be either 0 or 1.
The marginal (total-population) potential-outcome distribution for Y after intervention
on X, p(yx), equals p(y|do[X=x]), which under fig. 2 equals the standardized (mixing)
distribution of Y given X standardized to (weighted by or mixed over) p(c,u) =
p(c|u)p(u):
p(yx) = p(y|do[x]) = Σu,c p(y|u,c,x)p(c|u)p(u).
This estimand involves only three factors in the decomposition, but none of them are
identified if U is unobserved and no further assumptions are made. Analysis of the causal
estimand p(yx) must somehow relate it to the observed distribution p(c*,x*,y*|S=1) using
known or estimable quantities, or else remain purely speculative (i.e., a sensitivity
analysis).
It is a long, hard road from p(c*,x*,y*|S=1) to p(yx), much longer than the current
“causal inference” literature often makes it look. To appreciate the distance, rewrite the
summand of the standardization formula for p(yx) as an inverse-probability-weighted
(IPW) term derived from an observation (c*,x*,y*|S=1): From fig. 2,
p(y|u,c,x)p(c|u)p(u) =
p(c*,x*,y*|S=1)p(S=1)p(u,c,x,y|c*,x*,y*,S=1)/
p(x|u,c)p(c*|u,c,x,y)p(x*|u,c,x,y)p(y*|u,c,x,y)p(S=1|u,c,x,y,c*,x*,y*).
The latter expression includes
1) the exposure dependence on its parents, p(x|u,c);
2) the measurement distributions p(c*|u,c,x,y), p(x*|u,c,x,y), p(y*|u,c,x,y); and
3) the fully conditioned selection probability p(S=1|u,c,x,y,c*,x*,y*).
The absence of effects corresponding to 1−3 from graphs offered as “causal” suggests
that “causal inference” from observational data using formal causal models remains a
theoretical and largely speculative exercise (albeit often presented without explicit
acknowledgement of that fact).
When adjustments for these effects are attempted, we are usually forced to use crude
empirical counterparts of terms like those in 1−3, with each substitution demanding
nonidentified assumptions. Consider that, for valid inference under figure 2,
1) Propensity scoring and IPW for treatment need p(x|u,c), but all we get from data
is p(x*|c*). Absence of u and c is usually glossed over by assuming “no
unmeasured confounders” or “no residual confounding.” These are not credible
assumptions in OHSS.
2) IPW for selection and censoring needs p(S=1|u,c,x,y,c*,x*,y*), but usually the
most we get from a cohort study or nested study is p(S=1|c*,x*). We do not even
get that much in a case-control study.
3) Measurement-error correction needs conditional distributions from
p(c*,x*,y*,u,c,x,y|S=1), but even when a “validation” study is done, we obtain
only alternative measurements c†,x†,y† (which are rarely error-free) on a tiny and
371
Overthrowing the Tyranny of Null Hypotheses!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
"
This statement describes Bayes factors (Good, 1983) conditioned on the model. That model may
include an unknown parameter that indexes a finite number of submodels scattered over some high-
dimensional subspace, in which case the Bayesian analysis is called “model averaging,” usually
with an implicit uniform prior over the models. Model averaging may also operate over continuous
parameters via priors on those parameters.!
372
Overthrowing the Tyranny of Null Hypotheses!
process behaves, no statistical analysis can provide more, no matter how much causal
modeling is done.
5 Predictive Analysis
If current models for observed-data generators (whether logistic, structural, or propensity
models) can’t be taken seriously as “causal”, what can we make of their outputs? It is
hard to believe the usual excuses offered for regression outputs (e.g., that they are
“descriptive”) when the fitted model is asserted to be causal or “structural.” Are we to
consider the outputs of (say) and IPW-fitted MSM to be some sort of data summary? Or
will it function as some kind of optimal predictor of outcomes in a purely predictive
context? No serious case has been made for causal models in either role, and it seems that
some important technical improvements are needed before causal modeling methods
become credible predictive tools.
Nonetheless, graphical models remain useful (and might be less misleading) even when
they are not “causal,” serving instead as mere carriers of conditional independence
assumptions within a time-ordered framework. In this usage, one may still employ
presumed causal independencies as prior judgments for specification. In particular, for
predictive purposes, some or all of the arrows in the graph may retain informal causal
interpretations; but they may be causally wrong, and yet the graph can still be correct for
predictive purposes.
In this regard, most of the graphical modeling literature in statistics imposes little in the
way of causal burden on the graph, as when graphs are used as influence diagrams, belief
and information networks, and so on without formal causal interpretation (that is, without
representing a formal causal model, e.g., Pearl, 1988; Hajek et al., 1992; Cox and
Wermuth, 1996; Lauritzen, 1996). DAG rules remain valid for prediction if the absence
of an open path from X to Y is interpreted as entailing X╨Y, or equivalently if the
absence of a directed path from X to Y (in causal terms, X is not a cause of Y;
equivalently, Y is not affected by X) is interpreted as entailing X╨Y|paX, the noncausal
Markov condition (where paX is the set of parents of X). In that case, X→Y can be used
in the graph even if X has no effect on Y, or vice-versa.
As an example, suppose X and Y are never observed without them affecting selection
S, as when X is affects miscarriage S and Y is congenital malformation. If the target
population is births, X predicts malformations Y among births (which have S=1). As
another example, suppose X and Y are never observed without an uncontrolled,
ungraphed confounder U, as when X is diet and Y is health status. If one wishes to target
those at high risk for screening or actuarial purposes it does not matter if X→Y
represents a causally confounded relation. Lack of a directed path from X to Y now
corresponds to lack of additional predictive value for Y from X given paX. Arrow
directions in temporal (time-ordered) predictive graphs correspond to point priors about
time order, just as they do in causal graphs.
Of course, if misinterpreted as causal, predictive inferences from graphs (or any
predictive modeling) may be potentially disastrous for judging interventions on X. But, in
OHSS, the causality represented by a directed path in a so-called causal diagram rarely
373
374
Overthrowing the Tyranny of Null Hypotheses!
where much background information is available) are not well represented by the
dimensions constrained by the model, considerably efficiency can be lost for estimating
parameters of interest. A striking example given by Whittemore and Keller (1986)
displayed the poor small-sample performance for estimating a survival curve when using
an unsmoothed nonparametric hazard estimator (Kaplan-Meier or Nelson-Altschuler
estimation), relative to spline smoothing of the hazard.
375
Overthrowing the Tyranny of Null Hypotheses!
376
Overthrowing the Tyranny of Null Hypotheses!
Here then is a final challenge to the “causal modeling” literature: If we know our
observations are just a dim and distant projection of the causal structure and we can only
identify associations among observed quantities, how can we interpret the outputs of
“structural modeling” (such as confidence limits for ostensibly causal estimands which
are not in fact identified) as data summaries? We should want to see answers that are
sensible when the targets are effects in a context at least as complex as in fig. 2.
377
Overthrowing the Tyranny of Null Hypotheses!
378
Overthrowing the Tyranny of Null Hypotheses!
Despite my view and similar ones (e.g., Gustafson, 2005), I suspect the bulk of causal-
inference statistics will trundle on relying exclusively on artificially identified models. It
will thus be particularly important to remember that just because a method is labeled a
“causal modeling” method does not mean it gives us estimates and tests of actual causal
effects. For those who find identification too hard to abandon in formal analysis, the only
honest recourse is to separate identified and nonidentified components of the model,
focus technique on the identified portion, and leave the latent residual as a topic for
sensitivity analysis, speculative modeling, and further study. In this task, graphs can be
used without the burden of causality if we allow them a role as pure prediction tools, and
they can also be used as causal diagrams of the largely latent structure that generates the
data.
References!
Biggs, N., Lloyd, E. and Wilson, R. (1986). Graph Theory, 1736-1936. Oxford
University Press.!
Box, G.E.P. (1980). Sampling and Bayes inference in scientific modeling and robustness.
Journal of the Royal Statistical Society, Series A 143, 383–430.!
Cole S.R. and M.A. Hernán (2002). Fallibility in estimating direct effects. International
Journal of Epidemiology 31, 163–165.
Cole, S.R. and C.E. Frangakis (2009). The consistency assumption in causal inference: a
definition or an assumption? Epidemiology 20, 3–5.!
Cole, S.R., L.P. Jacobson, P.C. Tien, L. Kingsley, J.S. Chmiel and K. Anastos (2010).
Using marginal structural measurement-error models to estimate the long-term effect
of antiretroviral therapy on incident AIDS or death. American Journal of Epidemiology
171, 113-122.!
Copas, J.G. (1973). Randomization models for matched and unmatched 2x2 tables.
Biometrika 60, 267-276.!
Cox, D.R. and N. Wermuth. (1996). Multivariate Dependencies: Models, Analysis and
Interpretation. Boca Raton, FL: CRC/Chapman and Hall.!
Dawid, A.P. (2000). Causal inference without counterfactuals (with discussion). Journal
of the American Statistical Association 95, 407-448.!
Dawid, A.P. (2008). Beware of the DAG! In: NIPS 2008 Workshop Causality: Objectives
and Assessment. JMLR Workshop and Conference Proceedings. !
Duncan, O.D. (1975). Introduction to Structural Equation Models. New York: Academic
Press.!
Fisher, R.A. (1943; reprinted 2003). Note on Dr. Berkson’s criticism of tests of
significance. Journal of the American Statistical Association 38, 103–104. Reprinted in
the International Journal of Epidemiology 32, 692.!
379
Overthrowing the Tyranny of Null Hypotheses!
Freedman, D.A. and Humphreys, P. (1999). Are there algorithms that discover causal
structure? Synthese 121, 29–54.
Glymour, M.M. and S. Greenland (2008). Causal diagrams. Ch. 12 in: Rothman, K.J., S.
Greenland and T.L. Lash, eds. Modern Epidemiology, 3rd ed. Philadelphia: Lippincott.
Good, I.J. (1983). Good thinking. Minneapolis: U. Minnesota Press.
Greenland, S. (1990). Randomization, statistics, and causal inference. Epidemiology 1,
421-429.
Greenland, S. (1991). On the logical justification of conditional tests for two-by-two-
contingency tables. The American Statistician 45, 248-251.
Greenland, S. (1993). Summarization, smoothing, and inference. Scandinavian Journal of
Social Medicine 21, 227-232.
Greenland, S. (1998). The sensitivity of a sensitivity analysis. In: 1997 Proceedings of the
Biometrics Section. Alexandria, VA: American Statistical Association, 19-21.
Greenland, S. (2005a). Epidemiologic measures and policy formulation: Lessons from
potential outcomes (with discussion). Emerging Themes in Epidemiology (online
journal) 2:1–4. (Originally published as “Causality theory for policy uses of
epidemiologic measures,” Chapter 6.2 in: Murray, C.J.L., J.A. Salomon, C.D. Mathers
and A.D. Lopez, eds. (2002) Summary Measures of Population Health. Cambridge,
MA: Harvard University Press/WHO, 291-302.)
Greenland, S. (2005b). Multiple-bias modeling for analysis of observational data (with
discussion). Journal of the Royal Statistical Society, Series A 168, 267–308.
Greenland, S. (2009a). Bayesian perspectives for epidemiologic research. III. Bias
analysis via missing-data methods. International Journal of Epidemiology 38, 1662–
1673.
Greenland, S. (2009b). Relaxation penalties and priors for plausible modeling of
nonidentified bias sources. Statistical Science 24, 195-210.
Greenland, S. (2009c). Dealing with uncertainty about investigator bias: disclosure is
informative. Journal of Epidemiology and Community Health 63, 593-598.
Greenland, S. (2010). The need for syncretism in applied statistics (comment on “The
future of indirect evidence” by Bradley Efron). Statistical Science 25, in press.
Greenland, S., J. Pearl, and J.M. Robins (1999). Causal diagrams for epidemiologic
research. Epidemiology 10, 37-48.
Greenland, S., M. Gago-Dominguez, and J.E. Castellao (2004). The value of risk-factor
("black-box") epidemiology (with discussion). Epidemiology 15, 519-535.
Gustafson, P. (2005). On model expansion, model contraction, identifiability, and prior
information: two illustrative scenarios involving mismeasured variables (with
discussion). Statistical Science 20, 111-140.
Hajek, P., T. Havranek and R. Jirousek (1992). Uncertain Information Processing in
Expert Systems. Boca Raton, FL: CRC Press.
Hastie, T., R. Tibshirani and J. Friedman (2009). The elements of statistical learning:
Data mining, inference, and prediction, 2nd ed. New York: Springer.
380
Overthrowing the Tyranny of Null Hypotheses!
381
Overthrowing the Tyranny of Null Hypotheses!
382
Return to TOC
23
1 Introduction
In The Graduate, Benjamin Braddock (Dustin Hoffman) is told that the future can
be summed up in one word: “Plastics”. One of us (Halpern) recalls that in roughly
1990, Judea Pearl told him that the future was in causality. Pearl’s own research
was largely focused on causality in the years after that; his seminal contributions
are widely known. We were among the many influenced by his work. We discuss one
aspect of it, actual causation, in this article, although a number of our comments
apply to causal modeling more generally.
Pearl introduced a novel account of actual causation in Chapter 10 of Causality,
which was later revised in collaboration with one of us [Halpern and Pearl 2005].
In some ways, Pearl’s approach to actual causation can be seen as a contribution to
the philosophical project of trying to analyze actual causation in terms of counter-
factuals, a project associated most strongly with David Lewis [1973a]. But Pearl’s
account was novel in at least two important ways. The first was his use of struc-
tural equations as a tool for modeling causality. In the philosophical literature,
causal structures were often represented using so-called neuron diagrams, but these
are not (and were never intended to be) all-purpose representational tools. (See
[Hitchcock 2007b] for a detailed discussion of the limitations of neuron diagrams.)
We believe that the lack of a more adequate representational tool had been a se-
rious obstacle to progress. Second, while the philosophical literature on causality
has focused almost exclusively on actual causality, for Pearl, actual causation was
a rather specialized topic within the study of causation, peripheral to many issues
involving causal reasoning and inference. Thus, Pearl’s work placed the study of
actual causation within a much broader context.
The use of structural equations as a model for causal relationships was well
known long before Pearl came on the scene; it seems to go back to the work of
Sewall Wright in the 1920s (see [Goldberger 1972] for a discussion). However, the
details of the framework that have proved so influential are due to Pearl. Besides
the Halpern-Pearl approach mentioned above, there have been a number of other
closely-related approaches for using structural equations to model actual causation;
see, for example, [Glymour and Wimberly 2007; Hall 2007; Hitchcock 2001; Hitch-
cock 2007a; Woodward 2003]. The goal of this paper is to look more carefully at
the modeling of causality using structural equations. For definiteness, we use the
383
Joseph Y. Halpern and Christopher Hitchcock
Halpern-Pearl (HP) version [Halpern and Pearl 2005] here, but our comments apply
equally well to the other variants.
It is clear that the structural equations can have a major impact on the conclu-
sions we draw about causality—it is the equations that allow us to conclude that
lower air pressure is the cause of the lower barometer reading, and not the other
way around; increasing the barometer reading will not result in higher air pressure.
The structural equations express the effects of interventions: what happens to the
bottle if it is hit with a hammer; what happens to a patient if she is treated with
a high dose of the drug, and so on. These effects are, in principle, objective; the
structural equations can be viewed as describing objective features of the world.
However, as pointed out by Halpern and Pearl [2005] and reiterated by others [Hall
2007; Hitchcock 2001; Hitchcock 2007a], the choice of variables and their values can
also have a significant impact on causality. Moreover, these choices are, to some
extent, subjective. This, in turn, means that judgments of actual causation are
subjective.
Our view of actual causation being at least partly subjective stands in contrast to
the prevailing view in the philosophy literature, where the assumption is that the job
of the philosopher is to analyze the (objective) notion of causation, rather like that
of a chemist analyzing the structure of a molecule. This may stem, at least in part,
from failing to appreciate one of Pearl’s lessons: actual causality is only part of the
bigger picture of causality. There can be an element of subjectivity in ascriptions
of actual causality without causation itself being completely subjective. In any
case, the experimental evidence certainly suggests that people’s views of causality
are subjective, even when there is no disagreement about the relevant structural
equations. For example, a number of experiments show that broadly normative
considerations, including the subject’s own moral beliefs, affect causal judgment.
(See, for example, [Alicke 1992; Cushman 2009; Cushman, Knobe, and Sinnott-
Armstrong 2008; Hitchcock and Knobe 2009; Knobe and Fraser 2008].) Even in
relatively non-controversial cases, people may want to focus on different aspects
of a problem, and thus give different answers to questions about causality. For
example, suppose that we ask for the cause of a serious traffic accident. A traffic
engineer might say that the bad road design was the cause; an educator might
focus on poor driver’s education; a sociologist might point to the pub near the
highway where the driver got drunk; a psychologist might say that the cause is the
driver’s recent breakup with his girlfriend.1 Each of these answers is reasonable.
By appropriately choosing the variables, the structural equations framework can
accommodate them all.
Note that we said above “by appropriately choosing the variables”. An obvious
question is “What counts as an appropriate choice?”. More generally, what makes
a model an appropriate model? While we do want to allow for subjectivity, we need
384
Actual Causation and the Art of Modeling
to be able to justify the modeling choices made. A lawyer in court trying to argue
that faulty brakes were the cause of the accident needs to be able to justify his
model; similarly, his opponent will need to understand what counts as a legitimate
attack on the model. In this paper we discuss what we believe are reasonable bases
for such justifications. Issues such as model stability and interactions between the
events corresponding to variables turn out to be important.
Another focus of the paper is the use of defaults in causal reasoning. As we hinted
above, the basic structural equations model does not seem to suffice to completely
capture all aspects of causal reasoning. To explain why, we need to briefly outline
how actual causality is defined in the structural equations framework. Like many
other definitions of causality (see, for example, [Hume 1739; Lewis 1973b]), the HP
definition is based on counterfactual dependence. Roughly speaking, A is a cause of
B if, had A not happened (this is the counterfactual condition, since A did in fact
happen) then B would not have happened. As is well known, this naive definition
does not capture all the subtleties involved with causality. Consider the following
example (due to Hall [2004]): Suzy and Billy both pick up rocks and throw them at
a bottle. Suzy’s rock gets there first, shattering the bottle. Since both throws are
perfectly accurate, Billy’s would have shattered the bottle had Suzy not thrown.
Thus, according to the naive counterfactual definition, Suzy’s throw is not a cause
of the bottle shattering. This certainly seems counterintuitive.
The HP definition deals with this problem by taking A to be a cause of B if B
counterfactually depends on A under some contingency. For example, Suzy’s throw
is the cause of the bottle shattering because the bottle shattering counterfactually
depends on Suzy’s throw, under the contingency that Billy doesn’t throw. (As we
will see below, there are further subtleties in the definition that guarantee that, if
things are modeled appropriately, Billy’s throw is not also a cause.)
While the definition of actual causation in terms of structural equations has been
successful at dealing with many of the problems of causality, examples of Hall [2007],
Hiddleston [2005], and Hitchcock [2007a] show that it gives inappropriate answers in
cases that have structural equations isomorphic to ones where it arguably gives the
appropriate answer. This means that, no matter how we define actual causality in
the structural-equations framework, the definition must involve more than just the
structural equations. Recently, Hall [2007], Halpern [2008], and Hitchcock [2007a]
have suggested that using defaults might be a way of dealing with the problem.
As the psychologists Kahneman and Miller [1986, p. 143] observe, “an event is
more likely to be undone by altering exceptional than routine aspects of the causal
chain that led to it”. This intuition is also present in the legal literature. Hart and
Honoré [1985] observe that the statement “It was the presence of oxygen that caused
the fire” makes sense only if there were reasons to view the presence of oxygen as
abnormal.
As shown by Halpern [2008], we can model this intuition formally by combining a
well-known approach to modeling defaults and normality, due to Kraus, Lehmann,
385
Joseph Y. Halpern and Christopher Hitchcock
and Magidor [1990] with the structural-equation model. Moreover, doing this leads
to a straightforward solution to the problem above. The idea is that, when showing
that if A hadn’t happened then B would not have happened, we consider only
contingencies that are at least as normal as the actual world. For example, if
someone typically leaves work at 5:30 PM and arrives home at 6, but, due to
unusually bad traffic, arrives home at 6:10, the bad traffic is typically viewed as the
cause of his being late, not the fact that he left at 5:30 (rather than 5:20).
But once we add defaults to the model, the problem of justifying the model be-
comes even more acute. We not only have to justify the structural equations and the
choice of variables, but also the default theory. The problem is exacerbated by the
fact that default and “normality” have a number of interpretations. Among other
things, they can represent moral obligations, societal conventions, prototypicality
information, and statistical information. All of these interpretations are relevant to
understanding causality; this makes justifying default choices somewhat subtle.
The rest of this paper is organized as follows. In Sections 2 and 3, we review the
notion of causal model and the HP definition of actual cause; most of this material is
taken from [Halpern and Pearl 2005]. In Section 4, we discuss some issues involved
in the choice of variables in a model. In Section 5, we review the approach of
[Halpern 2008] for adding considerations of normality to the HP framework, and
discuss some modeling issues that arise when we do so. We conclude in Section 6.
2 Causal Models
In this section, we briefly review the HP definition of causality. The description of
causal models given here is taken from [Halpern 2008], which in turn is based on
that of [Halpern and Pearl 2005].
The HP approach assumes that the world is described in terms of random vari-
ables and their values. For example, if we are trying to determine whether a forest
fire was caused by lightning or an arsonist, we can take the world to be described
by three random variables:
Some random variables may have a causal influence on others. This influence
is modeled by a set of structural equations. For example, to model the fact that
if either a match is lit or lightning strikes, then a fire starts, we could use the
random variables ML, F , and L as above, with the equation F = max(L, ML).
(Alternately, if a fire requires both causes to be present, the equation for F becomes
F = min(L, ML).) The equality sign in this equation should be thought of more like
an assignment statement in programming languages; once we set the values of F
386
Actual Causation and the Art of Modeling
and L, then the value of F is set to their maximum. However, despite the equality,
if a forest fire starts some other way, that does not force the value of either ML or
L to be 1.
It is conceptually useful to split the random variables into two sets: the exoge-
nous variables, whose values are determined by factors outside the model, and the
endogenous variables, whose values are ultimately determined by the exogenous
variables. For example, in the forest-fire example, the variables ML, L, and F are
endogenous. However, we want to take as given that there is enough oxygen for
the fire and that the wood is sufficiently dry to burn. In addition, we do not want
to concern ourselves with the factors that make the arsonist drop the match or the
factors that cause lightning. These factors are all determined by the exogenous
variables.
Formally, a causal model M is a pair (S, F), where S is a signature, which explic-
itly lists the endogenous and exogenous variables and characterizes their possible
values, and F defines a set of modifiable structural equations, relating the values
of the variables. A signature S is a tuple (U, V, R), where U is a set of exogenous
variables, V is a set of endogenous variables, and R associates with every variable
Y ∈ U ∪ V a nonempty set R(Y ) of possible values for Y (that is, the set of values
over which Y ranges). F associates with each endogenous variable X ∈ V a func-
tion denoted FX such that FX : (×U ∈U R(U )) × (×Y ∈V−{X} R(Y )) → R(X). This
mathematical notation just makes precise the fact that FX determines the value
of X, given the values of all the other variables in U ∪ V. If there is one exoge-
nous variable U and three endogenous variables, X, Y , and Z, then FX defines the
values of X in terms of the values of Y , Z, and U . For example, we might have
FX (u, y, z) = u + y, which is usually written as X ← U + Y .2 Thus, if Y = 3 and
U = 2, then X = 5, regardless of how Z is set.
In the running forest-fire example, suppose that we have an exogenous random
variable U that determines the values of L and ML. Thus, U has four possible values
of the form (i, j), where both of i and j are either 0 or 1. The i value determines
the value of L and the j value determines the value of ML. Although FL gets as
arguments the vale of U , ML, and F , in fact, it depends only on the (first component
of) the value of U ; that is, FL ((i, j), m, f ) = i. Similarly, FML ((i, j), l, f ) = j. The
value of F depends only on the value of L and ML. How it depends on them depends
on whether either cause by itself is sufficient for the forest fire or whether both are
necessary. If either one suffices, then FF ((i, j), l, m) = max(l, m), or, perhaps more
comprehensibly, F = max(L, ML); if both are needed, then F = min(L, ML). For
future reference, call the former model the disjunctive model, and the latter the
conjunctive model.
The key role of the structural equations is to define what happens in the presence
of external interventions. For example, we can explain what happens if the arsonist
2 The fact that X is assigned U + Y (i.e., the value of X is the sum of the values of U and Y )
does not imply that Y is assigned X − U ; that is, FY (U, X, Z) = X − U does not necessarily hold.
387
Joseph Y. Halpern and Christopher Hitchcock
does not drop the match. In the disjunctive model, there is a forest fire exactly
if there is lightning; in the conjunctive model, there is definitely no fire. Setting
the value of some variable X to x in a causal model M = (S, F) results in a new
causal model denoted MX←x . In the new causal model, the equation for X is very
simple: X is just set to x; the remaining equations are unchanged. More formally,
MX←x = (S, F X←x ), where F X←x is the result of replacing the equation for X in
F by X = x.
The structural equations describe objective information about the results of in-
terventions, that can, in principle, be checked. Once the modeler has selected a set
of variables to include in the model, the world determines which equations among
those variables correctly represent the effects of interventions.3 By contrast, the
choice of variables is subjective; in general, there need be no objectively “right” set
of exogenous and endogenous variables to use in modeling a problem. We return to
this issue in Section 4.
It may seem somewhat circular to use causal models, which clearly already encode
causal information, to define actual causation. Nevertheless, as we shall see, there
is no circularity. The equations of a causal model do not represent relations of
actual causation, the very concept that we are using them to define. Rather, the
equations characterize the results of all possible interventions (or at any rate, all
of the interventions that can be represented in the model) without regard to what
actually happened. Specifically, the equations do not depend upon the actual values
realized by the variables. For example, the equation F = max(L, ML), by itself,
does not say anything about whether the forest fire was actually caused by lightning
or by an arsonist, or, for that matter, whether a fire even occurred. By contrast,
relations of actual causation depend crucially on how things actually play out.
A sequence of endogenous X1 , . . . , Xn of is a directed path from X1 to Xn if the
value of Xi+1 (as given by FXi+1 ) depends on the value of Xi , for 1 = 1, . . . , n − 1.
In this paper, following HP, we restrict our discussion to acyclic causal models,
where causal influence can be represented by an acyclic Bayesian network. That is,
there is no cycle X1 , . . . , Xn , X1 of endogenous variables that forms a directed path
from X1 to itself. If M is an acyclic causal model, then given a context, that is,
a setting ~u for the exogenous variables in U, there is a unique solution for all the
equations.
3 Ingeneral, there may be uncertainty about the causal model, as well as about the true setting
of the exogenous variables in a causal model. Thus, we may be uncertain about whether smoking
causes cancer (this represents uncertainty about the causal model) and uncertain about whether
a particular patient actually smoked (this is uncertainty about the value of the exogenous variable
that determines whether the patient smokes). This uncertainty can be described by putting a
probability on causal models and on the values of the exogenous variables. We can then talk
about the probability that A is a cause of B.
388
Actual Causation and the Art of Modeling
This is essentially the but-for test, perhaps the most widely used test of actual
causation in tort adjudication. The but-for test states that an act is a cause of
injury if and only if, but for the act (i.e., had the the act not occurred), the injury
would not have occurred.
There are two well-known problems with this definition. The first can be seen
by considering the disjunctive causal model for the forest fire again. Suppose that
the arsonist drops a match and lightning strikes. Which is the cause? According
to a naive interpretation of the counterfactual definition, neither is. If the match
hadn’t dropped, then the lightning would still have struck, so there would have been
389
Joseph Y. Halpern and Christopher Hitchcock
a forest fire anyway. Similarly, if the lightning had not occurred, there still would
have been a forest fire. As we shall see, the HP definition declares both lightning
and the arsonist causes of the fire. (In general, there may be more than one actual
cause of an outcome.)
A more subtle problem is what philosophers have called preemption, which is
illustrated by the rock-throwing example from the introduction. As we observed,
according to a naive counterfactual definition of causality, Suzy’s throw would not
be a cause.
The HP definition deals with the first problem by defining causality as coun-
terfactual dependency under certain contingencies. In the forest-fire example, the
forest fire does counterfactually depend on the lightning under the contingency that
the arsonist does not drop the match; similarly, the forest fire depends counterfac-
tually on the dropping of the match under the contingency that the lightning does
not strike.
Unfortunately, we cannot use this simple solution to treat the case of preemp-
tion. We do not want to make Billy’s throw the cause of the bottle shattering by
considering the contingency that Suzy does not throw. So if our account is to yield
the correct verdict in this case, it will be necessary to limit the contingencies that
can be considered. The reason that we consider Suzy’s throw to be the cause and
Billy’s throw not to be the cause is that Suzy’s rock hit the bottle, while Billy’s did
not. Somehow the definition of actual cause must capture this obvious intuition.
With this background, we now give the preliminary version of the HP definition
of causality. Although the definition is labeled “preliminary”, it is quite close to
the final definition, which is given in Section 5. The definition is relative to a causal
model (and a context); A may be a cause of B in one causal model but not in
another. The definition consists of three clauses. The first and third are quite
simple; all the work is going on in the second clause.
The types of events that the HP definition allows as actual causes are ones of
the form X1 = x1 ∧ . . . ∧ Xk = xk —that is, conjunctions of primitive events; this is
often abbreviated as X ~ = ~x. The events that can be caused are arbitrary Boolean
combinations of primitive events. The definition does not allow statements of the
form “A or A0 is a cause of B”, although this could be treated as being equivalent
to “either A is a cause of B or A0 is a cause of B”. On the other hand, statements
such as “A is a cause of B or B 0 ” are allowed; this is not equivalent to “either A is
a cause of B or A is a cause of B 0 ”.
DEFINITION 1. (Actual cause; preliminary version) [Halpern and Pearl 2005] X ~ =
~x is an actual cause of φ in (M, ~u) if the following three conditions hold:
AC2. There is a partition of V (the set of endogenous variables) into two subsets
~ and W
Z ~ with X ~ ⊆ Z,~ and a setting ~x0 and w ~ and W
~ of the variables in X ~,
390
Actual Causation and the Art of Modeling
391
Joseph Y. Halpern and Christopher Hitchcock
the causal path were set to their original values in the context ~u, φ would still be
~ = ~x.
true, as long as X
EXAMPLE 2. For the forest-fire example, let M be the disjunctive model for the
forest fire sketched earlier, with endogenous variables L, ML, and F . We want to
show that L = 1 is an actual cause of F = 1. Clearly (M, (1, 1)) |= F = 1 and
(M, (1, 1)) |= L = 1; in the context (1,1), the lightning strikes and the forest burns
down. Thus, AC1 is satisfied. AC3 is trivially satisfied, since X ~ consists of only one
element, L, so must be minimal. For AC2, take Z ~ = {L, F } and take W ~ = {ML},
0
let x = 0, and let w = 0. Clearly, (M, (1, 1)) |= [L ← 0, ML ← 0](F 6= 1); if the
lightning does not strike and the match is not dropped, the forest does not burn
down, so AC2(a) is satisfied. To see the effect of the lightning, we must consider the
contingency where the match is not dropped; the definition allows us to do that by
setting ML to 0. (Note that here setting L and ML to 0 overrides the effects of U ;
this is critical.) Moreover, (M, (1, 1)) |= [L ← 1, ML ← 0](F = 1); if the lightning
strikes, then the forest burns down even if the lit match is not dropped, so AC2(b)
is satisfied. (Note that since Z ~ = {L, F }, the only subsets of Z
~ −X~ are the empty
set and the singleton set consisting of just F .)
It is also straightforward to show that the lightning and the dropped match are
also causes of the forest fire in the context where U = (1, 1) in the conjunctive
model. Again, AC1 and AC3 are trivially satisfied and, again, to show that AC2
holds in the case of lightning we can take Z ~ = {L, F }, W
~ = {ML}, and x0 = 0, but
now we let w = 1. In the conjunctive scenario, if there is no lightning, there is no
forest fire, while if there is lightning (and the match is dropped) there is a forest
fire, so AC2(a) and AC2(b) are satisfied; similarly for the dropped match.
EXAMPLE 3. Now consider the Suzy-Billy example.4 We get the desired result—
that Suzy’s throw is a cause, but Billy’s is not—but only if we model the story
appropriately. Consider first a coarse causal model, with three endogenous variables:
• ST for “Suzy throws”, with values 0 (Suzy does not throw) and 1 (she does);
• BT for “Billy throws”, with values 0 (he doesn’t) and 1 (he does);
• BS for “bottle shatters”, with values 0 (it doesn’t shatter) and 1 (it does).
(We omit the exogenous variable here; it determines whether Billy and Suzy throw.)
Take the formula for BS to be such that the bottle shatters if either Billy or Suzy
throw; that is BS = max(BT , ST ). (We assume that Suzy and Billy will not
miss if they throw.) BT and ST play symmetric roles in this model; there is
nothing to distinguish them. Not surprisingly, both Billy’s throw and Suzy’s throw
are classified as causes of the bottle shattering in this model. The argument is
essentially identical to that in the disjunctive model of the forest-fire example in
4 The discussion of this example is taken almost verbatim from HP.
392
Actual Causation and the Art of Modeling
the context U = (1, 1), where both the lightning and the dropped match are causes
of the fire.
The trouble with this model is that it cannot distinguish the case where both
rocks hit the bottle simultaneously (in which case it would be reasonable to say
that both ST = 1 and BT = 1 are causes of BS = 1) from the case where Suzy’s
rock hits first. To allow the model to express this distinction, we add two new
variables to the model:
• BH for “Billy’s rock hits the (intact) bottle”, with values 0 (it doesn’t) and
1 (it does); and
• SH for “Suzy’s rock hits the bottle”, again with values 0 and 1.
• SH = ST ;
• BH = min(BT , 1 − SH ); and
• BS = max(SH , BH ).
Now it is the case that, in the context where both Billy and Suzy throw, ST = 1
is a cause of BS = 1, but BT = 1 is not. To see that ST = 1 is a cause, note
that, as usual, it is immediate that AC1 and AC3 hold. For AC2, choose Z ~ =
~
{ST , SH , BH , BS }, W = {BT }, and w = 0. When BT is set to 0, BS tracks ST :
if Suzy throws, the bottle shatters and if she doesn’t throw, the bottle does not
shatter. To see that BT = 1 is not a cause of BS = 1, we must check that there
is no partition Z~ ∪W ~ of the endogenous variables that satisfies AC2. Attempting
the symmetric choice with Z ~ = {BT , BH , SH , BS }, W
~ = {ST }, and w = 0 violates
~ 0
AC2(b). To see this, take Z = {BH }. In the context where Suzy and Billy both
throw, BH = 0. If BH is set to 0, the bottle does not shatter if Billy throws
and Suzy does not. It is precisely because, in this context, Suzy’s throw hits the
bottle and Billy’s does not that we declare Suzy’s throw to be the cause of the
bottle shattering. AC2(b) captures that intuition by allowing us to consider the
contingency where BH = 0, despite the fact that Billy throws. We leave it to the
reader to check that no other partition of the endogenous variables satisfies AC2
either.
This example emphasizes an important moral. If we want to argue in a case of
preemption that X = x is the cause of φ rather than Y = y, then there must be
a random variable (BH in this case) that takes on different values depending on
whether X = x or Y = y is the actual cause. If the model does not contain such
a variable, then it will not be possible to determine which one is in fact the cause.
This is certainly consistent with intuition and the way we present evidence. If we
want to argue (say, in a court of law) that it was A’s shot that killed C rather than
B’s, then we present evidence such as the bullet entering C from the left side (rather
393
Joseph Y. Halpern and Christopher Hitchcock
than the right side, which is how it would have entered had B’s shot been the lethal
one). The side from which the shot entered is the relevant random variable in this
case. Note that the random variable may involve temporal evidence (if Y ’s shot
had been the lethal one, the death would have occurred a few seconds later), but it
certainly does not have to.
394
Actual Causation and the Art of Modeling
395
Joseph Y. Halpern and Christopher Hitchcock
highest-ranking officer. Both a sergeant and a major issue the order to march, and
the soldiers march. Let us put aside the morals that Schaffer attempts to draw from
this example (with which we disagree; see [Halpern and Pearl 2005] and [Hitchcock
2010]), and consider only the modeling problem. We will presumably want variables
S, M , and A, corresponding to the sergeant’s order, the major’s order, and the sol-
diers’ action. We might let S = 1 represent the sergeant’s giving the order to march
and S = 0 represent the sergeant’s giving no order; likewise for M and A. But this
would not be adequate. If the only possible order is the order to march, then there
is no way to capture the principle that in the case of conflicting orders, the soldiers
obey the major. One way to do this is to replace the variables M , S, and A by
variables M 0 , S 0 and A0 that take on three possible values. Like M , M 0 = 0 if the
major gives no order and M 0 = 1 if the major gives the order to march. But now
we allow M 0 = 2, which corresponds to the major giving some other order. S 0 and
A0 are defined similarly. We can now write an equation to capture the fact that if
M 0 = 1 and S 0 = 2, then the soldiers march, while if M 0 = 2 and S 0 = 1, then the
soldiers do not march.
The appropriate set of values of a variable will depend on the other variables
in the picture, and the relationship between them. Suppose, for example, that a
hapless homeowner comes home from a trip to find that his front door is stuck. If
he pushes on it with a normal force then the door will not open. However, if he
leans his shoulder against it and gives a solid push, then the door will open. To
model this, it suffices to have a variable O with values either 0 or 1, depending on
whether the door opens, and a variable P , with values 0 or 1 depending on whether
or not the homeowner gives a solid push.
On the other hand, suppose that the homeowner also forgot to disarm the security
system, and that the system is very sensitive, so that it will be tripped by any push
on the door, regardless of whether the door opens. Let A = 1 if the alarm goes off,
A = 0 otherwise. Now if we try to model the situation with the same variable P , we
will not be able to express the dependence of the alarm on the homeowner’s push.
To deal with both O and A, we need to extend P to a 3-valued variable P 0 , with
values 0 if the homeowner does not push the door, 1 if he pushes it with normal
force, and 2 if he gives it a solid push.
These considerations parallel issues that arise in philosophical discussions about
the metaphysics of “events”.6 Suppose that our homeowner pushed on the door with
enough force to open it. Is there just one event, the push, that can be described
at various levels of detail, such as a “push” or a “hard push”? This is the view of
Davidson [1967]. Or are there rather many different events corresponding to these
different descriptions, as argued by Kim [1973] and Lewis [1986b]? And if we take
the latter view, which of the many events that occur should be counted as causes of
the door’s opening? These strike us as pseudoproblems. We believe that questions
6 This philosophical usage of the word “event” is different from the typical usage of the word in
computer science and probability, where an event is just a subset of the state space.
396
Actual Causation and the Art of Modeling
about causality are best addressed by dealing with the methodological problem of
constructing a model that correctly describes the effects of interventions in a way
that is not misleading or ambiguous.
A slightly different way in which one variable may constrain the values that
another may take is by its implicit presuppositions. For example, a counterfactual
theory of causation seems to have the somewhat counterintuitive consequence that
one’s birth is a cause of one’s death. This sounds a little odd. If Jones dies suddenly
one night, shortly before his 80th birthday, the coroner’s inquest is unlikely to list
“birth” as among the causes of his death. Typically, when we investigate the causes
of death, we are interested in what makes the difference between a person’s dying
and his surviving. So our model might include a variable D such D = 1 holds if
Jones dies shortly before his 80th birthday, and D = 0 holds if he continues to
live. If our model also includes a variable B, taking the value 1 if Jones is born, 0
otherwise, then there simply is no value that D would take if B = 0. Both D = 0
and D = 1 implicitly presuppose that Jones was born (i.e., B = 1). Our conclusion
is that if we have chosen to include a variable such as D in our model, then we
cannot conclude that Jones’ birth is a cause of his death!
397
Joseph Y. Halpern and Christopher Hitchcock
possible situations in which both rocks hit or neither rock hits the bottle. In partic-
ular, this representation does not allow us to consider independent interventions on
the rocks hitting the bottle. As the discussion in Example 3 shows, it is precisely
such an intervention that is needed to establish that Suzy’s throw (and not Billy’s)
is the actual cause of the bottle shattering.
While these rules are simple in principle, their application is not always trans-
parent.
EXAMPLE 4. Consider cases of “switching”, which have been much discussed in
the philosophical literature. A train is heading toward the station. An engineer
throws a switch, directing the train down the left track, rather than the right track.
The tracks re-converge before the station, and the train arrives as scheduled. Was
throwing the switch a cause of the train’s arrival? HP consider two causal models
of this scenario. In the first, there is a random variable S which is 1 if the switch
is thrown (so the train goes down the left track) and 0 otherwise. In the second,
in addition to S, there are variables LT and RT , indicating whether or not the
train goes down the left track and right track, respectively. Note that with the first
representation, there is no way to model the train not making it to the arrival point.
With the second representation, we have the problem that LT = 1 and RT = 1
are arguably not independent; the train cannot be on both tracks at once. If we
want to model the possibility of one track or another being blocked, we should use,
instead of LT and RT , variables LB and RB , which indicate whether the left track
or right track, respectively, are blocked. This allows us to represent all the relevant
possibilities without running into independence problems. Note that if we have
only S as a random variable, then S = 1 cannot be a cause of the train arriving;
it would have arrived no matter what. With RB in the picture, the preliminary
HP definition of actual cause rules that S = 1 can be an actual cause of the train’s
arrival; for example, under the contingency that RB = 1, the train does not arrive
if S = 0. (However, once we extend the definition to include defaults, as we will in
the next section, it becomes possible once again to block this conclusion.)
These rules will have particular consequences for how we should represent events
that might occur at different times. Consider the following simplification of an
example introduced by Bennett [1987], and also considered in HP.
EXAMPLE 5. Suppose that the Careless Camper (CC for short) has plans to go
camping on the first weekend in June. He will go camping unless there is a fire in
the forest in May. If he goes camping, he will leave a campfire unattended, and
there will be a forest fire. Let the variable C take the value 1 if CC goes camping,
and 0 otherwise. How should we represent the state of the forest?
There appear to be at least three alternatives. The simplest proposal would be
to use a variable F that takes the value 1 if there is a forest fire at some time, and 0
otherwise.7 But now how are we to represent the dependency relations between F
7 This is, in effect, how effects have been represented using “neuron diagrams” in late preemption
398
Actual Causation and the Art of Modeling
and C? Since CC will go camping only if there is no fire (in May), we would want to
have an equation such as C = 1−F . On the other hand, since there will be a fire (in
June) just in case CC goes camping, we will also need F = C. This representation is
clearly not rich enough, since it does not let us make the clearly relevant distinction
between whether the forest fire occurs in May or June. The problem is manifested
in the fact that the equations are cyclic, and have no consistent solution.8
The third way to model the scenario is to use two separate variables, F1 and F2 ,
to represent the state of the forest at separate times. F1 = 1 will represent a fire
in May, and F1 = 0 represents no fire in May; F2 = 1 represents a fire in June and
F2 = 0 represents no fire in June. Now we can write our equations as C = 1 − F1
and F2 = C × (1 − F1 ). This representation is free from the defects that plague the
other two representations. We have no cycles, and hence there will be a consistent
solution for any value of the exogenous variables. Moreover, this model correctly
tells us that only an intervention on the state of the forest in May will affect CC’s
camping plans.
Once again, our discussion of the methodology of modeling parallels certain meta-
physical discussions in the philosophy literature. If heavy rains delay the onset of
a fire, is it the same fire that would have occurred without the rains, or a different
fire? It is hard to see how to gain traction on such an issue by direct metaphysical
speculation. By contrast, when we recast the issue as one about what kinds of
variables to include in causal models, it is possible to say exactly how the models
will mislead you if you make the wrong choice.
model, BH is a cause of BS , even though it is the earlier shattering of the bottle that prevents
Billy’s rock from hitting. Halpern and Pearl [2005] note this problem and offer a dynamic model
akin to the one recommended below. As it turns out, this does not affect the analysis of the
example offered above.
399
Joseph Y. Halpern and Christopher Hitchcock
400
Actual Causation and the Art of Modeling
EXAMPLE 7. Consider the following story, taken from (an early version of) [Hall
2004]: Suppose that Billy is hospitalized with a mild illness on Monday; he is
treated and recovers. In the obvious causal model, the doctor’s treatment is a cause
of Billy’s recovery. Moreover, if the doctor does not treat Billy on Monday, then
the doctor’s omission to treat Billy is a cause of Billy’s being sick on Tuesday. But
now suppose that there are 100 doctors in the hospital. Although only doctor 1 is
assigned to Billy (and he forgot to give medication), in principle, any of the other
99 doctors could have given Billy his medication. Is the nontreatment by doctors
2–100 also a cause of Billy’s being sick on Tuesday?
Suppose that in fact the hospital has 100 doctors and there are variables
A1 , . . . , A100 and T1 , . . . , T100 in the causal model, where Ai = 1 if doctor i is
assigned to treat Billy and Ai = 0 if he is not, and Ti = 1 if doctor i actually treats
Billy on Monday, and Ti = 0 if he does not. Doctor 1 is assigned to treat Billy;
the others are not. However, in fact, no doctor treats Billy. Further assume that,
typically, no doctor is assigned to a given patient; if doctor i is not assigned to
treat Billy, then typically doctor i does not treat Billy; and if doctor i is assigned
to Billy, then typically doctor i treats Billy. We can capture this in an extended
causal model where the world where no doctor is assigned to Billy and no doctor
401
Joseph Y. Halpern and Christopher Hitchcock
treats him has rank 0; the 100 worlds where exactly one doctor is assigned to Billy,
and that doctor treats him, have rank 1; the 100 worlds where exactly one doctor is
assigned to Billy and no one treats him have rank 2; and the 100 × 99 worlds where
exactly one doctor is assigned to Billy but some other doctor treats him have rank
3. (The ranking given to other worlds is irrelevant.) In this extended model, in the
context where doctor i is assigned to Billy but no one treats him, i is the cause of
Billy’s sickness (the world where i treats Billy has lower rank than the world where
i is assigned to Billy but no one treats him), but no other doctor is a cause of Billy’s
sickness. Moreover, in the context where i is assigned to Billy and treats him, then
i is the cause of Billy’s recovery (for AC2(a), consider the world where no doctor is
assigned to Billy and none treat him).
Adding a normality theory to the model gives the HP account of actual causation
greater flexibility to deal with these kinds of cases. This raises the worry, however,
that this gives the modeler too much flexibility. After all, the modeler can now
render any claim that A is an actual cause of B false, simply by choosing a nor-
mality order that assigns the actual world s~u a lower rank than any world s needed
to satisfy AC2. Thus, the introduction of normality exacerbates the problem of
motivating and defending a particular choice of model. Fortunately, the literature
on the psychology of counterfactual reasoning and causal judgment goes some way
toward enumerating the sorts of factors that constitute normality. (See, for exam-
ple, [Alicke 1992; Cushman 2009; Cushman, Knobe, and Sinnott-Armstrong 2008;
Hitchcock and Knobe 2009; Kahneman and Miller 1986; Knobe and Fraser 2008;
Kahneman and Tversky 1982; Mandel, Hilton, and Catellani 1985; Roese 1997].)
These factors include the following:
• Statistical norms concern what happens most often, or with the greatest fre-
quency. Kahneman and Tversky [1982] gave subjects a story in which Mr.
Jones usually leaves work at 5:30, but occasionally leaves early to run errands.
Thus, a 5:30 departure is (statistically) “normal”, and an earlier departure
“abnormal”. This difference affected which alternate possibilities subjects
were willing to consider when reflecting on the causes of an accident in which
Mr. Jones was involved.
• Policies adopted by social institutions can also be norms. For instance, Knobe
and Fraser [2008] presented subjects with a hypothetical situation in which
a department had implemented a policy allowing administrative assistants
to take pens from the department office, but prohibiting faculty from doing
402
Actual Causation and the Art of Modeling
The law suggests a variety of principles for determining the norms that are used
in the evaluation of actual causation. In criminal law, norms are determined by
direct legislation. For example, if there are legal standards for the strength of seat
belts in an automobile, a seat belt that did not meet this standard could be judged
a cause of a traffic fatality. By contrast, if a seat belt complied with the legal
standard, but nonetheless broke because of the extreme forces it was subjected to
during a particular accident, the fatality would be blamed on the circumstances of
the accident, rather than the seat belt. In such a case, the manufacturers of the
seat belt would not be guilty of criminal negligence. In contract law, compliance
with the terms of a contract has the force of a norm. In tort law, actions are often
judged against the standard of “the reasonable person”. For instance, if a bystander
was harmed when a pedestrian who was legally crossing the street suddenly jumped
out of the way of an oncoming car, the pedestrian would not be held liable for
damages to the bystander, since he acted as the hypothetical “reasonable person”
would have done in similar circumstances. (See, for example, [Hart and Honoré
1985, pp. 142ff.] for discussion.) There are also a number of circumstances in
which deliberate malicious acts of third parties are considered to be “abnormal”
interventions, and affect the assessment of causation. (See, for example, [Hart and
Honoré 1985, pp. 68ff.].)
As with the choice of variables, we do not expect that these considerations will
always suffice to pick out a uniquely correct theory of normality for a causal model.
They do, however, provide resources for a rational critique of models.
6 Conclusion
As HP stress, causality is relative to a model. That makes it particularly important
to justify whatever model is chosen, and to enunciate principles for what makes a
reasonable causal model. We have taken some preliminary steps in investigating
this issue with regard to the choice of variables and the choice of defaults. However,
we hope that we have convinced the reader that far more needs to be done if causal
models are actually going to be used in applications.
Acknowledgments: We thank Wolfgang Spohn for useful comments. Joseph
Halpern was supported in part by NSF grants IIS-0534064 and IIS-0812045, and by
AFOSR grants FA9550-08-1-0438 and FA9550-05-1-0055.
403
Joseph Y. Halpern and Christopher Hitchcock
References
Adams, E. (1975). The Logic of Conditionals. Dordrecht, Netherlands: Reidel.
Alicke, M. (1992). Culpable causation. Journal of Personality and Social Psy-
chology 63, 368–378.
Bennett, J. (1987). Event causation: the counterfactual analysis. In Philosophical
Perspectives, Vol. 1, Metaphysics, pp. 367–386. Atascadero, CA: Ridgeview
Publishing Company.
Cushman, F. (2009). The role of moral judgment in causal and intentional attri-
bution: What we say or how we think?”. Unpublished manuscript.
Cushman, F., J. Knobe, and W. Sinnott-Armstrong (2008). Moral appraisals
affect doing/allowing judgments. Cognition 108 (1), 281–289.
Davidson, D. (1967). Causal relations. Journal of Philosophy LXIV (21), 691–703.
Glymour, C. and F. Wimberly (2007). Actual causes and thought experiments.
In J. Campbell, M. O’Rourke, and H. Silverstein (Eds.), Causation and Ex-
planation, pp. 43–67. Cambridge, MA: MIT Press.
Goldberger, A. S. (1972). Structural equation methods in the social sciences.
Econometrica 40 (6), 979–1001.
Hall, N. (2004). Two concepts of causation. In J. Collins, N. Hall, and L. A. Paul
(Eds.), Causation and Counterfactuals. Cambridge, Mass.: MIT Press.
Hall, N. (2007). Structural equations and causation. Philosophical Studies 132,
109–136.
Halpern, J. Y. (2008). Defaults and normality in causal structures. In Principles
of Knowledge Representation and Reasoning: Proc. Eleventh International
Conference (KR ’08), pp. 198–208.
Halpern, J. Y. and J. Pearl (2005). Causes and explanations: A structural-model
approach. Part I: Causes. British Journal for Philosophy of Science 56 (4),
843–887.
Hansson, R. N. (1958). Patterns of Discovery. Cambridge, U.K.: Cambridge Uni-
versity Press.
Hart, H. L. A. and T. Honoré (1985). Causation in the Law (second ed.). Oxford,
U.K.: Oxford University Press.
Hiddleston, E. (2005). Causal powers. British Journal for Philosophy of Sci-
ence 56, 27–59.
Hitchcock, C. (2001). The intransitivity of causation revealed in equations and
graphs. Journal of Philosophy XCVIII (6), 273–299.
Hitchcock, C. (2007a). Prevention, preemption, and the principle of sufficient
reason. Philosophical Review 116, 495–532.
404
Actual Causation and the Art of Modeling
405
Joseph Y. Halpern and Christopher Hitchcock
406
Return to TOC
24
From C-Believed Propositions to the
Causal Calculator
Vladimir Lifschitz
1 Introduction
Default rules, unlike inference rules of classical logic, allow us to derive a new
conclusion only when it does not conflict with the other available information. The
best known example is the so-called commonsense law of inertia: in the absence
of information to the contrary, properties of the world can be presumed to be the
same as they were in the past. Making the idea of commonsense inertia precise is
known as the frame problem [Shanahan 1997]. Default reasoning is nonmonotonic,
in the sense that we may be forced to retract a conclusion derived using a default
when additional information becomes available.
The idea of a default first attracted the attention of AI researchers in the 1970s.
Developing a formal semantics of defaults turned out to be a difficult task. For
instance, the attempt to describe commonsense inertia in terms of circumscription
outlined in [McCarthy 1986] was unsatisfactory, as we learned from the Yale Shoot-
ing example [Hanks and McDermott 1987].
In this note, we trace the line of work on the semantics of defaults that started
with Judea Pearl’s 1988 paper on the difference between “E-believed” and “C-
believed” propositions. That paper has led other researchers first to the invention
of several theories of nonmonotonic causal reasoning, then to designing action lan-
guages C and C+, and then to the creation of the Causal Calculator—a software
system for automated reasoning about action and change.
almost every default rule falls into one of two categories: expectation-
evoking or explanation-evoking. The former describes association among
events in the outside world (e.g., fire is typically accompanied by smoke);
the latter describes how we reason about the world (e.g., smoke normally
suggests fire).
Thus the rule fire ⇒ smoke is an expectation-evoking, or “causal” default; the rule
smoke ⇒ fire is explanation-evoking, or “evidential.” To take another example,
407
Vladimir Lifschitz
is a causal default;
is an evidential default.
To discuss the distinction between properties of causal and evidential defaults,
Pearl labels believed propositions by distinguishing symbols C and E. A proposi-
tion P is E-believed, written E(P ), if it is a direct consequence of some evidential
rule. Otherwise, if P can be established as a direct consequence of only causal rules,
it is said to be C-believed, written C(P ). The labels are used to prevent certain
types of inference chains; in particular, C-believed propositions are prevented in
Pearl’s paper from triggering evidential defaults. For example, both causal rule (1)
and evidential rule (2) are reasonable, but using them to infer sprinkler on from
rained is not.
We will see that the idea of using the distinguishing symbols C and E had
a significant effect on the study of commonsense reasoning over the next twenty
years.
A rule such as “rain causes the grass to be wet” may thus be expressed
as a sentence
rain → C grass wet,
which can then be read as saying that if rain is true, grass wet is
explained [Geffner 1990].
The paper defined, for a set of axioms of this kind, which propositions are “causally
entailed” by it.
Geffner showed how this modal language can be used to describe effects of actions.
We can express that e(x) is an effect of an action a(x) with precondition p(x) by
the axiom
where p(x)t expresses that fluent p(x) holds at time t, and e(x)t+1 is understood in
a similar way; a(x)t expresses that action a(x) is executed between times t and t+1.
Such axioms explain the value of a fluent at some point in time (t + 1 in the
consequent of the implication) in terms of the past (t in the antecedent). Geffner
gives also an example of explaining the value of a fluent in terms of the values of
other fluents at the same point in time: if all ducts are blocked at time t, that causes
408
C-Believed Propositions
the room to be stuffy at time t. Such “static” causal dependencies are instrumental
when actions with indirect effects are involved. For instance, blocking a duct can
indirectly cause the room to become stuffy. We will see another example of this
kind in the next section.
4 Predicate “Caused”
Fangzhen Lin showed a few years later that the intuitions explored by Pearl and
Geffner can be made precise without introducing a new nonmonotonic semantics.
Circumscription [McCarthy 1986] will do if we employ, instead of the modal oper-
ator C, a new predicate.
Lin acknowledges his intellectual debt to [Pearl 1988] by noting that his approach
echoes the theme of Pearl’s paper—the need for a primitive notion of causality in
default reasoning.
The proposal to circumscribe Caused was a major event in the history of research
on the use of circumscription for solving the frame problem. As we mentioned
before, the original method [McCarthy 1986] turned out to be unsatisfactory; the
improvement described in [Haugh 1987; Lifschitz 1987] is only applicable when
actions have no indirect effects. The method of [Lin 1995] is free of this limitation.
The main example of that paper is a suitcase with two locks and a spring loaded
mechanism that opens the suitcase instantaneously when both locks are in the
up position; opening the suitcase may thus become an indirect effect of toggling
a switch. The static causal relationship between the fluents up(l) and open is
expressed in Lin’s language by the axiom
F → CG,
where F and G do not contain C. (Such formulas are particularly useful; for in-
stance, (3) has this form.) The authors wrote such a formula as
(6) F ⇒ G,
409
Vladimir Lifschitz
The authors note that the principle of universal causation represents a strong philo-
sophical commitment that is rewarded by the mathematical simplicity of the non-
monotonic semantics that it leads to. The definition of their semantics is indeed
surprisingly simple, or at least short. They note also that in applications this strong
commitment can be easily relaxed.
The extension of [McCain and Turner 1997] described in [Giunchiglia, Lee, Lif-
schitz, McCain, and Turner 2004] allows F and G in (6) to be slightly more general
than propositional formulas, which is convenient when non-Boolean fluents are in-
volved. In the language of that paper we can write, for instance,
(7) at ⇒ ft+1 = v
6 Action Descriptions
An action description is a formal expression representing a transition system—a
directed graph such that its vertices can be interpreted as states of the world,
with edges corresponding to the transitions caused by the execution of actions.
In [Giunchiglia and Lifschitz 1998], the nonmonotonic causal logic from [McCain
and Turner 1997] was used to define an action description language, called C. The
language C+ [Giunchiglia, Lee, Lifschitz, McCain, and Turner 2004] is an extension
of C that accomodates non-Boolean fluents and is also more expressive in some other
ways.
The distinguishing syntactic feature of action description languages is that they
do not involve symbols for time instants. For example, the counterpart of (7) in C+
is
a causes f = v.
The C+ keyword causes implicitly indicates a shift from the time instant t when the
execution of action a begins to the next time instant t+1 when fluent f is evaluated.
This keyword represents a combination of three elements: material implication, the
Pearl-Geffner causal operator, and time shift.
410
C-Believed Propositions
8 Conclusion
As we have seen, Judea Pearl’s idea of labeling the propositions that are derived
using causal rules has suggested to Geffner, Lin and others that the condition
G is caused by F.
Eliminating the binary “is caused by” in favor of the unary “is caused” turned out
to be a remarkably useful technical device.
9 Acknowledgements
Thanks to Selim Erdoğan, Hector Geffner, and Joohyung Lee for comments on
a draft of this note. This work was partially supported by the National Science
Foundation under Grant IIS-0712113.
References
Akman, V., S. Erdoğan, J. Lee, V. Lifschitz, and H. Turner (2004). Representing
the Zoo World and the Traffic World in the language of the Causal Calculator.
Artificial Intelligence 153(1–2), 105–140.
411
Vladimir Lifschitz
412
C-Believed Propositions
2 ftp://ftp.cs.utexas.edu/pub/techreports/tr97-25.ps.gz .
413
Return to TOC
25
Analysis of the Binary Instrumental
Variable Model
Thomas S. Richardson and James M. Robins
1 Introduction
Pearl’s seminal work on instrumental variables [Chickering and Pearl 1996; Balke
and Pearl 1997] for discrete data represented a leap forwards in terms of under-
standing: Pearl showed that, contrary to what many had supposed based on linear
models, in the discrete case the assumption that a variable was an instrument
could be subjected to empirical test. In addition, Pearl improved on earlier bounds
[Robins 1989] for the average causal effect (ACE) in the absence of any monotonic-
ity assumptions. Pearl’s approach was also innovative insofar as he employed a
computer algebra system to derive analytic expressions for the upper and lower
bounds.
In this paper we build on and extend Pearl’s work in two ways. First we show
the geometry underlying Pearl’s bounds. As a consequence we are able to derive
bounds on the average causal effect for all four compliance types. Our analysis
also makes it possible to perform a sensitivity analysis using the distribution over
compliance types. Second our analysis provides a clear geometric picture of the
instrumental inequalities, and allows us to isolate the counterfactual assumptions
necessary for deriving these tests. This may be seen as analogous to the geometric
study of models for two-way tables [Fienberg and Gilbert 1970; Erosheva 2005].
Among other things this allows us to clarify which are the alternative hypotheses
against which Pearl’s test has power. We also relate these tests to recent work of
Pearl’s on bounding direct effects [Cai, Kuroki, Pearl, and Tian 2008].
2 Background
We consider three binary variables, X, Y and Z. Where:
Y is the response.
For X and Z, we will use 0 to indicate placebo, and 1 to indicate drug. For Y
we take 1 to indicate a desirable outcome, such as survival. Xz is the treatment a
415
Thomas S. Richardson and James M. Robins
tX , tY
Z X Y
for each patient, so that a patient’s outcome only depends on treatment assigned
via the treatment received. One consequence of the analysis below is that these
equations may be tested separately. We may thus similarly enumerate four types
of patient in terms of their response to received treatment:
(Here and elsewhere we use angle brackets htX , tY i to indicate an ordered pair.) Let
πtX ≡ p(tX ) denote the marginal probability of a given compliance type tX ∈ DX ,
416
Binary Instrumental Variable Model
and let
πX ≡ {πtX | tX ∈ DX }
denote a marginal distribution on DX . Similarly we use πtY |tX ≡ p(tY | tX ) to
denote the probability of a given response type within the sub-population of in-
dividuals of compliance type tX , and πY |X to indicate a specification of all these
conditional probabilities:
πY |X ≡ {πtY |tX | tX ∈ DX , tY ∈ DY }.
Z⊥
⊥ {Xz=0 , Xz=1 , Yx=0 , Yx=1 }. (2)
A graph corresponding to the model given by (1) and (2) is shown in Figure 1.
Notation
In places we will make use of the following compact notation for probability distri-
butions:
There are several simple geometric constructions that we will use repeatedly. In
consequence we introduce these in a generic setting.
2.1 Joints compatible with fixed margins
Consider a bivariate random variable U = hU1 , U2 i ∈ {0, 1} × {0, 1}. Now for fixed
c1 , c2 ∈ [0, 1] consider the set
( )
X X
Pc1 ,c2 = p p(1, u2 ) = c1 ; p(u1 , 1) = c2
u u
2 1
in other words, Pc1 ,c2 is the set of joint distributions on U compatible with fixed
margins p(Ui = 1) = ci , i = 1, 2.
It is not hard to see that Pc1 ,c2 is a one-dimensional subset (line segment) of
the 3-dimensional simplex of distributions for U . We may describe it explicitly as
follows:
p(1, 1) = t
p(1, 0) = c − t
1
t ∈ max {0, (c1 + c2 ) − 1} , min {c1 , c2 } . (3)
p(0, 1) = c2 − t
p(0, 0) = 1 − c1 − c2 + t
417
Thomas S. Richardson and James M. Robins
c2
1
(ii)
(i) (iv)
(iii)
c1
0 1
Figure 2. The four regions corresponding to different supports for t in (3); see Table
1.
See also [Pearl 2000] Theorem 9.2.10. The range of t, or equivalently the support
for p(1, 1), is one of four intervals, as shown in Table 1. These cases correspond to
c1 ≤ 1 − c2 c1 ≥ 1 − c2
c1 ≤ c2 (i) t ∈ [0, c1 ] (ii) t ∈ [c1 + c2 − 1, c1 ]
c1 ≥ c2 (iii) t ∈ [0, c2 ] (iv) t ∈ [c1 + c2 − 1, c2 ]
Table 1. The support for t in (3) in each of the four cases relating c1 and c2 .
where c, α ∈ [0, 1]. In words, Qc,α is the set of pairs of values hu, vi in [0, 1] which
are such that the weighted average αu + (1 − α)v is c.
It is simple to see that this describes a line segment in the unit square. Further
consideration shows that for any value of α ∈ [0, 1], the segment will pass through
the point hc, ci and will be contained within the union of two rectangles:
The slope of the line is negative for α ∈ (0, 1). For α ∈ (0, 1) the line segment may
418
Binary Instrumental Variable Model
u
0 c 1
be parametrized as follows:
( )
u = (c − t(1 − α))/α, c−α c
t ∈ max 0, , min ,1 .
v = t, 1−α 1−α
and
hu, vi = min c/α, 1 , max 0, (c − α)/(1 − α)
In words, this consists of the set of triples hu, v, wi ∈ [0, 1]3 for which pre-specified
averages of u and w (via α1 ), and v and w (via α2 ) are equal to c1 and c2 respectively.
If this set is not empty, it is a line segment in [0, 1]3 obtained by the intersection
of two rectangles:
{hu, wi ∈ Qc1 ,α1 } × {v ∈ [0, 1]} ∩ {hv, wi ∈ Qc2 ,α2 } × {u ∈ [0, 1]} ; (4)
see Figures 4 and 5. For α1 , α2 ∈ (0, 1) we may parametrize the line segment (4) as
follows:
u = (c1 − t(1 − α1 ))/α1 ,
v = (c2 − t(1 − α2 ))/α2 , t ∈ [tl , tu ] ,
w = t,
419
Thomas S. Richardson and James M. Robins
(a) (b)
w
c2
u
u c1
c1 v v c2
where
c1 − α1 c2 − α2 c1 c2
tl ≡ max 0, , , tu ≡ min 1, , .
1 − α1 1 − α2 1 − α1 1 − α2
Thus Q(c1 ,α1 )(c2 ,α2 ) 6= ∅ if and only if tl ≤ tu . It follows directly that for fixed
c1 , c2 the set of pairs hα1 , α2 i ∈ [0, 1]2 for which Q(c1 ,α1 )(c2 ,α2 ) is not empty may
be characterized thus:
Rc1 ,c2 ≡ hα1 , α2 i Q(c1 ,α1 )(c2 ,α2 ) 6= ∅
\
= [0, 1]2 ∩ {hα1 , α2 i | (αi − ci )(αi∗ − (1 − ci∗ )) ≤ c∗i (1 − ci )}. (5)
i∈{1,2}
i∗ =3−i
Rc1 ,c2 = [0, 1]2 ∩ {hα1 , α2 i | (αk − ck )(αk∗ − (1 − ck∗ )) ≤ c∗k (1 − ck )}.
([0, c1 ] × [0, c2 ] × [c2 , 1]) ∪ ([0, c1 ] × [c2 , 1] × [c1 , c2 ]) ∪ ([c1 , 1] × [c2 , 1] × [0, c1 ]).
420
Binary Instrumental Variable Model
(a) (b)
w
c2 c2
u c1 c1
c2 c1
c1 v v
c2
Figure 5. Q(c1 ,α1 )(c2 ,α2 ) corresponds to the section of the line between the two
marked points; (a) view towards u-w plane; (b) view from v-w plane. (Here c1 < c2 .)
([0, c1 ] × [0, c2 ] × [c1 , 1]) ∪ ([c1 , 1] × [0, c2 ] × [c2 , c1 ]) ∪ ([c1 , 1] × [c2 , 1] × [0, c2 ]).
(iii) If c1 = c2 = c then the ‘middle’ box disappears and we are left with
In this case the two boxes touch at the point hc, c, ci.
Note however, that the number of ‘boxes’ within which the line segment (4) lies
may be 1, 2 or 3 (or 0 if Q(c1 ,α1 )(c2 ,α2 ) = ∅). This is in contrast to the simpler
case considered in §2.2 where the line segment Qc,α always intersected exactly two
rectangles; see Figure 3.
421
Thomas S. Richardson and James M. Robins
!"
!!
Figure 6. Rc1 ,c2 corresponds to the shaded region. The hyperbola of which one
arm forms a boundary of this region corresponds to the active constraint; the other
hyperbola to the inactive constraint.
2. Next we use the technique used for Step 1 to reduce the problem of character-
izing distributions πY |X compatible with p(x, y | z) to that of characterizing
the values of p(yx = 1 | tX ) compatible with p(x, y | z).
5. Finally we describe the values for p(yx = 1 | tX ) compatible with the distri-
butions π over DX found at the previous step.
422
Binary Instrumental Variable Model
(a) (b)
w
w
c1
c2 v
c2
u c1
c2
c1 c1
v c2
u
423
Thomas S. Richardson and James M. Robins
p(NR | tX )
p(HE | tX ) 1 p(HU | tX )
p(AR | tX )
γt1 γt0
X X
that are compatible with p(x, y | z). Note that p(yx=i = 1 | tX ) is written as
p(y = 1 | do(x = i), tX ) using Pearl’s do(·) notation. It is then clear that the set of
possible distributions πY |X that are compatible with p(x, y | z) simply follows from
the analysis in §2.1, since
γt0X = p(yx=0 = 1 | tX )
= p(HU | tX ) + p(AR | tX ),
γt1X = p(yx=1 = 1 | tX )
= p(HE | tX ) + p(AR | tX ).
It follows from the discussion at the end of §2.1 that the values of γt0 and γt1 are
X X
not restricted by the requirement that there exists a distribution p(· | tX ) on DY .
Consequently we may proceed in two steps: first we derive the set of values for the
eight parameters {γti } and the distribution on πX (jointly) without consideration
X
of the parameters for πY |X ; second we then derive the parameters πY |X , as described
above.
Finally we note that many causal quantities of interest, such as the average causal
effect (ACE), and relative risk (RR) of X on Y , for a given response type tX , may
424
Binary Instrumental Variable Model
0 1
γCO γDE
0 0 πX 1 1
γNT γAT γNT γAT
0 1
γDE γCO
We will call a specification of values for πX, feasible for the observed distribution if
(a) πX lies within the set described in §3.1 of distributions compatible with p(x | z)
and (b) there exists a set of values for γti which results in the distribution p(y | x, z).
X
In the next section we give an explicit characterization of the set of feasible
distributions πX ; in this section we characterize the set of values of γti compatible
X
with a fixed feasible distribution πX and p(y | x, z).
0 0 0
PROPOSITION 1. The following equations relate πX , γCO , γDE , γNT to p(y | x =
0, z):
0 0
p(y = 1 | x = 0, z = 0) = (γCO πCO + γNT πNT )/(πCO + πNT ), (8)
0 0
p(y = 1 | x = 0, z = 1) = (γDE πDE + γNT πNT )/(πDE + πNT ), (9)
1 1 1
Similarly, the following relate πX , γCO , γDE , γAT to p(y | x = 1, z):
1 1
p(y = 1 | x = 1, z = 0) = (γDE πDE + γAT πAT )/(πDE + πAT ), (10)
1 1
p(y = 1 | x = 1, z = 1) = (γCO πCO + γAT πAT )/(πCO + πAT ). (11)
0
Equations (8)–(11) are represented in Figure 9. Note that the parameters γAT and
1
γNT are completely unconstrained by the observed distribution since they describe,
respectively, the effect of non-exposure (X = 0) on Always Takers, and exposure
(X = 1) on Never Takers, neither of which ever occur. Consequently, the set
425
Thomas S. Richardson and James M. Robins
Figure 10. Geometric picture illustrating the relation between the γti parameters
X
and p(y | x, z). See also Figure 9.
of possible values for each of these parameters is always [0, 1]. Graphically this
0 1
corresponds to the disconnection of γAT and γNT from the remainder of the graph.
As shown in Proposition 1 the remaining six parameters may be divided into two
0 0 0 1 1 1
groups, {γNT , γDE , γCO } and {γAT , γDE , γCO }, depending on whether they relate to
unexposed subjects, or exposed subjects. Furthermore, as the graph indicates, for
a fixed feasible value of πX , compatible with the observed distribution p(x, y | z)
(assuming such exists), these two sets are variation independent. Thus, for a fixed
feasible value of πX we may analyze each of these sets separately.
A geometric picture of equations (8)–(11) is given in Figure 10: there is one square
for each compliance type, with axes corresponding to γt0 and γt1 ; the specific value
X X
of hγt0 , γt1 i is given by a cross in the square. There are four lines corresponding
X X
to the four observed quantities p(y = 1 | x, z). Each of these observed quantities,
which is denoted by a cross on the respective line, is a weighted average of two γti
X
parameters, with weights given by πX (the weights are not depicted explicitly).
Proof of Proposition 1: We prove (8); the other proofs are similar. Subjects for
426
Binary Instrumental Variable Model
Here the first equality is by the chain rule of probability; the second follows by
consistency; the third follows since Compliers and Never Takers have X = 0 when
Z = 0; the fourth follows by randomization (2). 2
0 0 0
Values for γCO , γDE , γNT compatible with a feasible πX
Since (8) and (9) correspond to three quantities with two averages specified, we may
apply the analysis in §2.3, taking α1 = πCO /(πCO + πNT ), α2 = πDE /(πDE + πNT ),
0 0 0
ci = p(y = 1 | x = 0, z = i − 1) for i = 1, 2, u = γCO , v = γDE and w = γNT .
0 0 0
Under this substitution, the set of possible values for hγCO , γDE , γNT i is then given
by Q(c1 ,α1 )(c2 ,α2 ) .
1 1 1
Values for γCO , γDE , γAT compatible with a feasible πX
Likewise since (10) and (11) contain three quantities with two averages specified we
again apply the analysis from §2.3, taking α1 = πCO /(πCO +πAT ), α2 = πDE /(πDE +
1 1 1
πAT ), ci = p(y = 1 | x = 1, z = 2 − i) for i = 1, 2, u = γCO , v = γDE and w = γAT .
1 1 1
The set of possible values for hγCO , γDE , γAT i is then given by Q(c1 ,α1 )(c2 ,α2 ) .
427
Thomas S. Richardson and James M. Robins
compatible with p(y | x = 0, z) (i.e. for which the corresponding set of values for
0 0 0
hγCO , γDE , γNT i is non-empty) is given by Rc∗1 ,c∗2 , where c∗i = p(y = 1 | x = 0, z =
i − 1), i = 1, 2 (see §2.3). The inequalities defining Rc∗1 ,c∗2 may be translated into
upper bounds on t ≡ πAT in (6), as follows:
X X
t ≤ min 1 − p(y = j, x = 0 | z = j), 1 − p(y = k, x = 0 | z = 1−k) . (13)
j∈{0,1} k∈{0,1}
Proof: The analysis in §3.3 implied that for Rc∗1 ,c∗2 6= ∅ we require
c∗1 − α1 c∗2 c∗2 − α2 c∗1
≤ and ≤ . (14)
1 − α1 1 − α2 1 − α2 1 − α1
Taking the first of these and plugging in the definitions of c∗1 , c∗2 , α1 and α2 from
(12) gives:
py1 |x0 ,z0 − (πCO /px0 |z0 ) py1 |x0 ,z1
≤
1 − (πCO /px0 |z0 ) 1 − (πDE /px0 |z1 )
(⇔) (py1 |x0 ,z0 − (πCO /px0 |z0 ))(1 − (πDE /px0 |z1 )) ≤ py1 |x0 ,z1 (1 − (πCO /px0 |z0 ))
(⇔) (py1 ,x0 |z0 − πCO )(px0 |z1 − πDE ) ≤ py1 ,x0 |z1 (px0 |z0 − πCO ).
But px0 |z1 − πDE = px0 |z0 − πCO = πNT , hence these terms may be cancelled to
give:
as required. 2
428
Binary Instrumental Variable Model
The proof that these inequalities are implied, is very similar to the derivation of the
upper bounds on πAT arising from p(y | x = 0, z) considered above.
where
p(x = 1 | z = 0), p(x = 1 | z = 1),
P P
uπAT = min 1 − j p(y = j, x = 0 | z = j), 1 − k p(y = k, x = 0 | z = 1−k), .
P p(y = j, x = 1 | z = j), P
k p(y = k, x = 1 | z = 1−k)
j
Observe that unlike the upper bound, the lower bound on πAT (and πNT ) obtained
from p(x, y | z) is the same as the lower bound derived from p(x | z) alone.
We define πX (πAT ) ≡ hπNT (πAT ), πCO (πAT ), πDE (πAT ), πAT i, for use below. Note
the following:
PROPOSITION 2. When πAT (equivalently πNT ) is minimized then either πNT = 0
or πAT = 0.
Proof: This follows because, by the expression for lπAT , either lπAT = 0, or lπAT =
p(x = 1 | z = 0) + p(x = 1 | z = 1) − 1, in which case lπNT = 0 by (16). 2
4 Projections
The analysis in §3 provides a complete description of the set of distributions over
D compatible with a given observed distribution. In particular, equation (16) de-
scribes the one dimensional set of compatible distributions over DX ; in §3.3 we
0 0 0
first gave a description of the one dimensional set of values over hγCO , γDE , γNT i
compatible with the observed distribution and a specific feasible distribution πX
1 1 1
over DX ; we then described the one dimensional set of values for hγCO , γDE , γAT i.
Varying πX over the set PX of feasible distributions over DX , describes a set of
lines, forming two two-dimensional manifolds which represent the space of possible
0 0 0 1 1 1
values for hγCO , γDE , γNT i and likewise for hγCO , γDE , γAT i. As noted previously,
0 1
the parameters γAT and γNT are unconstrained by the observed data. Finally, if
there is interest in distributions over response types, there is a one-dimensional set
429
Thomas S. Richardson and James M. Robins
of such distributions associated with each possible pair of values from γt0 and γt1 .
X X
αij
tX (πX ) ≡ p(tX | Xz=i = j),
0 0
while if πNT = 0 then we define lγNT (πX ) ≡ 0 and uγNT (πX ) ≡ 1. Similarly, if πX
is such that πAT > 0 then we define:
py |x z − α11 01
CO (πX ) py1 |x1 z0 − αDE (πX )
1
lγAT (πX ) ≡ max 0, 1 1 111 , ,
αAT (πX ) α01
AT (πX )
430
Binary Instrumental Variable Model
0 0 0
γNT lγNT (πX ) uγNT (πX )
0 0
γCO (py1 |x0 z0 − uγNT (πX ) · α00 00
NT )/αCO
0
(py1 |x0 z0 − lγNT (πX ) · α00 00
NT )/αCO
0 0
γDE (py1 |x0 z1 − uγNT (πX ) · α10
NT
)/α10
DE
0
(py1 |x0 z1 − lγNT (πX ) · α10
NT
)/α10
DE
0
γAT 0 1
1
γNT 0 1
1 1
γCO (py1 |x1 z1 − uγAT (πX ) · α11
AT
)/α11
CO
1
(py1 |x1 z1 − lγAT (πX ) · α11
AT
)/α11
CO
1 1
γDE (py1 |x1 z0 − uγAT (πX ) · α01 01
AT )/αDE
1
(py1 |x1 z0 − lγAT (πX ) · α01 01
AT )/αDE
1 1 1
γAT lγAT (πX ) uγAT (πX )
x x
We note that the bounds on γCO and γDE need not be monotonic in πAT .
min
PROPOSITION 4. Let πX be the distribution in PX for which πAT and πNT are
minimized then either:
min 0 min 0 min
(1) πNT = 0, hence lγNT (πX ) = 0 and uγNT (πX ) = 1; or
min 1 min 1 min
(2) πAT = 0, hence lγAT (πX ) = 0 and uγAT (πX ) = 1.
Proof: This follows from Proposition 2, and the fact that if πtX = 0 then γti is not
X
identified (for any i). 2
0
4.2 Upper and Lower bounds on p(AT) as a function of γNT
The expressions given in Table 2 allow the range of values for each γti to be
X
determined as a function of πX , giving the upper and lower bounding curves in
Figure 11. However it follows directly from (8) and (9) that there is a bijection
0 0 0
between the three shapes shown for γCO , γDE and γNT (top row of Figure 11).
431
Thomas S. Richardson and James M. Robins
similarly we have
1 1 0 1
uπAT (γAT ) ≡ max πAT | γAT ∈ [lγAT (πX ), uγAT (πX )]
px |z py |x z px |z py |x z px |z py |x z px |z py |x z
= min uπAT , 1 1 1 1 1 1 , 1 0 1 1 1 0 , 1 1 01 1 1 , 1 0 01 1 0 .
γAT γAT 1 − γAT 1 − γAT
The curves added to the unexposed plots for Compliers and Defiers in Figure 11
are as follows:
0 0 0
γCO (πX , γNT ) ≡ (py1 |x0 z0 − γNT · α00 00
NT )/αCO ,
0 0 0 0
cγCO (πAT , γNT ) ≡ {hπAT , γCO (πX (πAT ), γNT )i}; (17)
0 0 0
γDE (πX , γNT ) ≡ (py1 |x0 z1 − γNT · α10 10
NT )/αDE ,
0 0 0 0
cγDE (πAT , γNT ) ≡ {hπAT , γDE (πX (πAT ), γNT )i}; (18)
0 0 min 0 min 0
for γNT ∈ [lγNT (πX ), uγNT (πX )]; πAT ∈ [lπAT , uπAT (γNT )]. The curves added
432
Binary Instrumental Variable Model
Table 3. Flu Vaccine Data from [McDonald, Hiu, and Tierney 1992].
Z X Y count
0 0 0 99
0 0 1 1027
0 1 0 30
0 1 1 233
1 0 0 84
1 0 1 935
1 1 0 31
1 1 1 422
2,861
to the exposed plots for Compliers and Defiers in Figure 11 are given by:
1 1 1
γCO (πX , γAT ) ≡ (py1 |x1 z1 − γAT · α11
AT
)/α11
CO
,
1 1 1 1
cγDE (πAT , γAT ) ≡ {hπAT , γCO (πX (πAT ), γAT )i}; (19)
1 1 1
γDE (πX , γAT ) ≡ (py1 |x1 z0 − γAT · α01
AT
)/α01
DE
,
1 1 1 1
cγDE (πAT , γAT ) ≡ {hπAT , γDE (πX (πAT ), γAT )i}; (20)
1 1 min 1 min 1
for γAT ∈ [lγAT (πX ), uγAT (πX )]; πAT ∈ [lπAT , uπAT (γAT )].
433
Possible values for Pr(Y=1) for never takers with X=0 Possible values for Pr(Y=1) for compliers with X=0 Possible values for Pr(Y=1) for defiers with X=0
1
Pr[Y=1|do(X=0),NT]
Pr[Y=1|do(X=0),DE]
Pr[Y=1|do(X=0),CO]
1 1
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
AT AT AT
0 0.1 0.19 0 0.1 0.19 0 0.1 0.19
Compliance Compliance Compliance
CO CO CO
0.31 0.22 0.12 0.31 0.22 0.12 0.31 0.22 0.12
Type Type Type
DE DE DE
0.19 0.1 0 0.19 0.1 0 0.19 0.1 0
Proportions: Proportions: Proportions:
NT NT NT
0.5 0.6 0.69 0.5 0.6 0.69 0.5 0.6 0.69
434
0
Figure 11. Depiction of the set of values for πX vs. hγCO
Thomas S. Richardson and James M. Robins
0
, γDE
Pr[Y=1|do(X=1),AT]
Pr[Y=1|do(X=1),DE]
Pr[Y=1|do(X=1),CO]
0
, γNT
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
AT AT AT
0 0.1 0.19 0 0.1 0.19 0 0.1 0.19
Compliance Compliance Compliance
CO CO CO
0.31 0.22 0.12 0.31 0.22 0.12 0.31 0.22 0.12
Type Type Type
DE DE DE
0.19 0.1 0 0.19 0.1 0 0.19 0.1 0
Proportions: Proportions: Proportions:
NT NT NT
0.5 0.6 0.69 0.5 0.6 0.69 0.5 0.6 0.69
i (upper row),
Binary Instrumental Variable Model
Following [Pearl 2000; Robins 1989; Manski 1990; Robins and Rotnitzky 2004]
we also consider the average causal effect on the entire population:
X
ACEglobal (πX , {γtxX }) ≡ (γt1X (πX ) − γt0X (πX ))πtX ;
tX ∈DX
upper and lower bounds taken over {γtx } are defined similarly. The bounds given
X
for ACEtX in Table 5 are an immediate consequence of equations (8)–(11) which
relate p(y | x, z) to πX and {γtx }. Before deriving the ACE bounds we need the
X
following observation:
LEMMA 5. For a given feasible πX and p(y, x | z),
Proof: (21) follows from the definition of ACEglobal and the observation that py1 ,x1 |z1 =
1 1 0 0
πCO γCO + πAT γAT and py1 ,x0 |z0 = πCO γCO + πNT γNT . The proof of (22) is similar.
2
PROPOSITION 6. For a given feasible πX and p(y, x | z), the compatible distribu-
tion which minimizes [maximizes] ACEglobal has
435
Thomas S. Richardson and James M. Robins
0 0
NT 0 − uγNT (πX ) 1 − lγNT (πX )
1 0 1 0
CO lγCO (πX ) − uγCO (πX ) uγCO (πX ) − lγCO (πX )
1 1 1 1
= γCO (πX , uγAT (πX )) = γCO (πX , lγAT (πX ))
0 0 0 0
−γCO (πX , lγNT (πX )) −γCO (πX , uγNT (πX ))
1 0 1 0
DE lγDE (πX ) − uγDE (πX ) uγDE (πX ) − lγDE (πX )
1 1 1 1
= γDE (πX , uγAT (πX )) = γDE (πX , lγAT (πX ))
0 0 0 0
−γDE (πX , lγNT (πX )) −γDE (πX , uγNT (πX ))
1 1
AT lγAT (πX ) − 1 uγAT (πX ) − 0
global py1 ,x1 |z1 − py1 ,x0 |z0 py1 ,x1 |z1 − py1 ,x0 |z0
+ πDE · lACEDE (πX ) − πAT + πDE · uACEDE (πX ) + πNT
= py1 ,x1 |z0 − py1 ,x0 |z1 = py1 ,x1 |z0 − py1 ,x0 |z1
+ πCO · lACECO (πX ) − πAT + πCO · uACECO (πX ) + πNT
Table 4. Upper and Lower bounds on average causal effects for different groups, as
c
a function of a feasible πX . Here πNT ≡ 1 − πNT
0 1 0 1 0 1
hγNT , γAT i = hlγNT , uγAT i [huγNT , lγAT i]
1 0
hγNT , γAT i = h0, 1i [h1, 0i]
thus also minimizes [maximizes] ACECO and ACEDE , and conversely maximizes
[minimizes] ACEAT and ACENT .
Proof: The claims follow from equations (21) and (22), together with the fact that
0 1 0
γAT and γNT are unconstrained, so ACEglobal is minimized by taking γAT = 1 and
1 0 1
γNT = 0, and maximized by taking γAT = 0 and γNT = 1. 2
It is of interest here that although the definition of ACEglobal treats the four
compliance types symmetrically, the compatible distribution which minimizes [max-
imizes] this quantity (for a given πX ) does not: it always corresponds to the scenario
in which the treatment has the smallest [greatest] effect on Compliers and Defiers.
The bounds on the global ACE for the flu vaccine data of [Hirano, Imbens, Rubin,
and Zhou 2000] are shown are shown in Figure 13.
Finally we note that it would be simple to develop similar bounds for other
measures such as the Causal Relative Risk and Causal Odds Ratio.
436
Binary Instrumental Variable Model
Possible values for ACE for always takers Possible values for ACE for compliers
P(Y=1|do(X=1),CO) − P(Y=1|do(X=0),CO)
0.8
0.8
P(Y=1|do(X=1),AT) − P(Y=1|do(X=0),AT)
0.5
0.5
0.2
0.2
−0.1
−0.1
−0.4
−0.4
−0.7
−0.7
−1.0
−1.0
AT AT
0 0.1 0.19 0 0.1 0.19
Compliance
CO CO
Type 0.31 0.22 0.12 0.31 0.22 0.12
Proportions: DE DE
0.19 0.1 0 0.19 0.1 0
NT NT
0.5 0.6 0.69 0.5 0.6 0.69
Possible values for ACE for defiers Possible values for ACE for never takers
P(Y=1|do(X=1),DE) − P(Y=1|do(X=0),DE)
P(Y=1|do(X=1),NT) − P(Y=1|do(X=0),NT)
0.8
0.8
0.5
0.5
0.2
0.2
−0.1
−0.1
−0.4
−0.4
−0.7
−0.7
−1.0
−1.0
AT AT
0 0.1 0.19 0 0.1 0.19
Compliance
CO CO
Type 0.31 0.22 0.12 0.31 0.22 0.12
Proportions: DE DE
0.19 0.1 0 0.19 0.1 0
NT NT
0.5 0.6 0.69 0.5 0.6 0.69
Figure 12. Depiction of the set of values for πX vs. ACEtX (πX ) for tX ∈ DX for
the flu data.
6 Instrumental inequalities
The expressions involved in the upper bound on πAT in (16) appear similar to those
which occur in Pearl’s instrumental inequalities. Here we show that the requirement
that PX 6= ∅, or equivalently, lπAT ≤ uπAT is in fact equivalent to the instrumental
inequality. This also provides an interpretation as to what may be inferred from
the violation of a specific inequality.
THEOREM 7. The following conditions place equivalent restrictions on p(x | z)
and p(y | x = 0, z):
Similarly, the following place equivalent restrictions on p(x | z) and p(y | x = 1, z):
437
Thomas S. Richardson and James M. Robins
1.0
0.8
0.6
0.4
P(Y=1|do(X=1)) − P(Y=1| do(X=0))
0.2
0.0
−0.2
−0.4
−0.6
−0.8
−1.0
AT
0 0.1 0.19
CO
0.31 0.22 0.12
DE
0.19 0.1 0
NT
0.5 0.6 0.69
Figure 13. Depiction of the set of values for πX vs. the global ACE for the flu data.
The horizontal lines represent the overall bounds on the global ACE due to Pearl.
Thus the instrumental inequality (a2) corresponds to the requirement that the
upper bounds on p(AT) resulting from p(x | z) and p(y = 1 | x = 0, z) be greater
than the lower bound on p(AT) (derived solely from p(x | z)). Similarly for (b2)
and the upper bounds on p(AT) resulting from p(y = 1 | x = 1, z).
which always holds. By a symmetric argument we can show that it always holds
that:
P P
1 − j p(y = j, x = 0 | z = 1 − j) ≥ j p(x = 1 | z = j) − 1.
[(b1) ⇔ (b2)] It is clear that neither of the sums on the RHS of (b1) are negative,
hence if (b1) does not hold then max{0, p(x = 1 | z = 0) + p(x = 1 | z = 1) − 1} =
438
Binary Instrumental Variable Model
P
j p(x = 1 | z = j) − 1. Now
P P
j p(y = j, x = 1 | z = j) < j p(x = 1 | z = j) −1
P
⇔ 1 < j p(y = j, x = 1 | z = 1 − j).
Likewise
P P
j p(y = j, x = 1 | z = 1 − j) < j p(x = 1 | z = j) −1
P
⇔ 1 < j p(y = j, x = 1 | z = j).
This equivalence should not be seen as surprising since [Bonet 2001] states that
the instrument inequalities (a2) and (b2) are sufficient for a distribution to be
compatible with the binary IV model. This is not the case if, for example, X takes
more than 2 states.
439
Thomas S. Richardson and James M. Robins
Takers, but not for Compliers. It should be noted that tests of the instrument
inequalities have no power to detect failures of the exclusion restriction for Compliers
or defier.
We first prove the following Lemma, which also provides another characterization
of the instrument inequalities:
LEMMA 9. Suppose (RX) holds and Y ⊥ ⊥Z | tX = NT then (a2) holds. Similarly,
if (RX) holds and Y ⊥
⊥Z | tX = AT then (b2) holds.
Note that the conditions in the antecedent make no assumption regarding the exis-
tence of counterfactuals for Y .
Proof: We prove the result for Never Takers; the other proof is similar. By hypoth-
esis we have:
0
p(Y = 1 | Z = 0, tX = NT) = p(Y = 1 | Z = 1, tX = NT) ≡ γNT . (23)
In addition,
p(Y = 1 | Z = 0, X = 0)
= p(Y = 1 | Z = 0, X = 0, Xz=0 = 0)
= p(Y = 1 | Z = 0, Xz=0 = 0)
The first three equalities here follow from consistency, the definition of the compli-
ance types and the law of total probability. The final equality uses (RX). Similarly,
it may be shown that
p(Y = 1 | Z = 1, X = 0)
Equations (24) and (25) specify two averages of three quantities, thus taking
0
u = p(Y = 1 | Z = 0, tX = CO), v = p(Y = 1 | Z = 1, tX = DE) and w = γNT , we
may apply the analysis of §2.3. This then leads to the upper bound on πAT given
by equation (15). (Note that the lower bounds on πAT are derived from p(x | z)
and hence are unaffected by dropping the exclusion restriction.) The requirement
that there exist some feasible distribution πX then implies equation (a2) which is
shown in Theorem 7 to be equivalent to (b2) as required. 2
440
Binary Instrumental Variable Model
tX , tY
Z X Y
Figure 14. Graphical representation of the model given by the randomization as-
sumption (2) alone. It is no longer assumed that Z does not have a direct effect on
Y.
Proof of Theorem 8: We establish that (RX), (RYX=0 ), (EX=0 ) ⇒ (a2). The proof
of the other implication is similar. By Lemma 9 it is sufficient to establish that
Y⊥ ⊥Z | tX = NT.
p(Y = 1 | Z = 0, tX = NT)
= p(Y = 1 | Z = 0, X = 0, tX = NT) definition of NT;
= p(Yx=0,z=0 = 1 | Z = 0, X = 0, tX = NT) consistency;
= p(Yx=0,z=0 = 1 | Z = 0, tX = NT) definition of NT;
= p(Yx=0,z=0 = 1 | tX = NT) by (RYX=0 );
= p(Yx=0,z=1 = 1 | tX = NT) by (EX=0 );
= p(Yx=0,z=1 = 1 | Z = 1, tX = NT) by (RYX=0 );
= p(Y = 1 | Z = 1, tX = NT) consistency, NT.
2
A similar result is given in [Cai, Kuroki, Pearl, and Tian 2008], who consider the
Average Controlled Direct Effect, given by:
under the model given solely by the equation (2), which corresponds to the graph
in Figure 14. Cai et al. prove that under this model the following bounds obtain:
It is simple to see that ACDE(x) will be bounded away from 0 for some x iff one
of the instrumental inequalities is violated. This is as we would expect: the IV
model of Figure 1 is a sub-model of Figure 14, but if ACDE(x) is bounded away
from 0 then the Z → Y edge is present, and hence the exclusion restriction (1) is
incompatible with the observed distribution.
441
Thomas S. Richardson and James M. Robins
",+ "-+
!"#$%&'$%()$%+
!"#$%&'$*()$%+
!"#$%&'$*()$*+ !"#$%&'$%()$*+
Figure 15. Illustration of the possible values for p(y | x, z) compatible with the
instrument inequalities, for a given distribution p(x|z).The darker shaded region
satisfies the inequalities: (a) X = 0, inequalities (a2); (b) X = 1, inequalities
(b2). In this example p(x = 1 | z = 0) = 0.84, p(x = 1 | z = 1) = 0.32. Since
0.84/(1 − 0.32) > 1, (a2) is trivially satisfied; see proof of Theorem 10.
giving two half-planes in (θ00 , θ10 )-space (see Figure 15(a)). Since the lines defin-
ing the half-planes are parallel, it is sufficient to show that the half-planes always
intersect, and hence that the regions in which (a2.1) and (a2.2) are violated are
disjoint. However, this is immediate since the (non-empty) set of points for which
θ10 · px0 |z1 − θ00 · px0 |z0 = 0 always satisfy both inequalities.
The proof that at most one of (b2.1) and (b2.2) may be violated is symmetric.
We now show that the inequalities (a2.1) and (a2.2) place non-trivial restric-
tions on (θ00 , θ10 ) iff (b2.1) and (b2.2) place trivial restrictions on (θ01 , θ11 ). The
line corresponding to (a2.1) passes through (θ00 , θ10 ) = (−px1 |z0 /px0 |z0 , 0) and
442
Binary Instrumental Variable Model
(0, px1 |z0 /px0 |z1 ); since the slope of the line is non-negative, it has non-empty inter-
section with [0, 1]2 iff px1 |z0 /px0 |z1 ≤ 1. Thus there are values of (θ01 , θ11 ) ∈ [0, 1]2
which fail to satisfy (a2.1) iff px1 |z0 /px0 |z1 < 1. By a similar argument it may
be shown that (a2.2) is non-trivial iff px1 |z1 /px0 |z0 < 1, which is equivalent to
px1 |z0 /px0 |z1 < 1.
The proof is completed by showing that (b2.1) and (b2.2) are non-trivial if and
only if px1 |z0 /px0 |z1 > 1. 2
For the data in Table 3, all of the instrument inequalities hold. Consequently there is
no evidence of a direct effect of Z on Y . (Again we emphasize that unlike [Hirano,
Imbens, Rubin, and Zhou 2000], we are not using any information on baseline
covariates in the analysis.) Finally we note that, since all of the instrumental
inequalities hold, maximum likelihood estimates for the distribution p(x, y | z) under
the IV model are given by the empirical distribution. However, if one of the IV
inequalities were to be violated then the MLE would not be equal to the empirical
distribution, since the latter would not be a law within the IV model. In such a
circumstance a fitting procedure would be required; see [Ramsahai 2008, Ch. 5].
7 Conclusion
We have built upon and extended the work of Pearl, displaying how the range of
possible distributions over types compatible with a given observed distribution may
be characterized and displayed geometrically. Pearl’s bounds on the global ACE
are sometimes objected to on the grounds that they are too extreme, since for
example, the upper bound presupposes a 100% success rate among Never Takers if
they were somehow to receive treatment, likewise a 100% failure rate among Always
Takers were they not to receive treatment. Our analysis provides a framework for
performing a sensitivity analysis. Lastly, our analysis relates the IV inequalities to
the bounds on direct effects.
Acknowledgements
This research was supported by the U.S. National Science Foundation (CRI 0855230)
and U.S. National Institutes of Health (R01 AI032475) and Jesus College, Oxford
where Thomas Richardson was a Visiting Senior Research Fellow in 2008. The
authors used Avitzur’s Graphing Calculator software (www.pacifict.com) to con-
struct two and three dimensional plots. We thank McDonald, Hiu and Tierney for
giving us permission to use their flu vaccine data.
443
Thomas S. Richardson and James M. Robins
References
Balke, A. and J. Pearl (1997). Bounds on treatment effects from studies with
imperfect compliance. Journal of the American Statistical Association 92,
1171–1176.
Bonet, B. (2001). Instrumentality tests revisited. In Proceedings of the 17th Con-
ference on Uncertainty in Artificial Intelligence, pp. 48–55.
Cai, Z., M. Kuroki, J. Pearl, and J. Tian (2008). Bounds on direct effects in the
presence of confounded intermediate variables. Biometrics 64, 695–701.
Chickering, D. and J. Pearl (1996). A clinician’s tool for analyzing non-
compliance. In AAAI-96 Proceedings, pp. 1269–1276.
Erosheva, E. A. (2005). Comparing latent structures of the Grade of Membership,
Rasch, and latent class models. Psychometrika 70, 619–628.
Fienberg, S. E. and J. P. Gilbert (1970). The geometry of a two by two contin-
gency table. Journal of the American Statistical Association 65, 694–701.
Hirano, K., G. W. Imbens, D. B. Rubin, and X.-H. Zhou (2000). Assessing the
effect of an influenza vaccine in an encouragement design. Biostatistics 1 (1),
69–88.
Manski, C. (1990). Non-parametric bounds on treatment effects. American Eco-
nomic Review 80, 351–374.
McDonald, C., S. Hiu, and W. Tierney (1992). Effects of computer reminders for
influenza vaccination on morbidity during influenza epidemics. MD Comput-
ing 9, 304–312.
Pearl, J. (2000). Causality. Cambridge, UK: Cambridge University Press.
Ramsahai, R. (2008). Causal Inference with Instruments and Other Supplemen-
tary Variables. Ph. D. thesis, University of Oxford, Oxford, UK.
Robins, J. (1989). The analysis of randomized and non-randomized AIDS treat-
ment trials using a new approach to causal inference in longitudinal studies.
In L. Sechrest, H. Freeman, and A. Mulley (Eds.), Health Service Research
Methodology: A focus on AIDS. Washington, D.C.: U.S. Public Health Ser-
vice.
Robins, J. and A. Rotnitzky (2004). Estimation of treatment effects in randomised
trials with non-compliance and a dichotomous outcome using structural mean
models. Biometrika 91 (4), 763–783.
444
Return to TOC
26
Pearl Causality and the Value of Control
Ross Shachter and David Heckerman
1 Introduction
We welcome this opportunity to acknowledge the significance of Judea Pearl’s con-
tributions to uncertain reasoning and in particular to his work on causality. In
the decision analysis community causality had long been “taboo” even though it
provides a natural framework to communicate with decision makers and experts
[Shachter and Heckerman 1986]. Ironically, while many of the concepts and meth-
ods of causal reasoning are foundational to decision analysis, scholars went to great
lengths to avoid causal terminology in their work. Judea Pearl’s work is helping
to break this barrier, allowing the exploration of some fundamental principles. We
were inspired by his work to understand exactly what assumptions are being made
in his causal models, and we would like to think that our subsequent insights have
contributed to his and others’ work as well.
In this paper, we revisit our previous work on how a decision analytic perspective
helps to clarify some of Pearl’s notions, such as those of the do operator and atomic
intervention. In addition, we show how influence diagrams [Howard and Matheson
1984] provide a general graphical representation for cause. Decision analysis can be
viewed simply as determining what interventions we want to make in the world to
improve the prospects for us and those we care about, an inherently causal concept.
As we shall discuss, causal models are naturally represented within the framework
of decision analysis, although the causal aspects of issues about counterfactuals and
causal mechanisms that arise in computing the value of clairvoyance [Howard 1990],
were first presented by Heckerman and Shachter [1994, 1995]. We show how this
perspective helps clarify decision-analytic measures of sensitivity, such as the value
of control and the value of revelation [Matheson 1990; Matheson and Matheson
2005].
2 Decision-Theoretic Foundations
In this section we introduce the relevant concepts from [Heckerman and Shachter
1995], the framework for this paper, along with some extensions to those concepts.
Our approach rests on a simple but powerful primitive concept of unresponsive-
ness. An uncertain variable is unresponsive to a set of decisions if its value is
unaffected by our choice for the decisions. It is unresponsive to those decisions in
worlds limited by other variables if the decisions cannot affect the uncertain variable
without also changing one of the other variables.
445
Ross Shachter and David Heckerman
446
447
Ross Shachter and David Heckerman
448
449
450
The Value of Control
including any decisions that are direct interventions on it, and a mapping variable.
As an example, the influence diagram shown in Figure 3b is in canonical form.
In the next section we apply these concepts to define and contrast different mea-
sures for the value to a decision maker of manipulating (or observing) an uncertain
variable.
3 Value of Control
When assisting a decision maker developing a model, sensitivity analysis measures
help the decision maker to validate the model. One popular measure is the value
of clairvoyance, the most a decision maker should be willing to pay to observe a
set of uncertain variables before making particular decisions [Howard 1967]. Our
focus of attention is another measure, the value of control (or wizardry), the most a
decision maker should be willing to pay a hypothetical wizard to optimally control
the distribution of an uncertain variable [Matheson 1990], [Matheson and Matheson
2005]. We consider and contrast the value of control with two other measures, the
value of do, and the value of revelation, and we develop the conditions under which
the different measures are equal.
In formalizing the value of control, it is natural to consider the value of an atomic
intervention on uncertain variable x, in particular do(x∗ ), that would set it to x∗
the instance yielding the most valuable decision situation, rather than to idle. We
call the most the decision maker should be willing to pay for such an intervention
the value of do and compute it as the difference in the values of the diagrams.
DEFINITION 7 (Value of Do). Given a decision problem including an atomic in-
tervention on uncertain variable x ∈ U , the value of do for x, denoted by V oD(x∗ ),
is the most one should be willing to pay for an atomic intervention on uncertain
variable x to the best possible deterministic instance, do(x∗ ), instead of to idle.
Our goal in general is to value the optimal manipulation of the conditional distri-
bution of a target uncertain variable x in a causal influence diagram, P {x|Y }, and
the most we should be willing to pay for such an intervention is the value of control.
The simplest case is when {do(x)} is a cause of x with respect to D, Y = {}, so the
optimal distribution is equivalent to an atomic intervention on x to x∗ , and control
and do are the same intervention. Otherwise, the do operation effectively severs the
arcs from Y to x and replaces the previous causal mechanism with the new atomic
one. By contrast, the control operation is an atomic intervention on the mapping
variable x(Y ) to its optimal value do(x∗ (Y )) rather than to idle.
DEFINITION 8 (Value of Control). Given a decision problem including variables
Y , a mapping variable x(Y ) for uncertain variable x ∈ U , and atomic interventions
do(x) and do(x(Y )) such that Y ∪ {do(x), do(x(Y ))} is a cause of x with respect
to D, the value of control for x, denoted by V oC(x∗ (Y )), is the most one should
be willing to pay for an atomic intervention on the mapping variable for uncertain
variable x to the best possible deterministic function of Y , do(x∗ (Y )), instead of
451
Ross Shachter and David Heckerman
to idle.
If Y = {}, then do(x) is the same atomic intervention as do(x(Y )), and the values
of do and control for x are equal, V oD(x∗ ) = V oC(x∗ ()).
In many cases, while it is tempting to assume atomic interventions, they can be
cumbersome or implausible. In an attempt to avoid such issues, Ronald A. Howard
has suggested an alternative passive measure, the value of revelation: how much
better off the decision maker should be by observing that the uncertain variable
in question obtained its most desirable value. This is only well-defined for vari-
ables unresponsive to D, except for those atomic interventions that are set to idle,
because otherwise the observation would be made before decisions it might be re-
sponsive to. Under our assumptions this can be computed as the difference in value
between two situations, but it is hard to describe it as a willingness to pay for this
difference as it is more passive than intentional. (The value of revelation is in fact
an intermediate term in the computation of the value of clairvoyance.)
DEFINITION 9 (Value of Revelation). Given a decision problem including uncer-
tain variable x ∈ U and a (possibly empty) set of atomic interventions, A, that is a
cause for x with respect to D, the value of revelation for uncertain variable x ∈ U ,
denoted by V oR(x∗ ), is the increase in the value of the situation with d = idle
for all d ∈ A, if one observed that uncertain variable x = x∗ , the best possible
deterministic instance, instead of not observing x.
To illustrate these three measures we, consider a partial causal influence diagram
including x and its parents, Y , which we assume for this example are uncertain
and nonempty, as shown in Figure 4a. There are atomic interventions do(x) on
x, do(x(Y )) on mapping variable x(Y ), and do(y) on each y ∈ Y represented as
do(Y ). The variable x is a deterministic function of Y , do(x) and x(Y ). In this
model, Y ∪ {do(x), do(x(Y ))} is a cause of x with respect to D. The dashed line
from x to values V suggests that there might be some directed path from x to V .
If not, V would be unresponsive to do(x) and do(x(Y )) and the values of do and
control would be zero.
To obtain the reference diagram for our proposed changes, we set all of the atomic
interventions to idle as shown in Figure 4b1. We can compute the value of this
diagram by eliminating the idle decisions and absorbing the mapping variable into
x, yielding the simpler diagram shown in (b2). To compute the value of do for x,
we can compute the value of the diagram with do(x∗ ) by setting the other atomic
interventions to idle, as shown in (c1). But since that is making the optimal choice
for x with no interventions on Y or x(Y ), we can now think of x as a decision
variable as indicated in the diagram shown in (c2). We shall use this shorthand
in many of the examples that we consider. To compute the value of control for
x, we can compute the value of the diagram with do(x∗ (Y )) by setting the other
atomic interventions to idle, as shown in (d1). But since that is making the optimal
choice for x(Y ) with none of the other interventions, we can compute its value with
452
453
454
455
456
457
Ross Shachter and David Heckerman
in that
P {V |Y, x∗ } = P {V |Y, do(x∗ )} = P {V |Y, do(x∗ (Y ))}.
However, in valuing the decision situation we do not get to observe Y and thus
P {V |x∗ } might not be equal to P {V |do(x∗ )}. Consider the diagrams shown in
Figure 9. Because Income satisfies the back door criterion relative to Income Tax,
the values of do, control and revelation on Income Tax would all be the same if we
observed Income. But we do not know what our Income will be and the values of
do, control, and revelation can all be different.
Nonetheless, if we make a stronger assumption, that Y is d-separated from V by
x, the three measures will be equal. The atomic intervention on x or its mapping
variable only affects the value V through the descendants of x in a causal model,
and all other variables are unresponsive to the intervention in worlds limited by x.
However, the atomic interventions might not be independent of V given x unless Y
is d-separated from V by x. Otherwise, observing x or an atomic intervention on
the mapping variable for x can lead to a different value for the diagram than an
atomic intervention on x.
We establish this result in two steps for both general situations and for Pearl
causal models. By assuming that do(x) is independent of V given x, we first show
that the values of do and revelation are equal. If we then assume that Y is d-
separated from V by x, we show that the values of do and control are equal. The
conditions under which these two different comparisons can be made are not iden-
tical either. To be able to compute the value of revelation for x we must set to idle
all interventions that x is responsive to, while to compute the value of control for
x we need to be ensure that we have an atomic intervention on a mapping variable
for x.
THEOREM 10 (Equal Values of Do and Revelation). Given a decision problem
including uncertain variable x ∈ U , if there is a set of atomic interventions A,
including do(x), that is a cause of x with respect to D, and do(x) is independent of V
given x, then the values of do and revelation for x are equal, V oD(x∗ ) = V oR(x∗ ).
If {do(x)} is a cause of x with respect to D, then they are also equal to the value
of control for x, V oC(x∗ ()) = V oD(x∗ ) = V oR(x∗ ).
Proof. Consider the probability of V after the intervention do(x∗ ) with all other
interventions in A set to idle. Because x is determined by do(x∗ ), and do(x) is
independent of V given x,
If {do(x)} is is a cause of x with respect to D then the values of do and control for
x are equal. ⊓
⊔
COROLLARY 11. Given a decision problem described by a Pearl causal model in-
cluding uncertain variable x ∈ U , if P a(x) is d-separated from V by x, then the
458
The Value of Control
As a result,
X
P {V |do(x∗ (Y ))} = P {V, Y|do(x∗ (Y ))}
Y
X
= P {V |do(x∗ (Y )), Y}P {Y|do(x∗ (Y ))}
Y
X
= P {V |x∗ }P {Y|do(x∗ (Y ))}
Y
X
= P {V |x∗ } P {Y|do(x∗ (Y ))}
Y
= P {V |x∗ }
⊔
⊓
459
460
The Value of Control
461
Return to TOC
27
Cause for Celebration, Cause for Concern
Yoav Shoham
While it is hard to argue that our definition (or any other definition, for
463
Yoav Shoham
that matter) is the “right definition”, we show that it deals well with
the difficulties that have plagued other approaches in the past, especially
those exemplified by the rather extensive compendium of [Hall 2004]1 .
464
Cause for Celebration, Cause for Concern
References
Fagin, R., J. Y. Halpern, Y. Moses, and M. Y. Vardi (1994). Reasoning about
Knowledge. MIT Press.
Gettier, E. L. (1963). Is justified true belief knowledge? Analysis 23, 121–123.
Hall, N. (2004). Two concepts of causation. In J. Collins, N. Hall, and L. A. Paul
(Eds.), Causation and Counterfactuals. MIT Press.
2 The discussion there is done in the context of formal models of intention, but the considerations
465
Yoav Shoham
466
Return to TOC
28
_________________________________________________________________
1 Introduction
The rapid spread of interest in the last two decades in principled methods of search or
estimation of causal relations has been driven in part by technological developments,
especially the changing nature of modern data collection and storage techniques, and the
increases in the speed and storage capacities of computers. Statistics books from 30 years
ago often presented examples with fewer than 10 variables, in domains where some
background knowledge was plausible. In contrast, in new domains, such as climate
research where satellite data now provide daily quantities of data unthinkable a few
decades ago, fMRI brain imaging, and microarray measurements of gene expression, the
number of variables can range into the tens of thousands, and there is often limited
background knowledge to reduce the space of alternative causal hypotheses. In such
domains, non-automated causal discovery techniques appear to be hopeless, while the
availability of faster computers with larger memories and disc space allow for the
practical implementation of computationally intensive automated search algorithms over
large search spaces. Contemporary science is not your grandfather’s science, or Karl
Popper’s.
Causal inference without experimental controls has long seemed as if it must
somehow be capable of being cast as a kind of statistical inference involving estimators
with some kind of convergence and accuracy properties under some kind of assumptions.
Until recently, the statistical literature said not. While parameter estimation and
experimental design for the effective use of data developed throughout the 20th century,
as recently as 20 years ago the methodology of causal inference without experimental
controls remained relatively primitive. Besides a cessation of hostilities from the
majority of the statistical and philosophical communities (which has still only partially
happened), several things were needed for theories of causal estimation to appear and to
flower: well defined mathematical objects to represent causal relations; well defined
connections between aspects of these objects and sample data; and a way to compute
those connections. A sequence of studies beginning with Dempster’s work on the
factorization of probability distributions [Dempster 1972] and culminating with Kiiveri
and Speed’s [Kiiveri & Speed 1982] study of linear structural equation models, provided
the first, in the form of directed acyclic graphs, and the second, in the form of the “local”
Markov condition. Pearl and his students [Pearl 1988], and independently, Stefan
467
Peter Spirtes, Clark Glymour, Richard Scheines, Robert Tillman
Lauritzen and his collaborators [Lauritzen, Dawid, Larsen, & Leimer 1990], provided the
third, in the form of the “global” Markov condition, or d-separation in Pearl’s
formulation, and the assumption of its converse, which came to be known as “stability”
or “faithfulness.” Further fundamental conceptual and computational tools were needed,
many of them provided by Pearl and his associates; for example, the characterization and
representation of Markov equivalence classes and the idea of “inducing paths,” essential
to understanding the properties of models with unrecorded variables. Initially, most of
these authors, including Pearl, did not connect directed graphical models with a causal
interpretation (in the sense of representing outcomes of interventions). This connection
between graphs and interventions was drawn from an earlier tradition in econometrics
[Strotz & Wold 1960], and in our work [Spirtes, Glymour, & Scheines 1993]. With this
connection, and the pieces Speed, Lauritzen, Pearl and others had established, a
principled theory of causal estimation could, and did, begin around 1990, and Pearl and
his students have made important contributions to it. Pearl has become the foremost
advocate in the universe for reconceiving the relations between causality and statistics.
Once begun for special cases, the understanding of search methods for causal relations
has expanded to a variety of scientific and statistical settings, and in many scientific
enterprises—neuroimaging for example—causal representations and search are treated as
almost routine.
The theory of interventions also provided a coherent normative theory of inference
using causal premises. That effort can also be traced back to Strotz and Wold [Strotz &
Wold 1960], then to our own work [Spirtes, Glymour, & Scheines 1993] on prediction
from classes of causal graphs, and then to the full development of a non-parametric
theory of prediction for graphical models by Pearl and his collaborators [Shpitser & Pearl
2008]. Pearl brilliantly turned philosopher and developed the theory of interventions into
a general account of counterfactual reasoning. Although we will not discuss it further, we
think there remain interesting open problems about prediction algorithms for various
parametric classes of graphical causal models.
The following paper surveys a broad range of causal estimation problems and
algorithms, concentrating especially on those that can be illustrated with empirical
examples that we and our students and collaborators have analyzed. This has naturally led
to a concentration on the algorithms and tools that we have developed. The kinds of
causal estimation problems and algorithms discussed are broadly representative of the
most important developments in methods for estimating causal structure since 1990, but
it is not a comprehensive survey. There have been so many improvements to the basic
algorithms that we describe here there is not room to discuss them all. A good resource
for a description of further research in this area is the Proceedings of the Conferences on
Uncertainty in Artificial Intelligence, at http://uai.sis.pitt.edu.
The dimensions of the problems, as we have long understood them, are these:
1. Finding computationally and statistically feasible methods for discovering
causal information for large numbers of variables, provably correct under
standard sampling assumptions, assuming no confounding by unrecorded
variables.
468
Automated Search for Causal Relations
2 Assumptions
We assume the reader’s familiarity with the standard notions used in discussions of
graphical causal model search: conditional independence, Markov properties, d-
separation, Markov equivalence, patterns, distribution equivalence, causal sufficiency,
etc. The appendix gives a brief review of the essential definitions, assumptions and
theorems required for known proofs of correctness of the algorithms we will discuss.
469
Peter Spirtes, Clark Glymour, Richard Scheines, Robert Tillman
A B C D A B C D
E E
True Causal Graph G1 True Causal Pattern P1
A B A B A B
E D E D E D
C C C
(i) (i) (i)
Sepset Sepset
I(A,D|8) 8 I(A,C|{B}) {B}
I(B,D|8) 8 I(A,E|{B}) {B}
I(B,E|{C}) {C}
I(D,E|{C}) {C}
After the adjacency phase of the algorithm, the orientation phase of the algorithm is
performed. The orientation phase of the algorithm is illustrated in Figure 2.
470
Automated Search for Causal Relations
A B C D A B C D
(i) E (ii) E
C % Sepset(B,D)
B & Sepset(A,C)
C & Sepset(B,E)
C & Sepset(D,E)
471
Peter Spirtes, Clark Glymour, Richard Scheines, Robert Tillman
Assuming that the causal relations can be represented by a directed acyclic graph, the
Causal Markov Assumption, the Causal Faithfulness Assumption, and consistent tests of
conditional independence, in the large sample (i.i.d.) limit for a causally sufficient set of
variables, the PC algorithm outputs a pattern that represents the true causal graph.
The PC algorithm has been shown to apply to very high dimensional data sets (under
a stronger version of the Causal Faithfulness Assumption), both for finding causal
structure [Kalisch & Buhlmann 2007] and for classification [Aliferis, Tsamardinos, &
Statnikov 2003]. A version of the algorithm controlling the false discovery rate is
available [Junning & Wang 2009].
3.1.1 Example - Foreign Investment
This example illustrates how the PC algorithm can find plausible alternatives to a model
built from domain knowledge. Timberlake and Williams used regression to claim foreign
investment in third-world countries promotes dictatorship [Timberlake & Williams
1984]. They measured political exclusion (PO) (i.e., dictatorship), foreign investment
penetration in 1973 (FI), energy development in 1975 (EN), and civil liberties (CV) for
72 countries. CV was measured on an ordered scale from 1 to 7, with lower values
indicating greater civil liberties.
Their inference is unwarranted. Their model (with the relations between the
regressors omitted) and the pattern obtained from the PC algorithm using a 0.12
significance level to test for vanishing partial correlations) are shown in Figure 3.1 We
typically run the algorithms at a variety of different significance levels, and compare the
results to see if any of the features of the output are constant.
FI .762
–.478 + –
EN PO FI EN PO CV
CV 1.061
1Searches at lower significance levels remove the adjacency between FI and EN.
472
Automated Search for Causal Relations
as well supported as the regression model, and hence serve to cast doubt upon
conclusions drawn from that model.
3.1.2 Example - Spartina Biomass
This example illustrates a case where the PC algorithm output received some
experimental confirmation. A textbook on regression [Rawlings 1988] skillfully
illustrates regression principles and techniques for a biological study from a dissertation
[Linthurst 1979] in which it is reasonable to think there is a causal process at work
relating the variables. The question at issue is plainly causal: among a set of 14 variables,
which have the most influence on an outcome variable, the biomass of Spartina grass?
Since the example is the principle application given for an entire textbook on regression,
the reader who reaches the 13th chapter may be surprised to find that the methods yield
almost no useful information about that question.
According to Rawlings, Linthurst obtained five samples of Spartina grass and soil
from each of nine sites on the Cape Fear Estuary of North Carolina. Besides the mass of
Spartina (BIO), fourteen variables were measured for each sample:
• Free Sulfide (H2S)
• Salinity (SAL)
• Redox potentials at pH 7 (EH7)
• Soil pH in water (PH)
• Buffer acidity at pH 6.6 (BUF)
• Phosphorus concentration (P)
• Potassium concentration (K)
• Calcium concentration (CA)
• Magnesium concentration (MG)
• Sodium concentration (NA)
• Manganese concentration (MN)
• Zinc concentration (ZN)
• Copper concentration (CU)
• Ammonium concentration (NH4)
The aim of the data analysis was to determine for a later experimental study which of
these variables most influenced the biomass of Spartina in the wild. Greenhouse
experiments would then try to estimate causal dependencies out in the wild. In the best
case one might hope that the statistical analyses of the observational study would
correctly select variables that influence the growth of Spartina in the greenhouse. In the
worst case, one supposes, the observational study would find the wrong causal structure,
or would find variables that influence growth in the wild (e.g., by inhibiting or promoting
growth of a competing species) but have no influence in the greenhouse.
Using the SAS statistical package, Rawlings analyzed the variable set with a multiple
regression and then with two stepwise regression procedures from the SAS package. A
search through all possible subsets of regressors was not carried out, presumably because
the candidate set of regressors is too large. The results were as follows:
473
Peter Spirtes, Clark Glymour, Richard Scheines, Robert Tillman
(i) a multiple regression of BIO on all other variables gives only K and CU
significant regression coefficients;
(ii) two stepwise regression procedures 2 both yield a model with PH, MG, CA
and CU as the only regressors, and multiple regression on these variables alone gives
them all significant coefficients;
(iii) simple regressions one variable at a time give significant coefficients to PH,
BUF, CA, ZN and NH4.
What is one to think? Rawling's reports that "None of the results was satisfying to
the biologist; the inconsistencies of the results were confusing and variables expected to
be biologically important were not showing significant effects." (p. 361).
This analysis is supplemented by a ridge regression, which increases the stability of
the estimates of coefficients, but the results for the point at issue--identifying the
important variables--are much the same as with least squares. Rawlings also provides a
principal components factor analysis and various geometrical plots of the components.
These calculations provide no information about which of the measured variables
influence Spartina growth.
Noting that PH, for example, is highly correlated with BUF, and using BUF instead
of PH along with MG, CA and CU would also result in significant coefficients, Rawlings
effectively gives up on this use of the procedures his book is about:
Ordinary least squares regression tends either to indicate that none of the
variables in a correlated complex is important when all variables are in the
model, or to arbitrarily choose one of the variables to represent the complex
when an automated variable selection technique is used. A truly important
variable may appear unimportant because its contribution is being usurped
by variables with which it is correlated. Conversely, unimportant variables
may appear important because of their associations with the real causal
factors. It is particularly dangerous in the presence of collinearity to use the
regression results to impart a "relative importance," whether in a causal
sense or not, to the independent variables. (p. 362)
Rawling's conclusion is correct in spirit, but misleading and even wrong in detail. If
we apply the PC algorithm to the Linthurst data then there is one robust conclusion: the
only variable that may directly influence biomass in this population3 is PH; PH is
distinguished from all other variables by the fact that the correlation of every other
variable (except MG) with BIO vanishes or vanishes when PH is conditioned on.4 The
relation is not symmetric; the correlation of PH and BIO, for example, does not vanish
when BUF is controlled. The algorithm finds PH to be the only variable adjacent to BIO
2The "maximum R-square" and "stepwise" options in PROC REG in the SAS program.
3Although the definition of the population in this case is unclear, and must in any case be
drawn quite narrowly.
4More exactly, at .05, with the exception of MG the partial correlation of every regressor
with BIO vanishes when some set containing PH is controlled for; the correlation of MG
with BIO vanishes when CA is controlled for.
474
Automated Search for Causal Relations
no matter whether we use a significance level of .05 to test for vanishing partial
correlations, or a level of 0.1, or a level of 0.2. In all of these cases, the PC algorithm
(and the FCI algorithm, which allows for the possibility of latent variables in section 4.2 )
yields the result that PH and only PH can be directly connected with BIO. If the system is
linear normal and the Causal Markov Assumption obtains, then in this population any
influence of the other regressors on BIO would be blocked if PH were held constant. Of
course, over a larger range of values of the variables there is little reason to think that
BIO depends linearly on the regressors, or that factors that have no influence in
producing variation within this sample would continue to have no influence.
Although the analysis cannot conclusively rule out possibility that PH and BIO are
confounded by one or more unmeasured common causes, in this case the principles of the
theory and the data argue against it. If PH and BIO have a common unmeasured cause T,
say, and any other variable, Zi, among the 13 others either causes PH or has a common
unmeasured cause with PH (Figure 4, in which we do not show connections among the Z
variables), then Zi and BIO should be correlated conditional on PH, which is statistically
not the case.
` Z1 o Z2 o Z3 o PH
T
BIO
The program and theory lead us to expect that if PH is forced to have values like
those in the sample--which are almost all either below PH 5 or above PH 7-- then
manipulations of other variables within the ranges evidenced in the sample will have no
effect on the growth of Spartina. The inference is a little risky, since growing plants in a
greenhouse under controlled conditions may not be a direct manipulation of the variables
relevant to growth in the wild. If, for example, in the wild variations in PH affect
Spartina growth chiefly through their influence on the growth of competing species not
present in the greenhouse, a greenhouse experiment will not be a direct manipulation of
PH for the system.
The fourth chapter of Linthurst's thesis partly confirms the PC algorithm's analysis.
In the experiment Linthurst describes, samples of Spartina were collected from a salt
marsh creek bank (presumably at a different site than those used in the observational
study). Using a 3 x 4 x 2 (PH x SAL x AERATION) randomized complete block design
with four blocks, after transplantation to a greenhouse the plants were given a common
nutrient solution with varying values PH and SAL and AERATION. The AERATION
variable turned out not to matter in this experiment. Acidity values were PH 4, 6 and 8.
SAL for the nutrient solutions was adjusted to 15, 25, 35 and 45 %o.
Linthurst found that growth varied with SAL at PH 6 but not at the other PH values,
4 and 8, while growth varied with PH at all values of SAL (p. 104). Each variable was
475
Peter Spirtes, Clark Glymour, Richard Scheines, Robert Tillman
At pH 4 and 8, salinity had little effect on the performance of the species. The
pH appeared to be more dominant in determining the growth response.
However, there appears to be no evidence for any causal effects of high or low
tissue concentrations on plant performance unless the effects of pH and salinity
are also accounted for. (p.108)
The overall effect of pH at the two extremes is suggestive of damage to the
root, thereby modifying its membrane permeability and subsequently its
capacity for selective uptake. (p. 109).
The question of interest is what the causes of college plans are. This data set is of
interest because it has been used by a variety of different search algorithms that make
different assumption. The different results illustrate the role that the different assumptions
make in the output and are discussed in subsequent sections.
5Examples of the analysis of the Sewell and Shah data using Bayesian networks are given
in Spirtes et al. (2001), and Heckerman (1998).
476
Automated Search for Causal Relations
SES
SEX PE CP
IQ
Figure 5: Model of Causes of College Plans
The pattern produced as the output of the PC algorithm is shown in Figure 5. The model
predicts that SEX affects CP only indirectly via PE.
It is possible to predict the effects of some manipulations from the pattern, but not
others. For example, because the pattern is compatible both with SES 6 IQ and with SES
5 IQ, it is not possible to determine if SES is a cause or an effect of IQ, and hence it is
not possible to predict the effect of manipulating SES on IQ from the pattern. On the
other hand, it can be shown that all of the models in the conditional independence
equivalence class represented by the pattern entail the same predictions about the
quantitative effects of manipulating PE on CP. When PE is manipulated, in the
manipulated distribution: P(CP=0|PE=0) = .095; P(CP=1|PE=0) = .905; P(CP=0|PE=1)
= .484; P(CP=1PE=1) = .516 [Spirtes, Scheines, Glymour, & Meek 2004].
477
Peter Spirtes, Clark Glymour, Richard Scheines, Robert Tillman
GES has proved especially valuable in searches for latent structure (GESMIMBuild)
and in searches with multiple data sets (IMaGES). Examples are discussed in sections 4.4
and 5.3 .
3.3 LiNGAM
Standard implementations of the constraint-based and score-based algorithms above
usually assume that continuous variables have multivariate Gaussian distributions. This
assumption is inappropriate in many contexts such as EEG analysis where variables are
known to deviate from Gaussianity.
The LiNGAM (Linear Non-Gaussian Acyclic Model) algorithm [Shimizu, Hoyer,
Hyvärinen, & Kerminen 2006] is appropriate specifically for cases where each variable in
a set of measured variables can be written as a linear function of other measured variables
plus an independent noise component, where at most one of the measured variables’
noise components may be Gaussian. For example, consider the system with the causal
graph shown in Figure 6 and assume X, Y, and Z are determined as follows, where a, b,
and c are real-valued coefficients and 2x , 2y, and 2z are independent noise components of
which at least two are non-Gaussian.
(1) X = 2x
(2) Y = aX + 2y
(3) Z = bX + cY + 2z
2X X Y 2Y 2X 2Y 2Z
b c
Z 2Z X Y Z
(i) Causal Graph (ii) Reduced Form
(4) X = 2X
(5) Y = a 2 X + 2Y
(6) Z = b2X + ac2X + c2Y + 2Z
The standard Independent Components Analysis (ICA) procedure [Hyvärinen & Oja,
2000] can be used to recover a matrix containing the real-valued coefficients a, b, and c
from an i.i.d. sample of data generated from the above system of equations. The
LiNGAM algorithm finds the correct matching of coefficients in this ICA matrix to
variables and prunes away any insignificant coefficients using statistical criteria.
478
Automated Search for Causal Relations
The procedure yields correct values even if the coefficients were to perfectly cancel,
and hence the variables such as X, Z above were to be uncorrelated. Since coefficients
are determined for each variable, we can always reconstruct the true unique DAG, instead
of its Markov equivalence class. The procedure converges (at least) pointwise to the true
DAG and coefficients assuming: (1) there are no unmeasured common causes; (2) the
dependencies among measured variables are linear; (3) none of the relations among
measured variables are deterministic; (4) i.i.d. sampling; (5) the Markov Condition; (6) at
most one error or disturbance term is Gaussian. We do not know its complexity
properties.
The LiNGAM procedure can be generalized to estimate causal relations among
observables when there are latent common causes [Hoyer, Shimizu, & Kerminen 2006],
although the result is not in general a unique DAG, and LiNGAM has been combined
[Shimizu, Hoyer, & Hyvarinen 2009] with Silva’s clustering procedure (section 4.4 ) for
locating latent variables to estimate a unique DAG among latent variables, and also with
search for cyclic graphs [Lacerda, Spirtes, Ramsey, & Hoyer 2008], and combined with
the PC and GES algorithms when more than one disturbance term is Gaussian [Hoyer et
al. 2008].
479
480
Automated Search for Causal Relations
equivalence class. In many cases, only a few variables need be nonlinear or non-Gaussian
to obtain a unique DAG using kPC.
kPC requires the following additional assumption:
This assumption does rule out invertible additive noise models or many cases where
noise may not be additive, only the hypothetical case where we can fit an additive noise
model to the data, but only in the incorrect direction. Weak additivity can be considered
an extension of the simplicity intuitions underlying the causal faithfulness assumption,
i.e. a complicated true model will not generate data resembling a different simpler model.
Faithfulness can fail, but under a broad range of distributions, violations are Lebesgue
measure zero [Spirtes, Glymour, & Scheines 2000]. Whether a similar justification can
be given for the weak additivity assumption is an open question.
kPC is both correct and complete, i.e. it converges to the correct DAG or smallest
possible equivalence class of DAGs in the limit under weak additivity and the
assumptions of the PC algorithm.
3.4.1 Example - Auto MPG
Figure 9 shows the structures learned for the Auto MPG dataset, which records MPG fuel
consumption of 398 automobiles in 1983 with 8 characteristics from the UCI database
(Asuncion & Newman, 2007). The nominal variables Year and Origin were excluded.
Displacement Weight
Horsepower Horsepower
481
Peter Spirtes, Clark Glymour, Richard Scheines, Robert Tillman
structures learned by PC and kPC for this dataset. kPC finds every variable other than
Area is a cause of Area, which is sensible since each of these variables were included in
the dataset by domain experts as predictors which influence the total area burned by
forest fires.
The PC structure, however, indicates that Area is not associated with any of the
variables, which are all assumed to be predictors by experts.
482
Automated Search for Causal Relations
6The FCI algorithm is similar to Pearl’s IC* algorithm [Pearl 2000] in many respects, and
uses concepts bases on IC*; however IC* is computationally and statistically feasible
only for a few variables.
483
Peter Spirtes, Clark Glymour, Richard Scheines, Robert Tillman
L2
L1 SEX PE CP SEX o PE CP
L4
IQ L5 o IQ
7See http://oli.web.cmu.edu/openlearning/forstudents/freecourses/csr
484
Automated Search for Causal Relations
Using data from 2002, and some background knowledge about causal order, the
output of the FCI algorithm was the PAG shown in Figure 12a. That model predicts that
interventions that stops students from printing out the text and encourages students to use
the online interactive exercises should raise the final grade in the class.
In 2003, students were advised that completing the voluntary exercises seemed to be
important in helping grades, but that printing out the modules seemed to prevent
completing the voluntary exercises. They were advised that, if they printed out the text
they should make extra effort to go online and complete the interactive online exercises.
Data on the same variables was gathered in 2003, and the output of the FCI algorithm is
shown Figure 12b. The interventions to discourage printing and encourage the use of the
online interactive exercises were largely successful, and the PAG output by the FCI
algorithm from the 2003 data is exactly the PAG one would expect after intervening on
the PAG output by the FCI algorithm from the 2002 data.
print voluntary exercise print voluntary exercises
.302* -.41** -.08 -.16
o
pre .353* .75** preo
o o .41*
.323* .25*
final quiz final
(a) 2002 (b) 2003
Figure 12: Online Course Printing
485
Peter Spirtes, Clark Glymour, Richard Scheines, Robert Tillman
covariates, they were faced with a variable selection problem to which they applied
backwards-stepwise variable selection, arriving at a final regression model involving lead
and five of the original 40 covariates. The covariates were measures of genetic
contributions to the child’s IQ (the parent’s IQ), the amount of environmental stimulation
in the child’s early environment (the mother’s education), physical factors that might
compromise the child’s cognitive endowment (the number of previous live births), and
the parent’s age at the birth of the child, which might be a proxy for many factors. The
measured variables they used are as follows:
8 The covariance data for this reanalysis was originally obtained from Needleman by
Steve Klepper, who generously forwarded it. In this, and all subsequent analyses
described, the correlation matrix was used.
486
Automated Search for Causal Relations
for lead’s influence on IQ moved up and down slightly, its sign and significance (the 95%
central region in the posterior over the lead-iq connection always included zero) were
robust.
L1 L2 L3 L4 L5 L6 L1 L2 L3
A reanalysis using the FCI algorithm produced different results [Scheines 2000].
Scheines first used the FCI algorithm to generate a PAG, which was subsequently used as
the basis for constructing an errors-in-variables model. The FCI algorithm produced a
PAG that indicated that mab, fab, and nlb are not adjacent to ciq, contrary to
Needleman’s regression.9 If we construct an errors-in-variables model compatible with
the PAG produced by the FCI algorithm, the model does not contain mab, fab, or nlb. See
Figure 13. (We emphasize that there are other models compatible with the PAG, which
are not errors-in-variables models; the selection of an error-in-variables model from the
9 The fact that mab had a significant regression coefficient indicates that mab and ciq are
correlated conditional on the other variables; the FCI algorithm concluded that mab is not
a cause of ciq because mab and ciq are unconditionally uncorrelated.
487
Peter Spirtes, Clark Glymour, Richard Scheines, Robert Tillman
set of models represented by the PAG is an assumption.) In fact the variables that the FCI
algorithm eliminated were precisely those, which required unreasonable measurement
error assumptions in Klepper's analysis. With the remaining regressors, Scheines
specified an errors-in-variables model to parameterize the effect of actual lead exposure
on children’s IQ. This model is still underidentified but under several priors, nearly all
the mass in the posterior was over negative values for the effect of actual lead exposure
(now a latent variable) on measured IQ. In addition, applying Klepper’s bounds analysis
to this model indicated that the effect of actual lead exposure on ciq was bounded below
zero given reasonable assumptions about the degree of measurement error.
250 250
200 200
150 150
100 100
50 50
0 0
0.00
0.09
0.17
-0.56
-0.48
-0.40
-0.32
-0.24
-0.16
-0.08
488
Automated Search for Causal Relations
independence relations among the X variables alone - the only entailed conditional
independencies require conditioning on an unobserved common cause. Hence the FCI
algorithm would return a completely unoriented PAG in which every pair of variables in
X is adjacent. Such a PAG makes no predictions at all about the effects of manipulations
of the observed variables.
Furthermore, in this case, the effects of manipulating the observed variables (answers
to test questions) are of no interest - the interesting questions are about the effects of
manipulating the unobserved variables and the qualitative causal relationships between
them.
Although PAGs can reveal the existence of latent common causes (as by the double-
headed arrows in Figure 11 for example), before one could make a prediction about the
effect of manipulating an unobserved variable(s), one would have to identify what the
variable (or variables) is, which is never possible from a PAG.
X2
X8 X3
X9 Care X5
X10
X11 Emotionality X7
X16
X18 Self-defeating X6
X14
Models such as S are multiple indicator models, and can be divided into two parts:
the measurement model, which contains the edges between the unobserved variables and
the observed variables (e.g. Emotionality 6 X2), and the structural model, which contains
the edges between the unobserved variables (e.g. Emotionality 6 Care).
The X variables in S ({X2, X3, X5, X6, X7, X8, X9, X10, X11, X14, X16, X18}) were chosen
with the idea that they indirectly measure some psychological trait that cannot be directly
observed. Ideally, the X variables can be broken into clusters, where each variable in the
cluster is caused by one unobserved cause common to the members of the cluster, and a
unique error term uncorrelated with the other error terms, and nothing else. From the
values of the variables in the cluster, it is then possible to make inferences about the
value of the unobserved common cause. Such a measurement model is called pure. In
psychometrics, pure measurement models satisfy the property of local independence:
each measured variable is independent of all other variables, conditional on the
unobserved variable it measures. In Figure 16, the measurement model of S is pure.
If the measurement model is impure (i.e. there are multiple common causes of a pair
of variables in X, or some of the X variables cause each other) then drawing inferences
about the values of the common causes is much more difficult. Consider the set X’ = X 9
{X15}. If X15 indirectly measured (was a direct effect of) the unobserved variable Care,
but X10 directly caused X15, then the measurement model over the expanded set of
489
Peter Spirtes, Clark Glymour, Richard Scheines, Robert Tillman
variables would not be pure. If a measurement model for a set X’ of variables is not pure,
it is nevertheless possible that some subset of X’ (such as X) has a pure measurement
model. If the only reason that the measurement model is impure is that X10 causes X15
then X = X’\{X15} does have a pure measurement model, because all the “impurities”
have been removed. S does not contain all of the questions on the survey precisely
because various tests described below indicated that they some of them needed to be
excluded in order to have a pure measurement model.
The task of searching for a multiple indicator model can then be broken into two
parts: first finding clusters of variables so that the measurement model is pure; second,
use the pure measurement model to make inferences about the structural model.
Factor analysis is often used to determine the number of unmeasured common causes
in a multiple indicator model, but there are important theoretical and practical problems
in using factor analysis in this way. Factor analysis constructs models with unobserved
common causes (factors) of the observed X variables. However, factor analysis models
typically connect each unobserved common cause (factor) to each X variable, so the
measurement model is not pure. A major difficulty with giving a causal interpretation to
factor analytic models is that the observed distribution does not determine the covariance
matrix among the unobserved factors. Hence, a number of different factor analytic
models are compatible with the same observed data [Harman 1976]. In order to reduce
the underdetermination of the factor analysis model by the data, it is often assumed that
the unobserved factors are independent of each other; however, this is clearly not an
appropriate assumption for unobserved factors that are supposed to represent actual
causes that may causally interact with each other. In addition, simulation studies indicate
that factor analysis is not a reliable tool for estimating the correct number of unobserved
common causes [Glymour 1998].
On this data set, factor analysis indicates that there are 2 unobserved direct common
causes, rather than 3 unobserved direct common causes [Bartholomew, Steele, Moustaki,
& Galbraith 2002]. If a pure measurement model is constructed from the factor analytic
model by associating each observed X variable only with the factor that it is most
strongly associated with, the resulting model fails a statistical test (has a p-value of zero)
[Silva, Scheines, Glymour, & Spirtes 2006]. A search for pure measurement models that
depends upon testing vanishing tetrad constraints is an alternative to factor analysis.
Conceptually, the task of building a pure measurement model from the observed
variables can be broken into 3 separate tasks:
1. Select a subset of the observed variables that form a pure measurement model.
2. Determine the number of clusters (i.e. the number of unobserved common
causes) that the observed variables measure.
3. Cluster the observed variables into the proper groups (so each group has exactly
one unobserved direct common cause.)
490
Automated Search for Causal Relations
491
Peter Spirtes, Clark Glymour, Richard Scheines, Robert Tillman
unobserved common causes that does not create a cycle is also compatible with the
pattern.) The resulting model (or set of models) passes a statistical test with a p-value of
0.47.
4.4.1 Example - Religion and Depression
Data relating religion and depression provides an example that shows how an
automated causal search produces a model that is compatible with background
knowledge, but fits much better than a model that was built from theories about the
domain.
Bongjae Lee from the University of Pittsburgh organized a study to investigate
religious/spiritual coping and stress in graduate students [Silva & Scheines 2004]. In
December of 2003, 127 Masters in Social Works students answered a questionnaire
intended to measure three main factors:
• Stress, measured with 21 items, each using a 7-point scale (from “not all
stressful” to “extremely stressful”) according to situations such as: “fulfilling
responsibilities both at home and at school”; “meeting with faculty”; “writing
papers”; “paying monthly expenses”; “fear of failing”; “arranging childcare”;
• Depression, measured with 20 items, each using a 4-point scale (from “rarely or
none” to “most or all the time”) according to indicators as: “my appetite was
poor”; “I felt fearful”; “I enjoyed life” “I felt that people disliked me”; “my
sleep was restless”;
• Spiritual coping, measured with 20 items, each using a 4-point scale (from “not
at all” to “a great deal”) according to indicators such as: “I think about how my
life is part of a larger spiritual force”; “I look to God (high power) for strength in
crises”; “I wonder whether God (high power) really exists”; “I pray to get my
mind off of my problems”;
The goal of the original study was to use graphical models to quantify how Spiritual
coping moderates the association of Stress and Depression, and hypothesized that
Spiritual coping reduces the association of Stress and Depression. The theoretical model
(Figure 17) fails a chi-square test: p = 0. The measurement model produced by
BuildPureClusters is shown in Figure 18. Note that the variables selected automatically
are proper subsets of Lee’s substantive clustering. The full model automatically produced
with GESMIMBuild with the prior knowledge that Stress is not an effect of other latent
variables is given in Figure 19. This model passes a chi square test, p = 0.28, even though
the algorithm itself does not try to directly maximize the fit. Note that it supports the
hypothesis that Depression causes Spiritual Coping rather than the other way around.
Although this conclusion is not conclusive, the example does illustrate how the algorithm
can find a theoretically plausible alternative model that fits the data well.
492
Automated Search for Causal Relations
St1 Dep1
St2 Dep2
. Stress Depression .
+
. .
+ –
. Coping.
St21 Dep21
St3 Dep9
St4 Dep13
St16 Stress Depression Dep19
St18 .
S20 Coping .
St3 Dep9
St4 Dep13
+
St16 Stress Depression Dep19
St18 .
+
St20 Coping .
493
Peter Spirtes, Clark Glymour, Richard Scheines, Robert Tillman
494
495
Peter Spirtes, Clark Glymour, Richard Scheines, Robert Tillman
it by methods described in [Glymour, Scheines, Spirtes, & Kelly 1987], some of the work
whose aims and methods Cartwright previously sought to demonstrate is impossible
[Cartwright 1994]. But a chain model of contemporaneous causes is far too special a
case. Hoover & Demiralp, and later, Moneta & Spirtes, proposed applying PC to the
residuals [Hoover & Demiralp 2003; Moneta & Spirtes 2006]. (Moneta also worked out
the statistical corrections to the correlations required by the fact that they are obtained as
residuals from regressions.) When that is done for model above, the result is the unit
structure of the time series: QBO SOI 6 WP 4 PDO 4 AO 5 NA.
496
Automated Search for Causal Relations
averages of BIC scores to a function of posteriors is only available when the sample sizes
of several data sets are equal. Nonetheless, IMaGES has been applied to fMRI data from
multiple subjects with remarkably good results. For example, an fMRI study of
responses to visually presented rhyming and non-rhyming words and non-words should
produce a left hemisphere cascade leading to right hemisphere effects, which is exactly
what IMaGES finds, using only the prior knowledge that the input variable is not an
effect of other variables.
6 Conclusion
The discovery of d-separation, and the development of several related notions, has
made possible principled search for causal relations from observational and quasi-
experimental data in a host of disciplines. New insights, algorithms and applications have
appeared almost every year since 1990, and they continue. We are seeing a revolution in
understanding of what is and is not possible to learn from data, but the insights and
methods have seeped into statistics and applied science only slowly. We hope that pace
will quicken.
7 Appendix
A directed graph (e.g. G1 of Figure 22) consists of a set of vertices and a set of
directed edges, where each edge is an ordered pair of vertices. In G1, the vertices are
{A,B,C,D,E}, and the edges are {B 6 A, B 6 C, D 6 C, C 4 E}. In G1, B is a parent of
A, A is a child of B, and A and B are adjacent because there is an edge A 6 B. A path in a
directed graph is a sequence of adjacent edges (i.e. edges that share a single common
endpoint). A directed path in a directed graph is a sequence of adjacent edges all pointing
in the same direction. For example, in G1, B 6 C 6 E is a directed path from B to E. In
contrast, B 6 C 5 D is a path, but not a directed path in G1 because the two edges do not
point in the same direction; in addition, C is a collider on the path because both edges on
497
Peter Spirtes, Clark Glymour, Richard Scheines, Robert Tillman
the path are directed into C. A triple of vertices <B,C,D> is a collider if there are edges B
6 C 5 D in G1; <B,C,D> is an unshielded collider if in addition there is no edge
between B and D. E is a descendant of B (and B is an ancestor of E) because there is a
directed path from B to E; in addition, by convention, each vertex is a descendant (and
ancestor) of itself. A directed graph is acyclic when there is no directed path from any
vertex to itself: in that case the graph is a directed acyclic graph, or DAG for short.
A B C D A B C DA A B B C D
C D
E
E E G2 E
G1 G2 P1
498
Automated Search for Causal Relations
499
Peter Spirtes, Clark Glymour, Richard Scheines, Robert Tillman
500
Automated Search for Causal Relations
output a single DAG. A reliable algorithm could at best output the DAG conditional
independence equivalence class, e.g. {G1, G2}.
Fortunately, Theorem 1 is also the basis of a simple representation called a pattern
[Verma & Pearl 1990] of a DAG conditional independence equivalence class. Patterns
can be used to determine which predicted effects of a manipulation are the same in every
member of a DAG conditional independence equivalence class and which are not.
The adjacency phase of the PC algorithm is based on the following two theorems,
where Parents(G,A) is the set of parents of A in G.
Theorem 2: If A and B are d-separated conditional on any subset Z in DAG G, then A
and B are not adjacent in G.
Theorem 3: A and B are not adjacent in DAG G if and only if A and B are d-separated
conditional on Parents(G,A) or Parents(G,B) in G.
The justification of the orientation phase of the PC algorithm is based on Theorem 4.
Theorem 4: If in a DAG G, A and B are adjacent, B and C are adjacent, but A and C are
not adjacent, either B is in every subset of variables Z such that A and C are d-separated
conditional on Z, in which case <A,B,C> is not a collider, or B is in no subset of variables
Z such A and C are d-separated conditional on Z, in which case <A,B,C> is a collider.
501
Peter Spirtes, Clark Glymour, Richard Scheines, Robert Tillman
= f(<G2,22>). In that case say that <G1,21> and <G2,22> are distributionally equivalent
(relative to the parametric family). Whether two models are distributionally equivalent
depends not only on the graphs in the models, but also on the parameterization families of
the models. A set of models that are all distributionally equivalent to each other is a
distributional equivalence class. If the graphs are all restricted to be DAGs, then they
form a DAG distributional equivalence class.
In contrast to conditional independence equivalence, distribution equivalence
depends upon the parameterization families as well as the graphs. Conditional
independence equivalence of G1 and G2 is a necessary, but not always sufficient
condition for the distributional equivalence of <G1, A> and <G2, B>.
Algorithms that rely on constraints beyond conditional independence may be able to
output subsets of conditional independence equivalence classes, although without further
background knowledge or stronger assumptions they could at best reliably output a DAG
distribution equivalence class. In general, it would be preferable to take advantage of the
non conditional independence constraints to output a subset of the conditional
independence equivalence class, rather than simply outputting the conditional
independence equivalence class. For some parametric families it is known how to take
advantage of the non conditional independence constraints (sections 3.4 and 4.4 );
however in other parametric families, either there are no non conditional independence
constraints, or it is not known how to take advantage of the non conditional independence
constraints.
Acknowledgements: Clark Glymour and Robert Tillman thanks the James S. McDonnell
Foundation for support of their research.
References
Ali, A. R., Richardson, T. S., & Spirtes, P. (2009). Markov Equivalence for Ancestral
Graphs. Annals of Statistics, 37(5B), 2808-2837.
Aliferis, C. F., Tsamardinos, I., & Statnikov, A. (2003). HITON: A Novel Markov
Blanket Algorithm for Optimal Variable Selection. Proceedings of the 2003
American Medical Informatics Association Annual Symposium, Washington, DC,
21-25.
Andersson, S. A., Madigan, D., & Perlman, M. D. (1997). A characterization of Markov
equivalence classes for acyclic digraphs. Ann Stat, 25(2), 505-541.
Asuncion, A. & Newman, D. J. (2007). UCI Machine Learning Repository.
Bartholomew, D. J., Steele, F., Moustaki, I., & Galbraith, J. I. (2002). The Analysis and
Interpretation of Multivariate Data for Social Scientists (Texts in Statistical
Science Series). Chapman & Hall/CRC.
Cartwright, N. (1994). Nature's Capacities and Their Measurements (Clarendon
Paperbacks). Oxford University Press, USA.
Chickering, D. M. (2002). Optimal Structure Identification with Greedy Search. Journal
of Machine Learning Research, 3, 507-554.
502
Automated Search for Causal Relations
503
Peter Spirtes, Clark Glymour, Richard Scheines, Robert Tillman
504
Automated Search for Causal Relations
505
Peter Spirtes, Clark Glymour, Richard Scheines, Robert Tillman
Tillman, R. E., Gretton, A., & Spirtes, P. (2009). Nonlinear directed acyclic structure
learning with weakly additive noise models. In Advances in Neural Information
Processing Systems 22.
Timberlake, M. & Williams, K. R. (1984). Dependence, Political Exclusion And
Government Repression - Some Cross-National Evidence. Am Sociol Rev, 49(1),
141-146.
Verma, T. S. & Pearl, J. (1990). Equivalence and Synthesis of Causal Models. In
Proceedings of the 6th Conference on Uncertainty in Artificial Intelligence, 220-
227.
Zhang, K. & Hyvarinen, A. (2009). On the Identifiability of the Post-Nonlinear Causal
Model. Proceedings of the 26th International Conference on Uncertainty in
Artificial Intelligence, 647-655.
506
Return to TOC
29
––––––––––––––––––––––––––––––––––––––––––
The Structural Model and the Ranking Theoretic
Approach to Causation: A Comparison
WOLFGANG SPOHN
1 Introduction
Large parts of Judea Pearl’s very rich work lie outside philosophy; moreover, basically
being a computer scientist, his natural interest was in computational efficiency, which, as
such, is not a philosophical virtue. Still, the philosophical impact of Judea Pearl’s work is
tremendous and often immediate; for the philosopher of science and the formal episte-
mologist few writings are as relevant as his. Fully deservedly, this fact is reflected in
some philosophical contributions to this Festschrift; I am glad I can contribute as well.
For decades, Judea Pearl and I were pondering some of the same topics. We both re-
alized the importance of the Bayesian net structure and elaborated on it; his emphasis on
the graphical part was crucial, though. We both saw the huge potential of this structure
for causal theorizing, in particular for probabilistic causation. We both felt the need for
underpinning the probabilistic account by a theory of deterministic causation; this is, after
all, the primary notion. And we both came up with relevant proposals. Judea Pearl ap-
proached these topics from the Artificial Intelligence side, I from the philosophy side.
Given our different proveniences, overlap and congruity are surprisingly large.
Nevertheless, it slowly dawned upon me that the glaring similarities are deceptive,
and that we fill the same structure with quite different contents. It is odd how much di-
vergence can hide underneath so much similarity. I have identified no less than fifteen
different, though interrelated points of divergence, and, to be clear, I am referring here
only to our accounts of deterministic causation, the structural model approach so richly
developed by Judea Pearl and my (certainly idiosyncratic) ranking-theoretic approach. In
this brief paper I just want to list the points of divergence in a more or less descriptive
mood, without much argument. Still, the paper may serve as a succinct reference list of
the many crucial points that are at issue when dealing with causation and may thus help
future discussion.
At bottom, my comparison refers, on the one hand, to the momentous book of Pearl
(2000), the origins of which reach back to the other momentous book of Pearl (1988) and
many important papers in the 80’s and 90’s,s and, on the other hand, to the chapters 14
and 15 of Spohn (forthcoming) on causation, the origins of which reach back to Spohn
(1978, sections 3.2 - 3, and 1983) and a bunch of subsequent papers. For ease of access,
507
Wolfgang Spohn
though, I shall substantially refer to Halpern, Pearl (2005) and Spohn (2006) where the
relevant accounts are presented in a more compact way. Let me start with reproducing the
basic explications in section 2 and then proceed to my list of points of comparison in
section 3. Section 4 concludes with a brief moral.
A set F of structural1equations is just a set of functions FY that specifies for each vari-
able
1 Y in some subset V of U how Y1(essentially) functionally depends on 1some subset 1
X of U; thus F Y maps 1 X1 into 1Y. V is the set of endogenous variables, 1 U =U– V
the set of exogenous variables. The only condition on F is that no Y in V indirectly func-
tionally depends on itself via
1 the equations in F. Thus, F induces a DAG on U such that,
if F Y maps 1 X into 1Y, X is the set of parents of Y. (In their appendix A.4 Halpern,
1
Pearl (2005) generalize their account by dropping the assumption of the acyclicity of the
structural equations.) The idea is that F provides a set of laws that govern the variables in
U, though, again, the precise interpretation of F will have to be discussed below. %U, F& is
then called a structural model (SM). Note that a SM does not fix the values 1of any vari-
1
ables. However, once we fix the values u of all the exogenous variables 1 in U , the equa-
1
tions in F determine the values of all the endogenous variables in V . Let us call %U, F,
v
1
u & a contextualized structural model (CSM). Thus, each CSM determines a specific
1 1
world or course of events 4 = % u , v & in 1. Accordingly, each proposition A in A is true
1
or false in a CSM %U, F, u &, depending on whether or not 4 ! A for the 4 thereby de-
termined.
508
The Structural Model and the Ranking Theoretic Approach to Causation
This is not as complicated as it may look. Condition (1) says that the cause and the ef-
1
fect actually occur in the relevant CSM %U, F, u & and, indeed, had to occur given the
1
structural 1equations in F and the context u . Condition (2a) says that if the cause vari-
ables in X had been set differently, the effect {Y = y} would 1 not have
1 occurred. It is
indeed more liberal in allowing that also the variables
1 in W outside X are set to differ-
ent values, the reason
1 being that the effect of X on Y
1 may be hidden, as it were, by the
actual values of W , and uncovered only by setting W to different values. However, this
alone would be 1too liberal; perhaps the 1failure of the effect {Y = y} to occur is due only to
the change of W rather than that of X . 1Condition (2b) counteracts this permissiveness,
and ensures that basically the change in X alone 1 brings about the change of Y. Condition
1
(3), finally, is to guarantee that the cause { X = x1 } does not contain irrelevant
1 parts; for
the change described in (2a) all the variables in X are required. Note that X is a set of
509
Wolfgang Spohn
1 1
variables 1so that { X = x } should be called a total cause of {Y = y}; its parts {Xi = xi}
for Xi ! X may then be called contributory causes.
The details of the SM definition are mainly motivated by an adequate treatment of
various troubling examples much discussed in the literature. It would take us too far to go
into all of them. I should also mention that the SM definition is only preliminary in
Halpern, Pearl (2005); but again, the details of their more refined definition presented on
p. 870 will not be relevant for the present discussion.
The basics of the ranking-theoretic account may be explained in an equally brief way:
A negative ranking function 2 for 1 is just a function 2 from 1 into N ) {5} such that
2(4) = 0 for at least one 4 ! 1. It is extended to propositions in A by defining 2(A) =
min{2(4) | 4 ! A} and 2(8) = 5; and it is extended to conditional ranks by defining 2(B
| A) = 2(A 9 B) – 2(A) for 2(A) 7 5. Negative ranks express degrees of disbelief: 2(A) >
0 says that A is disbelieved, so that 2( A ) > 0 expresses that A is believed in 2; however,
we may well have 2(A) = 2( A ) = 0. It is useful to have both belief and disbelief repre-
sented in one function. Hence, we define the two-sided rank 3(A) = 2( A ) – 2 (A), so that
A is believed, disbelieved, or neither according to whether 3(A) > 0, < 0, or = 0. Again,
we have conditional two-sided ranks: 3(B | A) = 2( B | A ) – 2(B | A). The positive rele-
vance of a proposition A to a proposition B is then defined by 3(B | A) > 3(B | A ), i.e., by
the fact that B is more firmly believed or less firmly disbelieved given A than given A ;
we might also say in this case that A confirms or is a reason for B. Similarly for negative
relevance and irrelevance (= independence).
Like a set of structural equations, a ranking function 2 induces a DAG on the frame U
conforming with the given temporal order 1 . The procedure is the same as with prob- 1
abilities: we simply define the set of parents of a1variable Y1 as the unique minimal set X
# { 1 Y} such that Y is independent of { 1 Y} – X given X1 relative 1 to 2, i.e., such that Y
is independent
1 of all the other preceding variables given X . If X is empty, Y is exoge-
nous; if X 7 8, Y is endogenous. The reading that Y directly causally depends on its
parents will be justified later on.
Now, for me, being a cause is just being a special kind of conditional
1 reason, i.e., be-
ing a reason given the1 past. In order to express this, for a subset X of 1 U and a course of
1
events 4 ! 1 let [ X ] denote the proposition that the variables
1 1 in X 1 behave as they do
1
in 4. (So far, we could denote such a proposition by { X = x }, if X (4) = x , but we
shall see in a moment that this notation is now impractical.) Then the basic definition of
the ranking-theoretic account is this:
510
The Structural Model and the Ranking Theoretic Approach to Causation
It is obvious that the SM and the RT definition deal more or less with the same expli-
candum; both are after actual causes, where actuality is represented either by the context
1 1
u of a CSM %U, F, u & in the SM definition or by the course of events 4 in the RT defi-
nition. A noticeable difference is that in the RT definition the cause {X ! A} refers only
to a single variable X. Thus, the RT definition grasps what has been called a contributory
cause, a total cause of {Y ! B} then being something like the conjunction of its con-
tributory causes. As mentioned, the SM definition proceeds the other way around.
Of course, the major differences lie in the explicantia; this will be discussed in the
next section. A further noticeable difference in the definienda is that the RT definition 1
explains only direct causation; indeed, if {X ! A} would be an indirect cause of {Y ! B},
we could not expect {X ! A} to be positively relevant to {Y ! B} conditional on the rest
of the past of Y in 4, since that condition would not keep open the causal path from X to
Y, but fix it to its actual state in 4. Hence, the RT definition 1 is restricted accordingly.
As the required extension, I propose the following
One consequence of RT definition 3 is that the set of parents of Y in the DAG generated
by 2 and 1 consists precisely of all the variables on which Y directly causally depends.
So much for the two accounts to be compared. There are all the differences that meet
the eye. As we shall see, there are even more. Still, let me conclude this section by
pointing out that there are also less differences than meet the eye. I have already men-
tioned that both accounts make use of the DAG structure of causal graphs. And when we
supplement the probabilistic versions of the two accounts, they further converge. In the
1 1
structural-model approach we would then replace the context u of a CSM %U, F, u & by a
probability distribution over the exogenous variables rendering them independent and
extending via the structural equations to a distribution for the whole of U, thus forming a
pseudo-indeterministic system, as Spirtes et al. (1993, pp. 38f.) call it, and hence a Baye-
sian net in which the probabilities agree with the causal graph. In the ranking-theoretic
approach, we would replace the ranking function by a probability measure for U (or over
A) that, together with the temporal order of the variables, would again induce a DAG or a
511
Wolfgang Spohn
causal graph so as to form a Bayesian net. In this way, the basic ingredient of both ac-
counts would become the same: a probability measure;the remaining differences appear
to be of a merely technical nature.
Indeed, as I see the recent history of the theory of causation, this large agreement ini-
tially dominated the picture of probabilistic causation. However, the need for underpin-
ning the probabilistic by a deterministic account was obvious; after all, the longer history
of the notion was an almost entirely deterministic one up to the recent counterfactual
accounts following Lewis (1973). And so the surprising ramification sketched above
came about, both branches of which well agree with their probabilistic origins. The rami-
fication is revealing since it makes explicit dividing lines that were hard to discern within
the probabilistic harmony. Indeed, the points of divergence between the structural-model
and the ranking-theoretic approach to be discussed in the next section apply to their prob-
abilistic sisters as well, a claim that is quite suggestive, though I shall not elaborate on it.
(1) The most obvious instances provoking comparison and divergence are provided
by examples, about preemption and prevention, overdetermination and switches, etc. The
literature abounds in cases challenging all theories of causation and examples designed
for discriminating among them, a huge bulk still awaiting systematic classification
(though I attempted one in my (1983, ch. 3) as far as possible at that time). A theory of
causation must do well with these examples in order to be acceptable. No theory, though,
will reach a perfect score, all the more as many examples are contested by themselves,
and do not provide a clear-cut criterion of adequacy. And what a ‘good score’ would be
cannot be but vague. Therefore, I shall not even open this unending field of comparison
regarding the two theories at hand.
(2) The main reason why examples provide only a soft criterion is that it is ultimately
left to intuition to judge whether an example has been adequately treated. There are
strong intuitions and weak ones. They often agree and often diverge. And they are often
hard to compromise. Indeed, intuitions play an indispensable and important role in as-
sessing theories of causation; they seem to provide the ultimate unquestionable grounds
for that assessment.
Still, I have become cautious about the role of intuitions. Quite often I felt that the
intuitions authors claim to have are guided by their theory; their intuitions seem to be
what their theory suggests they should be. Indeed, the more I dig into theories of causa-
tion and develop my own, the harder it is for me to introspectively discern whether or not
I share certain intuitions independently of any theorizing. So, again, the appeal to intui-
512
The Structural Model and the Ranking Theoretic Approach to Causation
tions must be handled with care, and I shall not engage into a comparison of the relevant
theories on an intuitive level.
(3) Another large field of comparison is the proximity to and the applicability in sci-
entific practice. No doubt, the SM account fares much better in this respect than the RT
approach. Structural modeling is something many scientists really do, whereas ranking
theory is unknown in the sciences and it may be hard to say why it should be known
outside epistemology. The point applies to other accounts as well. The regularity theory
of causation seems close to the sciences, since they seem to state laws and regularities,
whereas counterfactual analyses seem remote, since counterfactual claims are not an
official part of scientific theories, even though, unofficially, counterfactual talk is ubiq-
uitous. And probabilistic theories maintain their scientific appearance by ecumenically
hiding disputes about the interpretation of probability.
Again, the importance of this criterion is undeniable; the causal theorist is well ad-
vised to appreciate the great expertise of the sciences, in general and specifically con-
cerning causation. Still, I tend to downplay this criterion, not only in order to keep the RT
account as a running candidate. The point is rather that the issue of causation is of a kind
for which the sciences are not so well prepared. The counterfactual analysis is a case in
point. If it should be basically correct, then the counterfactual idiom can no longer be
treated as a second-rate vernacular (to use Quine’s term), as the sciences do, but must be
squarely faced in a systematic way, as, e.g., Pearl (2000, ch. 7) does, but qua philosopher,
not qua scientist. Probabilities are a similar case. Mathematicians and statisticians by far
know best how to deal with them. However, when it comes to say what probabilities
mean, they are not in a privileged position.
The point of these three remarks is to claim primacy for theoretical issues about cau-
sation as such. External considerations are relevant and helpful, but they cannot release
us from the task of taking some stance or other towards these theoretical issues. So, let us
turn to them.
(4) Both, the SM and the RT account, are based on a frame providing a framework of
variables and appertaining facts. I am not sure, however, whether we interpret it in the
same way. A (random) variable is a function from some state space into some range of
values, usually the reals; this is mathematical standard. That a variable takes a certain
value is a proposition, and if the value is the true one (in some model), the proposition is
a fact (in that model); so much is clear. However, the notion of a variable is ambiguous,
and it is so since its statistic origins. A variable may vary over a given population as its
state space and take on a certain value for each item in the population. E.g., size varies
among Germans and takes (presently) the value 6' 0'' for me. This is what I call a generic
variable. Or a variable may vary over a set of possibilities as its state space and take
values accordingly. For example, my (present) size is a variable in this sense and actually
takes the value 6' 0'', though it takes other values in other possibilities; I might (presently)
have a different size. I call this a singular variable representing the possibility range of a
513
Wolfgang Spohn
given single case. For each German (and time), size is such a singular variable. The ge-
neric variable of size, then, is formed by the actual values of all these singular variables.
The above RT account exclusively speaks about singular variables and their realiza-
tions; generic variables simply are out of the picture. By contrast, the ambiguity seems to
afflict the SM account. I am sure everybody is fully clear about the ambiguity, but this
clarity seems insufficiently reflected in the terminology. For instance, the equations of a
SM represent laws or ceteris paribus laws or invariances in Woodward’s (2003) terms or
statistical laws, if supplemented by statistical ‘error’ terms, and thus state relations be-
tween generic variables. It is contextualization by which the model gets applied to a
given single case; then, the variables should rather be taken as singular ones; their taking
certain values then are specific facts. There is, however, no terminological distinction of
the two interpretations; somehow, the notion of a variable seems to be intended to play
both roles. In probabilistic extensions we find the same ambiguity, since probabilities
may be interpreted as statistical distributions over populations or as realization propensi-
ties of the single case.
(5) I would not belabor the point if it did not extend to the causal relations we try to
capture. We have causation among facts, as analyzed in the SM definition and the RT
definitions 1 - 2; they are bound to apply to the single case. And we have causal relations
among variables, i.e., causal dependence (though often and in my view confusingly the
term “cause” is used here as well), and we find here the same ambiguity. Causal depend-
ence between generic variables is a matter of causal laws or of general causation. How-
ever, there is also causal dependence between singular variables, something rarely made
explicit, and it is a matter of singular causation applying to the single case just as much
as causation between facts. Since its inception the discussion of probabilistic causality
was caught in this ambiguity between singular and general causation; and I am wonder-
ing whether we can still observe the aftermath of that situation.
In any case, structural equations are intended to capture causal order, and the order
among generic variables thus given pertains to general causation. Derivatively these
equations may be interpreted as stating causal dependencies also between singular vari-
ables. In the SM account, though, singular causation is explicitly treated only as pertain-
ing to facts. By contrast, the RT definition 3 explicates only causal dependence between
singular variables. The RT account is so far silent about general causation and can grasp
it only by generalizing over the causal relations in the single case. These remarks are not
just pedantry; I think it is important to observe these differences for an adequate compari-
son of the accounts.
(6) I see these differences related to the issue of the role of time in an analysis of cau-
sation. The point is simply that generic variables as such are not temporally ordered,
since their arguments, the items to which they apply, may have varying temporal posi-
tions; usually, statistical data do not come temporally ordered. By contrast, singular vari-
ables are temporally ordered, since their variable realizability across possibilities is tied
to a fixed time. As a consequence, the SM definition makes no explicit reference to time,
514
The Structural Model and the Ranking Theoretic Approach to Causation
whereas the RT definitions make free use of that reference. While I think that this point
has indeed disposed Judea Pearl and me to our diverging perspectives on the relation
between time and causation, it must be granted that the issue takes on much larger dimen-
sions that open enough room for indecisive defenses of both perspectives.
Many points are involved: (i) Issues of analytic adequacy: while Pearl (2000, pp.
249ff.) argues that reference to time does not sufficiently further the analytic project and
proposes ingenious alternatives (sections 2.3 - 4 + 8 - 9), I am much more optimistic
about the analytic prospects of referring to time (see my 1990, section 3, and forthcom-
ing, section 14.4). (ii) Issues of analytic policy (see also point 10 below): Is it legitimate
to refer to time in an analysis of causation? I was never convinced by the objections. Or
should the two notions be analytically decoupled? Or should the analytic order be even
reversed by constructing a causal theory of time? Pearl (2000, section 2.8) shows sym-
pathies for the latter project, although he suggests an evolutionary explanation, rather
than Reichenbach’s (1956) physical explanation for relating temporal direction with
causal directionality. (iii) The issue of causal asymmetry: Is the explanation of causal
asymmetry by temporal asymmetry illegitimate? Or incomplete? Or too uninformative, as
far as it goes? If any of these, what is the alternative?
(7) Causation always is causation within given circumstances. What do the accounts
say what the circumstances are? The RT definition 1 explicitly takes the entire past of the
effect except the cause as the circumstances of a direct causal relationship, something
apparently much too large and hence inadequate, but free of conceptual circularity, as I
have continuously emphasized. In contrast, Pearl (2000, pp. 250ff.) endorses the circular
explanation of Cartwright (1979) that those circumstances consist of the other causes of
the effect and hence, in the case of direct causation, of the realizations of the other par-
ents of the effect variable in the causal graph. Pearl thus accepts also Cartwright’s con-
clusion that the reference to the obtaining circumstances does not help explicating causa-
tion; he thinks that this reference at best provides a kind of consistency test. I argue that
the explicatory project is not doomed thereby, since Cartwright’s circular explanation
may be derived from my apparently inadequate definition (cf. Spohn 1990, section 4). As
for the circumstances of indirect causation, the RT definition 2 is entirely silent, since it
relies on transitivity; however, in Spohn (1990, Theorems 14 and 16) I explored how
much I can say about them. In contrast, the SM definition contains an implicit account of
the circumstances
1 1 that applies1 to indirect causal relationships as well; it is hidden in the
partition 1 Z, W 2 of the set V of endogenous variables. However, it still accepts Cart-
wright’s circular explanation, since it presupposes the causal graph generated by the
structural equations. So, this is a further respect in which our accounts are diametrically
opposed.
(8) The preceding point contains two further issues. One concerns the distinction of
direct and indirect causation. The SM approach explicates causation without attending to
this distinction. Of course, it could account for it, but it does not acquire a basic impor-
tance. By contrast, the distinction receives analytic significance within the RT approach
515
Wolfgang Spohn
that first defines direct causation and then, only on that basis, indirect causation. The
reason is that, in this way, the RT approach hopes to reach a non-circular explication of
causation, whereas the SM approach has given up on this hope (see also point 10 below)
and thus sees no analytic rewards in this distinction.
(9) The other issue already alluded to in (7) is the issue of transitivity. This is a most
vexed topic, and the community seems unable to find a stable attitude. Transitivity had to
be given up, it seemed, within probabilistic causation (cf. Suppes 1970, p. 58), while it
was derivable from a regularity account and was still defended by Lewis (1973) for de-
terministic causation. In the meantime the situation has reversed; transitivity has become
more respectable within the probabilistic camp; e.g., Spirtes et al. (1993, p. 44) simply
assume it in their definition of “indirect cause”. By contrast, more and more tend to reject
it for deterministic causation (cf., e.g., McDermott 1995 and Hitchcock 2001).
This uncertainty is also reflected in the present comparison. Pearl (2000, p. 237) re-
jects transitivity of causal dependence among variables, but, as the argument shows, only
in the sense of what Woodward (2003, p. 51) calls “total cause”. Still, Woodward (2003,
p. 59), in his concluding explication M, accepts the transitivity of causal dependence
among variables in the sense of “contributory cause”, and I have not found any indication
in Pearl (2000) or Halpern, Pearl (2005) that they would reject Woodward’s account of
contributory causation. However, all of them deny the transitivity of actual causation
between facts.
I see it just the other way around. The RT definition 2 stipulates the transitivity of
causation (with arguments, though; cf. Spohn 1990, p. 138, and forthcoming, section
14.12), whereas the RT definition 3 entails the transitivity of causal dependence among
variables in the contributory sense only under (mild) additional assumptions. Another
diametrical opposition.
(10) A much grander issue is looming behind the previous points, the issue of analytic
policy. The RT approach starts defining direct causation between singular facts, proceeds
to indirect causation and then to causal dependence between singular variables, and fi-
nally only hopes to thereby grasp general causation as well. It thus claims to give a non-
circular explication or a reductive analysis of causation. The SM approach proceeds in
the opposite direction. It presupposes an account of general causation that is contained in
the structural equations, transfers this to causal dependence between singular variables (I
mentioned in points 4 and 5 that this step is not fully explicit), and finally arrives at actual
causation between facts. The claim is thereby to give an illuminating analysis of causa-
tion, but not a reductive one.
Now, one may have an argument about conceptual order: which causal notions to ex-
plicate on the basis of which? I admit I am bewildered by the SM order. The deeper issue,
though, or perhaps the deepest, is the feasibility of reductive analysis. Nobody doubts that
it would be most welcome to have one; therefore the history of the topic is full of at-
tempts at such an analysis. Perhaps, though, they are motivated by wishful thinking. How
to decide? One way of assessing the issue is by inspecting the proposals. The proponents
516
The Structural Model and the Ranking Theoretic Approach to Causation
are certainly confident of their analyses, but their inspection revealed so many problems
that doubts preponderate. However, this does not prove their failure. Also, one may ad-
vance principled arguments such as Cartwright’s (1979) that one cannot avoid being
entangled in conceptual circles. For such reasons, the majority, it seems, has acquiesced
in non-reductive analysis; cf., e.g., Woodward (2003, pp. 104ff.) for an apology of non-
reductivity or Glymour (2004) for a eulogy of the, as he calls it, Euclidean as opposed to
the Socratic ideal.
Another way of assessing the issue is more philosophical. Are there any more basic
features of reality to which causation may reduce? One may well say no, and thereby
justify the rejection of reductive analysis. Or one may say yes. Laws may be such a more
basic feature; this, however, threatens to result either in an inadequate regularity theory of
causation or in an inability to say what laws are beyond regularities. Objective probabili-
ties may be such a feature – if we only knew what they are. What else is there on offer?
On the other hand, it is not so easy to simply accept causation as a basic phenomenon;
after all, the point has deeply worried philosophers for centuries after Hume.
In any case, all these issue are involved in settling for a certain analytic policy. It will
become clearer in the subsequent points why I nevertheless maintain the possibility of
reductive analysis.
(11) The most conspicuous difference of the SM and the RT approach is a direct con-
sequence of their different policies. The SM account bases its analysis on structural mod-
els or equations, whereas the RT account explicates causation in terms of ranking func-
tions. These are entirely different things!
Prima facie, structural equations are easier to grasp. Despite its non-reductive proce-
dure the SM approach incurs the obligation, though, to somehow explain how the struc-
tural equations can establish causal order among generic variables. They can do this,
because Pearl (2000, pp. 157ff.) explicitly gives them an interventionistic interpretation
that, in turn, is basically a counterfactual one, as is entirely clear to Pearl; most interven-
tions are only counterfactual. Woodward (2003) repeatedly emphasizes the point that the
interventionistic account clarifies the counterfactual approach by forcing a specific inter-
pretation of the multiply ambiguous counterfactual idiom. Still, despite Woodward’s
(2003, pp. 121f.) claim to use counterfactuals only when they are clearly true of false,
and despite Pearl’s (2000, section 7.1) attempt to account for counterfactuals within
structural models, the issue how counterfactuals acquire truth conditions remains a mys-
tery in my view.
By contrast, it is quite bewildering to base an analysis of causation on ranking func-
tions that are avowedly to be understood only as doxastic states, i.e., in a purely episte-
mological way. One of my reasons for doing so is that the closer inspection envisaged in
(10) comes out, on the whole, more satisfactorily than for other accounts, that is, the
overall score in dealing with examples is better. The other reason why I find ranking
functions not so implausible a starting point lies in my profoundly Humean strategy in
dealing with causation. There is no more basic feature of reality to which causation might
reduce. The issue rather is how modal facts come into the world – where modal facts
517
Wolfgang Spohn
(12) A basic idea in our notion of causation between facts is, very roughly, that the
cause does something for its effect, contributes to it, makes it possible or necessary or
more likely, in short: that the cause is somehow positively relevant to its effect. One fact
could also be negatively relevant to another, in which case the second obtains despite the
first. As for causal dependence between variables, it is only required that the one is rele-
vant for the other. What are the notions of relevance and positive relevance provided by
the SM and the RT approach?
Ranking theory has a rich notion of positive and negative relevance, analogous and
equivalent in formal behavior to the probabilistic notions. Its relevance notion is much
richer and, I find, more adequate to the needs of causal theorizing than those provided by
the key terms of other approaches to deterministic causation: laws, counterfactuals, inter-
ventions, structural equations, or whatever. This fact grounds my optimism that the RT
approach is, on the whole, better able to cope with all the examples and problem cases.
I just said that the relevance notion provided by the SM approach is poorer. What is
it? Clause (2b) of the
1 SM1 definition says, in a way, that the effect {Y = y} had to occur
given the cause { X = x } occurs, and clause (2a) says that the effect might not have
occurred if the
1 cause does not occur and, indeed, would not have occurred if the cause
variable(s) X would have been realized in a suitable alternative way. In traditional
terms, we could say that the cause is a necessary and sufficient condition of the effect
provided the circumstances – where the subtleties of the SM approach lie in the proviso;
that’s the SM positive relevance notion. So, roughly, in SM terms, the only ‘action’ a
cause can do is making its effect necessary, whereas ranking theory allows many more
‘actions’. This is what I mean by the SM approach being poorer. For instance, it is not
clear how a fact could be negatively relevant to another fact in the SM approach, or how
one fact could be positively and another negatively relevant to a third one. And so forth.
(13) Let’s take a closer look at what “action” could mean in the previous paragraph.
In the RT approach it means comparing ranks conditional on the cause {X ! A} and on
its negation {X ! A }; the rank raising showing up in that comparison1 is what the cause
1
‘does’. In the1SM approach we do not conditionalize on the cause { X = x } and some
1
alternative { X = x ' }; rather, in clauses
1 (2a-b)1of the SM definition we look at the con-
1 1
sequences of the1 interventions X 6 x1 and X 6 x ' , i.e., by replacing the structural
1 1
equation(s) for X by the stipulation X = x or, respectively, = x ' . The received view
by now is that intervention is quite different from conditionalization (cf., e.g.,
Goldszmidt, Pearl 1992, and Meek, Glymour 1994), the suggestion being that interven-
518
The Structural Model and the Ranking Theoretic Approach to Causation
tion is what causal theorizing requires, and that all approaches relying on conditionaliza-
tion such as the RT approach therefore are misguided (cf. also Pearl 2000, section 3.2).
The difference looks compelling: intervention is a real activity, whereas conditionali-
zation is only a mental, suppositional activity. But once we grant that intervention is
mostly counterfactual (i.e., also suppositional), the difference shrinks. Indeed, I tend to
say that there never is a real intervention in a given single case; after a real intervention
we deal with a different single case than before. Hence, I think the difference the received
view assumes is spurious; rather, interventions may be construed in terms of condition-
alization: 1 1
1 1
Of course, the intervention X 6 x differs from conditioning on { X = x }; in this,
the received view is correct. However, the RT and other conditioning approaches do not
simply conditionalize on the cause, but on much more. What the intervention X16x1 on
the single variable X1 does is change the value of X1 to x1 while at the same time keeping
fixed the values of all temporally preceding variables as they are in the given context, or,
if only a causal graph and not temporal order is available, either of all ancestors of X1 or
of all non-descendants of X 1 (which comes to the same thing in structural models, and
also in probabilistic terms given the common cause principle). Thus, the intervention is
equivalent to conditioning on {X1 = x1} and on the fixed values of those other variables.
Similarly for a double intervention %X1, X 2& 6 %x 1, x 2&. For assessing the behavior of
the variables temporally between X1 and X 2 (or being descendants of X1, but not of X 2)
under the double intervention, we have to look at the same conditionalization as in the
single intervention X 16x1, whereas for the variables later than X 2 (or descending from
both X1 and X2) we have to condition on {X1 = x1}, {X2 = x2}, the past of X1 as it is in the
given context, and on those intermediate variables taking the values as they are after the
intervention X16x1. And so forth for multiple interventions (that are so crucial for the
SM approach).
Given this translation, this kind of difference between the SM and the RT approach
vanishes, I think. Consider, e.g., the definition of direct causal dependence of Woodward
(2003, p. 55): Y directly causally depends on X iff an intervention on X can make a differ-
ence to Y, provided the values of all other variables in the given frame U are somehow
fixed by intervention. Translate this as proposed, and you arrive at the conditionalization
I use in the above RT definitions to characterize direct causation.
(14) The preceding argument has a gap that emerges when we attend to another topic
that I find crucial, but nowhere thoroughly discussed: the frame-relativity of causation.
Everybody agrees that the distinction between direct and indirect causation is frame-
relative; of course, a direct causal relationship relative to a coarse-grained frame may turn
indirect under refinements. What about causation itself, though? One may try some mod-
erate antirealism, e.g., general thoughts to the effect that science only produces models of
reality and never truly represents reality as it really is; then causation would be model-
relative, too.
However, this is not what I have in mind. The point is quite specific: The RT defini-
tion 1 refers, in a way I had explained in point 7, to the obtaining circumstances, however
519
Wolfgang Spohn
only insofar as they are represented in the given frame U. This entails a genuine frame-
relativity of causation as such; {X = x} may be a (direct) cause of {Y = y} within one
frame, but not within another or more refined frame. As Halpern, Hitchock (2010, Sec-
tion 4.1) argue, this phenomenon may also show up within the SM approach.
I do not think that this agrees with Pearl’s intention in pursuing the SM account; an
actual cause should not cease to be an actual cause simply by refining the frame. Perhaps,
the intention was to arrive at a frame-independent notion of causation by assuming a
frame-independent notion of intervention. My translation of the intervention X 16x1 into
conditionalization referred to the past (or the ancestors or the non-descendants) of X1 as
far as they are represented in the given frame U, and thus reproduced only a frame-
relative notion of intervention. However, the intention presumably is to refer to the entire
past of X 1 absolutely, not leaving any hole for the supposition of {X1 = x 1} to backtrack.
If so, there is another sharp difference between the SM and the RT approach with reper-
cussions on the previous point.
Of course, I admit that our intuitive notion of causation is not frame-relative; we aim
at an absolute notion. However, this aim bars us from having a reductive analysis of cau-
sation, since the analysis would have to refer then to the rest of the world, as it were, to
many things outside the frame that are thus prevented from entering the analysis. In fact,
any rigorous causal theorizing is thereby frustrated in my view. For, how can you theo-
retically deal with all those don’t-know-what’s? For this reason I always preferred to
work with a fixed frame, to pretend that this frame is all there is, and then to say every-
thing about causation that can be said within this frame. This procedure at least allows a
reductive analysis of a frame-relative notion.
How, then, can we get rid of the frame-relativity? I propose, by ever more fine-
graining and extending the frame, studying the frame-relative causal relations within all
these well-defined frames, and finding out what remains stable across all these refine-
ments; we may hope, then, that these stable features are preserved even in the maximally
refined, universal frame (cf. Spohn forthcoming, section 14.9; for Halpern, Hitchcock
(2010, Section 4.1) this stability is also crucial). I would not know how else to deal with
the challenge posed by frame-relativity, and I suspect that considerable problems in
causal theorizing result from not explicitly facing this challenge.
(15) The various points may be summarized in the final opposition: whether causation
is to be subjectivistically or objectivistically conceived. Common sense, Judea Pearl, and
many others are on the objectivistic side: “I now take causal relationships to be the fun-
damental building blocks both of physical reality and of human understanding of that
reality” (Pearl 2000, pp. xiiif.). And insofar as structural equations are objective, the SM
approach shares this objectivism. By contrast, frame-relativity is an element of subject-
relativity; frames are chosen by us. And the use of only epistemically interpretable rank-
ing functions involves a much deeper subjectivization of the topic of causation. (The
issue of relevance, point 12, is related, by the way, since in my view only epistemic rele-
vance is rich enough a concept.)
520
The Structural Model and the Ranking Theoretic Approach to Causation
The motive of the subjectivistic RT approach was, I said, Hume’s challenge. And the
gain, I claimed, is the feasibility of a reductive analysis. Any objectivistic approach has to
tell how else to cope with that challenge and how to make peace with non-reductivity.
Still, we cannot simply acquiesce in subjectivism, since it flies in the face of everyone
keeping some sense of reality. The general philosophical strategy to escape pure subjec-
tivism has been aptly described by Blackburn (1993, part I) as Humean projectivism
leading to so-called quasi-realism that is indistinguishable from ‘real’ realism.
This general strategy may be precisely explicated in the case of causation: I had indi-
cated in the previous point how I propose to get rid of frame-relativity. And in Spohn
(forthcoming, ch. 15) I develop an objectification theory for ranking functions, according
to which some ranking functions, the objectifiable ones, may be said, to truly (or falsely)
represent causal relations. No doubt, this objectification theory is disputable, but it shows
that the subjectivistic starting point need not preclude us from objectivistic aims. Maybe,
though, these aims are more convincingly served by approaching them in a more direct
and realistic way, as the SM account does.
4 Conclusion
On none of the fifteen differences above could I seriously start discussion; obviously
nothing below book length would do. Indeed, discussing these points was not my aim at
all, let alone treating anyone conclusively (though, of course, I could not hide where my
sympathies are). My first intention was simply to display the differences, not all of which
are clearly seen in the literature; already the sheer number is surprising. And I expressed
my second intention between point 3 and point 4: namely to show that there are many
internal theoretical issues in the theory of causation. On all of them one must take and
argue a stance, a most demanding requirement. My hunch is that those theoretical consid-
erations will eventually override issues of exemplification and application. All the more
important it is to take some stance; no less will do for reaching a considered judgment.
Judea Pearl has paradigmatically shown how to do this. His brilliant theoretical develop-
ments have not closed, but tremendously advanced our understanding of all these issues
pertaining to causation.
References
Blackburn, S. (1993). Essays in Quasi-Realism, Oxford: Oxford University Press.
Cartwright, N. (1979). Causal laws and effective strategies. Noûs 13, 419-437.
Glymour, C. (2004). Critical notice on: James Woodward, Making Things Happen,
British Journal for the Philosophy of Science 55, 779-790.
Goldszmidt, M., and J. Pearl (1992). Rank-based systems: A simple approach to be-
lief revision, belief update, and reasoning about evidence and actions. In B. Nebel,
521
Wolfgang Spohn
522
Return to TOC
30
1 Introduction
This paper deals with the problem of inferring cause-effect relationships from a
combination of data and theoretical assumptions. This problem arises in diverse
fields such as artificial intelligence, statistics, cognitive science, economics, and the
health and social sciences. For example, investigators in the health sciences are
often interested in the effects of treatments on diseases; policymakers are concerned
with the effects of policy decisions; AI research is concerned with effects of actions
in order to design intelligent agents that can make effective plans under uncertainty;
and so on.
To estimate causal effects, scientists normally perform randomized experiments
where a sample of units drawn from the population of interest is subjected to the
specified manipulation directly. In many cases, however, such a direct approach is
not possible due to expense or ethical considerations. Instead, investigators have
to rely on observational studies to infer effects. A fundamental question in causal
analysis is to determine when effects can be inferred from statistical information,
encoded as a joint probability distribution, obtained under normal, intervention-
free behavior. A key point here is that it is not possible to make causal conclusions
from purely probabilistic premises – it is necessary to make causal assumptions.
This is because without any assumptions it is possible to construct multiple “causal
stories” which can disagree wildly on what effect a given intervention can have, but
agree precisely on all observables. For instance, smoking may be highly correlated
with lung cancer either because it causes lung cancer, or because people who are
genetically predisposed to smoke may also have a gene responsible for a higher cancer
incidence rate. In the latter case there will be no effect of smoking on cancer.
In this paper, we assume that the causal assumptions will be represented by
directed acyclic causal graphs [Pearl, 2000; Spirtes et al., 2001] in which arrows
represent the potential existence of direct causal relationships between the corre-
sponding variables and some variables are presumed to be unobserved. Our task will
be to decide whether the qualitative causal assumptions represented in any given
graph are sufficient for assessing the strength of causal effects from nonexperimental
data.
This problem of identifying causal effects has received considerable attention
in the statistics, epidemiology, and causal inference communities [Robins, 1986;
523
Jin Tian and Ilya Shpitser
Robins, 1987; Pearl, 1993; Robins, 1997; Kuroki and Miyakawa, 1999; Glymour
and Cooper, 1999; Pearl, 2000; Spirtes et al., 2001]. In particular Judea Pearl and
his colleagues have made major contributions in solving the problem. In his seminal
paper Pearl (1995) established a calculus of interventions known as do-calculus –
three inference rules by which probabilistic sentences involving interventions and
observations can be transformed into other such sentences, thus providing a syntac-
tic method of deriving claims about interventions. Later, do-calculus was shown to
be complete for identifying causal effects, that is, every causal effects that can be
identified can be derived using the three do-calculus rules [Shpitser and Pearl, 2006a;
Huang and Valtorta, 2006b]. Pearl (1995) also established the popular “back-door”
and “front-door” criteria – sufficient graphical conditions for ensuring identifica-
tion of causal effects. Using do-calculus as a guide, Pearl and his collaborators
developed a number of sufficient graphical criteria: a criterion for identifying causal
effects between singletons that combines and expands the front-door and back-door
criteria [Galles and Pearl, 1995], a condition for evaluating the effects of plans in
the presence of unmeasured variables, each plan consisting of several concurrent
or sequential actions [Pearl and Robins, 1995]. More recently, an approach based
on c-component factorization has been developed in [Tian and Pearl, 2002a; Tian
and Pearl, 2003] and complete algorithms for identifying causal effects have been
established [Tian and Pearl, 2003; Shpitser and Pearl, 2006b; Huang and Valtorta,
2006a]. Finally, a general algorithm for identifying arbitrary counterfactuals has
been developed in [Shpitser and Pearl, 2007], while the special case of effects of
treatment on the treated has been considered in [Shpitser and Pearl, 2009].
In this paper, we summarize the state of the art in identification of causal effects.
The rest of the paper is organized as follows. Section 2 introduces causal models
and gives formal definition for the identifiability problem. Section 3 presents Pearl’s
do-calculus and a number of easy to use graphical criteria. Section 4 presents the
results on identifying (unconditional) causal effects. Section 5 shows how to iden-
tify conditional causal effects. Section 6 considers identification of counterfactual
quantities which arise when we consider effects of relative interventions. Section 7
concludes the paper.
524
On Identifying Causal Effects
where pai are (values of) the parents of variable Vi in the graph. Here we use
uppercase letters to represent variables or sets of variables, and use corresponding
lowercase letters to represent their values (instantiations).
The set of conditional independences implied by the causal Bayesian network
can be obtained from the causal diagram G according to the d-separation criterion
[Pearl, 1988].
2
DEFINITION 1 (d-separation). A path p is said to be blocked by a set of nodes
Z if and only if
525
Jin Tian and Ilya Shpitser
U
U P(U)
do(X=False) P(Z|X) U
P(Y|Z,U)
X Z Y X Z Y
Smoking Tar in Cancer Smoking Tar in Cancer
lungs lungs X Z Y
(a) G (b) Gdo(x) (c) G
Eq. (2) represents a truncated factorization of (1), with factors corresponding to the
manipulated variables removed. This truncation follows immediately from (1) since,
assuming modularity, the post-intervention probabilities P (vi |pai ) corresponding
to variables in T are either 1 or 0, while those corresponding to unmanipulated
variables remain unaltered. If T stands for a set of treatment variables and Y for
an outcome variable in V \ T , then Eq. (2) permits us to calculate the probability
Pt (y) that event Y = y would occur if treatment condition T = t were enforced
uniformly over the population. This quantity, often called the “causal effect” of T
on Y , is what we normally assess in a controlled experiment with T randomized, in
which the distribution of Y is estimated for each level t of T .
As an example, consider the model shown in Figure 1(a) from [Pearl, 2000] that
concerns the relation between smoking (X) and lung cancer (Y ), mediated by the
amount of tar (Z) deposited in a person’s lungs. The model makes qualitative
causal assumptions that the amount of tar deposited in the lungs depends on the
level of smoking (and external factors) and that the production of lung cancer
depends on the amount of tar in the lungs but smoking has no effect on lung cancer
except as mediated through tar deposits. There might be (unobserved) factors (say
some unknown carcinogenic genotype) that affect both smoking and lung cancer,
but the genotype nevertheless has no effect on the amount of tar in the lungs
except indirectly (through smoking). Quantitatively, the model induces the joint
distribution factorized as
3 [Pearl, 1995; Pearl, 2000] used the notation P (v|set(t)), P (v|do(t)), or P (v|t̂) for the post-
526
On Identifying Causal Effects
where P ai and U i stand for the sets of the observed and unobserved parents of
Vi respectively, and the summation ranges over all the U variables. The post-
intervention distribution, likewise, will be given as a mixture of truncated products
X Y
P (vi |pai , ui )P (u) v consistent with t.
Pt (v) = u {i|Vi 6∈T } (6)
0 v inconsistent with t.
And, the question of identifiability arises, i.e., whether it is possible to express some
causal effect Pt (s) as a function of the observed distribution P (v), independent of
the unknown quantities, P (u) and P (vi |pai , ui ).
4 Whether or not any actual action is an ideal manipulation of a variable (or is feasible at all)
527
Jin Tian and Ilya Shpitser
528
On Identifying Causal Effects
where Z(W ) is the set of Z-nodes that are not ancestors of any W -node in
GX .
Similarly, if X and Y are two disjoint sets of nodes in G, then Z is said to satisfy
the back-door criterion relative to (X, Y ) if it satisfies the criterion relative to any
pair (Xi , Xj ) such that Xi ∈ X and Xj ∈ Y .
THEOREM 6 (Back-Door Criterion). [Pearl, 1995] If a set of variables Z satis-
fies the back-door criterion relative to (X, Y ), then the causal effect of X on Y is
identifiable and is given by the formula
X
Px (y) = P (y|x, z)P (z). (10)
z
For example, in Figure 1(c) X satisfies the back-door criterion relative to (Z, Y )
and we have
X
Pz (y) = P (y|x, z)P (x). (11)
x
529
Jin Tian and Ilya Shpitser
For example, in Figure 1(c) Z satisfies the front-door criterion relative to (X, Y )
and the causal effect Px (y) is given by Eq. (12).
There is a simple yet powerful graphical criterion for identifying the causal effects
of a singleton. For any set S, let An(S) denote the union of S and the set of ancestors
of the variables in S. For any set C, let GC denote the subgraph of G composed
only of variables in C. Let a path composed entirely of bidirected edges be called a
bidirected path.
THEOREM 9. [Tian and Pearl, 2002a] The causal effect Px (s) of a variable X on
a set of variables S is identifiable if there is no bidirected path connecting X to any
of its children in GAn(S) .
In fact, for X and S being singletons, this criterion covers both back-door and
front-door criteria, and also the criterion in [Galles and Pearl, 1995].
These criteria are simple to use but are not necessary for identification. In the
next sections we present complete systematic procedures for identification.
530
On Identifying Causal Effects
The lemma says that for each c-component Si the causal effect Q[Si ] = Pv\si (si )
is identifiable. For example, in Figure 1(c), we have Px,y (z) = Q[{Z}] = P (z|x)
and Pz (x, y) = Q[{X, Y }] = P (y|x, z)P (x).
Lemma 10 can be generalized to the subgraphs of G as given in the following
lemma.
LEMMA 11 (Generalized C-component Decomposition). [Tian and Pearl, 2003]
Let H ⊆ V , and assume that H is partitioned into c-components H1 , . . . , Hl in the
subgraph GH . Then we have
(i) Q[H] decomposes as Y
Q[H] = Q[Hi ]. (15)
i
(ii) Each Q[Hi ] is computable from Q[H]. Let k be the number of variables in
H, and let a topological order of the variables in H be Vm1 < · · · < Vmk in GH . Let
H (i) = {Vm1 , . . . , Vmi } be the set of variables in H ordered before Vmi (including
Vmi ), i = 1, . . . , k, and H (0) = ∅. Then each Q[Hj ], j = 1, . . . , l, is given by
Y Q[H (i) ]
Q[Hj ] = , (16)
Q[H (i−1) ]
{i|Vmi ∈Hj }
Lemma 11 says that if the causal effect Q[H] = Pv\h (h) is identifiable, then for
each c-component Hi of the subgraph GH , the causal effect Q[Hi ] = Pv\hi (hi ) is
identifiable.
Next, we show how to use the c-component decomposition to identify causal
effects.
531
Jin Tian and Ilya Shpitser
P
Note that we always have c Q[C] = 1.
Next, we show how to use Lemmas 10–12 to identify the causal effect Pt (s) where
S and T are arbitrary (disjoint) subsets of V . We have
X X
Pt (s) = Pt (v \ t) = Q[V \ T ]. (19)
(v\t)\s (v\t)\s
532
On Identifying Causal Effects
Algorithm Identify(C, T, Q)
INPUT: C ⊆ T ⊆ V , Q = Q[T ]. GT and GC are both composed of one single
c-component.
OUTPUT: Expression for Q[C] in terms of Q or FAIL.
Let A = An(C)GT .
P
• IF A = C, output Q[C] = t\c Q.
• IF A = T , output FAIL.
• IF C ⊂ A ⊂ T
Algorithm ID(s, t)
INPUT: two disjoint sets S, T ⊂ V .
OUTPUT: the expression for Pt (s) or FAIL.
Phase-1:
Phase-2:
For each set Di such that Di ⊆ Sj :
Compute Q[Di ] from Q[Sj ] by calling Identify(Di , Sj , Q[Sj ]) in Figure 2. If the
function returns FAIL, then stop and output FAIL.
P Q
Phase-3: Output Pt (s) = d\s i Q[Di ].
533
Jin Tian and Ilya Shpitser
conditional causal effects are written as Px (y|z), and defined just as regular condi-
tional distributions as
Px (y, z)
Px (y|z) =
Px (z)
Complete closed form algorithms for identifying effects of this type have been
developed. One approach [Tian, 2004] generalizes the algorithm for identifying
unconditional causal effects Px (y) found in Section 4. There is, however, an easier
approach which works.
The idea is to reduce the expression Px (y|z), which we don’t know how to handle
to something like Px′ (y ′ ), which we do know how to handle via the algorithm already
presented. This reduction would have to find a way to get rid of variables Z in the
conditional effect expression.
Ridding ourselves of some variables in Z can be accomplished via rule 2 of do-
calculus. Recall that applying rule 2 to an expression allows us to replace condi-
tioning on some variable set W ⊆ Z by fixing W instead. Rule 2 states that this
is possible in the expression Px (y|z) whenever W contains no back-door paths to
Y conditioned on the remaining variables in Z and X (that is X ∪ Z \ W ), in the
graph where all incoming arrows to X have been cut.
It’s not difficult to show the following uniqueness lemma.
LEMMA 14. [Shpitser and Pearl, 2006a] For every conditional effect Px (y|z) there
exists a unique maximal W ⊆ Z such that Px (y|z) is equal to Px,w (y|z \w) according
to rule 2 of do-calculus.
Lemma 14 states that we only need to apply rule 2 once to rid ourselves of as
many conditioned variables as possible in the effect of interest. However, even after
this is done, we may be left with some variables in Z \ W past the conditioning
bar in our effect expression. If we insist on using unconditional effect identification,
we may try to identify the joint distribution Px,w (y, z \ w) to obtain an expression
α, and obtain the conditional distribution Px,w (y|z \ w) by taking Pα α . But what
y
if Px,w (y, z \ w) is not identifiable? Are there cases where Px,w (y, z \ w) is not
identifiable, but Px,w (y|z \ w) is? Fortunately, it turns out the answer is no.
LEMMA 15. [Shpitser and Pearl, 2006a] Let Px (y|z) be a conditional effect of inter-
est, and W ⊆ Z the unique maximal set such that Px (y|z) is equal to Px,w (y|z \ w).
Then Px (y|z) is identifiable if and only if Px,w (y, z \ w) is identifiable.
Lemma 15 gives us a simple algorithm for identifying arbitrary conditional effects
by first reducing the problem into one of identifying an unconditional effect – and
then invoking the complete algorithm ID in Figure 3. This simple algorithm is
actually complete since the statement in Lemma 15 is if and only if. The algorithm
itself is shown in Fig. 4. The algorithm as shown picks elements of W one at a
time, although the set it picks as it iterates will equal the maximal set W due to
the following lemma.
534
On Identifying Causal Effects
Algorithm IDC(y, x, z)
INPUT: disjoint sets X, Y, Z ⊂ V .
OUTPUT: Expression for Px (y|z) in terms of P or FAIL.
LEMMA 16. Let Px (y|z) be a conditional effect of interest in a causal model induc-
ing G, and W ⊆ Z the unique maximal set such that Px (y|z) is equal to Px,w (y|z\w).
Then W = {W ′ |Px (y|z) = Px,w′ (y|z \ {w′ })}.
Completeness of the algorithm easily follows from the results we presented.
THEOREM 17. [Shpitser and Pearl, 2006a] The algorithm IDC is complete.
We note that the procedures ID and IDC served as a means to prove the com-
pleteness of do-calculus (Theorem 4). The proof [Shpitser and Pearl, 2006b] pro-
ceeds by reducing the steps in these procedures to sequences of do-calculus deriva-
tions.
535
Jin Tian and Ilya Shpitser
where U is the set of unobserved variables in the model. In other words, a joint
counterfactual probability is obtained by adding up the probabilities of every setting
536
On Identifying Causal Effects
of unobserved variables in the model that results in the observed values of each
counterfactual event Yx in the expression. The query with the conflict we considered
above can then be expressed as a conditional distribution derived from such a joint,
P (Yf (x) =y,X=x)
specifically P (Yf (x) = y|X = x) = P (X=x) . Queries of this form are well
known in the epidemiology literature as the effect of treatment on the treated (ETT)
[Heckman, 1992; Robins et al., 2006].
In fact, relative interventions aren’t quite the same as ETT since we don’t actually
know the original levels of X. To obtain effects of relative interventions, we simply
average over possible values of X, weighted by the prior distribution P (x) of X.
P
In other words, the relative causal effect P (y|do(f (X))) is equal to x P (Yf (x) =
y|X = x)P (X = x).
Since relative interventions reduce to ETT, and because ETT questions are of in-
dependent interest, identification of ETT is an important problem. If interventions
are performed over multiple variables, it turns out that identifying ETT questions is
almost as intricate as general counterfactual identification [Shpitser and Pearl, 2009;
Shpitser and Pearl, 2007]. However, in the case of a singleton intervention, there is a
formulation which bypasses most of the complexity of counterfactual identification.
This formulation is the subject of this section.
We want to approach identification of ETT in the same way we approached iden-
tification of causal effects in the previous sections, namely by providing a graphical
representation of conditional independences in joint distributions of interest, and
then expressing the identification algorithm in terms of this graphical representa-
tion. In the case of causal effects, we were given as input the causal diagram rep-
resenting the original, pre-intervention world, and we were asking questions about
the post-intervention world where arrows pointing to intervened variables were cut.
In the case of counterfactuals we are interested in joint distributions that span mul-
tiple worlds each with its own intervention. We want to construct a graph for these
distributions.
The intuition is that each interventional world is represented by a copy of the
original causal diagram, with the appropriate incoming arrows cut to represent the
changes in the causal structure due to the intervention. All worlds are assumed to
share history up to the moment of divergence due to differing interventions. This
is represented by all worlds sharing unobserved variables U . In the special case of
two interventional worlds the resulting graph is known as the twin network graph
[Balke and Pearl, 1994b; Balke and Pearl, 1994a].
In the general case, a refinement of the resulting graph (to account for the possi-
bility of duplicate random variables) is known as the counterfactual graph [Shpitser
and Pearl, 2007]. The counterfactual graph represents conditional independences
in the corresponding counterfactual distribution via the d-separation criterion just
as the causal diagram represents conditional independences in the observed distri-
bution of the original world. The graph in Figure 5(b) is a counterfactual graph for
the query P (Yx = y|X = x′ ) obtained from the original causal diagram shown in
537
Jin Tian and Ilya Shpitser
X’ Z Y W
U U U
X Z Y X Z Y X Z Y
Figure 5. (a) A causal diagram G. (b) The counterfactual graph for P (Yx = y|x′ )
in G. (c) The graph G′ from Theorem 18.
Figure 5(a).
There exists a rather complicated general algorithm for identifying arbitrary
counterfactual distributions from either interventional or observational data [Sh-
pitser and Pearl, 2007; Shpitser and Pearl, 2008], based on ideas from the causal
effect identification algorithms given in the previous sections, only applied to the
counterfactual graph, rather than the causal diagram. It turns out that while iden-
tifying ETT of a single variable X can be represented as an identification problem
of ordinary causal effects, ETT of multiple variables is significantly more complex
[Shpitser and Pearl, 2009]. In this paper, we will concentrate on single variable
ETT with multiple outcome variables Y .
What makes single variable ETT P (Yx = y|X = x′ ) particularly simple is the
form of its counterfactual graph. For the case of all ETTs, this graph will have
variables from two worlds – the “natural” world where X is observed to have taken
the value x′ and the interventional world, where X is fixed to assume the value x.
There are two key points that simplify matters. The first is that no descendant
of X (including variables in Y ) is of interest in the “natural” world, since we are
only interested in the outcome Y in the interventional world. The second is that
all non-descendants of X behave the same in both worlds (since interventions do
not affect non-descendants). Thus, when constructing the counterfactual graph we
don’t need to make copies of non-descendants of X, and we can ignore descendants
of X in the “natural” world. But this means the only variable in the “natural”
world we will construct is a copy of X itself.
What this implies is that a problem of identifying the ETT P (Yx = y|X = x′ )
can be rephrased as a problem of identifying a certain conditional causal effect.
THEOREM 18. [Shpitser and Pearl, 2009] For a singleton variable X, and a set
Y , P (Yx = y|X = x′ ) is identifiable in G if and only if Px (y|w) is identifiable in G′ ,
where G′ is obtained from G by adding a new node W with the same set of parents
(both observed and unobserved) as X, and no children. Moreover, the estimand for
538
On Identifying Causal Effects
539
Jin Tian and Ilya Shpitser
If neither the back-door nor the front-door criteria hold, we must invoke general
causal effect identification algorithms from the previous sections. However, in the
case of ETT of a single variable, there is a simple complete graphical criterion which
works.
THEOREM 21. [Shpitser and Pearl, 2009] For a singleton variable X, and a set
Y , P (Yx = y|X = x′ ) is identifiable in G if and only if there is no bidirected path
from X to a child of X in Gan(y) . Moreover, if there is no such bidirected path,
the estimand for P (Yx = y|X = x′ ) is obtained by multiplying the estimand for
x ′
P ]
Q[S
P
an(y)\(y∪{x}) Px (an(y) \ x) (which exists by Theorem 9) by P (x′ ) x Q[S x ] , where
S x is the c-component in G containing X, and Q[S x ]′ is obtained from the expression
for Q[S x ] by replacing all occurrences of x with x′ .
7 Conclusion
In this paper we described the state of the art in identification of causal effects and
related quantities in the framework of graphical causal models. We have shown
how this framework, developed over the period of two decades by Judea Pearl and
his collaborators, and presented in Pearl’s seminal work [Pearl, 2000], can sharpen
causal intuition into mathematical precision for a variety of causal problems faced
by scientists.
Acknowledgments: Jin Tian was partly supported by NSF grant IIS-0347846.
Ilya Shpitser was partly supported by AFOSR grant #F49620-01-1-0055, NSF grant
#IIS-0535223, MURI grant #N00014-00-1-0617, and NIH grant #R37AI032475.
References
A. Balke and J. Pearl. Counterfactual probabilities: Computational methods,
bounds, and applications. In R. Lopez de Mantaras and D. Poole, editors,
540
On Identifying Causal Effects
541
Jin Tian and Ilya Shpitser
542
On Identifying Causal Effects
J. Tian and J. Pearl. On the testable implications of causal models with hid-
den variables. In Proceedings of the Conference on Uncertainty in Artificial
Intelligence (UAI), 2002.
J. Tian and J. Pearl. On the identification of causal effects. Technical Report
R-290-L, Department of Computer Science, University of California, Los An-
geles, 2003.
J. Tian. Identifying conditional causal effects. In Proceedings of the Conference
on Uncertainty in Artificial Intelligence (UAI), 2004.
543
Part IV: Reminiscences
Return to TOC
31
Questions and Answers
Nils J. Nilsson
Few people have contributed as much to artificial intelligence (AI) as has Judea
Pearl. Among his several hundred publications, several stand out as among the
historically most significant and influential in the theory and practice of AI. With
my few pages in this celebratory volume, I join many of his colleagues and former
students in showing our gratitude and respect for his inspiration and exemplary
career. He is a towering figure in our field.
Certainly one key to Judea’s many outstanding achievements (beyond dedication
and hard work) is his keen ability to ask the right questions and follow them up
with insightful intuitions and penetrating mathematical analyses. His overarching
question, it seems to me, is “how is it that humans can do so much with simplistic,
unreliable, and uncertain information?” The very name of his UCLA laboratory,
the Cognitive Systems Laboratory, seems to proclaim his goal: understanding and
automating the most cognitive of all systems, namely humans.
In this essay, I’ll focus on the questions and inspirations that motivated his
ground-breaking research in three major areas: heuristics, uncertain reasoning, and
causality. He has collected and synthesized his work on each of these topics in three
important books [Pearl 1984; Pearl 1988; Pearl 2000].
1 Heuristics
Pearl is explicit about what inspired his work on heuristics [Pearl 1984, p. xi]:
The study of heuristics draws its inspiration from the ever-amazing ob-
servation of how much people can accomplish with that simplistic, un-
reliable information source known as intuition. We drive our cars with
hardly any thought of how they function and only a vague mental pic-
ture of the road conditions ahead. We write complex computer programs
while attending to only a fraction of the possibilities and interactions
that may take place in the actual execution of these programs. Even
more surprisingly, we maneuver our way successfully in intricate social
situations having only a guesswork expectation of the behavior of other
persons around and even less certainty of their expectations of us.
547
Nils J. Nilsson
most effective in order to achieve some goal.” “For example,” he writes, “a popular
method for choosing [a] ripe cantaloupe involves pressing the spot on the candidate
cantaloupe where it was attached to the plant, and then smelling the spot. If the
spot smells like the inside of a cantaloupe, it is most probably ripe [Pearl 1984, p.
3].”
Although heuristics, in several forms, were used in AI before Pearl’s book on the
subject, no one had analyzed them as profitably and in as much detail as did Pearl.
Besides focusing on several heuristic search procedures, including A*, his book
beneficially tackles the question of how heuristics can be discovered. He proposes
a method: consult “simplified models of the problem domain” particularly those
“generated by removing constraints which forbid or penalize certain moves in the
original problem [Pearl 1984, p. 115].”
2 Uncertain Reasoning
Pearl was puzzled by the contrast between, on the one hand, the ease with which hu-
mans reason and make inferences based on uncertain information and, on the other
hand, the computational difficulties of duplicating those abilities using probability
calculations. Again the question, “How do humans reason so effectively with un-
certain information?” He was encouraged in his search for answers by the following
observations [Pearl 1993]:
Some ideas about how to proceed came to him in the late 1970s after reading a
paper on reading comprehension by David Rumelhart [Rumelhart 1976]. In Pearl’s
words [Pearl 1988, p. 50]:
548
Questions and Answers
Pearl’s key insight was that beliefs about propositions and other quantities could
often be regarded as “direct causes” of other beliefs and that these causal linkages
could be used to construct the graphical structures he was interested in. Most
importantly, this method of constructing them would automatically encode the key
conditional independence assumptions among probabilities which he regarded as so
important for simplifying probabilistic reasoning.
Out of these insights, and after much hard work by Pearl and others, we get one
of the most important sets of inventions in all of AI – Bayesian networks and their
progeny.
3 Causality
Pearl’s work on causality was inspired by his notion that beliefs could be regarded as
causes of other beliefs. He came to regard “causal relationships [as] the fundamental
building blocks both of physical reality and of human understanding of that reality”
and that “probabilistic relationships [were] but the surface phenomena of the causal
machinery that underlies and propels our understanding of the world.” [Pearl 2000,
p. xiii]
In a Web page describing the genesis of his ideas about causality, Pearl writes
[Pearl 2000]:
I got my first hint of the dark world of causality during my junior year
of high school.
My science teacher, Dr. Feuchtwanger, introduced us to the study of
logic by discussing the 19th century finding that more people died from
smallpox inoculations than from smallpox itself. Some people used this
information to argue that inoculation was harmful when, in fact, the
data proved the opposite, that inoculation was saving lives by eradicat-
ing smallpox.
549
Nils J. Nilsson
“And here is where logic comes in,” concluded Dr. Feuchtwanger, “To
protect us from cause-effect fallacies of this sort.” We were all enchanted
by the marvels of logic, even though Dr. Feuchtwanger never actually
showed us how logic protects us from such fallacies.
550
Questions and Answers
References
Pearl, J. (1984). Heuristics: Intelligent Search Strategies for Computer Problem
Solving, Reading, MA: Addison-Wesley Publishing Company.
Pearl, J. (1988). Probabilistic Reasoning Systems: Networks of Plausible Infer-
ence, San Francisco: Morgan Kaufmann Publishers.
Pearl, J. (2000). Causality: Models, Reasoning, and Inference, New York: Cam-
bridge University Press (second edition, 2009).
Pearl, J. (1993). Belief networks revisited. Artificial Intelligence 59, 49–56.
Rumelhart, D. (1976). Toward an interactive model of reading. Tech. Rept.
#CHIP-56. University of California at San Diego, La Jolla, CA.
Pearl, J. (2000). http://bayes.cs.ucla.edu/BOOK-2K/why.html.
Daniel Pearl Foundation. http://www.danielpearl.org/.
551
Return to TOC
32
I was very lucky to have been Professor Judea Pearl’s first graduate student
advisee in the UCLA Computer Science Department. Now I am further honored to
be invited to contribute – in distinguished company – some fond memories of those
early days studying under Professor Pearl.
In January 1972, after completing the core coursework for the M.S. degree, I took
my first class in Artificial Intelligence from Professor Pearl. Thirty-eight calendar
years seems like cyber centuries ago, such has been the incredible pace of growth of
computer technologies and Computer Science and AI as academic disciplines.
The ARPAnet maps posted on the Boelter Hall corridor walls only showed a few
dozen nodes, and AI was still considered an “ad hoc” major field of study, requiring
additional administrative paperwork of prospective students. (Some jested, unfairly,
this was because AI was one step ahead of AH — ad hoc.)
The UCLA Computer Science Department had become a separate Department
in the School of Engineering only two and a half years earlier, in the Fall of 1969,
at the same time it became the birthplace of the Internet with the deployment of
the first ARPAnet Interface Message Processor node in room 3420 of Boelter Hall.
The computers available were “big and blue,” IBM S/360 and S/370 mainframes
of the Campus Computing Network, located on the fourth floor of the Mathemat-
ical Sciences Building, access tightly controlled. Some campus laboratories were
fortunate to have their own DEC PDP minicomputers.
Programming was coded in languages like Assembly Language, Fortran, APL,
PL/1, and Pascal, delimited by Job Control Language commands. Programs were
communicated via decks of punched cards fed to card readers at the Campus Com-
puting Network facility. A few hours later, the user could examine the program’s
output on print-out paper. LISP was not available at the Campus Computing Net-
work. Time-sharing terminals and computers were just beginning to introduce a
radical change in human-computer interaction: on screen programming, both input
and output.
Professor Pearl’s first “Introduction to AI” course was based on Nils Nilsson’s
Problem-Solving Methods in AI, a classic 1971 textbook focusing on the then two
core (definitely non-ad-hoc) problem-solving methodologies in AI: search and logic.
(As with the spectacular growth of computer technology, it is wondrous to regard
how much Judea’s research has extended and fortified these foundations of AI.)
Supplemental study material included Edward Feigenbaum’s 1963 compilation of
553
Edward T. Purcell
articles on early AI systems, Computers and Thought, and a 1965 book by Nils
Nilsson, Learning Machines.
In class I was immediately impressed and enchanted by Judea’s knowledge, in-
telligence, brilliance, warmth and humor. His teaching style engaging, interactive,
informative and fun. My interest in AI, dating back to pre-Computer Science un-
dergraduate days, was much stimulated.
After enjoying this first AI class, I asked Professor Pearl if he would serve as my
M.S. Advisor, and was very happy when he agreed.
Other textbooks Professor Pearl used in subsequent AI classes and seminars in-
cluded Howard Raiffa’s 1968 Decision Analysis: Introductory Lectures on Choices
under Uncertainty, Duncan Luce and Howard Raiffa’s 1957 Games and Decisions,
and George Polya’s How to Solve it, and the challenging 1971 three-volume Founda-
tions of Measurement, by David Krantz, Duncan Luce, Patrick Suppes and Amos
Tversky. The subtitles and chapter headings in this three-volume opus hint at
Professor Pearl’s future research on Bayesian networks: Volume I: Additive and
Polynomial Representations; Volume II: Geometrical, Threshold, and Probabilistic
Representations; and Volume III: Representation, Axiomatization, and Invariance.
It was always fun to visit Professor Pearl in his office. Along with the academic
consultation, Judea had time to talk about assorted extra-curricular topics, and
became like a family friend. One time, I found Judea strumming a guitar in his
office, singing a South American folk song, “Carnavalito,” which I happend to know
because of my U.S. diplomat’s son upbringing in South America. I was happy to
help with the pronunciation of the song’s lyrics. It was nice to discover that we
shared a love of music, Judea more in tune with classical music, myself more a jazz
fan. Now and then I would see Judea and his wife Ruth at Royce Hall concerts, for
example, a recital by the classical guitarist Narciso Yepes.
Judea’s musical orientation (and humor) appeared in the title of a presentation a
few years later at a Decision Analysis workshop, with the title acronym “AIDA’’ as
Artificial Intelligence and Decision Analysis. The titles of other Pearl papers also
revealed wry humor: “How to Do with Probabilities What People Say You Can’t,”
and “Reverend Bayes on Inference Engines: a Distributed Hierarchical Approach.”
My M.S. thesis title was “‘A Game-Playing Procedure for a Game of Induction,”
and included results from a (PL/1) program for the induction game Patterns, a
pattern sampling and guessing game introduced by Martin Gardner in his November
1969 Scientific American “Mathematical Games” column. (After sending Martin
Gardner a copy of my M.S. thesis, I received a letter of appreciation from the game
wizard himself.)
At a small public demonstration of the Patterns game-playing program in early
1973, a distinguished elderly scholar was very interested and asked many questions.
After the presentation Professor Pearl asked if I knew who the inquisitive gentleman
was. “No,” I said. “That was Jacob Marschak,” said Judea. Whenever I attend a
Marschak Colloquium presentation at the UCLA Anderson School of Management,
554
Fond Memories from an Old Student
555
Edward T. Purcell
556
Fond Memories from an Old Student
557
Return to TOC
33
––––––––––––––––––––––––––––––––––––––––––
Reverend Bayes and inference engines
DAVID SPIEGELHALTER
559
David Spiegelhalter
As a bonus, the Danish group finally introduced me to Pearl [1982] and Kim and
Pearl [1983]. These came as a shock: looking beneath the poor typography revealed
fundamental and beautiful ideas on local computation that made me doubt we could con-
tribute more. But Judea was working solely with directed graphs, and we felt the connec-
tion with undirected graphs was worth pursuing in the search for a general algorithm for
probability propagation in arbitrary graphs.
I wrote to Judea who replied in a typically enthusiastic and encouraging way, and so
at a 1985 workshop at Bell Labs I was able to try and put together his work with our
current focus on triangulated graphs, clique separations, potential representations and so
on [Spiegelhalter, 1986]. Then in July 1986 we finally met in Paris at the conference
mentioned at the start of this article, where Judea was introducing the audience to d-
separation. I have mentioned that I was nervous, but Judea was as embracing as ever.
We ended up in a pavement café in the Latin quarter, with Judea drawing graphs on the
paper napkin and loudly claiming that anyone could see that observations on a particular
node rendered two others independent – grabbing a passer-by, Judea demanded to know
whether this unfortunate Frenchman could recognise this obvious property, but the poor
innocent man just muttered something and walked briskly away, pleased to have escaped
these lunatics.
We continued to meet at conferences as he developed his propagation techniques
based on directed graphs [Pearl, 1986] and we published our algorithm based on embed-
ding the directed graph in a triangulated undirected graph that could be represented as a
tree of cliques [Lauritzen and Spiegelhalter, 1988]. We even jointly presented a tutorial
on probabilistic reasoning at the 1989 IJCAI meeting in Detroit, which I particularly
remember as my bus got stuck in traffic and I was late arriving, but Judea had just carried
on, extemporising from a massive pile of overhead slides from which he would appar-
ently draw specimens at random.
Then I started on MCMC on graphical models, and he began on causality, which was
too difficult for me. But I look back on that time in the mid 1980s as perhaps the most
exciting and creative period of my working life, continually engaged in a certain amount
of friendly rivalry with Judea, who always responded with characteristic generosity of
spirit.
References
Darroch, J. N., Lauritzen, S. L. and Speed, T. P. (1980) Markov Helds and log-linear
models for contingency tables. Ann. Statist., 8, 522-539.
Kim, J. H. and Pearl, J. (1983) A computational model for causal and diagnostic rea-
soning in inference systems. In Proc. 8th International Joint Conference on Artifi-
cial Intelligence, Karlsruhe, pp. 190-193.
Lauritzen, S. L. and Spiegelhalter, D. J. (1988) Local Computations with Probabili-
ties on Graphical Structures and Their Application to Expert Systems. Journal of
the Royal Statistical Society. Series B (Methodological), 50, 157-224.
560
Reverend Bayes and Inference Engines
561
Return to TOC
34
563
Hector Geffner
philosophers like Hume, Newton, and Leibniz felt centuries ago, although I doubt
that they were as much fun to be with.
In the late 80s, Judea had a small group of students, and we all used to meet
weekly for the seminars. Judea got alone well with everyone, and had a lot of
patience, in particular with me, who was a mix of rebel and dilettante, and couldn’t
get my research focused as Judea expected (and much less on the topics he was
interested in, even if he was paying my research assistantship!). I remember telling
him during the first couple of years that I didn’t feel I was ready for research and
preferred to learn more AI first. His answer was characteristic: “you do research
now, you learn later — after your PhD”. I told him also that I wanted to do
something closer to the mainstream, something closer to Schankian AI for example,
then in fashion. Judea wouldn’t get offended at all. He would answer with calm
“We will get there eventually”, and he certainly meant it. Judea was probably a bit
frustrated with me, but he never showed it; quite the opposite, he was sympathetic
to my explorations, gave me full confidence and support, and eventually let me
do my thesis in the area of non-monotonic reasoning using ideas from probability
theory, something that actually attracted his interest at the time.
Since Judea was not an expert in this area (although, unsurprisingly, he quickly
became one), I didn’t get much technical guidance from him in my specific disser-
tation research. By that time, however, I had learned from him something much
more important: I learned how science is done, and the passion and attitude that
go into it. Well, may be I didn’t learn this at all, and rather he managed to infect
me with the ‘virus’; in seminars, in conversations, by watching him work and ask
questions, by osmosis. If so, by now, I’m a proud and grateful carrier. In any case,
I have been extremely privileged and fortunate to have had the chance to benefit
from Judea’s generosity, passion, and wisdom, and from his example in both science
and life. I know I wouldn’t be the same person if I hadn’t met him.
564
Return to TOC
35
Sticking With the Crowd of Four
Rina Dechter
I joined Judea’s lab at UCLA at about the same time that Hector did, and his
words echo my experience and impressions so very well. In particular, I know I
wouldn’t be the same person, scientist, and educator if I hadn’t met Judea.
Interestingly, when I started this journey I was working in industry (with a
company named Perceptronics). We had just come to the U.S. then, my husband
Avi started his Ph.D. studies, and I was the breadwinner in our family. When I
discussed my plans to go back to school for a PhD, I was given a warning by three
former students of Judea who worked in that company (Chrolotte, Saleh, and Leal).
They all said that working with Judea was fun, but not practical. “If you want a
really good and lucrative career,” they said, “you should work with Len Kleinrock.”
This was precisely what I did. I was a student of Kleinrock for three years (and
even wrote a paper with him), and took AI only as a minor. During my 3rd year,
I decided to ignore practical considerations and follow my interests. I switched to
working with Judea.
At that time, Judea was giving talks about games and heuristic search to whoever
was willing to listen. I remember one talk that he gave at UCLA where the audience
consisted of me, Avi, and two professors from the math department. Judea spoke
enthusiastically just like he was speaking in front of the Hollywood Bowl. Even the
two math professors were mesmerized.
Kleinrock was a star already, and his students were getting lucrative positions
in Internet companies. I congratulate myself for sticking with the crowd of four,
fascinated by how machines can generate their own heuristics. Who could tell that
those modest seminars would eventually give birth to the theories of heuristics,
Bayesian networks, and causal reasoning?
Judea once told me that when he faces a really hard decision, a crossroad, he
asks himself “What would Rabbi Akiva do?”. Today, when I face a hard decision,
I ask “What would Judea do?”.
Thanks Judea for being such a wonderful (though quite a challenging) role model!
565