0% found this document useful (0 votes)
349 views

Introduction To Statistical Relational Learning

stat

Uploaded by

aqil_shamsi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
349 views

Introduction To Statistical Relational Learning

stat

Uploaded by

aqil_shamsi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 591

Introduction to Statistical Relational Learning

Adaptive Computation and Machine Learning


Thomas Dietterich, Editor
Christopher M. Bishop, David Heckerman, Michael I. Jordan, and Michael Kearns, Associate
Editors
Bioinformatics: The Machine Learning Approach
Pierre Baldi and Sren Brunak, 1998
Reinforcement Learning: An Introduction
Richard S. Sutton and Andrew G. Barto, 1998
Graphical Models for Machine Learning and Digital Communication
Brendan J. Frey, 1998
Learning in Graphical Models
Michael I. Jordan, ed., 1998
Causation, Prediction, and Search, 2nd Edition
Peter Spirtes, Clark Glymour, and Richard Scheines, 2001
Principles of Data Mining
David Hand, Heikki Mannila, and Padhraic Smyth, 2001
Bioinformatics: The Machine Learning Approach, 2nd Edition
Pierre Baldi and Sren Brunak, 2001
Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond
Bernhard Sch
olkopf and Alexander J. Smola, 2001
Learning Kernel Classiers: Theory and Algorithms
Ralf Herbrich, 2001
Introduction to Machine Learning
Ethem Alpaydin, 2004
Gaussian Processes for Machine Learning
Carl Edward Rasmussen and Christopher K. I. Williams, 2005
Semi-Supervised Learning
Olivier Chapelle, Bernhard Sch
olkopf, and Alexander Zien, eds. 2006
The Minimum Description Length Principle
Peter D. Gr
unwald, 2007
Introduction to Statistical Relational Learning
Lise Getoor and Ben Taskar, eds., 2007

Introduction to Statistical Relational Learning

edited by
Lise Getoor
Ben Taskar

The MIT Press


Cambridge, Massachusetts
London, England

c
2007
Massachusetts Institute of Technology

All rights reserved. No part of this book may be reproduced in any form by any electronic
or mechanical means (including photocopying, recording, or information storage and retrieval)
without permission in writing from the publisher.

Typeset by the authors using LATEX 2


Printed and bound in the United States of America

Library of Congress Cataloging-in-Publication Data


Introduction to statistical relational learning / edited by Lise Getoor, Ben Taskar.
p. cm.
Includes bibliographical references and index.
ISBN 978-0-262-07288-5 (hardcover : alk. paper)
1. Relational databases. 2. Machine learningStatistical methods 3. Computer algorithms. I.
Getoor, Lise. II. Taskar, Ben.
QA76.9.D3I68 2007
006.31dc22
2007000951
10 9 8 7 6 5 4 3 2 1

Contents

Series Foreword

xi

Preface

xiii

1 Introduction
Lise Getoor, Ben Taskar
1.1 Overview . . . . . . . . . . . . . . .
1.2 Brief History of Relational Learning
1.3 Emerging Trends . . . . . . . . . . .
1.4 Statistical Relational Learning . . .
1.5 Chapter Map . . . . . . . . . . . . .
1.6 Outlook . . . . . . . . . . . . . . . .
2 Graphical Models in a Nutshell
Daphne Koller, Nir Friedman,
2.1 Introduction . . . . . . . . . . . .
2.2 Representation . . . . . . . . . .
2.3 Inference . . . . . . . . . . . . . .
2.4 Learning . . . . . . . . . . . . . .
2.5 Conclusion . . . . . . . . . . . .

1
.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

1
2
3
3
5
8
13

Lise Getoor, Ben Taskar


. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .

3 Inductive Logic Programming in a Nutshell


Sa
so D
zeroski
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . .
3.2 Logic Programming . . . . . . . . . . . . . . . . . . . . .
3.3 Inductive Logic Programming: Settings and Approaches
3.4 Relational Classication Rules . . . . . . . . . . . . . .
3.5 Relational Decision Trees . . . . . . . . . . . . . . . . .
3.6 Relational Association Rules . . . . . . . . . . . . . . .
3.7 Relational Distance-Based Methods . . . . . . . . . . .
3.8 Recent Trends in ILP and RDM . . . . . . . . . . . . .
4 An Introduction to Conditional Random Fields
Charles Sutton, Andrew McCallum
4.1 Introduction . . . . . . . . . . . . . . . . . . . . .
4.2 Graphical Models . . . . . . . . . . . . . . . . . .
4.3 Linear-Chain Conditional Random Fields . . . .
4.4 CRFs in General . . . . . . . . . . . . . . . . . .
4.5 Skip-Chain CRFs . . . . . . . . . . . . . . . . . .
4.6 Conclusion . . . . . . . . . . . . . . . . . . . . .

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

57
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

for Relational Learning


.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

13
14
22
42
54

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

57
58
64
71
75
80
84
89
93

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

93
94
100
108
116
122

vi

Contents
5 Probabilistic Relational Models
Lise Getoor, Nir Friedman, Daphne Koller, Avi Pfeer, Ben Taskar
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.2 PRM Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.3 The Dierence between PRMs and Bayesian Networks . . . . . . . . . .
5.4 PRMs with Structural Uncertainty . . . . . . . . . . . . . . . . . . . . .
5.5 Probabilistic Model of Link Structure . . . . . . . . . . . . . . . . . . .
5.6 PRMs with Class Hierarchies . . . . . . . . . . . . . . . . . . . . . . . .
5.7 Inference in PRMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.8 Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.9 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6 Relational Markov Networks
Ben Taskar, Pieter Abbeel, Ming-Fai Wong, Daphne Koller
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.2 Relational Classication and Link Prediction . . . . . . . . . .
6.3 Graph Structure and Subgraph Templates . . . . . . . . . . . .
6.4 Undirected Models for Classication . . . . . . . . . . . . . . .
6.5 Learning the Models . . . . . . . . . . . . . . . . . . . . . . . .
6.6 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . .
6.7 Discussion and Conclusions . . . . . . . . . . . . . . . . . . . .
7 Probabilistic Entity-Relationship Models,
David Heckerman, Chris Meek, Daphne
7.1 Introduction . . . . . . . . . . . . . . . . .
7.2 Background: Graphical Models . . . . . .
7.3 The Basic Ideas . . . . . . . . . . . . . . .
7.4 Probabilistic Entity-Relationship Models .
7.5 Plate Models . . . . . . . . . . . . . . . .
7.6 Probabilistic Relational Models . . . . . .
7.7 Technical Details . . . . . . . . . . . . . .
7.8 Extensions and Future Work . . . . . . .
8 Relational Dependency Networks
Jennifer Neville, David Jensen
8.1 Introduction . . . . . . . . . . . . .
8.2 Dependency Networks . . . . . . .
8.3 Relational Dependency Networks .
8.4 Experiments . . . . . . . . . . . . .
8.5 Related Work . . . . . . . . . . . .
8.6 Discussion and Future Work . . . .
9 Logic-based Formalisms for
Statistical Relational Learning
James Cussens
9.1 Introduction . . . . . . . . . .
9.2 Representation . . . . . . . .
9.3 Inference . . . . . . . . . . . .
9.4 Learning . . . . . . . . . . . .
9.5 Conclusion . . . . . . . . . .

PRMs,
Koller
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .

129
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

175
.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

and Plate Models


.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

129
130
140
141
141
151
159
161
173

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

175
177
178
180
184
187
197
201

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

201
202
204
210
226
228
229
233
239

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

239
242
243
252
262
264

269
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

10 Bayesian Logic Programming: Theory and Tool


Kristian Kersting, Luc De Raedt

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

269
271
278
281
287
291

Contents

vii
10.1
10.2
10.3
10.4
10.5
10.6
10.7
10.8

Introduction . . . . . . . . . . . . . . . . . . . .
On Bayesian Networks and Logic Programs . .
Bayesian Logic Programs . . . . . . . . . . . .
Extensions of the Basic Framework . . . . . . .
Learning Bayesian Logic Programs . . . . . . .
Balios The Engine for Basic Logic Programs
Related Work . . . . . . . . . . . . . . . . . . .
Conclusions . . . . . . . . . . . . . . . . . . . .

11 Stochastic Logic Programs: A Tutorial


Stephen Muggleton, Niels Pahlavi
11.1 Introduction . . . . . . . . . . . . . . . . . . . .
11.2 Mixing Deterministic and Probabilistic Choice
11.3 Stochastic Grammars . . . . . . . . . . . . . . .
11.4 Stochastic Logic Programs . . . . . . . . . . . .
11.5 Learning Techniques . . . . . . . . . . . . . . .
11.6 Conclusion . . . . . . . . . . . . . . . . . . . .

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

291
293
296
304
311
315
315
318
323

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

12 Markov Logic: A Unifying Framework for Statistical Relational Learning


Pedro Domingos, Matthew Richardson
12.1 The Need for a Unifying Framework . . . . . . . . . . . . . . . . . . . . . . . .
12.2 Markov Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
12.3 First-Order Logic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
12.4 Markov Logic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
12.5 SRL Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
12.6 SRL Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
12.7 Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
12.8 Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
12.9 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
12.10 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
13 BLOG: Probabilistic Models with Unknown Objects
Brian Milch, Bhaskara Marthi, Stuart Russell, David
Daniel L. Ong, Andrey Kolobov
13.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . .
13.2 Examples . . . . . . . . . . . . . . . . . . . . . . . . . .
13.3 Syntax and Semantics: Possible Worlds . . . . . . . . . .
13.4 Syntax and Semantics: Probabilities . . . . . . . . . . .
13.5 Evidence and Queries . . . . . . . . . . . . . . . . . . .
13.6 Inference . . . . . . . . . . . . . . . . . . . . . . . . . . .
13.7 Related Work . . . . . . . . . . . . . . . . . . . . . . . .
13.8 Conclusions and Future Work . . . . . . . . . . . . . . .
14 The Design and Implementation
Language
Avi Pfeer
14.1 Introduction . . . . . . . . . . . .
14.2 The IBAL Language . . . . . . .
14.3 Examples . . . . . . . . . . . . .
14.4 Semantics . . . . . . . . . . . . .
14.5 Desiderata for Inference . . . . .
14.6 Related Approaches . . . . . . .
14.7 Inference . . . . . . . . . . . . . .

.
.
.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

323
324
330
333
335
337
339

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

339
341
342
344
350
354
356
358
360
367
373

Sontag,
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

373
375
378
383
388
388
393
394

of IBAL: A General-Purpose Probabilistic


399
.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

399
401
407
411
415
416
419

viii

Contents

14.8 Lessons Learned and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 429


15 Lifted First-Order Probabilistic Inference
Rodrigo de Salvo Braz, Eyal Amir, Dan Roth
15.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . .
15.2 Language, Semantics and Inference problem . . . . . . .
15.3 The First-Order Variable Elimination (FOVE) algorithm
15.4 An experiment . . . . . . . . . . . . . . . . . . . . . . .
15.5 Auxiliary operations . . . . . . . . . . . . . . . . . . . .
15.6 Applicability of lifted inference . . . . . . . . . . . . . .
15.7 Future Directions . . . . . . . . . . . . . . . . . . . . . .
15.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . .
16 Feature Generation and Selection in Multi-Relational
Alexandrin Popescul, Lyle H. Ungar
16.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . .
16.2 Detailed Methodology . . . . . . . . . . . . . . . . . . .
16.3 Experimental Evaluation . . . . . . . . . . . . . . . . . .
16.4 Related Work and Discussion . . . . . . . . . . . . . . .
16.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . .

433
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

Statistical Learning
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

433
435
437
444
446
448
449
449
453

.
.
.
.
.

.
.
.
.
.

453
458
463
471
472

17 Learning a New View of a Database: With an Application in Mammography 477


Jesse Davis, Elizabeth Burnside, In
es Dutra, David Page, Raghu Ramakrishnan,
Jude Shavlik, Vtor Santos Costa
17.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 477
17.2 View Learning for Mammography . . . . . . . . . . . . . . . . . . . . . . . . . . . . 478
17.3 Naive View Learning Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . 482
17.4 Initial Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 483
17.5 Integrated View Learning Framework . . . . . . . . . . . . . . . . . . . . . . . . . . 490
17.6 Further Experiments and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 491
17.7 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 493
17.8 Conclusions and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 494
18 Reinforcement Learning in Relational
Domains: A Policy-Language Approach
Alan Fern, SungWook Yoon, Robert Givan
18.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . .
18.2 Problem Setup . . . . . . . . . . . . . . . . . . . . . . . .
18.3 Approximate Policy Iteration with a Policy Language Bias
18.4 API for Relational Planning . . . . . . . . . . . . . . . . .
18.5 Bootstrapping . . . . . . . . . . . . . . . . . . . . . . . . .
18.6 Relational Planning Experiments . . . . . . . . . . . . . .
18.7 Related Work . . . . . . . . . . . . . . . . . . . . . . . . .
18.8 Summary and Future Work . . . . . . . . . . . . . . . . .

499
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

499
502
503
507
516
520
527
530

19 Statistical Relational Learning for Natural Language Information Extraction


Razvan C. Bunescu, Raymond J. Mooney
19.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
19.2 Background on Natural Language Processing . . . . . . . . . . . . . . . . . . . . .
19.3 Information Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
19.4 Collective Information Extraction with RMNs . . . . . . . . . . . . . . . . . . . . .
19.5 Future Research on SRL for NLP . . . . . . . . . . . . . . . . . . . . . . . . . . . .
19.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

535
535
536
537
538
549
550

Contents

20 Global Inference for Entity and Relation


Identication via a
Linear Programming Formulation
Dan Roth, Wen-tau Yih
20.1 Introduction . . . . . . . . . . . . . . . . . .
20.2 The Relational Inference Problem . . . . . .
20.3 Integer Linear Programming Inference . . .
20.4 Solving Integer Linear Programming . . . .
20.5 Experiments . . . . . . . . . . . . . . . . . .
20.6 Comparison with Other Inference Methods
20.7 Conclusion . . . . . . . . . . . . . . . . . .

ix

553
.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

553
556
560
562
563
570
576

Contributors

581

Index

587

Series Foreword

The goal of building systems that can adapt to their environments and learn from
their experience has attracted researchers from many elds, including computer
science, engineering, mathematics, physics, neuroscience, and cognitive science.
Out of this research has come a wide variety of learning techniques that have
the potential to transform many scientic and industrial elds. Recently, several
research communities have converged on a common set of issues surrounding supervised, unsupervised, and reinforcement learning problems. The MIT Press series
on Adaptive Computation and Machine Learning seeks to unify the many diverse
strands of machine learning research and to foster high quality research and innovative applications.
Thomas Dietterich

Preface

The goal of this book is to bring together important research at the intersection
of statistical, logical and relational learning. The material in the collection is
aimed at graduate students and researchers in machine learning and articial
intelligence. While by no means exhaustive, the articles introduce a wide variety of
recent approaches to combining expressive knowledge representation and statistical
learning.
The idea for this book emerged from a series of successful workshops addressing
these issues:
Learning Statistical Models from Relational Data (SRL2000) at the National
Conference on Articial Intelligence, AAAI-2000, organized by Lise Getoor and
David Jensen.
Learning Statistical Models from Relational Data (SRL2003) at the International Joint Conference on Articial Intelligence, (IJCAI-2003), organized by Lise
Getoor and David Jensen.
Statistical Relational Learning and its Connections to Other Fields (SRL2004)
at the International Conference on Machine Learning, (ICML2004), organized by
Tom Dietterich, Lise Getoor and Kevin Murphy.
Probabilistic, Logical and Relational Learning - Towards a Synthesis, Dagstuhl
Seminar 2005, organized by Luc De Raedt, Thomas Dietterich, Lise Getoor and
Stephen Muggleton.
Open Problems in Statistical Relational Learning (SRL2006) at the International
Conference on Machine Learning, (ICML2006), organized by Alan Fern, Lise
Getoor, and Brian Milch.
We would like to thank all of the participants at these workshops for their
intellectual contributions and also for creating a warm and welcoming research
community coming together from several distinct research areas.
In addition, there have been several other closely related workshops, including
the series of workshops on Multi-Relational Data Mining held in conjunction with
the Knowledge Discovery and Data Mining Conference beginning in 2002 organized
by Saso Dzeroski, Luc De Raedt, Stefan Wrobel, and Hendrik Blockeel.

This volume contains invited contributions from leading researchers in this new
research area. Each chapter has been reviewed by at least two anonymous reviewers.
We are very grateful to all the authors for their high quality contributions and to
all the reviewers for helping to clarify and improve this work.
In addition to thanking the workshop participants, book contributors and reviewers, we would like to thank our advisors: Daphne Koller, our PhD advisor;
Stuart Russell, Lise Getoors MS advisor; and Michael Jordan, Ben Taskars Postdoctoral advisor. Lise Getoor would also like to thank David Jensen; besides being
one of the people responsible for the name Statistical Relational Learning, David
has been a great mentor, workshop co-organizer and friend. We would also like
to thank Tom Dietterich, Pedro Domingos, and David Heckerman, who have been
very encouraging in developing this book. Luc De Raedt, Kristian Kersting, Stephen
Muggleton, Saso Dzeroski and Hendrik Blockeel have been especially encouraging
members from the inductive logic programming and relational learning community.
Lise would also like to thank her inquisitive graduate students, members of the
LINQs group at the University of Maryland, College Park, for their participation
in this project. Finally, on a more personal note, Lise would like to thank Pete for
his unwavering support and Ben would like to thank Anat for being his rock.

1 Introduction

Lise Getoor and Ben Taskar

We outline the major themes, problems and approaches that dene the subject of
the book: statistical relational learning. While the problems of statistical learning
and relational representation and reasoning have a fairly long history on their own
in Articial Intelligence research, the synthesis of the approaches is currently a
burgeoning eld. We briey sketch the background and the recent developments
presented in the book.

1.1

Overview
The vast majority of statistical learning literature assumes the data is represented
by points in a high-dimensional space. For any particular isolated task, such as
learning to detect a face in an image or classify an email message as spam or not,
we can usually construct the relevant low-level features (e.g., pixels, lters, words,
URLs) and solve the problem using standard tools for the vector representation.
While extremely useful for development of elegant and general algorithms and
analysis, this abstraction hides the rich logical structure of the underlying data
that is crucial for solving more general and complex problems. We may like to
detect not only a face in an image but to recognize that, for example, it is the face
of a tall woman who is spiking a volleyball or a little boy jumping into a puddle,
etc. Or, in the case of email, we might want to detect that an email message is
not only not-spam but is a request from our supervisor to meet tomorrow with
three colleagues or an invitation to the downstairs neighbors birthday party next
Sunday, etc. We are ultimately interested in not just answering an isolated yes/no
question, but in producing and manipulating structured representations of the data,
involving objects described by attributes and participating in relationships, actions,
and events. The challenge is to develop formalisms, models, and algorithms that
enable eective and robust reasoning about this type of object-relational structure
of the data.

Introduction

Dealing with real data, like images and text, inevitably requires the ability to
handle the uncertainty that arises from noise and incomplete information (e.g.,
occlusions, misspellings). In relational problems, uncertainty arises on many levels.
Beyond uncertainty about the attributes of an object, there may be uncertainty
about an objects type, the number of objects, and the identity of an object (what
kind, which, and how many entities are depicted or written about), as well as
relationship membership, type, and number (which entities are related, how, and
how many times). Solving interesting relational learning tasks robustly requires
sophisticated treatment of uncertainty at these multiple levels of representation.
In this book, we present the growing body of work on a variety of statistical
models that target relational learning tasks. The goal of these representations is
to express probabilistic models in a compact and intuitive way that reects the
relational structure of the domain and, ideally, supports ecient learning and
inference. The majority of these models are based on combinations of graphical
models, probabilistic grammars, and logical formulae.

1.2

Brief History of Relational Learning


Early work on machine learning often focused on learning deterministic logical
concepts. Methods were typically noise and mostly applied to toy domains. One
of the earliest relational learning systems is Winstons arch learning system [49].
This online-style system was trained using a sequence of instances labeled as positive
and negative examples of arches. The system maintained a current hypothesis,
represented as a semantic network. When a new example was presented, the system
made a prediction using the current hypothesis. If the prediction was correct, no
changes were made to the hypothesis. If it was incorrect, then the set of dierences
between the current hypothesis and the example was identied. If the example
was a positive instance, the dierences were used to generalize the concept; if the
example was a negative instance, it was used to specialize the concept. Following
this there were a number of more advanced relational learning systems [8, 18, 45],
but all used a similar logic-based representation for the concepts.
This approach of machine learning (ML) fell out of vogue for many years
because of problems handling noise and large-scale data. During that time, the ML
community shifted attention to statistical methods that ignored relational aspects
of the data (e.g., neural networks, decision trees, and generalized linear models).
These methods led to major boosts in accuracy in many problems in low-level
vision and natural language processing [11, 28]. However, their focus was on the
propositional or attribute-value representation.
The major exception has been the inductive logic programming (ILP) community.
The ILP community has concentrated its eorts on learning (deterministic) rstorder rules from relational data [27, 30]. Initially the ILP community focused its
attention solely on the task of program synthesis from examples and background
knowledge. However, recent research has tackled the discovery of useful rules

1.3

Emerging Trends

from larger databases [9]. These rules are often used for prediction and may
have a probabilistic interpretation. The ILP community has had successes in
a number of application areas including discovery of 2D structural alerts for
mutagenicity/carcinogenicity [22], 3D pharmacophore discovery for drug design
[10], and analysis of chemical databases [7].

1.3

Emerging Trends
Recently, both the ILP community and the statistical ML community have begun
to incorporate aspects of the complementary technology. Many ILP researchers are
developing stochastic and probabilistic representations and algorithms [31, 21, 6]. In
more traditional ML circles, researchers who have in the past focused on attributevalue or propositional learning algorithms are exploring methods for incorporating
relational information [5, 32, 4]. It is our hope that this trend will continue, and
that the work presented in this book will provide a bridge connecting relational and
statistical learning.
Among the strong motivations for using a relational model is its ability to
model dependencies between related instances. Intuitively, we would like to use
our information about one object to help us reach conclusions about other, related
objects. For example, in web data, we should be able to propagate information
about the topic of a document to documents it has links to and documents that link
to it. These, in turn, would propagate information to yet other documents. Many
researchers have proposed a process along the lines of this relational inuence
propagation idea [3, 44, 32]. Chakrabarti et al. [3] describe a relaxation labeling
algorithm that makes use of the neighboring link information. The algorithm begins
with the labeling given by a text-based classier constructed from the training set. It
then uses the estimated class of neighboring documents to update the distribution of
the document being classied. The intuitions underlying these procedural systems
can be given declarative semantics using probabilistic graphical models [46, 15, 47].

1.4

Statistical Relational Learning


We refer to this emerging area of research as statistical relational learning (SRL).
SRL research attempts to represent, reason, and learn in domains with complex
relational and rich probabilistic structure. Other terms that have been used recently
include probabilistic logic learning and multi-relational data mining. Many of the
tasks known as structured prediction problems also overlap greatly with problems
addressed by SRL research.
The majority of proposed SRL systems can be distinguished along several dimensions. The most common representation formalisms are based on either logic
(e.g., rule-based formalisms) or frame-based (e.g., objected-oriented) formalisms.

Introduction

The probabilistic semantics are mostly based on graphical models or stochastic


grammars; early SRL approaches were often dened in terms of directed graphical
models (e.g., Bayesian networks) whereas recently there has been a growing interest
in undirected models (e.g., Markov networks). The directed models can represent
complex generative models while the undirected models can represent non-causal
dependencies. Other alternatives, such as dependency networks [19] and mixed directed and undirected models,6 are also possible.
The logical interpretation for most SRL languages (e.g., probabilistic relational
models, Bayesian logic programs, relational Markov networks) is often in terms of
least Hebrand models and the probabilistic semantics is most often in terms of a
possible worlds semantics. Some of the early approaches, such as knowledge-based
model construction (KBMC) [48], relied on procedural semantics. There are other
possibilities, described in greater deal in the upcoming chapters.
The semantics of many of the SRL systems is given in terms of an unrolled or
ground graphical model. Thus, one approach to doing inference in these models
is to perform the appropriate probabilistic inference in the base-level model. One
simple KBMC-style optimization is to make use of the query in the construction of
the network. Rather than constructing the entire base-level model, the construction
may be made more ecient by constructing only the portion of the network required
to answer the query. But this doesnt exploit any of the inherent structure in the
probabilistic model. Pfeer et al. [38] observe that in many cases the models can
be decomposed into loosely coupled systems, and show how the interfaces between
the components can be used to encapsulate inference within the components. This
allows the reuse and caching of inferences and can lead to signicant improvements
in eciency during inference. More general approaches, such as rst-order variable
elimination [41, 1], combine variable elimination with unication and allow a lifted
inference to be performed (see chapter 15 for details).
Not surprisingly, learning is a fundamental component in any SRL approach. The
power of the structured representation is the hierarchical nature of the statistical
models. The advantage of the hierarchical models, and what distinguishes them
from at statistical models, is parameter sharing or parameter tying. Parameter
sharing occurs when potentially distinct parameters of the model are constrained
to be the same. A simple example occurs in a hidden Markov model: because of the
Markovian assumption, the parameters determining the next state are the same at
each time instance, hence we do not require distinct parameters indexed by specic
values of t, we simply have one set of parameters t+1|t .
This parameter tying not only gives us a compact model for rich classes of
distributions but is also what enables robust parameter estimation to even be
feasible. Unlike traditional ML scenarios, where the learning system is given as
input a sequence of i.i.d. observations, the input to an SRL learning algorithm is
most often just a single, richly connected, instance. If there were no parameter
sharing, this instance would be of little use for performing statistical inference.
But, because the same parameters are used in multiple places in the model, we can

1.5

Chapter Map

still extract meaningful statistics from the data to use in our statistical inference
procedures.
Model selection is a challenging SRL problem. Similar to work in propositional
graphical models, many approaches make use of some type of heuristic search
through the model space. Methods for scoring propositional graphical models have
been extended for SRL learning [12, 13]. The search can make use of certain biases
dened over the model space, such as allowing dependencies only among attributes
of related instances according to the entity relationship model or the use of binding
patterns to constrain clauses to consider adding to the probabilistic rules.
Certain common issues arise repeatedly, in dierent guises, in a number of the
SRL systems. One of the most common issues is feature construction and aggregation. The rich variety in structure combined with the need for a compact parameterization gives rise to the need to construct relational features or aggregates [12]
which capture the local neighborhood of a random variable. Because it is infeasible to explicitly dene factors over all potential neighborhoods, aggregates provide
an intuitive way of describing the relational neighborhood. Common aggregates
include taking the mean or mode of some neighboring attribute, taking the min
or the max, or simply counting the number of neighbors. More complex, domainspecic aggregates are also possible. Aggregation has also been studied as a means
for propositionalizing a relational classication problem [25, 23, 26] Within the SRL
community, Perlich and Provost [36, 37] have studied aggregation extensively and
Popescul and Ungar [42] have worked on statistical predicate invention.
Structural uncertainty is another common issue that researchers have begun
investigating. Many of the early SRL approaches consider the case where there is a
single logical interpretation, or relational skeleton, which denes the set of random
variables, and there is a probability distribution over the instantiations of the
random variables. Structural uncertainty supports uncertainty over the relational
interpretation. Koller and Pfeer [24] introduced several forms, including number
uncertainty, where there is a distribution over the number of related objects. Getoor
et al. [16] studied learning models with structural uncertainty, and showed how
these representations could be supported by a probabilistic logic-based system [14].
Pasula and Russell [35] studied identity uncertainty, a form of structural uncertainty
which allows modeling uncertainty about the identity of a reference. Most of these
models rely on a closed world assumption to dene the semantics for the models.
More recently, Milch et al. [29] have investigated the use of nonparametric models
which allow an innite number of objects and support an open-world model (see
the chapter 13 for details). Other recent exible approaches include the innite
relational models of Kemp et al. [20] and Xu et al. [50].

1.5

Chapter Map
The book begins with several introductory chapters providing tutorials for the material which many of the later chapters build upon. chapter 2 is on graphical models

Introduction

and covers the basics of representation, inference, and learning in both directed and
undirected models. Chapter 3 by Dzeroski describes ILP. ILP, unlike many other ML
approaches, has traditionally dealt with multi-relational data. The learned models
are typically described by sets of relational rules called logic programs, and the
methods can make use of logical background knowledge. Chapter 4 by Sutton and
McCallum covers conditional random elds (CRFs), a very popular class of models
for structured supervised learning. An advantage of CRFs is that the models are
optimized for predictive performance on only the subset of variables of interest.
The chapter provides a tutorial on training and inference in CRFs, with particular
attention given to the important special case of linear CRFs. The chapter concludes
with a discussion of applications to information extraction.
Then next set of chapters describes several frame-based SRL approaches. Chapter 5 provides an introduction to probabilistic relational models (PRMs). PRMs are
directed graphical models which can capture dependencies among objects and uncertainty over the relational structure. In addition to describing the representation,
the chapter describes algorithms for inference and learning. Chapter 6 describes
Markov relational networks (RMNs), which are essentially CRFs lifted to the relational setting. A particularly relevant advantage of RMNs over PRMs is that
acyclicity requirements do not hinder modeling complex, non-causal correlations
concisely; however, as in the non-relational case, this comes at the price of more expensive parameter estimation. Another advantage of RMNs, like CRFs, is that they
are well suited to discriminative training. Algorithms for inference and learning are
given. Chapter 7, by Heckerman et al., describes a graphical language for probabilistic entity-relationship models (PERs). One of the contributions of this chapter
is its discussion of the relationship between PERs, PRMs, and plate models. Plate
models [2, 17] were introduced in the statistics community as a graphical representation for hierarchical models. They can represent the repeated, shared, or tied
parameters in a hierarchical graphical model. PERs synthesize these approaches.
The chapter describes a directed version of PERs, DAPERs, and gives a number of illustrative examples. Chapter 8, by Neville and Jensen, describes relational
dependency networks (RDNs). RDNs extend propositional dependency networks
to relational domains, and, like dependency networks, have some advantages over
directed graphical models and undirected models. This chapter describes the representation, inference, and learning algorithms and presents results on several data
sets.
The next four chapters describe logic-based formalisms for SRL. An introductory
chapter, chapter 9 by Cussens, surveys this area, describing work on some of the
early logic-based formalisms such as Pooles work on probabilistic Horn abduction
[39] and independent choice logic [40], Ngo and Haddawys work on probabilistic
knowledge bases [34] and Satos work on the PRISM system [43], and Ng and Subrahmanians work on probabilistic logic programming [33]. Cussens compares and
contrasts these approaches and describes some of the common representational issues, making connections to approaches described in later chapters. Chapter 10, by
Kersting and De Raedt, describes Bayesian logic programs (BLPs). Their approach

1.5

Chapter Map

combines Bayesian networks and logic programs to upgrade them to a representation which overcomes the propositional nature of Bayesian networks and the purely
logical nature of logic programs. This chapter gives an introduction to BLPs, describing both a Bayesian logic programming tool and a graphical representation for
them. Chapter 11, by Muggleton and Pahlavi, describes stochastic logic programs
(SLPs). SLPs were originally introduced as a means of extending the expressiveness
of stochastic grammars to the level of logic programs. The chapter provides several
example programs and describes both parameter estimation and structure learning. Chapter 12, by Domingos and Richardson, describes Markov logic. Markov
logic combines Markov networks and rst-order logic. First-order logic formulae
are given weights; the formulae dene a log-linear model with a feature for each
grounding of the logical formulae with the appropriate weights. The relationship
between many of the other SRL approaches and Markov logic networks (MLNs) is
discussed, along with several common SRL tasks such as collective inference, link
prediction, and object identication. Inference and learning in MLNs are presented.
Many of the approaches discussed so far have assumed, either implicitly or explicitly, several practical assumptions (the closed-world assumption, domain closure,
unique names) about the underlying logical interpretation in order to dene the
underlying semantics. Chapter 13, by Milch et al. describes BLOG, a system especially tailored toward cases in which these assumptions are not appropriate. BLOG
models dene stochastic processes for generating worlds; inference in these models
is done via a sampling process. Chapter 14, by Pfeer, describes IBAL, a functional
programming language for probabilistic AI. IBAL supports a rich decision-theoretic
framework which includes probabilistic reasoning and utility maximization. The
chapter describes the syntax and semantics for the IBAL, along with a sophisticated inference algorithm which exploits both lazy evaluation and memoization for
ecient inference.
One of the issues that comes up in many of the approaches is the need to perform
eective inference in large scale probabilistic models. Many of the approaches can
make use of lifted inference, inference which is done at level of the rst-order
representation directly, rather than at the propositional level. Chapter 15 describes
rst-order variable elimination, an algorithm for lifted probabilistic inference, and
presents recent results.
One of the issues that comes up in each of the learning algorithms is the need
for feature generation and selection. Chapter 16, by Popescul and Ungar, examines
this issue in the context of structured generalized linear regression (SGLR). They
address the need for an integrated approach to feature generation and selection.
Chapter 17, by Davis et al., addresses a related issue, the need for view learning to
support feature generation and selection. They describe two approaches and present
results on a mammography analysis system.
Chapter 18, by Fern et al. surveys recent work in reinforcement learning in relational domains. There has been a lot of recent work on relational learning within the
reinforcement learning setting and our collection does not try to comprehensively
cover its scope. Instead we have chosen a representative contribution describing a

Introduction

novel approach to approximate policy iteration which is applicable to very large


relational Markov decision problems.
One of the domains which naturally lends itself to SRL techniques is natural
language processing. Chapter 19 by Bunescu and Mooney shows how RMNs can be
used for information extraction. An advantage of their approach is that inference
and learning support collective information extraction in which dependencies
between extractions are exploited. They present results on extracting protein names
from biomedical abstracts. Chapter 20, by Roth and Yih, also investigates SRL
approaches for information extraction, specically for combining named entity and
relation extraction. They show how a linear programming formulation can capture
the required global inference.

1.6

Outlook
In this introduction we have touched on a number of the common themes and
issues that will be developed in greater detail in the following chapters. While a
single unied framework has yet to emerge, we believe that the book highlights
the commonalities, and claries some of the important dierences among proposed
approaches. Along the way, important representational and algorithmic issues are
identied.
Statistical relational learning is a young and exciting eld. There are many
opportunities to develop new methods and apply the tools to compelling real-world
problems. We hope this book will provide an introduction to the eld, and stimulate
further research, development, and applications.

References
[1] R. Braz, E. Amir, and D. Roth. Lifted rst-order probabilistic inference. In
Proceedings of the International Joint Conference on Articial Intelligence,
2005.
[2] W. Buntine. Operations for learning with graphical models.
Articial Intelligence Research, 3:159225, 1994.

Journal of

[3] S. Chakrabarti, B. Dom, and P. Indyk. Enhanced hypertext categorization


using hyperlinks. In Proceedings of ACM International Conference on Management of Data, 1998.
[4] D. Cohn and T. Hofmann. The missing linka probabilistic model of document
content and hypertext connectivity. In Proceedings of Neural Information
Processing Systems, 2001.
[5] M. Craven, D. DiPasquo, D. Freitag, A. McCallum, T. Mitchell, K. Nigam,
and S. Slattery. Learning to extract symbolic knowledge from the World Wide
Web. In Proceedings of the National Conference on Articial Intelligence, 1998.

References

[6] J. Cussens. Loglinear models for rst-order probabilistic reasoning. In Proceedings of the Conference on Uncertainty in Articial Intelligence, 1999.
[7] L. Dehaspe, H. Toivonen, and R.D. King. Finding frequent substructures in
chemical compounds. In International Conference on Knowledge Discovery
and Data Mining, 1998.
[8] T. Dietterich and R. S. Michalski. Inductive learning of structural descriptions:
Evaluation criteria and comparative review of selected methods. Articial
Intelligence, 16:257294, 1986.
[9] S. Dzeroski and N. Lavrac, editors. Relational Data Mining. Kluwer, Berlin,
2001.
[10] P. Finn, S. Muggleton, D. Page, and A. Srinivasan. Discovery of pharmacophores using the inductive logic programming system Progol. Machine Learning, 30(1-2):241270, 1998.
[11] D. A. Forsyth and J. Ponce. Computer Vision: A Modern Approach. Prentice
Hall, Upper Saddle River, NJ, 2002.
[12] N. Friedman, L. Getoor, D. Koller, and A. Pfeer. Learning probabilistic
relational models. In Proceedings of the International Joint Conference on
Articial Intelligence, 1999.
[13] L. Getoor. Learning Statistical Models from Relational Data. PhD thesis,
Stanford University, Stanford, CA, 2001.
[14] L. Getoor and J. Grant. PRL: A probabilistic relational language. Machine
Learning Journal, 62(1-2):731, 2006.
[15] L. Getoor, E. Segal, B. Taskar, and D. Koller. Probabilistic models of text
and link structure for hypertext classication. In Proceedings of the IJCAI
Workshop on Text Learning: Beyond Supervision, 2001.
[16] L. Getoor, N. Friedman, D. Koller, and B. Taskar. Learning probabilistic
models of link structure. Journal of Machine Learning Research, 3:679707,
2002.
[17] W. Gilks, A. Thomas, and D. Spiegelhalter. A language and program for
complex Bayesian modeling. The Statistician, 43:169177, 1994.
[18] F. Hayes-Roth and J. McDermott. Knowledge acquisition from structural
descriptions. In Proceedings of the International Joint Conference on Articial
Intelligence, 1997.
[19] D. Heckerman, D. Chickering, C. Meek, R. Rounthwaite, and C. Kadie. Dependency networks for inference, collaborative ltering and data visualization.
Journal of Machine Learning Research, 1:4975, 2000.
[20] C. Kemp, J. Tenenbaum, T. Griths, T. Yamada, and N. Ueda. Learning
systems of concepts with an innite relational model. In Proceedings of the
National Conference on Articial Intelligence, 2006.

10

Introduction

[21] K. Kersting, L. De Raedt, and S. Kramer. Interpreting Bayesian logic


programs. In Proceedings of the AAAI-2000 Workshop on Learning Statistical
Models from Relational Data, 2000.
[22] R. King, S. Muggleton, A. Srinivasan, and M. Sternberg. Structure-activity
relationships derived by machine learning: The use of atoms and their bond
connectivities to predict mutagenicity by inductive logic programming. In
Proceedings of the National Academy of Sciences of the United States of
America, volume 93, pages 438442, 1996.
[23] A. Knobbe, M. deHaas, and A. Siebes. Propositionalisation and aggregates.
In Proceedings of the Fifth European Conference on Principles of Data Mining
and Knowledge Discovery, 2001.
[24] D. Koller and A. Pfeer. Probabilistic frame-based systems. In Proceedings
of the National Conference on Articial Intelligence, 1998.
[25] S. Kramer, N. Lavrac, and P. Flach. Propositionalization approaches to
relational data mining. In S. Dzeroski and N. Lavrac, editors, Relational Data
Mining. Springer-Verlag, New York, 2001.
[26] M. Krogel, S. Rawles, F. Zeezny, P. Flach, N. Lavrac, and S. Wrobel. Comparative evaluation of approaches to propositionalization. In Proceedings of the
International Conference on Inductive Logic Programming, 2003.
[27] N. Lavrac and S. Dzeroski. Inductive Logic Programming: Techniques and
Applications. Ellis Horwood, New York, 1994.
[28] C. Manning and H. Sch
utze. Foundations of Statistical Natural Language
Processing. MIT Press, Cambridge, MA, 1999.
[29] B. Milch, B. Marthi, S. Russell, D. Sontag, D. Ong, and A. Kolobov. BLOG:
Probabilistic models with unknown objects. In Proceedings of the International
Joint Conference on Articial Intelligence, 2005.
[30] S. Muggleton, editor. Inductive Logic Programming. Academic Press, London,
1992.
[31] S. Muggleton. Learning stochastic logic programs. In Proceedings of the
AAAI-2000 Workshop on Learning Statistical Models from Relational Data,
2000.
[32] J. Neville and D. Jensen. Iterative classication in relational data. In
Proceedings of the AAAI-2000 Workshop on Learning Statistical Models from
Relational Data, 2000.
[33] R. Ng and V.S. Subrahmanian. Probabilistic logic programming. Information
and Computation, 101(2):150201, 1992.
[34] L. Ngo and Peter Haddaway. Answering queries from context-sensitive probabilistic knowledge bases. Theoretical Computer Science, 171:147171, 1997.
[35] H. Pasula and S. Russell. Approximate inference for rst-order probabilistic
languages. In Proceedings of the International Joint Conference on Articial
Intelligence, 2001.

References

11

[36] C. Perlich and F. Provost. Aggregation-based feature invention and relational


concept classes. In International Conference on Knowledge Discovery and Data
Mining, 2003.
[37] C. Perlich and F. Provost. Distribution-based aggregation for relational
learning with identier attributes. Machine Learning Journal, 62(1-2):65105,
2006.
[38] A. Pfeer, D. Koller, B. Milch, and K. Takusagawa. SPOOK: A system for
probabilistic object-oriented knowledge representation. In Proceedings of the
Conference on Uncertainty in Articial Intelligence, 1999.
[39] D. Poole. Probabilistic Horn abduction and Bayesian networks. Articial
Intelligence, 64(1):81129, 1993.
[40] D. Poole. The independent choice logic for modelling multiple agents under
uncertainty. Articial Intelligence, 94(12):556, 1997.
[41] D. Poole. First-order probabilistic inference. In Proceedings of the International Joint Conference on Articial Intelligence, 2003.
[42] A. Popescul and L. Ungar. Structural logistic regression for link analysis. In
KDD Workshop on Multi-Relational Data Mining, 2003.
[43] T. Sato. A statistical learning method for logic programs with distribution
semantics. In Proceedings of the International Conference on Inductive Logic
Programming, 1995.
[44] S. Slattery and T. Mitchell. Discovering test set regularities in relational
domains. In Proceedings of the International Conference on Machine Learning,
2000.
[45] R. Stepp and R. S. Michalski. Conceptual Clustering: Inventing goal-oriented
classications of structured objects. Technical Report 940, University of
Illinois, Urbana, 1985.
[46] B. Taskar, E. Segal, and D. Koller. Probabilistic classication and clustering
in relational data. In Proceedings of the International Joint Conference on
Articial Intelligence, 2001.
[47] B. Taskar, P. Abbeel, and D. Koller. Discriminative probabilistic models for
relational data. In Proceedings of the Conference on Uncertainty in Articial
Intelligence, 2002.
[48] M. Wellman, J. Breese, and R. Goldman. From knowledge bases to decision
models. Knowledge Engineering Review, 7(1):3553, 1992.
[49] P. Winston. Learning structural descriptions from examples. In P. H.
Winston, editor, The Psychology of Computer Vision, pages 157209. McGrawHill, New York, 1975.
[50] Z. Xu, V. Tresp, K. Yu, and H. Kriegel. Innite hidden relational models. In
Proceedings of the Conference on Uncertainty in Articial Intelligence, 2006.

2 Graphical Models in a Nutshell

Daphne Koller, Nir Friedman, Lise Getoor and Ben Taskar

Probabilistic graphical models are an elegant framework which combines uncertainty (probabilities) and logical structure (independence constraints) to compactly
represent complex, real-world phenomena. The framework is quite general in that
many of the commonly proposed statistical models (Kalman lters, hidden Markov
models, Ising models) can be described as graphical models. Graphical models have
enjoyed a surge of interest in the last two decades, due both to the exibility and
power of the representation and to the increased ability to eectively learn and
perform inference in large networks.

2.1

Introduction
Graphical models [11, 3, 5, 9, 7] have become an extremely popular tool for modeling uncertainty. They provide a principled approach to dealing with uncertainty
through the use of probability theory, and an eective approach to coping with
complexity through the use of graph theory. The two most common types of graphical models are Bayesian networks (also called belief networks or causal networks)
and Markov networks (also called Markov random elds (MRFs)).
At a high level, our goal is to eciently represent a joint distribution P over
some set of random variables X = {X1 , . . . , Xn }. Even in the simplest case where
these variables are binary-valued, a joint distribution requires the specication of
2n numbers the probabilities of the 2n dierent assignments of values x1 , . . . , xn .
However, it is often the case that there is some structure in the distribution that
allows us to factor the representation of the distribution into modular components.
The structure that graphical models exploit is the independence properties that
exist in many real-world phenomena.
The independence properties in the distribution can be used to represent such
high-dimensional distributions much more compactly. Probabilistic graphical models provide a general-purpose modeling language for exploiting this type of structure
in our representation. Inference in probabilistic graphical models provides us with

14

Graphical Models in a Nutshell

the mechanisms for gluing all these components back together in a probabilistically
coherent manner. Eective learning, both parameter estimation and model selection, in probabilistic graphical models is enabled by the compact parameterization.
This chapter provides a compact graphical models tutorial based on [8]. We cover
representation, inference, and learning. Our tutorial is not comprehensive; for more
details see [8, 11, 3, 5, 9, 4, 6].

2.2

Representation
The two most common classes of graphical models are Bayesian networks and
Markov networks. The underlying semantics of Bayesian networks are based on
directed graphs and hence they are also called directed graphical models. The
underlying semantics of Markov networks are based on undirected graphs; Markov
networks are also called undirected graphical models. It is possible, though less
common, to use a mixed directed and undirected representation (see, for example,
the work on chain graphs [10, 2]); however, we will not cover them here.
Basic to our representation is the notion of conditional independence:
Denition 2.1
Let X, Y , and Z be sets of random variables. X is conditionally independent of
Y given Z in a distribution P if
P (X = x, Y = y | Z = z) = P (X = x | Z = z)P (Y = y | Z = z)
for all values x V al(X), y V al(Y ) and z V al(Z).
In the case where P is understood, we use the notation (X Y | Z) to say that X
is conditionally independent of Y given Z. If it is clear from the context, sometimes
we say independent when we really mean conditionally independent.
2.2.1

Bayesian Networks

The core of the Bayesian network representation is a directed acyclic graph (DAG)
G. The nodes of G are the random variables in our domain and the edges correspond,
intuitively, to direct inuence of one node on another. One way to view this graph is
as a data structure that provides the skeleton for representing the joint distribution
compactly in a factorized way.
Let G be a BN graph over the variables X1 , . . . , Xn . Each random variable Xi
in the network has an associated conditional probability distribution (CPD) or local
probabilistic model. The CPD for Xi , given its parents in the graph (denoted PaXi ),
is P (Xi | PaXi ). It captures the conditional probability of the random variable,
given its parents in the graph. CPDs can be described in a variety of ways. A
common, but not necessarily compact, representation for a CPD is a table which
contains a row for each possible set of values for the parents of the node describing

2.2

Representation

15

P T P(I |P, T )
Pneumonia

Tuberculosis

Lung Infiltrates
XRay

Sputum Smear

(a)

p
p
p
p

t
t
t
t

0.8
0.6
0.2
0.01

P(X|I )

i
i

0.8
0.6

P(P)

P(T)

0.05

0.02

P(S|T )

s
s

0.8
0.6

(b)

Figure 2.1 (a) A simple Bayesian network showing two potential diseases, Pneumonia and Tuberculosis, either of which may cause a patient to have Lung Inltrates.
The lung inltrates may show up on an XRay ; there is also a separate Sputum
Smear test for tuberculosis. All of the random variables are Boolean. (b) The same

Bayesian network, together with the conditional probability tables. The probabilities shown are the probability that the random variable takes the value true (given
the values of its parents); the conditional probability that the random variable is
false is simply 1 minus the probability that it is true.
the probability of dierent values for Xi . These are often referred to as table CPDs,
and are tables of multinomial distributions. Other possibilities are to represent
the distributions via a tree structure (called, appropriately enough, tree-structured
CPDs), or using an even more compact representation such as a noisy-OR or noisyMAX.
Example 2.1
Consider the simple Bayesian network shown in gure 2.1. This is a toy example
indicating the interactions between two potential diseases, pneumonia and tuberculosis. Both of them may cause a patient to have lung inltrates. There are two
tests that can be performed. An x-ray can be taken, which may indicate whether
the patient has lung inltrates. There is a separate sputum smear test for tuberculosis. gure 2.1(a) shows the dependency structure among the variables. All of the
variables are assumed to be Boolean. gure 2.1(b) shows the conditional probability
distributions for each of the random variables. We use initials P , T , I, X, and S
for shorthand. At the roots, we have the prior probability of the patient having
each disease. The probability that the patient does not have the disease a priori
is simply 1 minus the probability he or she has the disease; for simplicity only the
probabilities for the true case are shown. Similarly, the conditional probabilities
for the non-root nodes give the probability that the random variable is true, for
dierent possible instantiations of the parents.

16

Graphical Models in a Nutshell

Denition 2.2
Let G be a Bayesinan network graph over the variables X1 , . . . , Xn . We say that a
distribution PB over the same space factorizes according to G if PB can be expressed
as a product
PB (X1 , . . . , Xn ) =

n


P (Xi | PaXi ).

(2.1)

i=1

A Bayesian network is a pair (G, G ) where PB factorizes over G, and where PB is


specied as set of CPDs associated with Gs nodes, denoted G .
The equation above is called the chain rule for Bayesian networks. It gives us a
method for determining the probability of any complete assignment to the set of
random variables: any entry in the joint can be computed as a product of factors,
one for each variable. Each factor represents a conditional probability of the variable
given its parents in the network.
Example 2.2
The Bayesian network in gure 2.1(a) describes the following factorization:
P (P, T, I, X, S) = P (P )P (T )P (I | P, T )P (X | I)P (S | T ).
Sometimes it is useful to think of the Bayesian network as describing a generative
process. We can view the graph as encoding a generative sampling process executed
by nature, where the value for each variable is selected by nature using a distribution
that depends only on its parents. In other words, each variable is a stochastic
function of its parents.
2.2.2

Conditional Independence Assumptions in Bayesian Networks

Another way to view a Bayesian network is as a compact representation for a set


of conditional independence assumptions about a distribution. These conditional
independence assumptions are called the local Markov assumptions. While we wont
go into the full details here, this view is, in a strong sense, equivalent to the view
of the Bayesian network as providing a factorization of the distribution.
Denition 2.3
Given a BN network structure G over random variables X1 , . . . , Xn , let NonDescendantsXi
denote the variables in the graph that are not descendants of Xi . Then G encodes
the following set of conditional independence assumptions, called the local Markov
assumptions:
For each variable Xi , we have that
(Xi NonDescendantsXi | PaXi ),
In other words, the local Markov assumptions state that each node Xi is independent of its nondescendants given its parents.

2.2

Representation

17

(a)

(b)

Z
X

X
Y

(c)

Y
Z
(d)

Figure 2.2 (a) An indirect causal eect; (b) an indirect evidential eect; (c) a
common cause; (d) a common eect.

Example 2.3
The BN in gure 2.1(a) describes the following local Markov assumptions: (P
T | ), (T P | ), (X {P, T, S} | I), and (S {P, I, X} | T ).
These are not the only independence assertions that are encoded by a network.
A general procedure called d-separation (which stands for directed separation) can
answer whether an independence assertion must hold in any distribution consistent
with the graph G. However, note that other independencies may hold in some
distributions consistent with G; these are due to ukes in the particular choice of
parameters of the network (and this is why they hold in some of the distributions).
Returning to our denition of d-separation, it is useful to view probabilistic
inuence as a ow in the graph. Our analysis here tells us when inuence from
X can ow through Z to aect our beliefs about Y . We will consider ow allows
(undirected) paths in the graph.
Consider a simple three-node path XY Z If inuence can ow from X to Y
via Z, we say that the path XZY is active. There are four cases:
Causal path X Z Y : active if and only if Z is not observed.
Evidential path X Z Y : active if and only if Z is not observed.
Common cause X Z Y : active if and only if Z is not observed.
Common eect X Z Y : active if and only if either Z or one of Zs
descendants is observed.
A structure where X Z Y (as in gure 2.2(d)) is also called a v-structure.
Example 2.4
In the BN from gure 2.1(a), the path from P I X is active if I is not
observed. On the other hand, the path from P I T is active if I is observed.
Now consider a longer path X1 Xn . Intuitively, for inuence to ow
from X1 to Xn , it needs to ow through every single node on the trail. In other
words, X1 can inuence Xn if every two-edge path Xi1 Xi Xi+1 along the trail
allows inuence to ow. We can summarize this intuition in the following denition:

18

Graphical Models in a Nutshell

Denition 2.4
Let G be a BN structure, and X1 . . . Xn a path in G. Let E be a subset of
nodes of G. The path X1 . . . Xn is active given evidence E if
whenever we have a v-structure Xi1 Xi Xi+1 , then Xi or one of its
descendants is in E;
no other node along the path is in E.
Our ow intuition carries through to graphs in which there is more than one
path between two nodes: one node can inuence another if there is any path along
which inuence can ow. Putting these intuitions together, we obtain the notion
of d-separation, which provides us with a notion of separation between nodes in a
directed graph (hence the term d-separation, for directed separation):
Denition 2.5
Let X, Y , Z be three sets of nodes in G. We say that X and Y are d-separated
given Z, denoted d-sepG (X; Y | Z), if there is no active path between any node
X X and Y Y given Z.
Finally, an important theorem which relates the independencies which hold in a
distribution to the factorization of a distribution is the following:
Theorem 2.6
Let G be a BN graph over a set of random variables X and let P be a joint
distribution over the same space. If all the local Markov properties associated with
G hold in P , then P factorizes according to G.
Theorem 2.7
Let G be a BN graph over a set of random variables X and let P be a joint
distribution over the same space. If P factorizes according to G, then all the local
Markov properties associated with G hold in P .
2.2.3

Markov Networks

The second common class of probabilistic graphical models is called a Markov network or a Markov random eld. The models are based on undirected graphical
models. These models are useful in modeling a variety of phenomena where one
cannot naturally ascribe a directionality to the interaction between variables. Furthermore, the undirected models also oer a dierent and often simpler perspective
on directed models, both in terms of the independence structure and the inference
task.
A representation that implements this intuition is that of an undirected graph.
As in a Bayesian network, the nodes in the graph of a Markov network graph
H represent the variables, and the edges correspond to some notion of direct
probabilistic interaction between the neighboring variables.
The remaining question is how to parameterize this undirected graph. The graph
structure represents the qualitative properties of the distribution. To represent the

2.2

Representation

19

distribution, we need to associate the graph structure with a set of parameters, in


the same way that CPDs were used to parameterize the directed graph structure.
However, the parameterization of Markov networks is not as intuitive as that of
Bayesian networks, as the factors do not correspond either to probabilities or to
conditional probabilities.
The most general parameterization is a factor :
Denition 2.8
Let D be a set of random variables. We dene a factor to be a function from
Val(D) to IR+ .
Denition 2.9
Let H be a Markov network structure. A distribution PH factorizes over H if it is
associated with
a set of subsets D1 , . . . , D m , where each D i is a complete subgraph of H;
factors 1 [D 1 ], . . . , m [D m ],
such that
PH (X1 , . . . , Xn ) =

1 
P (X1 , . . . , Xn ),
Z

where

(X1 , . . . , Xn ) = i [D1 ] 2 [D2 ] m [D m ]
PH

is an unnormalized measure and


Z=


PH
(X1 , . . . , Xn )

X1 ,...,Xn

is a normalizing constant called the partition function. A distribution P that


factorizes over H is also called a Gibbs distribution over H. (The naming convention
has roots in statistical physics.)
Note that this denition is quite similar to the factorization denition for
Bayesian networks: There, we decomposed the distribution as a product of CPDs.
In the case of Markov networks, the only constraint on the parameters in the factor
is non-negativity.
As every complete subgraph is a subset of some clique, we can simplify the
parameterization by introducing factors only for cliques, rather than for subcliques.
More precisely, let C 1 , . . . , C k be the cliques in H. We can parameterize P using a
set of factors 1 [C 1 ], . . . , k [C k ]. These factors are called clique potentials (in the
context of the Markov network H). It is tempting to think of the clique potentials
as representing the marginal probabilities of the variables in their scope. However,
this is incorrect. It is important to note that, although conceptually somewhat
simpler, the parameterization using clique potentials can obscure the structure that

20

Graphical Models in a Nutshell

is present in the original parameterization, and can possibly lead to an exponential


increase in the size of the representation.
It is often useful to consider a slightly dierent way of specifying potentials, by
using a logarithmic transformation. In particular, we can rewrite a factor [D] as
[D] = exp([D]),
where [D] = ln [D] is often called an energy function. The use of the word
energy derives from statistical physics, where the probability of a physical state
(e.g., a conguration of a set of electrons), depends inversely on its energy.
In this logarithmic representation, we have that
 m


PH (X1 , . . . , Xn ) exp
i [Di ] .
i=1

The logarithmic representation ensures that the probability distribution is positive. Moreover, the logarithmic parameters can take any real value.
A subclass of Markov networks that arises in many contexts is that of pairwise
Markov networks, representing distributions where all of the factors are over single
variables or pairs of variables. More precisely, a pairwise Markov network over a
graph H is associated with a set of node potentials {[Xi ] : i = 1, . . . , n} and a set of
edge potentials {[Xi , Xj ] : (Xi , Xj ) H}. The overall distribution is (as always)
the normalized product of all of the potentials (both node and edge). Pairwise
MRFs are attractive because of their simplicity, and because interactions on edges
are an important special case that often arises in practice.
Example 2.5
Figure 2.3(a) shows a simple Markov network. This toy example has random
variables describing the tuberculosis status of four patients. Patients that have been
in contact are linked by undirected edges. The edges indicate the possibilities for the
disease transmission. For example, P atient 1 has been in contact with P atient 2
and P atient 3, but has not been in contact with P atient 4. gure 2.3(b) shows the
same Markov network, along with the node and edge potentials. We use P 1, P 2,
P 3, and P 4 for shorthand. In this case, all of the node and edge potentials are the
same, but this is not a requirement. The node potentials show that the patients
are much more likely to be uninfected. The edge potentials capture the intuition
that it is most likely for two people to have the same infection state either both
infected, or both not. Furthermore, it is more likely that they are both not infected.
2.2.4

Independencies in Markov Networks

As in the case of Bayesian networks, the graph structure in a Markov network can
be viewed as encoding a set of independence assumptions. Intuitively, in Markov
networks, probabilistic inuence ows along the undirected paths in the graph,
but is blocked if we condition on the intervening nodes. We can dene two sets

2.2

Representation

21

(P1 , P2 )

P1 P2

P1 (P1)
p1
p1

0.2
100

p1

p2

p1
p1
p1

p2
p2
p2

0.5

TB Patient 1

TB Patient 2

(P1 , P3 )

(a)

P2 P4

(P2 , P4 )

p3

p2

p4

p1
p1
p1

p3
p3
p3

0.5
0.5

p2
p2
p2

p4
p4
p4

0.5
0.5

p3
p3

TB Patient 4

0.2
100

P2

p1

P3 (P3)

TB Patient 3

p2
p2

P1
P1 P3

P2 (P2)

0.5

0.2
100

P3

P4
P3 P4

(P3 , P4 )

p3

p4

p3
p3
p3

p4
p4
p4

0.5

P4 (P4)
p4
p4

0.2
100

0.5
2

(b)

(a) A simple Markov network describing the tuberculosis status of four


patients. The links between patients indicate which patients have been in contact
with each other. (b) The same Markov network, together with the node and edge
potentials.

Figure 2.3

of independence assumptions, the local Markov properties and the global Markov
properties.
The local Markov properties are associated with each node in the graph and are
based on the intuition that we can block all inuences on a node by conditioning
on its immediate neighbors.
Denition 2.10
Let H be an undirected graph. Then for each node X X , the Markov blanket of
X, denoted NH (X), is the set of neighbors of X in the graph (those that share an
edge with X). We dene the local Markov independencies associated with H to be
I (H) = {(X X {X} NH (X) | NH (X)) : X X }.
In other words, the Markov assumptions state that X is independent of the rest of
the nodes in the graph given its immediate neighbors.
Example 2.6
The MN in gure 2.3(a) describes the following local Markov assumptions: (P1
P4 | {P2 , P3 }), (P2 P3 | {P1 , P4 }), (P3 P2 | {P1 , P4 }), (P4 P1 | {P2 , P3 }).
To dene the global Markov properties, we begin by dening active paths in
undirected graphs.
Denition 2.11
Let H be a Markov network structure, and X1 . . . Xk be a path in H. Let
E X be a set of observed variables. The path X1 . . . Xk is active given E if
none of the Xi s, i = 1, . . . , k, is in E.

22

Graphical Models in a Nutshell

Using this notion, we can dene a notion of separation in the undirected graph.
This is the analogue of d-separation; note how much simpler it is.
Denition 2.12
We say that a set of nodes Z separates X and Y in H, denoted sepH (X; Y | Z),
if there is no active path between any node X X and Y Y given Z. We dene
the global Markov assumptions associated with H to be
I(H) = {(X Y | Z) : sepH (X; Y | Z)}.
As in the case of Bayesian networks, we can make a connection between the local
Markov properties and the global Markov properties. The assumptions are in fact
equivalent, but only for positive distributions. (Informally, a distribution is positive
if every possible joint instantiation has probability > 0.)
We begin with the analogue to theorem 2.7, which asserts that a Gibbs distribution satises the global independencies associated with the graph.
Theorem 2.13
Let P be a distribution over X , and H a Markov network structure over X . If P is
a Gibbs distribution over H, then all the local Markov properties associated with
H hold in P .
The other direction, which goes from the global independence properties of a
distribution to its factorization, is known as the Hammersley-Cliord theorem.
Unlike for Bayesian networks, this direction does not hold in general. It only holds
under the additional assumption that P is a positive distribution.
Theorem 2.14
Let P be a positive distribution over X , and H a Markov network graph over X .
If all of the independence constraints implied by H hold in P , then P is a Gibbs
distribution over H.
This result shows that, for positive distributions, the global Markov property
implies that the distribution factorizes according to the network structure. Thus,
for this class of distributions, we have that a distribution P factorizes over a Markov
network H if and only if all of the independencies implied by H hold in P . The
positivity assumption is necessary for this result to hold.

2.3

Inference
Both directed and undirected graphical models represent a full joint probability
distribution over X . We describe some of the main query types one might expect
to answer with a joint distribution, and discuss the computational complexity of
answering such queries using a graphical model.
The most common query type is the standard conditional probability query,
P (Y | E = e). Such a query consists of two parts: the evidence, a subset E of

2.3

Inference

23

random variables in the network, and an instantiation e to these variables; and


the query, a subset Y of random variables in the network. Our task is to compute
,e)
, i.e., the probability distribution over the values y of Y ,
P (Y | E = e) = PP(Y(e)
conditioned on the fact that E = e.
Another type of query that often arises is that of nding the most probable
assignment to some subset of variables. As with conditional probability queries,
we have evidence E = e. In this case, however, we are trying to compute the most
likely assignment to some subset of the remaining variables. This problem has two
variants, where the rst variant is an important special case of the second. The
simplest variant of this task is the most probable explanation (MPE) queries. An
MPE query tries to nd the most likely assignment to all of the (non-evidence)
variables. More precisely, if we let W = X E, our task is to nd the most likely
assignment to the variables in W given the evidence E = e: argmaxw P (w, e),
where, in general, argmaxx f (x) represents the value of x for which f (x) is maximal.
Note that there might be more than one assignment that has the highest posterior
probability. In this case, we can either decide that the MPE task is to return the
set of possible assignments, or to return an arbitrary member of that set.
In the second variant, the maximum a posteriori (MAP) query, we have a
subset of variables Y which forms our query. The task is to nd the most likely
assignment to the variables in Y given the evidence E = e: argmaxy P (y | e).
This class of queries is clearly more general than MPE queries, so it might not
be clear why the class of MPE queries is suciently interesting to consider as a
special case. The dierence becomes clearer if we explicitly write out the expression
for a general MAP query. If we let Z = X Y E, the MAP task is to

compute: argmaxY Z P (Y , Z | e). MAP queries contain both summations and
maximizations; in a way, they contain elements of both a conditional probability
query and an MPE query. This combination makes the MAP task harder than
either of these other tasks. In particular, there are techniques and analysis for the
MPE task that do not generalize to the MAP task. This observation, combined
with the fact that the MPE case is reasonably common, makes it worthwhile to
consider MPE as a separate task. Note that in statistics literature, as well as in
some work on graphical models, the term MAP is often used to mean MPE, but
the distinction can be made clear from the context.
In principle, a graphical model can be used to answer all of the query types
described above. We simply generate the joint distribution, and exhaustively sum
out the joint (in the case of a conditional probability query), search for the most
likely entry (in the case of an MPE query), or both (in the case of an MAP query).
However, this approach to the inference problem is not very satisfactory, as it
results in the exponential blowup of the joint distribution that the graphical model
representation was precisely designed to avoid.

24

Graphical Models in a Nutshell

We assume that we are dealing with a set of factors F over a set of variables X .
This set of factors denes a possibly unnormalized function

PF (X ) =
.
(2.2)
F

For a Bayesian network without evidence, the factors are simply the CPDs, and the
distribution PF is a normalized distribution. For a Bayesian network B with evidence E = e, the factors are the CPDs restricted to e, and PF (X ) = PB (X , e). For
a Markov network H (with or without evidence), the factors are the (restricted)

before dicompatibility potentials, and PF is the unnormalized distribution PH
viding by the partition function. It is important to note, however, that most of
the operations that one can perform on a normalized distribution can also be performed on an unnormalized one. Thus, we can marginalize PF on a subset of the
variables by summing out the others. We can also consider a conditional probability
PF (X | Y ) = PF (X, Y )/PF (Y ). Thus, for the purposes of this section, we treat
PF as a distribution, ignoring the fact that it may not be normalized.
In the worst case, the complexity of probabilistic inference is unavoidable. Below,
we assume that the set of factors { F } of the graphical model dening the desired
distribution can be specied in a polynomial number of bits (in terms of the number
of variables).
Theorem 2.15
The following decision problems are N P-complete:
Given a distribution PF over X , a variable X X , and a value x Val(X),
decide whether PF (X = x) > 0.
Given a distribution PF over X and a number , decide whether there exists an
assignment x to X such that PF (x) > .
The following problem is #P-complete:
Given a distribution PF over X , a variable X X , and a value x Val(X),
compute PF (X = x).
These results seem like very bad news: every type of inference in graphical
models is N P-hard or harder. In fact, even the simple problem of computing
the distribution over a single binary variable is N P-hard. Assuming (as seems
increasingly likely) that the best computational performance we can achieve for
N P-hard problems is exponential in the worst case, there seems to be no hope for
ecient algorithms for even the simplest type of inference. However, as we discuss
below, the worst-case blowup can often be avoided. For all other models, we will
resort to approximate inference techniques. Note that the worst-case results for
approximate inference are also negative:

2.3

Inference

25

Theorem 2.16
The following problem is N P-hard for any  (0, 1/2): Given a distribution PF
over X , a variable X X , and a value x Val(X), nd a number , such that
|PF (X = x) | .
Fortunately, many types of exact inference can be performed eciently for a
very important class of graphical models (low treewidth) we dene below. For a
large number of models, however, exact inference is intractable and we resort to
approximations. Broadly speaking, there are two major frameworks for probabilistic
inference: optimization-based and sampling-based. Exact inference algorithms have
been historically derived from the dynamic programming perspective, by carefully
avoiding repeated computations. We take a somewhat unconventional approach here
by presenting exact and approximate inference in a unied optimization framework.
We thus start out by considering approximate inference and then present conditions
under which it yields exact results.
2.3.1

Inference as Optimization

The methods that fall into an optimization framework are based on a simple
conceptual principle: dene a target class of easy distributions Q, and then search
for a particular instance Q within that class which is the best approximation to
PF . Queries can then be answered using inference on Q rather than on PF . The
specic algorithms that have been considered in the literature dier in many details.
However, most of them can be viewed as optimizing a target function for measuring
the quality of approximation.
Suppose that we want to approximate PF with another distribution Q. Intuitively,
we want to choose the approximation Q to be close to PF . There are many
possible ways to measure the distance between two distributions, such as the
Euclidean distance (L2 ), or the L1 distance. Our main challenge, however, is that
our aim is to avoid performing inference with the distribution PF ; in particular, we
cannot eectively compute marginal distributions in PF . Hence, we need methods
that allow us to optimize the distance (technically, divergence) between Q and
PF without answering hard queries in PF . A priori, this requirement may seem
impossible to satisfy. However, it turns out that there exists a distance measure
the relative entropy (or KL-divergence) that allows us to exploit the structure
of PF without performing reasoning with it.
Recall
that the relative entropy between P1 and P2 is dened as ID(P1 ||P2 ) =

1 (X )
IEP1 ln P
P2 (X ) . The relative entropy is always non-negative, and equal to 0 if and
only if P1 = P2 . Thus, we can use it as a distance measure, and choose to nd an
approximation Q to PF that minimizes the relative entropy. However, the relative
entropy is not symmetric ID(P1 ||P2 )
= ID(P2 ||P1 ). A priori, it might appear that
ID(PF ||Q) is a more appropriate measure for approximate inference, as one of the
main information-theoretic justications for relative entropy is the number of bits
lost when coding a true message distribution PF using an (approximate) estimate Q.

26

Graphical Models in a Nutshell

However, computing the so-called M-projection Q of PF the argminQ ID(PF ||Q)


is actually equivalent to running inference in PF . Somewhat surprisingly, as we
show in the subsequent discussion, this does not apply to the so-called I-projection:
we can exploit the structure of PF to optimize argminQ ID(Q||PF ) eciently, without
running inference in PF .
An additional reason for using relative entropy as our distance measure is based
on the following result, which relates the relative entropy ID(Q||PF ) with the
partition function Z:
Proposition 2.17
(2.3)
ln Z = F [PF , Q] + ID(Q||PF ),

where F [PF , Q] is the energy functional F [PF , Q] = F IEQ [ln ] + IHQ (X ).
This proposition has several important ramications. Note that the term ln Z does
not depend on Q. Hence, minimizing the relative entropy ID(Q||PF ) is equivalent
to maximizing the energy functional F [PF , Q]. This latter term relates to concepts
from statistical physics, and it is the negative of what is referred to in that eld as
the Helmholtz free energy. While explaining the physics-based motivation for this
term is out of the scope of this chapter, we continue to use the standard terminology
of energy functional.
In the remainder of this section, we pose the problem of nding a good approximation Q as one of maximizing the energy functional, or, equivalently, minimizing
the relative entropy. Importantly, the energy functional involves expectations in Q.
As we show, by choosing approximations Q that allow for ecient inference, we can
both evaluate the energy functional and optimize it eectively.
Moreover, as ID(Q||PF ) 0, we have that ln Z F [PF , Q]. That is, the energy
functional is a lower bound on the value of the logarithm of the partition function
Z, for any choice of Q. Why is this fact signicant? Recall that, in directed models,
the partition function Z is the probability of the evidence. Computing the partition
function is often the hardest part of inference. And so this theorem shows that if
we have a good approximation (that is, ID(Q||PF ) is small), then we can get a good
lower bound approximation to Z. The fact that this approximation is a lower bound
plays an important role in learning parameters of graphical models.
2.3.2

Exact Inference as Optimization

Before considering approximate inference methods, we illustrate the use of a variational approach to derive an exact inference procedure. The concepts we introduce
here will serve in discussion of the following approximate inference methods.
The goal of exact inference here will be to compute marginals of the distribution.
To achieve this goal, we will need to make sure that the set of distributions Q is
expressive enough to represent the target distribution PF . Instead of approximating
PF , the solution of the optimization problem transforms the representation of the

2.3

Inference

27

(a)

A1,1

A1,2

A1,3

A1,4

A2,1

A2,2

A2,3

A2,4

A3,1

A3,2

A3,3

A3,4

A4,1

A4,2

A4,3

A4,4

(b)

Figure 2.4 (a) Chain-structured Bayesian network and equivalent Markov network (b) Grid-structured Markov network.

distribution from a product of factors into a more useful form Q that directly yields
the desired marginals.
To accomplish this, we will need to optimize over the set of distributions Q that
include PF . Then, if we search over this set, we are guaranteed to nd a distribution
Q for which ID(Q ||PF ) = 0, which is therefore the unique global optimum of our
energy functional. We will represent this set using an undirected graphical model
called the clique tree, for reasons that will be clear below.
Consider the undirected graph corresponding to the set of factors F. In this
graph, nodes are connected if they appear together in a factor. Note that if a factor
is the CPD of a directed graphical model, then the family will be a clique in the
graph, so its connectivity is denser then the original directed graph since parents
have been connected (moralized). The key property for exact inference in the graph
is chordality:
Denition 2.18
Let X1 X2 Xk X1 be a loop in the graph; a chord in the loop is an edge
connecting Xi and Xj for two nonconsecutive nodes Xi , Xj . An undirected graph
H is said to be chordal if any loop X1 X2 Xk X1 for k 4 has a chord.
In other words, the longest minimal loop (one that has no shortcut) is a triangle.
Thus, chordal graphs are often also called triangulated.
The simplest (and most commonly used) chordal graphs are chain-structured
(see gure 2.4(a)). What if the graph is not chordal? For example, grid-structured
graphs are commonly used in computer vision for pixel-labeling problems (see gure 2.4(b)). To make a graph chordal (triangulate it), ll-in edges are added to
short-circuit loops. There are generally many ways to do this and nding the least
number of edges to ll is N P-hard. However, good heuristic algorithms for this
problem exist [12, 1].
We now dene a cluster graph the backbone of the graphical data structure
needed to perform inference. Each node in the cluster graph is a cluster, which

28

Graphical Models in a Nutshell

is associated with a subset of variables; the graph contains undirected edges that
connect clusters whose scopes have some nonempty intersection.
Denition 2.19
A cluster graph K for a set of factors F over X is an undirected graph, each of
whose nodes i is associated with a subset C i X . A cluster graph must be familypreserving each factor F must be associated with a cluster C, denoted
(), such that Scope[] C i . Each edge between a pair of clusters C i and C j is
associated with a sepset S i,j = C i C j . A singly connected cluster graph (a tree)
is called a cluster tree.
Denition 2.20
Let T be a cluster tree over a set of factors F . We say that T has the running
intersection property if, whenever there is a variable X such that X C i and
X C j , then X is also in every cluster in the (unique) path in T between C i and
C j . A cluster tree that satises the running intersection property is called a clique
tree.
Theorem 2.21
Every chordal graph G has a clique tree T .
Constructing the clique tree from a chordal graph is actually relatively easy:
(1) nd maximal cliques of the graph (this is easy in chordal graphs) and (2)
run a maximum spanning tree algorithm on the appropriate clique graph. More
specically, we build an undirected graph whose nodes are the maximal cliques,
and where every pair of nodes C i , C j is connected by an edge whose weight is
|C i C j |.
Because of this correspondence, we can dene a very important characteristic of
a graph, which is critical to the complexity of exact inference:
Denition 2.22
The treewidth of a chordal graph is the size of the largest clique minus 1. The
treewidth of an untriangulated graph is the minimum treewidth of all of its triangulations.
Note that the treewidth of a chain in gure 2.4(a) is 1 and the treewidth of the
grid in gure 2.4(b) is 4.
2.3.2.1

The Optimization Problem

Suppose we are given a clique tree T for PF . That is, T satises the running
intersection property and the family preservation property. Moreover, suppose we
are given a set of potentials Q = {i } {i,j : (C i C j ) T }, where C i denotes
clusters in T , S i,j denote separators along edges in T , i is a potential over C i , and
i,j is a potential over S i,j . The set of potentials denes a distribution Q according

2.3

Inference

29

to T by the formula


Q(X ) =

C i T

(C i C j )T

i,j

(2.4)

Note that by construction, Q can represent PF by simply letting the appropriate


potentials i equal factors i and letting i,j equal 1. However, we will consider a
dierent, more useful, representation.
Denition 2.23
The set of potentials Q is calibrated when for each (C i C j ) T the potential i,j
on S i,j is the marginal of i (and j ).
Proposition 2.24
Let Q be a set of calibrated potentials for T , and let Q be the distribution dened
by (2.4). Then i [ci ] = Q(ci ) and i,j [si,j ] = Q(si,j ).
In other words, the potentials correspond to marginals of the distribution Q dened
by (2.4). Now if Q is a set of uncalibrated potentials for T , and Q is the distribution
dened by (2.4), we can construct Q , a set of calibrated potentials which represent
Q by simply using the appropriate marginals of Q.
Once we decide to focus our attention on calibrated clique trees, we can rewrite
the energy functional in a factored form, as a sum of terms each of which depends
directly only on one of the potentials in Q. This form reveals the structure in the
distribution, and is therefore a much better starting point for further analysis. As
we shall see, this form will also be the basis for our approximations in subsequent
sections.
Denition 2.25
Given a clique tree T with a set of potentials, Q, and an assignment that maps
factors in PF to clusters in T , we dene the factored energy functional





F [PF , Q] =
IEi ln i0 +
IHi (C i )
IHi,j (S i,j ),
(2.5)
i
C i T
(C i C j )T

where i0 = ,()=i .
Before we prove that the energy functional is equivalent to its factored form,
let


us rst understand its form. The rst term is a sum of terms of the form IEi ln i0 .
Recall that i0 is a factor (not necessarily a distribution) over the scope C i , that is,
a function from Val(C i ) to IR+ . Its logarithm is therefore a function from Val(C i )
to IR. The clique potential i is a distribution over Val(C i ). We can therefore

compute the expectation, ci i [ci ] ln i0 . The last two terms are entropies of the
distributions the potentials and messages associated with the clusters and
sepsets in the tree.

30

Graphical Models in a Nutshell

Proposition 2.26
If Q is a set of calibrated potentials for T , and Q is dened by by (2.4), then
F [PF , Q] = F [PF , Q].
Using this form of the energy, we can now dene the optimization problem. We rst
need to dene the space over which we are optimizing. If Q is factorized according
to T , we can represent it by a set of calibrated potentials. Calibration is essentially
a constraint on the potentials, as a clique tree is calibrated if neighboring potentials
agree on the marginal distribution on their joint subset. Thus, we pose the following
constrained optimization procedure:
CTree-Optimize
Find
that maximize

Q
F [PF , Q]


subject to

i = i,j ,

(C i C j ) T ;

(2.6)

i = 1,

C i T .

(2.7)

C i \S i,j


Ci

The constraints (2.6) and (2.7) ensure that the potentials in Q are calibrated and
represent legal distributions. It can be shown that the objective function is strictly
concave in the variables , . The constraints dene a convex set (linear subspace),
so this optimization problem has a unique maximum. Since Q can represent PF ,
this maximum is attained when ID(Q||PF ) = 0.
2.3.2.2

Fixed-Point Characterization

We can now prove that the stationary points of this constrained optimization
function the points at which the gradient is orthogonal to all the constraints
can be characterized by a set of self-consistent equations.
Recall that a stationary point of a function is either a local maximum, a local
minimum, or a saddle point. In this optimization problem, there is a single global
maximum. Although we do not show it here, we can show that it is also the
single stationary point. We can therefore dene the global optimum declaratively,
as a set of equations, using standard methods based on Lagrange multipliers.
As we now show, this declarative formulation gives rise to a set of equations
which precisely corresponds to message-passing steps in the clique tree, a standard
inference procedure usually derived via dynamic programming.

2.3

Inference

31

Theorem 2.27
A set of potentials Q is a stationary point of CTree-Optimize if and only if there
exists a set of factors {ij [S i,j ] : C i C j T } such that



ij
i0
ki
(2.8)
C i S i,j

i i0

kNC i {j}

ji

(2.9)

jNC i

i,j = ji ij ,

(2.10)

where NC i are the neighboring cliques of C i in T .


Theorem 2.27 illustrates themes that appear in many approaches that turn
variational problems into message-passing schemes. It provides a characterization of
the solution of the optimization problem in terms of xed-point equations that must
hold when we nd a maximal Q. These xed-point equations dene the relationships
that must hold between the dierent parameters involved in the optimization
problem. Most importantly, (2.8) denes each ij in terms of kj other than
ij . The other parameters are all dened in a noncyclic way in terms of the
ij s.
The form of the equations resulting from the theorem suggest an iterative
procedure for nding a xed point, in which we view the equations as assignments,
and iteratively apply equations to the current values of the left-hand side to dene
a new value for the right-hand side. We initialize all of the ij s to 1, and then
iteratively apply (2.8), computing the left-hand side ij of each equality in terms
of the right-hand side (essentially converting each equality sign to an assignment).
Clearly, a single iteration of this process does not usually suce to make the
equalities hold; however, under certain conditions (which hold in this particular
case) we can guarantee that this process converges to a solution satisfying all of the
equations in (2.8); the other equations are now easy to satisfy.
2.3.3

Loopy Belief Propagation in Pairwise Markov Networks

We focus on the class of pairwise Markov networks. In these networks, we have


a univariate potential i [Xi ] over each variable Xi , and in addition a pairwise
potential (i,j) [Xi , Xj ] over some pairs of variables. These pairwise potentials
correspond to edges in the Markov network. Examples of such networks include
our simple tuberculosis example in gure 2.3 and the grid networks we discussed
above.
The transformation of such a network into a cluster graph is fairly straightforward. For each potential, we introduce a corresponding cluster, and put edges
between the clusters that have overlapping scope. In other words, there is an edge

32

Graphical Models in a Nutshell

1: A, B, C

2: B, C, D

3: B,D,F

4: B, E

5: D, E

1: A, B, C

2: B, C, D

3: B,D,F

4: B, E

5: D, E

12: B, C
6: A

7: B

8: C

9: D

(a) K3

10: E

11: F

6: A

7: B

8: C

9: D

10: E

11: F

(b) K4

Figure 2.5 Two additional examples of generalized cluster graphs for a Markov
network with potentials over {A, B, C}, {B, C, D}, {B, D, F }, {B, E}, and {D, E}. (a)
Bethe factorization. (b) Capturing interactions between {A, B, C} and {B, C, D}.

between the cluster C (i,j) that corresponds to the edge Xi Xj and the clusters
C i and C j that correspond to the univariate factors over Xi and Xj .
As there is a direct correspondence between the clusters in the cluster graphs and
variables or edges in the original Markov network, it is often convenient to think
of the propagation steps as operations on the original network. Moreover, as each
pairwise cluster has only two neighbors, we consider two propagation steps along
the path C i C (i,j) C j as propagating information between Xi and Xj . Indeed,
early versions of generalized belief propagation were stated in these terms. This
algorithm is known as loopy belief propagation, as it uses propagation steps used by
algorithms for Markov trees, except that it was applied to networks with loops.
A natural question is how to extend this method to networks that are more
complex than pairwise Markov networks. Once we have larger potentials, they may
overlap in ways that result in complex interactions among them.
One simple construction creates a bipartite graph. The rst layer consists of
large clusters, with one cluster for each factor in F , whose scope is Scope[].
These clusters ensure that we satisfy the family-preservation property. The second
layer consists of small univariate clusters, one for each random variable. Finally,
we place an edge between each univariate cluster X on the second layer and each
cluster in the rst layer that includes X; the scope of this edge is X itself. For a
concrete example, see gure 2.5(a).
We can easily verify that this is a proper cluster graph. First, by construction it
satises the family-preserving property. Second, the edges that mention a variable
X form a star-shaped subgraph with edges from the univariate cluster with scope
X to all the large clusters that contain X. We will call this construction the Bethe
approximation (for reasons that will be claried below). The construction of this
cluster graph is simple and can easily be automated.
So far, our discussion of belief propagation has been entirely procedural, and motivated purely by similarity to message-passing algorithms for cluster trees. Is there
any formal justication for this approach? Is there a sense in which we can view
this algorithm as providing an approximation to the exact inference task? In this
section, we show that belief propagation can be justied using the energy function
formulation. Specically, the messages passed by generalized belief propagation can
be derived from xed-point equations for the stationary points of an approximate

2.3

Inference

33

version of the energy functional of (2.3). As we shall see, this formulation provides
signicant insight into the generalized belief propagation algorithm. It allows us to
better understand the convergence properties of generalized belief propagation, and
to characterize its convergence points. It also suggests generalizations of the algorithm which have better convergence properties, or that optimize a more accurate
approximation to the energy functional.
Our construction will be similar to the one in section 2.3.2 for exact inference.
However, there are some dierences. As we saw, the calibrated cluster graph
maintains the information in PF . However, the resulting cluster potentials are
not, in general, the marginals of PF . In fact, these cluster potentials may not
represent the marginals of any single coherent joint distribution over X . Thus, we
can think of generalized belief propagation as constructing a set of pseudo-marginal
distributions, each one over the variables in one cluster. These pseudo-marginals are
calibrated, and therefore locally consistent with each other, but are not necessarily
marginals of a single underlying joint distribution.
The energy functional F [PF , Q] has terms involving the entropy of an entire joint
distribution; thus, it cannot be used to evaluate the quality of an approximation
dened in terms of (possibly incoherent) pseudo-marginals. However, the factored
free energy functional F [PF , Q] is dened in terms of entropies of clusters and
messages, and is therefore well-dened for pseudo-marginals Q. Thus, we can write
down an optimization problem as before:
CGraph-Optimize
Find
that maximize

Q
F [PF , Q]


subject to

i = i,j ,

(C i C j ) T ;

(2.11)

i = 1,

C i T .

(2.12)

C i \S i,j


Ci

Importantly, however, unlike for clique trees, F [PF , Q] is no longer simply a


reformulation of the free energy, but rather an approximation of it. Thus, our
optimization problem contains two approximations: we are using an approximation,
rather than an exact, energy functional; and we are optimizing it over the space of
pseudo-marginals, which is a relaxation (a superspace) of the space of all coherent
probability distributions that factorize over the cluster graph. The approximate
energy functional in this case is a restricted form of an approximation known as
the Kikuchi free energy in statistical physics.
We noted that the energy functional is a lower bound of the log-partition function;
thus, by maximizing it, we get better approximations of PF . Unfortunately, the
factored energy functional, which is only an approximation to the true energy
functional, is not necessarily also a lower bound. Nonetheless, it is still a reasonable
strategy to maximize the approximate energy functional.

34

Graphical Models in a Nutshell

Our maximization problem is the natural analogue of CTree-Optimize to the case


of cluster graphs. Not surprisingly, we can derive a similar analogue to theorem 2.27.
Theorem 2.28
A set of potentials Q is a stationary point of CGraph-Optimize if and only if for
every edge (C i C j ) K there are auxiliary potentials ij (S i,j ) and ji (S j,i )
so that


i0
ki
(2.13)
ij
C i S i,j

i0

kNC i {j}

ji

(2.14)

jNC i

i,j = ji ij .

(2.15)

This theorem shows that we can characterize convergence points of the energy
function in terms of the original potentials and messages between clusters. We
can, once again, dene a procedural variant, in which we initialize ij , and then
iteratively use (2.13) to redene each ij in terms of the current values of other
ki . theorem 2.28 shows that convergence points of this procedure are related to
stationary points of F [PF , Q].
It is relatively easy to verify that F [PF , Q] is bounded from above. And thus,
this function must have a maximum. There are two cases. The maximum is either
an interior point or a boundary point (some of the probabilities in Q are 0). In the
former case the maximum is also a stationary point, which implies that it satises
the condition of theorem 2.28. In the latter case, the maximum is not necessarily
a stationary point. This situation, however, is very rare in practice, and can be
guaranteed not to arise if we make some fairly benign assumptions.
It is important to understand what these results imply, and what they do not.
The results imply only that the convergence points of generalized belief propagation
are stationary points of the free energy function They do not imply that we can
reach these convergence points by applying belief propagation steps. In fact, there
is no guarantee that the message-passing steps of generalized belief propagation
necessarily improve the free energy objective: a message passing step may increase
or decrease the energy functional. (In fact, if generalized belief propagation was
guaranteed to monotonically improve the functional, then it would necessarily
always converge.)
What are the implications of this result? First, it provides us with a declarative
semantics for generalized belief propagation in terms of optimization of a target
functional. This declarative semantics opens the way to investigate other computational approaches for optimizing the same functional. We discuss some of these
approaches below.
This result also allows us to understand what properties are important for this
type of approximation, and subsequently to design other approximations that may
be more accurate, or better in some other way. As a concrete example, recall that,
in our discussion of generalized cluster graphs, we required the running intersection

2.3

Inference

35

property. This property has two important implications. First, that the set of
clusters that contain some variable X are connected; hence, the marginal over X
will be the same in all of these clusters at the calibration point. Second, that there
is no cycle of clusters and sepsets all of which contain X. We can motivate this
assumption intuitively, by noting that it prevents us from allowing information
about X to cycle endlessly through a loop. The free energy function analysis
provides a more formal justication. To understand it, consider rst the form of the
factored free energy functional when our cluster graph K has the form of the Bethe
approximation Recall that in the Bethe approximation graph there are two layers:
one consisting of clusters that correspond to factors in F , and the other consisting
of univariate clusters. When the cluster graph is calibrated, these univariate clusters
have the same distribution as the separators between them and the factors in the
rst layer. As such, we can combine together the entropy terms for all the separators
labeled by X and the associated univariate cluster and rewrite the free energy, as
follows:
Proposition 2.29
If Q = { : F } {i (Xi )} is a calibrated set of potentials for K for a Bethe
approximation cluster graph with clusters {C : F } {Xi : Xi X }, then



F [PF , Q] =
IE [ln ] +
IH (C )
(di 1)IHi (Xi ),
(2.16)
F

where di = |{ : Xi Scope[]}| is the number of factors that contain Xi .


Note that (2.16) is equivalent to the factored free energy only when Q is calibrated.
However, as we are interested only in such cases, we can freely alternate between
the two forms for the purpose of nding xed points of the factored free energy
functional. Equation (2.16) is known as the Bethe free energy, and again has a
history in statistical mechanics. The Bethe approximation we discussed above is a
construction in terms of cluster graphs that is designed to match the Bethe free
energy.
As we can see in this alternative form, if the variable Xi appears in di clusters in
the cluster graph, then it appears in an entropy term with a positive sign exactly
di times. Due to the running intersection property, the number of separators that
contain Xi is di 1 (the number of edges in a tree with k vertices is k 1), so that
Xi appears in an entropy term with a negative sign exactly di 1 times. In this
case, we say that the counting number of Xi is 1. Thus, our approximation does not
over- or undercount the entropy of Xi . It is not dicult to show that the counting
number result holds for any approximation that satises the running intersection
property. Thus, one motivation for the running intersection property is that cluster
graphs satisfying it provide a better approximation to the free energy functional.
This intuition forms the basis for improved approximations. Specically, we
can construct energy functionals (called Kikuchi free energy approximations) that
resemble (2.5), in which we introduce additional entropy terms, with both positive
and negative signs, in a way that ensures that the counting number for all variables

36

Graphical Models in a Nutshell

is 1. Somewhat remarkably, the same analysis we performed in this section


dening a set of xed-point equations for stationary points of the approximate free
energy also leads to message-passing algorithms for these richer approximations.
The propagation rules for these approximations, which also fall under the heading
of generalized belief propagation, are more elaborate, and we do not discuss them
here.
2.3.4

Sampling-Based Approximate Inference

As we discussed above, another approach to dealing with the worst-case combinatorial explosion of exact inference in graphical models is via sampling-based methods.
In these methods, we approximate the joint distribution as a set of instantiations
to all or some of the variables in the network. These instantiations, often called
samples, represent part of the probability mass.
The general framework for most of the discussion is as follows. Consider some
distribution P (X ), and assume we want to estimate the probability of some event
Y = y relative to P , for some Y X and y Val(Y ). More generally, we might
want to estimate the expectation of some function f (X ) relative to P ; this task
is a generalization, as we can choose f () = 1 {Y  = y}. We approximate this
expectation by generating a set of M samples, estimating the value of the function
or its expectation relative to each of the generated samples, and then aggregating
the results.
2.3.4.1

Markov Chain Monte Carlo Methods

Markov chain Monte Carlo (abbreviated MCMC ) is an approach for generating


samples from the posterior distribution. As we discussed, we cannot typically sample
from the posterior directly; however, we can construct a process which gradually
samples from distributions that are closer and closer to the posterior. Intuitively,
we dene a state graph whose nodes are the states of the system, i.e., possible
instantiations Val(X ). (This graph is very dierent from the graphical model that
denes the distribution P (X ), whose nodes correspond to variables.) We then dene
a process that randomly traverses this graph, moving from one state to another.
This process is dened so that, ultimately (after enough steps), the probability of
being in any particular state is the desired posterior distribution.
We begin by describing the general framework of Markov chains, and then
describe their application to approximate inference in graphical models. We note
that, unlike forward sampling methods (including likelihood weighting), Markov
chain methods apply equally well to directed and to undirected models.
A Markov chain is dened in terms of a set of states, and a transition model
from one state to another. The chain denes a process that evolves stochastically
from state to state.

2.3

Inference

37

Denition 2.30
A Markov chain is dened via a state space Val(X) and a transition probability
model, which denes, for every state x Val(X) a next-state distribution over
Val(X). The transition probability of going from x to x is denoted T (x x ).
This transition probability applies whenever the chain is in state x.
We note that, in this denition and in the subsequent discussion, we restrict
attention to homogeneous Markov chains, where the system dynamics do not change
over time.
We can imagine a random sampling process that denes a sequence of states
(0)
x , x(1) , x(2) , . . .. As the transition model is random, the state of the process at
step t can be viewed as a random variable X (t) . We assume that the initial state
X (0) is distributed according to some initial state distribution P (0) (X (0) ). We can
now dene distributions over the subsequent states P (1) (X (1) ), P (2) (X (2) ), . . . using
the chain dynamics:

P (t+1) (X (t+1) = x ) =
P (t) (X (t) = x)T (x x ).
(2.17)
xVal(X)

Intuitively, the probability of being at state x at time t + 1 is the sum over all
possible states x that the chain could have been in at time t of the probability
being in state x times the probability that the chain took a transition from x to
x .
As the process converges, we would expect P (t+1) to be close to P (t) . Using (2.17),
we obtain

P (t) (x ) P (t+1) (x ) =
P (t) (x)T (x x ).
xVal(X)

At convergence, we would expect the resulting distribution (X) to be an equilibrium relative to the transition model; i.e., the probability of being in a state
is the same as the probability of transitioning into it from a randomly sampled
predecessor. Formally:
Denition 2.31
A distribution (X) is a stationary distribution for a Markov chain T if it satises

(X = x)T (x x ).
(2.18)
(X = x ) =
xVal(X)

We wish to restrict attention to Markov chains that have a unique stationary


distribution, which is reached from any starting distribution P (0) . There are various
conditions that suce to guarantee this property. The condition most commonly
used is a fairly technical one, that the chain be ergodic. In the context of Markov
chains where the state space Val(X) is nite, the following condition is equivalent
to this requirement:

38

Graphical Models in a Nutshell

Denition 2.32
A Markov chain is said to be regular if there exists some number k such that, for
every x, x Val(X), the probability of getting from x to x in exactly k steps is
greater than 0.
The following result can be shown to hold:
Theorem 2.33
A nite-state Markov chain T has a unique stationary distribution if and only if it
is regular.
Ensuring regularity is usually straightforward. Two simple conditions that guarantee regularity in nite-state Markov chains are:
It is possible to get from any state to any state using a positive probability path
in the state graph.
For each state x, there is a positive probability of transitioning from x to x in
one step (a self-loop).
These two conditions together are sucient but not necessary to guarantee regularity. However, they often hold in the chains used in practice.
2.3.4.2

Markov Chains for Graphical Models

The theory of Markov chains provides a general framework for generating samples
from a target distribution . In this section, we discuss the application of this
framework to the sampling tasks encountered in probabilistic graphical models. In
this case, we typically wish to generate samples from the posterior distribution
P (X | E = e). Thus, we wish to dene a chain for which P (X | e) is the stationary
distribution. Clearly, there are many ways of dening such a chain. We focus on
the most common approaches.
In graphical models, we dene the states of the Markov chain to be instantiations
to X , which are compatible with e; i.e., all of the states in the Markov chain
satisfy E = e. The states in our Markov chain are therefore some subset of
the possible assignments to the variables X . In order to dene a Markov chain, we
need to dene a process that transitions from one state to the other, converging to
a stationary distribution () which is the desired posterior distribution P ( | e).
In the case of graphical models, our state space has a factorized structure
each state is an assignment to several variables. When dening a transition model
over this state space, we can consider a fully general case, where a transition can
go from any state to any state. However, it is often convenient to decompose the
transition model, considering transitions that only update a single component of
the state vector at a time, i.e., only a value for a single variable. In this case,
as in several other settings, we often dene a set of transition models T1 , . . . , Tk ,
each with its own dynamics. In certain cases, the dierent transition models are
necessary, because no single transition model on its own suces to ensure regularity.

2.3

Inference

39

In other cases, having multiple transition models simply makes the state space more
connected, and therefore speeds the convergence to a stationary distribution.
There are several ways of combining these multiple transition models into a single
chain. One common approach is simply to randomly select between them at each
step, using any distribution. Thus, for example, at each step, we might select one
of T1 , . . . , Tk , each with probability 1/k. Alternatively, we can simply cycle over the
dierent transition models, taking each one in turn. Clearly, this approach does not
dene a homogeneous chain, as the transition model used in step i is dierent from
the one used in step i + 1. However, we can simply view the process as dening a
single transition model T each of whose steps is an aggregate step, consisting of
rst taking T1 , then T2 , . . . , through Tk .
In the case of graphical models, we dene X = X E = {X1 , . . . , Xk }. We
dene a multiple transition chain, where we have a local transition model Ti for
each variable Xi X. Let U i = X {Xi }, and let ui denote an instantiation
to U i . The model Ti takes a state (ui , xi ) and transitions to a state of the form
(ui , xi ). As we discussed above, we can combine the dierent local transition models
into a single global model in various ways.
2.3.4.3

Gibbs Sampling

Gibbs sampling is one simple yet eective Markov chain for factored state spaces,
which is particularly ecient for graphical models. We dene the local transition
model Ti as follows. Intuitively, we simply forget the value of Xi in the current
state, and sample a new value for Xi from its posterior given the rest of the current
state. More precisely, let (ui , xi ) be a state in the chain. We dene
T ((ui , xi ) (ui , xi )) = P (xi | ui ).

(2.19)

Note that the transition probability does not depend on the current value xi of Xi ,
but only on the remaining state ui .
The Gibbs chain is dened via a set of local transition models; we use the
multistep transition model to combine them. Note that the dierent local transitions
are taken consecutively; i.e., having changed the value for a variable X1 , the value
for X2 is sampled based on the new value. Also note that we are only collecting a
single sample for every sequence where each local transition has been taken once.
This chain is guaranteed to be regular whenever the distribution is positive,
so that every value of Xi has positive probability given an assignment ui to the
remaining variables. In this case, we can get from any state to any state in at most
k local transition steps, where k = |X E|. Positivity is, however, not necessary;
there are many examples of nonpositive distributions where the Gibbs chain is
regular. It is also easy to show that the posterior distribution P (X | e) is a
stationary distribution of this process.
Gibbs sampling is particularly well suited to many graphical models, where we
can compute the transition probability P (Xi | ui ) very eciently. In particular, as

40

Graphical Models in a Nutshell

we now show, this distribution can be done based only on the Markov blanket of Xi .
We show this analysis for a Markov network; the extension to Bayesian networks is
straightforward. In general, we can decompose the probability of an instantiation
as follows:


1 
1
j [C j ] =
j [C j ]
j [C j ].
P (x1 | x2 , . . . , xn ) =
Z j
Z
j : Xi C j

j : Xi C j

For shorthand, let j [xi , u] denote j [xi , uC j ]. We can now compute


P (xi , ui )
C j Xi j [xi , ui ]



=
.
P (xi | ui ) = 


x P (xi , ui )
x
C j Xi j [(xi , ui )]
i

(2.20)

This last expression uses only the clique potentials involving Xi , and depends only
on the instantiation in ui of Xi s Markov blanket. In the case of Bayesian networks,
this expression reduces to a formula involving only the CPDs of Xi and its children,
and its value, again, depends only on the assignment in ui to the Markov blanket
of Xi . It can thus be computed very eciently.
We note that the Markov chain dened by a graphical model is not necessarily
regular, and might not converge to a unique stationary distribution. It turns out
that this type of situation can only arise if the distribution dened by the graphical
model is nonpositive, i.e., if the CPDs or clique potentials have entries with the
value 0.
Theorem 2.34
Let H be a Markov network such that all of the clique potentials are strictly positive.
Then the Gibbs-sampling Markov chain is regular.
2.3.4.4

Building a Markov Chain

As we discussed, the use of MCMC methods relies on the construction of a


Markov chain that has the desired properties: regularity, and the target stationary
distribution. Above, we described the Gibbs chain, a simple Markov chain that is
guaranteed to have these properties under certain assumptions. However, Gibbs
sampling is only applicable in certain circumstances; in particular, we must be able
to sample from the distribution P (Xi | ui ). Although this sampling step is easy for
discrete graphical models, there are other types of models where this step is not
practical, and the Gibbs chain is not applicable. Unfortunately, it is beyond the
scope of this chapter to discuss the Metropolis-Hastings algorithm, a more general
method of constructing a Markov chain that is guaranteed to converge to the desired
stationary distribution.
2.3.4.5

Generating Samples

The burn-in time for a large Markov chain is often quite large. Thus, the naive
algorithm described above has to execute a large number of sampling steps for

2.3

Inference

41

every usable sample. However, a key observation is that, if x(t) is sampled from ,
then x(t+1) is also sampled from . Thus, once we have run the chain long enough
that we are sampling from the stationary distribution (or a distribution close to it),
we can continue generating samples from the same trajectory, and obtain a large
number of samples from the stationary distribution.
More formally, assume that we use x(0) , . . . , x(T ) as our burn-in phase, and then
collect M samples x(T +1) , . . . , x(T +M ) . Thus, we have collected a data set D where
xm = x(T +m) , for m = 1, . . . , M . Assume, for simplicity, that x(T +1) is sampled
from , and hence so are all of the samples in D. It follows that for any function
M
f : m=1 f (xm ) is an unbiased estimator for IE(X) [f (X)].
The key problem, of course, is that consecutive samples from the same trajectory
are correlated. Thus, we cannot expect the same performance as we would from
M independent samples from . In other words, the variance of the estimator is
signicantly higher than that of an estimator generated by M independent samples
from , as discussed above.
One solution to this problem is not to collect consecutive samples from the chain.
Rather, having collected a sample x(T ) , we let the chain run for a while, and collect
a second sample x(T +d) for some appropriate choice of d. For d large enough, x(T )
and x(T +d) are only slightly correlated, and we can view them as independent
samples from . However, the time d required for forgetting the correlation is
clearly related to the mixing time of the chain. Thus, chains that are slow to mix
initially also require larger d in order to produce close-to-independent samples.
Nevertheless, the samples do come from the correct distribution for any value of d,
and hence it is often better to compromise and use a shorter d than it is to use a
shorter burn-in time T . This method thus allows us to collect a larger number of
usable samples with fewer transitions of the Markov chain.
In fact, we can often make even better use of the samples generated using this
single-chain approach. Although the samples between x(T ) and x(T +d) are not
independent samples, there is no reason to discard them. That is, using all of the
samples x(T ) , x(T +1) , . . . , x(T +d) produces a provably better estimator than using
just the two samples x(T ) and x(T +d) : our variance is always no higher if we use all
of the samples we generated rather than a subset. Thus, the strategy of picking only
a subset of the samples is useful primarily in settings where there is a signicant
cost associated with using each sample (e.g., the evaluation of f is costly), so that
we might want to reduce the overall number of samples used.
2.3.4.6

Discussion

This description of the use of Markov chains is quite abstract: It contains no


specication of the number of chains to run, the metrics for evaluating mixing,
techniques for determining the delay between samples that would allow them
to be considered independent, and more. Unfortunately, at this point, there is
little theoretical analysis that can help answer these questions for the chains that
are of interest to us. Thus, the application of Markov chains is more of an art

42

Graphical Models in a Nutshell

than a science, and often requires signicant experimentation and hand-tuning of


parameters.
Nevertheless, MCMC methods are, for many probabilistic models, the only
technique that can achieve reasonable performance. Specically, unlike forward
sampling methods, it does not degrade when the probability of the evidence is low,
or when the posterior is very dierent from the prior. Furthermore, unlike forward
sampling, it applies to undirected models as well as to directed models. As such, it
is an important component in the suite of approximate inference techniques.

2.4

Learning
Next, we turn our attention to learning graphical models [4, 6]. There are two
variants of the learning task: parameter estimation and structure learning. In the
parameter estimation task, we assume that the qualitative dependency structure
of the graphical model is known; i.e., in the directed model case, G is given, and
in the undirected case, H is given. In this case, the learning task is simply to ll
in the parameters that dene the CPDs of the attributes or the parameters which
dene the potential functions of the Markov network. In the structure learning task,
there is no additional required input (although the user can, if available, provide
prior knowledge about the structure, e.g., in the form of constraints). The goal is
to extract a Bayesian network or Markov network, structure as well as parameters,
from the training data alone. We discuss each of these problems in turn.
2.4.1

Parameter Estimation in Bayesian Networks

We begin with learning the parameters for a Bayesian network where the dependency structure is known. In other words, we are given the structure G that determines the set of parents for each random variable, and our task is to learn the
parameters G that dene the CPDs for this structure. Our learning is based on a
particular training set D = {x1 , . . . , xm }, which, for now, we will assume is complete
(i.e., each instance is fully observed, there are no missing values). While this task is
relatively straightforward, it is of interest in and of itself. In addition, it is a crucial
component in the structure learning algorithm described in section 2.4.3.
There are two approaches to parameter estimation: maximum likelihood estimation (MLE) and Bayesian approaches. The key ingredient for both is the likelihood
function: the probability of the data given the model. This function captures the
response of the probability distribution to changes in the choice of parameters. The
likelihood of a parameter set is dened to be the probability of the data given the
model. For a Bayesian network structure G the likelihood of a parameter set G is
L(G : D) = P (D | G ).

2.4

Learning

2.4.1.1

43

Maximum Likelihood Parameter Estimation

Given the above, one approach to parameter estimation is maximum likelihood


parameter estimation. Here, our goal is to nd the parameter setting G that
maximizes the likelihood L(G : D). For Bayesian networks, the likelihood can
be decomposed as follows:
L(G , D) =
=
=

m


P (xj : G )

j=1
n
m 


P (xji | Paxj : G )
i

j=1 i=1
m
n 


P (xji | Paxj : G )
i

i=1 j=1

We will use Xi |Pai to denote the subset of parameters that determine P (Xi | Pai ).
In the case where the parameters are disjoint (each CPD is parameterized by a
separate set of parameters that do not overlap; this allows us to maximize each
parameter set independently. We can write the likelihood as follows:
L(G : D) =

n


Li (Xi |Pai : D),

i=1

where the local likelihood function for Xi is


Li (Xi |Pai : D) =

m


P (xji | paji : Xi |Pai ).

j=1

The simplest parameterization for the CPDs is as a table. Suppose we have a


variable X with parents U . If we represent that CPD P (X | U ) as a table, then we
will have a parameter x|u for each combination of x Val(X) and u Val(U ).
In this case, we can write the local likelihood function as follows:
LX ( X|U : D) =

m


xj |uj

j=1


uVal(U )

Nu,x
x|u
,

(2.21)

xVal(X)

where Nu,x is the number of times X = x and Pai = u in D. That is, we have
grouped together all the occurrences of x|u in the product over all instances.
We need to maximize this term under the constraints that, for each choice of
value for the parents U , the conditional probability is legal:

x|u = 1 for all u.

44

Graphical Models in a Nutshell

These constraints imply that the choice of value for x|u can impact the choice of
value for x |u . However, the choice of parameters given dierent values u of U
are independent of each other. Thus, we can maximize each of the terms in square
brackets in (2.21) independently.
We can thus further decompose the local likelihood function for a tabular CPD
into a product of simple likelihood functions. It is easy to see that each of these
likelihood functions is a multinomial likelihood. The counts in the data for the
dierent outcomes x are simply {Nu,x : x Val(X)}. We can then immediately
use the MLE parameters for a multinomial which are simply
Nu,x
x|u =
,
Nu

where we use the fact that Nu = x Nu,x .
2.4.1.2

Bayesian Parameter Estimation

In many cases, maximum likelihood parameter estimation is not robust, as it overts the training data. The Bayesian approach uses a prior distribution over the
parameters to smooth the irregularities in the training data, and is therefore signicantly more robust. As we will see in section 2.4.3, the Bayesian framework also
gives us a good metric for evaluating the quality of dierent candidate structures.
Roughly speaking, the Bayesian approach introduces a prior over the unknown
parameters, allowing us to specify a joint distribution over the unknown parameters
and the data instances, and performs Bayesian conditioning, using the data as
evidence, to compute a posterior distribution over these parameters.
Consider the following simple example: we want to estimate parameters for a
simple network with two variables X and Y , where X is the parent of Y . Our
training data consists of observations xj , y j for j = 1, . . . , m. In addition, assume
that our CPDs are represented as multinomials and we have unknown parameter
vectors X , Y |x0 , and Y |x1 .
The dependencies between these variables are described in the network of gure 2.6. This is the meta-Bayesian network that describes our learning setup. This
Bayesian network structure immediately reveals several points. For example, the
instances are independent given the unknown parameters. In addition, a common
assumption made is that the individual parameter variables are a priori independent. That is, we believe that knowing the value of one parameter tells us nothing
about another. This is called parameter independence. The suitability of this assumption depends on the domain, and it should be considered with care.
If we accept parameter independence, we can draw an important conclusion.
Complete data d-separates the parameters for dierent CPDs. Given the data set
D, we can determine the posterior over X independently of the posterior over
Y |X . Once we solve each problem separately, we can combine the results. This
is the analogous result to the likelihood decomposition for MLE estimation of
section 2.4.1.1.

2.4

Learning

45

X
X[1]

X[2]

Y[1]

Y[2]

...
...

X[M]

Y|x0

Y|x1

Y[M]

The Bayesian network for parameter estimation for a simple two-node


Bayesian network.

Figure 2.6

Consider, for example, the learning setting described in gure 2.6, where we take
both X and Y to be binary. We need to represent the posterior X and Y |X given
the data. If we use a Dirichlet prior over X , Y |x0 , and Y |x1 , then the posterior
P (X | x1 , . . . , xM ) can also be represented as a Dirichlet distribution.
Suppose that P (X ) is a Dirichlet prior with hyperparameters x0 and x1 ,
P (Y |x0 ) is a Dirichlet prior with hyperparameters y0 |x0 and y1 |x0 , and P (Y |x1 )
is a Dirichlet prior with hyperparameters y0 |x1 and y1 |x1 .
As in decomposition for the likelihood function in section 2.4.1.1, the likelihood
terms that involve Y |x0 depend on all the data elements X j such that xj = x0 and
the terms that involve Y |x1 depend on all the data elements X j such that xj = x1
We can decompose the joint distribution over parameters and data as follows:
P (G , D) = P (X )LX ( X : D)

P (y j | xn : Y |x1 )
P (Y |x1 )
j:xj =x1

P (Y |x0 )

P (y j | xj : Y |x0 )

j:xj =x0

Thus, this joint distribution is a product of three separate joint distributions


with a Dirichlet prior for some multinomial parameter and data drawn from this
multinomial. We can conclude that the posterior for P ( X | D) is Dirichlet with
hyperparameters x0 +Nx0 and x1 +Nx1 ; the posterior for P (Y |x0 | D) is Dirichlet
with hyperparameters y0 |x0 + Nx0 ,y0 and y1 |x0 + Nx0 ,y1 ; and the posterior for
P (Y |x1 | D) is Dirichlet with hyperparameters y0 |x1 + Nx1 ,y0 and y1 |x1 + Nx1 ,y1 .
The same pattern of reasoning we discussed applied to the general case. Let D
be a complete data set for X , and let G be a network structure over these variables
with table CPDs. If the prior P ( G ) satises parameter independence, then

P (G | D) =
P (Xi |pai | D).
i pai

If P (X|u ) is a Dirichlet prior with hyperparameters x1 |u , . . . , xK |u , then the


posterior P (X|u | D) is a Dirichlet distribution with hyperparameters x1 |u +
Nu,x1 , . . . , xK |u + Nu,xK .

46

Graphical Models in a Nutshell

This induces a predictive model in which, for the next instance, we have that
x |u + Nxi ,u
.
P (Xi [m + 1] = xi | U [m + 1] = u, D) =  i
i xi |u + Nxi ,u

(2.22)

Putting this all together, we can see that for computing the probability of a
new instance, we can use a single network parameterized as usual, via a set of
multinomials, but ones computed as in (2.22).
2.4.2

Parameter Estimation in Markov Networks

Unfortunately, for general Markov networks, the likelihood function cannot be


decomposed. A notable exception is chordal Markov networks, but we will focus
on the general case here. For a network with a set of cliques D 1 , . . . , Dn , the
likelihood function is given by
 n

m


1
j
exp
i [xi ] ,
L(, D) =
Z
j=1
i=1

n
where xji is the value of the variables D i in the instance xj and Z = x exp [ i=1 i [xi ]]
is the normalization constant. This normalization constant is responsible for coupling the estimation parameters and eectively ruling out a closed-form solution.
Luckily, this objective function is concave in , so we have an unconstrained concave maximization problem, which can be solved by simple gradient ascent or
second-order methods.
More concretely, for each ui Val(D i ) we have a parameter i,ui IR. This is
the simplest case of complete parameterization. Often, however, parameters may
be tied or clamped to zero. This does not change the fundamental complexity or
method of estimation. The derivative of the log-likelihood with respect to i,ui is
given by

log L(, D)  
P (ui | ) 1 {xji = ui } = mP (ui | ) Nui .
=
i,ui
j=1
m

Note that the gradient is zero when the counts of the data correspond exactly
with the expected counts predicted by the model. In practice, a prior on the
parameters is used to help avoid overtting. The standard prior is a diagonal
 i
Gaussian,  N (0, 2 I), which adds an additional factor of i,u
2 to the gradient.
To compute the probability P (ui | ) needed to evaluate the gradient, we need
to perform inference in the Markov network. Unlike in Bayesian networks, where
parameters of intractable (large treewidth) graphs can be estimated by simple
counting because of local normalization, the undirected case requires inference
even during the learning stage. This is one of the prices of the exibility of global
normalization in Markov networks. See further discussion in chapter 4. Because
of this added complexity, maximum-likelihood learning of the Markov network

2.4

Learning

47

structure is much more expensive and much less investigated; we will focus below
on Bayesian networks.
2.4.3

Learning the Bayesian Network Structure

Next we consider the problem of learning the structure of a Bayesian network. There
are three broad classes of algorithms for BN structure learning:
Constraint-based approaches These approaches view a Bayesian network as
a representation of independencies. They try to test for conditional dependence
and independence in the data, and then nd a network that best explains these
dependencies and independencies. Constraint-based methods are quite intuitive;
they closely follow the denition of Bayesian network. A potential disadvantage
of these methods is they can be sensitive to failures in individual independence
tests.
Score-based approaches These approaches view a Bayesian network as specifying a statistical model, and then address learning as a model selection problem.
These all operate on the same principle: We dene a hypothesis space of potential models the set of possible network structures we are willing to consider
and a scoring function that measures how well the model ts the observed
data. Our computational task is then to nd the highest-scoring network structure. The space of Bayesian networks is a combinatorial space, consisting of a
2
superexponential number of structures 2O(n ) . Therefore, even with a scoring
function, it is not clear how one can nd the highest-scoring network. There are
very special cases where we can nd the optimal network. In general, however,
the problem is NP-hard, and we resort to heuristic search techniques. Score-based
methods consider the whole structure at once, and are therefore less sensitive to
individual failures and are better at making compromises between the extent to
which variables are dependent in the data and the cost of adding the edge.
The disadvantage of the score-based approaches is that they are in general not
gauranteed to nd the optimal solution.
Bayesian model averaging approaches The third class of approaches do not
attempt to learn a single structure. They are based on a Bayesian framework
describing a distribution over possible structures and try to average the prediction
of all possible structures. Since the number of structures is immense, performing
this task seems impossible. For some classes of models this can be done eciently,
and for others we need to resort to approximations.
In this chapter, we focus on the second approach, score-based approaches to
structure selection. For details about the other approaches, see [8].

48

Graphical Models in a Nutshell

2.4.3.1

Structure Scores

As discussed above, score-based methods approach the problem of structure learning as an optimization problem. We dene a score function that can score each
candidate structure with respect to the training data, and then search for a highscoring structure. As can be expected, one of the most important decisions we
must make in this framework is the choice of scoring function. In this subsection,
we discuss two of the most obvious choices.

The Likelihood Score A natural choice for scoring function is the likelihood
function, which we used for parameter estimation. This measures the probability
of the data given a model; thus, it seems intuitive to nd a model that would make
the data as probable as possible.
Assume that we want to maximize the likelihood of the model. Our goal is to
nd both a graph G and parameters G that maximize the likelihood. It is easy to
show that to nd the maximum-likelihood (G, G ) pair, we should nd the graph
structure G that achieves the highest likelihood when we use the MLE parameters
for G. We therefore dene
G  : D),
scoreL (G : D) = (G,
G  : D) is the logarithm of the likelihood function, and
G are the
where (G,
maximum-likelihood parameters for G. (It is typically easier to deal with the
logarithm of the likelihood.)
The problem with the likelihood score is that it overts the training data. It
will learn a model that precisely ts the specics of the empirical distribution in
our training set. This model captures both dependencies that are true of the
underlying distribution, and dependencies that are artifacts of the specic set of
instances that were given as training data. It therefore fails to generalize well to
new data cases: these are sampled from the underlying distribution, which is not
identical to the empirical distribution in our training set.
However it is reasonable to use the maximum-likelihood score when there are additional mechanisms that disallow overcomplicated structures. For example, learning networks with a xed indegree. Such a limitation can constrain the tendency
to overt when using the maximum-likelihood score.

Bayesian Score An alternative scoring function is based on Bayesian considerations. Recall that the main principle of the Bayesian approach is that, whenever we
have uncertainty over anything, we should place a distribution over it. In this case,
we have uncertainty both over structure and over parameters. We therefore dene
a structure prior P (G) that puts a prior probability on dierent graph structures,
and a parameter prior P ( G | G) that puts a probability on a dierent choice of

2.4

Learning

49

parameters once the graph is given. By Bayes rule, we have


P (G | D) =

P (D | G)P (G)
,
P (D)

where, as usual, the denominator is simply a normalizing factor that does not help
distinguish between dierent structures. Then, we dene the Bayesian score as
scoreB (G : D) = log P (D | G) + log P (G),

(2.23)

The ability to ascribe a prior over structures gives us a way of preferring some
structures over others. For example, we can penalize dense structures more than
sparse ones. It turns out, however, that this term in the score is almost irrelevant
compared to the second term. This rst term, P (D | G) takes into consideration
our uncertainty over the parameters:

P (D | G) =

P (D | G , G)P ( G | G)d G ,

(2.24)

where P (D | G , G) is the likelihood of the data given the network G, G  and
P (G | G) is our prior distribution over dierent parameter values for the network
G. This term is the marginal likelihood of the data given the structure, since we
marginalize out the unknown parameters.
Note that the marginal likelihood is dierent from the maximum-likelihood score.
Both terms examine the likelihood of the data given the structure. The maximumlikelihood score returns the maximum of this function. In contrast, the marginal
likelihood is the average value of this function, where we average based on the prior
measure P ( G | G).
Instantiating this further, if we consider a network with Dirichlet priors, such
that P ( Xi |pai | G) has hyperparameters {Gxj |u : j = 1, . . . , |Xi |, then we have
i
i
that

(Gxj |u + Nxj ,ui )





(GXi |ui )
i
i
i

,
P (D | G) =
(GXi |ui + Nui ) j
(Gxj |u )
G
i
ui Val(PaX )
i

xi Val(Xi )


where GXi |ui = j Gxj |u . In practice, we use the logarithm of this formula, which
i
i
is more manageable to compute numerically.
The Bayesian score is biased toward simpler structures, but as it gets more data,
it is willing to recognize that a more complex structure is necessary. In other words,
it trades o t to data with model complexity. To understand behavior, it is useful to
consider an approximation to the Bayesian score that better exposes its fundamental
properties.

50

Graphical Models in a Nutshell

Theorem 2.35
If we use a Dirichlet parameter prior for all parameters in our network, then, as
M , we have that
G : D)
log P (D | G) = (

log M
Dim[G] + O(1),
2

where Dim[G] is the number of independent parameters in G.


From this we see that the Bayesian score tends precisely to trade o the likelihood
t to data on the one hand, and the model complexity on the other.
This approximation is called the Bayesian information criterion (BIC) score:
G : D) log M Dim[G]
scoreBIC (G : D) = (
2
Our next task is to dene the actual priors that are used in the Bayesian score. In
the case of the prior of network structures, P (G), note that although this term seems
to describe our bias for a certain structure, in fact it plays a relatively minor role. As
we can see in theorem 2.35, the logarithm of the marginal likelihood grows linearly
with the number of examples, while the prior over structures remains constant.
Thus, the structure prior does not play an important role in asymptotic analysis as
long as it does not rule out (i.e., assign probability 0) any structure.
In part because of this, it is common to use a uniform prior over structures.
Nonetheless, the structure prior can make some dierence when we consider small
samples. Thus, we might want to encode some of our preferences in this prior. For
example, we might penalize edges in the graph, and use a prior
P (G) c|G| ,
where c is some constant smaller than 1, and |G| is the number of edges in the
graph. In both these choices (the uniform, and the penalty per edge) it suces to
use a value that is proportional to the prior, since the normalizing constant is the
same for all choices of G and hence can be ignored.
It is mathematically convenient to assume that the structure prior satises
structure modularity. This condition requires that the prior P (G) is proportional to
a product of terms, where each term relates to one family. Formally,

P (PaXi = PaGXi ),
P (G)
i

PaGXi )

denotes the prior probability assigned to choosing the


where P (PaXi =
specic set of parents for Xi . Structure priors that satisfy this property do not
penalize for global properties of the graph (such as its depth), but only for local
properties (such as the number of indegrees).
Next we need to represent our parameter priors. The number of possible structures is superexponential, which makes it dicult to elicit separate parameters for
each one.

2.4

Learning

51

A simple approach is simply to take some xed Dirichlet distribution, e.g.,


Dirichlet (, , , . . . , ), for every parameter, where is a predetermined constant.
A typical choice is = 1. This prior is often referred to as the K2 prior, referring
to the name of the system where it was rst used.
A more sophisticated approach is called the BDe prior. We elicit a prior distribution P  over the entire probability space and an equivalent sample size M  for
the set of imaginary samples. We then set the parameters as follows:
xi |pai = M  P  (xi , pai ).
This choice avoids certain inconsistencies exhibited by the K2 prior. We can
represent P  as a Bayesian network, whose structure can represent our prior about
the domain structure. Most simply, when we have no prior knowledge, we set P 
to be the uniform distribution, i.e., the empty Bayesian network with a uniform
marginal distribution for each variable.
The BDe score turns out to satisfy an important property. Two networks are
said to be I-equivalent if they encode the same set of independence statements.
Hence based on observed independencies we cannot distinguish between I-equivalent
networks. This suggests that based on observing data cases, we do not expect to
distinguish between equivalent networks. The BDe score has the desirable property
that I-equivalent networks have the same score, or are score-equivalent.
2.4.3.2

Search

We now have a well-dened optimization problem. Our input is


training set D;
scoring function (including priors, if needed);
a set GG of possible network structures (incorporating any prior knowledge).
Our desired output is a network structure (from the set of possible structures) that
maximizes the score.
It turns out that, for this discussion, we can ignore the specic choice of score.
Our search algorithms will apply unchanged to all three of these scores.
An important property of the scores that aects the eciency of search is their
decomposability. A score is decomposable if we can write the score of a network
structure G:

FamScore(Xi | PaGi : D)
score(G : D) =
i

All of the scores we have considered are decomposable. Another property that is
shared by all these scores is score equivalence; if G is independence-equivalent to
G  , then score(G : D) = score(G  : D).

52

Graphical Models in a Nutshell

There are several special cases where structure learning is tractable. We wont go
into full details, but two important cases are: (1) learning tree-structured networks
and (2) learning networks with known ordering over the variables.
A network is tree-structured if each variable has at most one parent. In this case,
for decomposable, score-equivalent scores, we can construct an undirected graph,
where the weight on an edge Xi Xj is the change in network score if we add
Xi as the parent of Xj (note that, because of score-equivalence, this is the same as
the change if we add Xj as parent of Xi ). We can nd a weighted spanning tree of
this graph in polynomial time. We can transform the undirected spanning tree into
a directed spanning tree by choosing an arbitrary root, and directing edges away
from the root.
Another interesting tractable case is the problem of learning a BN structure
consistent with some known total order over X and bounded indegree d. In other
G
words,
 we restrict attention to structures G where if Xi PaXj then Xi Xj
 G 
and PaXj  < d. For some domains, nding an ordering such as this is relatively
straightforward; for example, a temporal ow over the order in which variables take
on their values. In this case, for each Xi we can evaluate each possible parent-set
of size d from {X1 , . . . , Xi1 }. This is polynomial in n (but exponential in d).
Unfortunately, the general case, nding an optimally scoring G , for bounded
degree d 2, is N P-hard. Instead of aiming for an algorithm that will always nd
the highest-scoring network, we resort to heuristic algorithms that attempt to nd
the best network, but are not guaranteed to do so.
To dene the heuristic search algorithm, we must dene the search space and
search procedure. We can think of a search space as a graph, where each vertex or
node is a candidate network structure to be considered, and edges denote possible
moves that the search procedure can perform. The search procedure denes an
algorithm that explores the search space, without necessarily seeing all of it . The
simplest search procedure is the greedy one that whenever it is a node chooses to
move the neighbor that has the highest score, until it reaches a node that has a
better score than all of its neighbors.
To elaborate further, in our case a node in the search space is a complete network
structure G over X . There is a tradeo in how densely each node is connected with
how eective the search will be. If each node has few neighbors, then the search
procedure has to consider only few options at each point of the search. Thus, it can
aord to evaluate each of these options. However, paths from the initial node to a
good one might be long and complex. On the other hand, if each node has many
neighbors, there are short paths from each point to another, but we might not be
able to pick it, because we dont have time to evaluate all of the options at each
step.
A good tradeo for this problem chooses reasonably few neighbors for each node,
but ensures that the diameter of the search space remains small. A natural choice
of neighbors of a network structure is a set of structures that are identical to it
except for small local modications. The most commonly used operators which
dene the local modications are

2.4

Learning

1
2
3
4
5
6
7
8
9
10
11
12
13

53

Procedure Greedy-Structure-Search (
G , // initial network structure
D // Fully observed dataset
score, // Score
O, // A set of search operators
)
Gbest G
do
G Gbest
Progress false
for each operator o O
Go o(G) // Result of applying o on G
if Go is legal structure then
if score(Go : D) > score(Gbest : D) then
Gbest Go
Progress true
while Progress
return Gbest

Figure 2.7 Greedy structure search algorithm, with an arbitrary scoring function
score(G : D).

add an edge;
delete an edge;
reverse an edge.
In other words, if we consider the node G, then the neighboring nodes in the
search space are those where we change one edge, either by adding one, deleting
one, or reversing the orientation of one. We only consider operations that result
in legal networks (i.e., acyclic networks satisfying any constraints such as bounded
indegree).
This denition of search space is quite natural and has several desirable properties. First, notice that the diameter of the search space is at most n2 . That is,
there is a relatively short path between any two networks we choose. To see this,
note that if we consider traversing a path from G1 to G2 , we can start by deleting
all edges in G1 that do not appear in G2 , and then we can add the edges that are
in G2 and not in G1 . Clearly, the number of steps we take is bounded by the total
number of edges we can have, n2 .
Second, recall that the score of a network G is a sum of local scores. The operations
we consider result in changing only one local score term (in the case of addition
or deletion of an edge) or two (in the case of edge reversal). Thus, they result in a
local change in the score the main mass of the score remains the same. This
implies that there is some sense of continuity in the score of neighboring nodes.
The search methods most commonly used are local search procedures. Such
search procedures are characterized by the following design: they keep a current
candidate node. At each iteration they explore some of the neighboring nodes, and

54

Graphical Models in a Nutshell

then decide to make a step to one of the neighbors and make it the current
candidate. These iterations are repeated until some termination condition. In other
words, local search procedures can be thought of as keeping one pointer into the
search space and moving it around.
One of the simplest, and often used, search procedures is the greedy hill-climbing
procedure. The intuition is simple. As the name suggests, at each step we take the
step that leads to the largest improvement in the score. The actual details of the
procedure are shown in gure 2.7. We pick an initial network structure G as a
starting point; this network can be the empty one, a random choice, the best tree,
or a network obtained from some prior knowledge. We compute its score. We then
consider all of the neighbors of G in the space all of the legal networks obtained
by applying a single operator to G and compute the score for each of them. We
then apply the change that leads to the best improvement in the score. We continue
this process until no modication improves the score.
We can improve on the performance of greedy hill-climbing by using more clever
search algorithms. Some common extensions are:
TABU search: Keep a list of K most recently visited structures and avoid them,
i.e., apply the best move that leads to a structure not on the list. This approach
deals with local maxima whose hill has fewer than K structures.
Random restarts: Once stuck, apply some xed number of random edge changes
and then restart the greedy search. At the end of the search, select the best
structure encountered anywhere on the trajectory. This approach can escape from
the basin of one local maximum to another.
Simulated annealing: Evaluate operators in random order. If the randomly
selected operator induces an uphill step, move to the resulting structure. (Note:
it does not have to be the best of the current neighbors.) If the operator induces a
downhill step, apply it with probability inversely proportional to the reduction in
score. A temperature parameter determines the probability of taking downhill
steps. As the search progress, the temperature decreases, and the algorithm
becomes less likely to take a downhill step.

2.5

Conclusion
This chapter presented a condensed description of graphical models, including their
representation, inference algorithms, and learning algorithms. Many topics have not
been covered; we refer the reader to [8] for a more complete description.

References
[1] A. Becker and D. Geiger. A suciently fast algorithm for nding close to
optimal clique trees. Articial Intelligence, 125(1-2):317, 2001.

References

55

[2] W. Buntine. Chain graphs for learning. In Proceedings of the Conference on


Uncertainty in Articial Intelligence, 1995.
[3] R. G. Cowell, A. P. Dawid, S. L. Lauritzen, and D. J. Spiegelhalter. Probabilistic
Networks and Expert Systems. Springer-Verlag, New York, 1999.
[4] D. Heckerman. A tutorial on learning with Bayesian networks. Technical
Report MSR-TR-95-06, Microsoft Research, Seattle, WA, 1996.
[5] F. V. Jensen. Bayesian Networks and Decision Graphs. Springer-Verlag, New
York, 2001.
[6] M. I. Jordan, editor. Learning in Graphics Models. The MIT Press, Cambridge,
MA, 1998.
[7] M. I. Jordan. Graphical models. Statistical Science (Special issue on Bayesian
Statistics), 19:140155, 2004.
[8] D. Koller and N. Friedman. BNs and beyond, 2007. To appear.
[9] S. Lauritzen. Graphical Models. Oxford University Press, New York, 1996.
[10] S. Lauritzen and N. Wermuth. Graphical models for association between
variables, some of which are qualitative and some quantitative. Annals of
Statistics, 17(1):3157, 1989.
[11] J. Pearl. Probabilistic Reasoning in Intelligent Systems. Morgan Kaufmann,
San Mateo, CA, 1988.
[12] K. Shoikhet and D. Geiger. A practical algorithm for nding optimal triangulations. In Proceedings of the National Conference on Articial Intelligence,
1997.

3 Inductive Logic Programming in a Nutshell

Sa
so D
zeroski

Inductive logic programming (ILP) is concerned with the development of techniques


and tools for relational data mining (RDM). Besides the ability to deal with data
stored in multiple tables, ILP systems are usually able to take into account generally
valid background (domain) knowledge in the form of a logic program. They also
use the powerful language of logic programs for describing discovered patterns. This
chapter introduces the basics of ILP and RDM. First it introduces the basics of logic
programming and relates logic programming terminology to database terminology.
It then discusses the major settings for, tasks of, and approaches to ILP and RDM.
The tasks of learning relational classication rules, decision trees, and association
rules and approaches to solving them are discussed next, followed by relational
distance-based approaches. The chapter also briey discusses recent trends in ILP
and RDM research.

3.1

Introduction
From a knowledge discovery in database (KDD) perspective, we can say that
inductive logic programming (ILP) is concerned with the development of techniques
and tools for relational data mining (RDM). While typical data mining approaches
nd patterns in a given single table, relational data mining approaches nd patterns
in a given relational database. In a typical relational database, data resides in
multiple tables. ILP tools can be applied directly to such multi-relational data to
nd patterns that involve multiple relations. This is a distinguishing feature of ILP
approaches: most other data mining approaches can only deal with data that resides
in a single table and require preprocessing to integrate data from multiple tables
(e.g., through joins or aggregation) into a single table before they can be applied.
Integrating data from multiple tables through joins or aggregation can cause loss
of meaning or information. Suppose we are given the relation customer(CustID, N ame,
Age, SpendsALot) and the relation purchase(CustID, P roductID, Date, V alue,
P aymentM ode), where each customer can make multiple purchases, and we are in-

58

Inductive Logic Programming in a Nutshell

terested in characterizing customers that spend a lot. Integrating the two relations
via a natural join will give rise to a relation purchase1 where each row corresponds
to a purchase and not to a customer. One possible aggregation would give rise to
the relation customer1(CustID, Age, N of P urchases, T otalV alue, SpendsALot).
In this case, however, some information has been clearly lost during the aggregation process.
The following pattern can be discovered by an ILP system if the relations
customer and purchase are considered together.
customer(CID, N ame, Age, yes)
Age > 30
purchase(CID, P ID, D, V alue, P M )
P M = credit card V alue > 100.
This pattern says: a customer spends a lot if she is older than 30, has purchased
a product of value more than 100, and paid for it by credit card. It would not
be possible to induce such a pattern from either of the relations purchase1 and
customer1 considered on their own.
Besides the ability to deal with data stored in multiple tables directly, ILP systems are usually able to take into account generally valid background (domain)
knowledge in the form of a logic program. The ability to take into account background knowledge and the expressive power of the language of discovered patterns
are also distinctive for ILP.
Note that data mining approaches that nd patterns in a given single table are
referred to as attribute-value or propositional learning approaches, as the patterns
they nd can be expressed in propositional logic. ILP approaches are also referred to
as rst-order learning approaches, or relational learning approaches, as the patterns
they nd are expressed in the relational formalism of rst-order logic. A more
detailed discussion of the single table assumption, the problems resulting from it,
and how a relational representation alleviates these problems can be found in [49]
and in (chapter 4 of [15]).
The remainder of this chapter rst introduces the basics of logic programming and
relates logic programming terminology to database terminology. It then discusses
the major settings for, tasks of, and approaches to ILP and RDM. The tasks of
learning relational classication rules, decision trees, and association rules and
approaches to solving them are discussed in the following three sections. Relational
distance-based approaches are covered next. The chapter concludes with a brief
discussion of recent trends in ILP and RDM research.

3.2

Logic Programming
We rst briey describe the basic logic programming terminology and relate it
to database terminology, then proceed with a more complete introduction to logic

3.2

Logic Programming

59

programming. The latter discusses both the syntax and semantics of logic programs.
While syntax denes the language of logic programs, semantics is concerned with
assigning meaning (truth-values) to such statements. Proof theory focuses on
(deductive) reasoning with such statements.
For a thorough treatment of logic programming we refer to the standard textbook
of Lloyd [31]. The overview below is mostly based on the comprehensive and easily
readable text by Hogger [22].
3.2.1

The Basics of Logic Programming

Logic programs consist of clauses. We can think of clauses as rst-order rules, where
the conclusion part is termed the head and the condition part the body of the clause.
The head and body of a clause consist of atoms, an atom being a predicate applied
to some arguments, which are called terms. In Datalog, terms are variables and
constants, while in general they may consist of function symbols applied to other
terms. Ground clauses have no variables.
Consider the clause f ather(X, Y ) mother(X, Y ) parent(X, Y ). It reads: if
X is a parent of Y, then X is the father of Y or X is the mother of Y ( stands for
logical or). parent(X, Y ) is the body of the clause, and f ather(X, Y )mother(X, Y )
is the head. parent, f ather, and mother are predicates, X and Y are variables,
and parent(X, Y ), f ather(X, Y ), and mother(X, Y ) are atoms. We adopt the
Prolog [4] syntax and start variable names with capital letters. Variables in clauses
are implicitly universally quantied. The above clause thus stands for the logical
formula XY : f ather(X, Y ) mother(X, Y ) parent(X, Y ). Clauses are also
viewed as sets of literals, where a literal is an atom or its negation. The above clause
is then the set {f ather(X, Y ), mother(X, Y ), parent(X, Y )}.
As opposed to full clauses, denite clauses contain exactly one atom in the
head. As compared to denite clauses, program clauses can also contain negated
atoms in the body. The clause in the paragraph above is a full clause; the clause
ancestor(X, Y ) parent(Z, Y ) ancestor(X, Z) is a denite clause ( stands
for logical and). It is also a recursive clause, since it denes the relation ancestor in
terms of itself and the relation parent. The clause mother(X, Y ) parent(X, Y )
not male(X) is a program clause.
A set of clauses is called a clausal theory. Logic programs are sets of program
clauses. A set of program clauses with the same predicate in the head is called a
predicate denition. Most ILP approaches learn predicate denitions.
A predicate in logic programming corresponds to a relation in a relational
database. An n-ary relation p is formally dened as a set of tuples [47], i.e., a
subset of the Cartesian product of n domains D1 D2 . . . Dn , where a domain
(or a type) is a set of values. It is assumed that a relation is nite unless stated
otherwise. A relational database (RDB) is a set of relations.
Thus, a predicate corresponds to a relation, and the arguments of a predicate
correspond to the attributes of a relation. The major dierence is that the attributes
of a relation are typed (i.e., a domain is associated with each attribute). For

60

Inductive Logic Programming in a Nutshell

Table 3.1

Database and logic programming terms


DB terminology

LP terminology

relation name p
attribute of relation p
tuple a1 , . . . , an 
relation p a set of tuples

predicate symbol p
argument of predicate p
ground fact p(a1 , . . . , an )
predicate p dened extensionally
by a set of ground facts
predicate q
dened intensionally
by a set of rules (clauses)

relation q
dened as a view

example, in the relation lives in(X, Y ), we may want to specify that X is of type
person and Y is of type city. Database clauses are typed program clauses.
A deductive database (DDB) is a set of database clauses. In DDBs, relations
can be dened extensionally as sets of tuples (as in RDBs) or intensionally as
sets of database clauses. Database clauses use variables and function symbols in
predicate arguments and the language of DDBs is substantially more expressive
than the language of RDBs [31, 47]. A deductive Datalog database consists of
denite database clauses with no function symbols.
Table 3.1 relates basic database and logic programming terms. For a full treatment of logic programming, RDBs, and DDBs, we refer the reader to [31] and [47].
3.2.2

The Syntax and Semantics of Logic Programs

The basic concepts of logic programming include the language (syntax) of logic
programs, as well as notions from model and proof theory (semantics). The syntax denes what are legal sentences/statements in the language of logic programs.
Model theory (semantics) is concerned with assigning meaning (truth-values) to
such statements. Proof theory focuses on (deductive) reasoning with such statements.
3.2.2.1

Syntax: The Language

A rst-order alphabet consists of variables, predicate symbols, and function symbols


(which include constants). A variable is a term, and a function symbol immediately
followed by a bracketed n-tuple of terms is a term. Thus f (g(X), h) is a term when
f , g, and h are function symbols and X is a variable strings starting with lowercase
letters denote predicate and function symbols, while strings starting with uppercase
letters denote variables. A constant is a function symbol of arity 0 (i.e., followed
by a bracketed 0-tuple of terms, which is often left implicit). A predicate symbol

3.2

Logic Programming

61

immediately followed by a bracketed n-tuple of terms is called an atomic formula


or atom. For example, mother(maja, f ilip) and f ather(X, Y ) are atoms.
A well-formed formula (also called a sentence or statement) is either an atomic
formula or takes one of the following forms: F , (F ), F , F G, F G, F G,
F G, X : F and X : F , where F and G are well-formed formulae and X
is a variable. F denotes the negation of F , denotes logical disjunction (or), and
logical conjunction (and). F G stands for implication (F if G, F G) and
F G stands for equivalence (F if and only if G). and are the universal (for
all X F holds) and the existential quantier (there exists an X such that F holds).
In the formulae X : F and X : F , all occurrences of X are said to be bound.
A sentence or a closed formula is a well-formed formula in which every occurrence
of every variable symbol is bound. For example, Y Xf ather(X, Y ) is a sentence,
while f ather(X, andy) is not.
The clausal form is a normal form for rst-order sentences. A clause is a
disjunction of literals a positive literal is an atom, a negative literal the negation
of an atom preceded by a prex of universal quantiers, one for each variable
appearing in the disjunction. In other words, a clause is a formula of the form
X1 X2 ...Xs (L1 L2 ...Lm ), where each Li is a literal and X1 , X2 , ...., Xs are
all the variables occurring in L1 L2 ...Lm .
A clause can also be represented as a nite set (possibly empty) of literals.
The set {A1 , A2 , ..., Ah , B1 , B2 , ..., Bb }, where Ai and Bi are atoms, stands for
the clause (A1 ... Ah B1 ... Bb ), which is equivalently represented as
A1 ... Ah B1 ... Bb . Most commonly, this same clause is written as
A1 , ..., Ah B1 , ..., Bb , where A1 , ..., Ah is called the head and B1 , ..., Bb the body
of the clause. Commas in the head of the clause denote logical disjunction, while
commas in the body of the clause denote logical conjunction. A set of clauses is
called a clausal theory and represents the conjunction of its clauses.
A clause is a Horn clause if it contains at most one positive literal; it is a denite
clause if it contains exactly one positive literal. A set of denite clauses is called
a denite logic program. A fact is a denite clause with an empty body, e.g.,
parent(mother(X), X) , also written simply as parent(mother(X), X). A goal
(also called a query) is a Horn clause with no positive literals.
A program clause is a clause of the form A L1 , ...., Lm where A is an atom,
and each of L1 , ..., Lm is a positive or negative literal. A negative literal in the body
of a program clause is written in the form not B, where B is an atom. A normal
program (or logic program) is a set of program clauses. A predicate denition is a
set of program clauses with the same predicate symbol (and arity) in their heads.
Let us now illustrate the above denitions with some examples. The clause
daughter(X, Y ) f emale(X), mother(Y, X)
is a denite program clause, while the clause
daughter(X, Y ) not male(X), f ather(Y, X)

62

Inductive Logic Programming in a Nutshell

is a normal program clause. Together, the two clauses constitute a predicate


denition of the predicate daughter/2. This predicate denition is also a normal
logic program. The rst clause is an abbreviated representation of the formula
XY : daughter(X, Y ) f emale(X) mother(Y, X)
and can also be written in set notation as
{daughter(X, Y ), f emale(X), mother(Y, X)}.
The set of variables in a term, atom, or clause F is denoted by vars(F ). A
substitution = {V1 /t1 , ...., Vn /tn } is an assignment of terms ti to variables Vi .
Applying a substitution to a term, atom, or clause F yields the instantiated term,
atom, or clause F where all occurrences of the variables Vi are simultaneously
replaced by the term ti . A term, atom, or clause F is called ground when there is
no variable occurring in F , i.e., vars(F ) = . The fact daughter(mary, ann) is thus
ground.
A clause or clausal theory is called function-free if it contains only variables
as terms, i.e., contains no function symbols (this also means no constants). The
clause daughter(X, Y ) f emale(X), mother(Y, X) is function-free and the clause
even(s(s(X)) even(X) is not. A Datalog clause (program) is a denite clause
(program) that contains no function symbols of nonzero arity. This means that only
variables and constants can be used as predicate arguments. The size of a term,
atom, clause, or a clausal theory T is the number of symbols that appear in T ,
i.e., the number of all occurrences in T of predicate symbols, function symbols, and
variables.
3.2.2.2

Semantics: Model Theory

Model theory is concerned with attributing meaning (truth-value) to sentences (wellformed formulae) in a rst-order language. Informally, the sentence is mapped to
some statement about a chosen domain through a process known as interpretation.
An interpretation is determined by the set of ground facts (ground atomic formulae)
to which it assigns the value true. Sentences involving variables and quantiers are
interpreted by using the truth-values of the ground atomic formulae and a xed set
of rules for interpreting logical operations and quantiers, such as F is true if and
only if F is false.
An interpretation which gives the value true to a sentence is said to satisfy the
sentence; such an interpretation is called a model for the sentence. An interpretation
which does not satisfy a sentence is called a counter-model for that sentence. By
extension, we also have the notion of a model (counter-model) for a set of sentences
(e.g., for a clausal theory): an interpretation is a model for the set if and only if it
is a model for each of the sets members. A sentence (set of sentences) is satisable
if it has at least one model; otherwise it is unsatisable.

3.2

Logic Programming

63

A sentence F logically implies a sentence G if and only if every model for F is


also a model for G. We denote this by F |= G. Alternatively, we say that G is a
logical (or semantic) consequence of F . By extension, we have the notion of logical
implication between sets of sentences.
A Herbrand interpretation over a rst-order alphabet is a set of ground facts
constructed with the predicate symbols in the alphabet and the ground terms from
the corresponding Herbrand domain of function symbols; this is the set of ground
atoms considered to be true by the interpretation. A Herbrand interpretation I is
a model for a clause c if and only if for all substitutions such that c is ground
body(c) I implies head(c)I
= . In that case, we say c is true in I. A Herbrand
interpretation I is a model for a clausal theory T if and only if it is a model for all
clauses in T . We say that I is a Herbrand model of c, respectively T .
Roughly speaking, the truth of a clause c in a (nite) interpretation I can
be determined by running the goal (query) body(c), not head(c) on a database
containing I, using a theorem prover such as Prolog [4]. If the query succeeds, the
clause is false in I; if it fails, the clause is true. Analogously, one can determine the
truth of a clause c in the minimal (least) Herbrand model of a theory T by running
the goal body(c), not head(c) on a database containing T .
To illustrate the above notions, consider the Herbrand interpretation i =
{parent(saso, f ilip), parent(maja, f ilip), son(f ilip, saso), son(f ilip, maja)}.
The clause c = parent(X, Y ) son(Y, X) is true in i, i.e., i is a model of c.
On the other hand, i is not a model of the clause parent(X, X) (which means
that everybody is their own parent).
3.2.3

Semantics: Proof Theory

Proof theory focuses on (deductive) reasoning with logic programs. Whereas model
theory considers the assignment of meaning to sentences, proof theory considers
the generation of sentences (conclusions) from other sentences (premises). More
specically, proof theory considers the derivability of sentences in the context of
some set of inference rules, i.e., rules for sentence derivation. Formally, an inference
system consists of an initial set S of sentences (axioms) and a set R of inference
rules.
Using the inference rules, we can derive new sentences from S and/or other
derived sentences. The fact that sentence s can be derived from S is denoted S  s.
A proof is a sequence s1 , s2 , ....., sn , such that each si is either in S or derivable using
R from S and s1 , ..., si1 . Such a proof is also called a derivation or deduction. Note
that the above notions are of entirely syntactic nature. They are directly relevant
to the computational aspects of automated deductive inference.
The set of inference rules R denes the derivability relation . A set of inference
rules is sound if the corresponding derivability relation is a subset of the logical
implication relation, i.e., for all S and s, if S  s, then S |= s. It is complete if
the other direction of the implication holds, i.e., for all S and s, if S |= s, then
S  s. The properties of soundness and completeness establish a relation between

64

Inductive Logic Programming in a Nutshell

the notions of syntactic () and semantic (|=) entailment in logic programming and
rst-order logic. When the set of inference rules is both sound and complete, the
two notions coincide.
Resolution comprises a single inference rule applicable to clausal-form logic. From
any two clauses having an appropriate form, resolution derives a new clause as their
consequence. For example, the clauses daughter(X, Y ) f emale(X), parent(Y, X)
and f emale(sonja) resolve into daughter(sonja, Y ) parent(Y, sonja). Resolution is sound: every resolvent is implied by its parents. It is also refutation complete: the empty clause is derivable by resolution from any set S of Horn clauses if
S is unsatisable.

3.3

Inductive Logic Programming: Settings and Approaches


Logic programming as a subset of rst-order logic is mostly concerned with deductive inference. ILP, on the other hand, is concerned with inductive inference. It
generalizes from individual instances/observations in the presence of background
knowledge, nding regularities/hypotheses about yet unseen instances.
In this section, we discuss the dierent ILP settings as well as the dierent
relational learning tasks, starting with the induction of logic programs (sets of
relational rules). We also discuss the two major approaches to solving relational
learning tasks, namely transforming relational problems to propositional form and
upgrading propositional algorithms to a relational setting.
3.3.1

Logical Settings for Concept Learning

One of the most basic and most often considered tasks in machine learning is the
task of inductive concept learning (table 3.3.1). Given U, a universal set of objects
(observations), a concept C is a subset of objects in U, C U. For example, if U is
the set of all patients in a given hospital, C could be the set of all patients diagnosed
with hepatitis A. The task of inductive concept learning is dened as follows: Given
instances and non-instances of concept C, nd a hypothesis (classier) H able to
tell whether x C, for each x U.
To dene the task of inductive concept learning more precisely, we need to specify
U the space of instances (examples), as well as the space of hypotheses considered.
This is done through specifying the languages of examples (LE ) and concept
descriptions (LH ). In addition, a coverage relation covers(H, e) has to be specied,
which tells us when an example e is considered to belong to the concept represented
by hypothesis H. Examples that belong to the target concept are termed positive;
those that do not are termed negative. Given positive and negative examples, we
want hypotheses that are complete (cover all positive examples) and consistent (do
not cover negative examples).
Looking at concept learning in a logical framework, De Raedt [11] considers
three settings for concept learning. The key aspect that varies in these settings is

3.3

Inductive Logic Programming: Settings and Approaches

Table 3.2

65

The task of inductive concept learning


Given:

a language of examples LE
a language of concept descriptions LH
a covers relation between LH and LE , dening when
an example e is covered by a hypothesisH: covers(H, e)
sets of positive P and negative N examples described in LE
Find hypothesis H from LH , such that
completeness: H covers all positive examples p P
consistency: H does not cover any negative example n N

the notion of coverage, but the languages LE and LH vary as well. We characterize
these for each of the three settings below.
In learning from entailment, the coverage relation is dened as covers(H, e) i
H |= e. The hypothesis logically entails the example. Here H is a clausal theory
and e is a clause.
In learning from interpretations, we have covers(H, e) i e is model of H. The
example has to be a model of the hypothesis. H is a clausal theory and e is a
Herbrand interpretation.
In learning from satisability, covers(H, e) i H e
|=. The example and the
hypothesis taken together have to be satisable. Here both H and e are clausal
theories.
The setting of learning from entailment, introduced by Muggleton [34], is the
one that has received the most attention in the eld of ILP. The alternative ILP
setting of learning from interpretations was proposed by De Raedt and Dzeroski
[14]: this setting is a natural generalization of propositional learning. Many learning
algorithms for propositional learning have been upgraded to the learning from
interpretations ILP setting. Finally, the setting of learning from satisability was
introduced by Wrobel and Dzeroski [50], but has rarely been used in practice due
to computational complexity problems.
De Raedt [11] also discusses the relationships among the three settings for concept
learning. Learning from nite interpretations reduces to learning from entailment.
Learning from entailment reduces to learning from satisability. Learning from
interpretations is thus the easiest and learning from satisability the hardest of the
three settings.
As introduced above, the logical settings for concept learning do not take into
account background knowledge, one of the essential ingredients of ILP. However,
the denitions of the settings are easily extended to take it into account. Given
background knowledge B, which in its most general form can be a clausal theory,

66

Inductive Logic Programming in a Nutshell

the denition of coverage should be modied by replacing H with B H for all


three settings.
3.3.2

The ILP Task of Relational Rule Induction

The most commonly addressed task in ILP is the task of learning logical denitions
of relations [40], where tuples that belong or do not belong to the target relation
are given as examples. From training examples ILP then induces a logic program
(predicate denition) corresponding to a view that denes the target relation in
terms of other relations that are given as background knowledge. This classical
ILP task is addressed, for instance, by the seminal MIS system [44] (rightfully
considered as one of the most inuential ancestors of ILP) and one of the best
known ILP systems FOIL [40].
Given is a set of examples, i.e., tuples that belong to the target relation p (positive
examples) and tuples that do not belong to p (negative examples). Given are also
background relations (or background predicates) qi that constitute the background
knowledge and can be used in the learned denition of p. Finally, a hypothesis
language, specifying syntactic restrictions on the denition of p, is also given (either
explicitly or implicitly). The task is to nd a denition of the target relation p that
is consistent and complete, i.e., explains all the positive and none of the negative
tuples.
Formally, given is a set of examples E = P N , where P contains positive and N
negative examples, and background knowledge B. The task is to nd a hypothesis
H such that e P : B H |= e (H is complete) and e N : B H
|= e (H
is consistent), where |= stands for logical implication or entailment. This setting,
introduced by Muggleton [34] (and discussed in the previous section), is thus also
called learning from entailment.
In the most general formulation, each e, as well as B and H, can be a clausal
theory. In practice, each e is most often a ground example (tuple), B is a relational
database (which may or may not contain views), and H is a denite logic program.
The semantic entailment (|=) is in practice replaced with syntactic entailment () or
provability, where the resolution inference rule (as implemented in Prolog) is most
often used to prove examples from a hypothesis and the background knowledge. In
learning from entailment, a positive fact is explained if it can be found among the
answer substitutions for h produced by a query ? b on database B, where h b
is a clause in H. In learning from interpretations, a clause h b from H is true in
the minimal Herbrand model of B if the query b h fails on B.
As
an
illustration,
consider
the
task
of
dening
relation
daughter(X, Y ), which states that person X is a daughter of person Y , in terms of
the background knowledge relations f emale and parent. These relations are given
in table 3.3. There are two positive and two negative examples of the target relation
daughter. In the hypothesis language of denite program clauses it is possible to

3.3

Inductive Logic Programming: Settings and Approaches

67

formulate the following denition of the target relation:


daughter(X, Y ) f emale(X), parent(Y, X),
which is consistent and complete with respect to the background knowledge and
the training examples.
Table 3.3 A simple ILP problem: learning the daughter relation. Positive examples
are denoted by and negative by 
Training examples
daughter(mary, ann).
daughter(eve, tom).
daughter(tom, ann).
daughter(eve, ann).

Background knowledge




parent(ann, mary).
parent(ann, tom).
parent(tom, eve).
parent(tom, ian).

f emale(ann).
f emale(mary).
f emale(eve).

In general, depending on the background knowledge, the hypothesis language, and


the complexity of the target concept, the target predicate denition may consist of
a set of clauses, such as
daughter(X, Y ) f emale(X), mother(Y, X),
daughter(X, Y ) f emale(X), f ather(Y, X),
if the relations mother and f ather were given in the background knowledge instead
of the parent relation.
The hypothesis language is typically a subset of the language of program clauses.
As the complexity of learning grows with the expressiveness of the hypothesis language, restrictions have to be imposed on hypothesized clauses. Typical restrictions
are the exclusion of recursion and restrictions on variables that appear in the body
of the clause but not in its head (so-called new variables).
Declarative bias [38] explicitly species the language of hypotheses (clauses)
considered by the ILP system at hand. This is input to the learning system (and
not hard-wired in the learning algorithm). Various types of declarative bias have
been used by dierent ILP systems, such as argument types and input/output
modes, parameterized language bias (e.g., maximum number of variables, literals,
depth of variables, arity, etc.), clause templates, and grammars. For example, a
suitable clause template for learning family relationships would be P (X, Y )
Q(X, Z), R(Z, Y ). Here P , Q, and R are second order variables that can be replaced
by predicates, e.g., grandmother, mother, and parent. The same template can be
used to learn the notions of grandmother and a grandfather.

68

Inductive Logic Programming in a Nutshell

3.3.3

Other Tasks of Relational Learning

Initial eorts in ILP focused on relational rule induction, more precisely on concept
learning in rst-order logic and synthesis of logic programs; cf. [34]. An overview
of early work is given in the textbook on ILP by Lavrac and Dzeroski [30].
Representative early ILP systems addressing this task are Cigol [36], Foil [40],
Golem [37], and Linus [29]. More recent representative ILP systems are Progol
[35] and Aleph [46].
State-of-the-art ILP approaches now span most of the spectrum of data mining
tasks and use a variety of techniques to address these. The distinguishing features
of using multiple relations directly and discovering patterns expressed in rst-order
logic are present throughout: the ILP approaches can thus be viewed as upgrades
of traditional approaches. Van Laer and De Raedt [48] (chapter 10 of [15]) present
a case study of upgrading a propositional approach to classication rule induction
to rst-order logic. Note, however, that upgrading to rst-order logic is non-trivial:
the expressive power of rst-order logic implies computational costs and much work
is needed in balancing the expressive power of the pattern languages used and the
computational complexity of the data mining algorithm looking for such patterns.
This search for a balance between the two has occupied much of the ILP research
in the last ten years.
Present ILP approaches to multi-class classication involve the induction of
relational classication rules (ICL [48]), as well as rst order logical decision trees in
Tilde [3] and S-Cart [26]. ICL upgrades the propositional rule inducer CN2 [6].
Tilde and S-Cart upgrade decision tree induction as implemented in C4.5 [41] and
Cart [5]. A nearest-neighbor approach to relational classication is implemented
in Ribl [21] and its successor Ribl2.
Relational regression approaches upgrade propositional regression tree and rules
approaches. Tilde and S-Cart, as well as Ribl2, can handle continuous classes.
Fors [23] learns decision lists (ordered sets of rules) for relational regression.
The main nonpredictive or descriptive data mining tasks are clustering and
discovery of association rules. These have been also addressed in a rst-order
logic setting. The Ribl distance measure has been used to perform hierarchical
agglomerative clustering in Rdbc , as well as k-means clustering (see section 3.7).
Section 3.6 describes a relational approach to the discovery of frequent queries and
query extensions, a rst-order version of association rules.
With such a wide arsenal of RDM techniques, there is also a variety of practical
applications. ILP has been successfully applied to discover knowledge from relational data and background knowledge in the areas of molecular biology (including
drug design, protein structure prediction, and functional genomics), environmental sciences, trac control, and natural language processing. An overview of such
applications is given by Dzeroski [19] and (chapter 14 in [15]).

3.3

Inductive Logic Programming: Settings and Approaches

3.3.4

69

Transforming ILP Problems to Propositional Form

One of the early approaches to ILP, implemented in the ILP system LINUS [29],
is based on the idea that the use of background knowledge can introduce new
attributes for learning. The learning problem is transformed from relational to attribute-value form and solved by an attribute-value learner. An advantage of this
approach is that data mining algorithms that work on a single table (and this
is the majority of existing data mining algorithms) become applicable after the
transformation.
This approach, however, is feasible only for a restricted class of ILP problems.
Thus, the hypothesis language of LINUS is restricted to function-free program
clauses which are typed (each variable is associated with a predetermined set of
values), constrained (all variables in the body of a clause also appear in the head),
and nonrecursive (the predicate symbol in the head does not appear in any of the
literals in the body).
The LINUS algorithm, which solves ILP problems by transforming them into
propositional form, consists of the following three steps:
The learning problem is transformed from relational to attribute-value form.
The transformed learning problem is solved by an attribute-value learner.
The induced hypothesis is transformed back into relational form.
The above algorithm allows for a variety of approaches developed for propositional problems, including noise-handling techniques in attribute-value algorithms,
such as CN2 [7], to be used for learning relations. It is illustrated on the simple
ILP problem of learning family relations. The task is to dene the target relation
daughter(X, Y ), which states that person X is a daughter of person Y , in terms of
the background knowledge relations f emale, male, and parent.
Table 3.4

Nonground background knowledge for learning the daughter relation

Training examples
daughter(mary, ann).
daughter(eve, tom).
daughter(tom, ann).
daughter(eve, ann).




parent(X, Y )
mother(X, Y ).
parent(X, Y )
f ather(X, Y ).

Background knowledge
mother(ann, mary). f emale(ann).
mother(ann, tom).
f emale(mary).
f ather(tom, eve).
f emale(eve).
f ather(tom, ian).

All the variables are of the type person, dened as person = {ann, eve,
ian, mary, tom}. There are two positive and two negative examples of the target
relation. The training examples and the relations from the background knowledge
are given in table 3.3. However, since the LINUS approach can use nonground
background knowledge, let us assume that the background knowledge from table 3.4
is given.

70

Inductive Logic Programming in a Nutshell


Table 3.5

Propositional form of the daughter relation problem

Variables
C




mary
eve
tom
eve

ann
tom
ann
ann

Propositional features
f (X) f (Y ) m(X) m(Y ) p(X, X) p(X, Y ) p(Y, X) p(Y, Y )
true true f alse f alse
true f alse f alse true
f alse true true f alse
true true f alse f alse

f alse
f alse
f alse
f alse

f alse
f alse
f alse
f alse

true
true
true
f alse

f alse
f alse
f alse
f alse

The rst step of the algorithm, i.e., the transformation of the ILP problem into
attribute-value form, is performed as follows. The possible applications of the background predicates on the arguments of the target relation are determined, taking
into account argument types. Each such application introduces a new attribute. In
our example, all variables are of the same type person. The corresponding attributevalue learning problem is given in table 3.5, where f stands for f emale, m for male,
and p for parent. The attribute-value tuples are generalizations (relative to the given
background knowledge) of the individual facts about the target relation.
In table 3.5, variables stand for the arguments of the target relation, and propositional features denote the newly constructed attributes of the propositional learning
task. When learning function-free clauses, only the new attributes (propositional
features) are considered for learning.
In the second step, an attribute-value learning program induces the following
if-then rule from the tuples in table 3.5:
Class = if [f emale(X) = true] [parent(Y, X) = true]
In the last step, the induced if-then rules are transformed into clauses. In our
example, we get the following clause:
daughter(X, Y ) f emale(X), parent(Y, X).
The LINUS approach has been extended to handle determinate clauses [16, 30],
which allow the introduction of determinate new variables (which have a unique
value for each training example). There also exist a number of other approaches to
propositionalization, some of them very recent: an overview is given by Kramer et
al. [28] (chapter 11 of [15]).
Let us emphasize again, however, that it is in general not possible to transform an
ILP problem into a propositional (attribute-value) form eciently. De Raedt [12]
treats the relation between attribute-value learning and ILP in detail, showing that
propositionalization of some more complex ILP problems is possible, but results
in attribute-value problems that are exponentially large. This has also been the
main reason for the development of a variety of new RDM and ILP techniques by
upgrading propositional approaches.
3.3.5

Upgrading Propositional Approaches

ILP/RDM algorithms have many things in common with propositional learning


algorithms. In particular, they share the learning as search paradigm, i.e., they

3.4

Relational Classication Rules

71

search for patterns valid in the given data. The key dierences lie in the representation of data and patterns, renement operators/generality relationships, and
testing coverage (i.e., whether a rule explains an example).
Van Laer and De Raedt [48] explicitly formulate a recipe for upgrading propositional algorithms to deal with relational data and patterns. The key idea is to keep
as much of the propositional algorithm as possible and upgrade only the key notions. For rule induction, the key notions are the renement operator and coverage
relationship. For distance-based approaches, the notion of distance is the key one.
By carefully upgrading the key notions of a propositional algorithm, an RDM/ILP
algorithm can be developed that has the original propositional algorithm as a special case.
The recipe has been followed (more or less exactly) to develop ILP systems for
rule induction, well before it was formulated explicitly. The well-known FOIL [40]
system can be seen as an upgrade of the propositional rule induction program CN2
[7]. Another well-known ILP system, PROGOL [35], can be viewed as upgrading
the AQ approach [33] to rule induction.
More recently, the upgrading approach has been used to develop a number of
RDM approaches that address data mining tasks other than binary classication.
These include the discovery of frequent Datalog patterns and relational association
rules [9] (chapter 8 of [15]), [8], the induction of relational decision trees (structural
classication and regression trees [27] and rst-order logical decision trees [3]), and
relational distance-based approaches to classication and clustering ([25], chapter
9 of [15], [21]). The algorithms developed have as special cases well-known propositional algorithms, such as the APRIORI algorithm for nding frequent patterns;
the CART and C4.5 algorithms for learning decision trees; k-nearest neighbor classication, hierarchical and k-medoids clustering. In the following two sections, we
briey review how the propositional approaches for association rule discovery and
decision tree inducion have been lifted to a relational framework, highlighting the
key dierences between the relational algorithms and their propositional counterparts.

3.4

Relational Classication Rules


The rst and still most commonly addressed problem in ILP is the one of learning
logic programs (sets of relational rules for binary classication). This section rst
describes the covering algorithm for inducing sets of rules, then the induction
of individual rules. In particular, we discuss how the space of rules/clauses is
structured and searched.
3.4.1

The Covering Approach to Relational Rule Induction

From a data mining perspective, the task described above is a binary classication
task, where one of two classes is assigned to the examples (tuples): (positive) or

72

Inductive Logic Programming in a Nutshell


Table 3.6

Top-down (general-to-specic) search of renement graphs

hypothesis H :=
repeat {covering}
clause c := p(X1 , ...Xn )
repeat {specialization}
build the set S of all renements of c
c := the best element of S (according to a heuristic)
until stopping criterion is satised (B H {c} is consistent)
add c to H
delete all examples from P entailed by B H {c}
until stopping criterion is satised
(B H {c} is complete)

 (negative). Classication is one of the most commonly addressed tasks within


the data mining community and includes approaches for rule induction. Rules can
be generated from decision trees [41] or induced directly [33, 6].
ILP systems dealing with the classication task typically adopt the covering
approach of rule induction systems (table 3.6). In a main loop, a covering algorithm
constructs a set of clauses. Starting from an empty set of clauses, it constructs a
clause explaining some of the positive examples, adds this clause to the hypothesis,
and removes the positive examples explained. These steps are repeated until all
positive examples have been explained (the hypothesis is complete).
In the inner loop of the covering algorithm, individual clauses are constructed by
(heuristically) searching the space of possible clauses, structured by a specialization
or generalization operator. Typically, search starts with a very general rule (clause
with no conditions in the body), then proceeds to add literals (conditions) to this
clause until it only covers (explains) positive examples (the clause is consistent).
When dealing with incomplete or noisy data, which is most often the case, the
criteria of consistency and completeness are relaxed. Statistical criteria are typically
used instead. These are based on the number of positive and negative examples
explained by the denition and the individual constituent clauses.
3.4.2

Structuring the Space of Clauses

Having described how to learn sets of clauses by using the covering algorithm for
clause/rule set induction, let us now look at some of the mechanisms underlying
single clause/rule induction. In order to search the space of relational rules (program
clauses) systematically, it is useful to impose some structure upon it, e.g., an
ordering. One such ordering is based on -subsumption, dened below.
A substitution = {V1 /t1 , ..., Vn /tn } is an assignment of terms ti to variables
Vi . Applying a substitution to a term, atom, or clause F yields the instantiated
term, atom, or clause F where all occurrences of the variables Vi are simultaneously

3.4

Relational Classication Rules

73

replaced by the term ti . Let c and c be two program clauses. Clause c -subsumes
c if there exists a substitution , such that c c [39].
To illustrate the above notions, consider the clause c = daughter(X, Y )
parent(Y, X). Applying the substitution = {X/mary, Y /ann} to clause c yields
c = daughter(mary, ann) parent(ann, mary).
Clauses can be viewed as sets of literals: the clausal notation daughter(X, Y )
parent(Y, X) thus stands for {daughter(X, Y ), parent(Y, X)} where all variables
are assumed to be universally quantied, denotes logical negation, and the commas denote disjunction. According to the denition, clause c -subsumes c if there is
a substitution that can be applied to c such that every literal in the resulting clause
occurs in c . Clause c -subsumes c = daughter(X, Y ) f emale(X), parent(Y, X)
under the empty substitution = , since {daughter(X, Y ), parent(Y, X)} is a
proper subset of {daughter(X, Y ), f emale(X), parent(Y, X)}. Furthermore, under the substitution = {X/mary, Y /ann}, clause c -subsumes the clause c =
daughter(mary, ann) f emale(mary), parent(ann, mary), parent(ann, tom).
-subsumption introduces a syntactic notion of generality. Clause c is at least
as general as clause c (c c ) if c -subsumes c . Clause c is more general than
c (c < c ) if c c holds and c c does not. In this case, we say that c is a
specialization of c and c is a generalization of c . If the clause c is a specialization
of c, then c is also called a renement of c.
Under a semantic notion of generality, c is more general than c if c logically
entails c (c |= c ). If c -subsumes c , then c |= c . The reverse is not always true.
The syntactic, -subsumption-based, generality is computationally more feasible.
Namely, semantic generality is in general undecidable. Thus, syntactic generality is
frequently used in ILP systems.
The relation dened by -subsumption introduces a lattice on the set of
reduced clauses [39]: this enables ILP systems to prune large parts of the search
space. -subsumption also provides the basis for clause construction by top-down
searching of renement graphs and bounding the search of renement graphs
from below by using a bottom clause (which can be constructed as least general
generalizations, i.e., least upper bounds of example clauses in the -subsumption
lattice).
3.4.3

Searching the Space of Clauses

Most ILP approaches search the hypothesis space of program clauses in a topdown manner, from general to specic hypotheses, using a -subsumption-based
specialization operator. A specialization operator is usually called a renement
operator [44]. Given a hypothesis language L, a renement operator maps a
clause c to a set of clauses (c) which are specializations (renements) of c:
(c) = {c | c L, c < c }.

74

Inductive Logic Programming in a Nutshell

daughter(X, Y )

XXz daughter(X, Y )
 @ X
parent(X, Z)
 @


)

R
@


daughter(X, Y )
daughter(X, Y )

parent(Y, X)
X=Y








daughter(X, Y )

f emale(X)


HH

HH


HH


j 



daughter(X, Y )
f emale(X)
f emale(Y )

Figure 3.1

daughter(X, Y )
f emale(X)
parent(Y, X)

Part of the renement graph for the family relations problem.

A renement operator typically computes only the set of minimal (most general)
specializations of a clause under -subsumption. It employs two basic syntactic
operations:
apply a substitution to the clause, and
add a literal to the body of the clause.
The hypothesis space of program clauses is a lattice, structured by the subsumption generality ordering. In this lattice, a renement graph can be dened
as a directed, acyclic graph in which nodes are program clauses and arcs correspond
to the basic renement operations: substituting a variable with a term, and adding
a literal to the body of a clause.
Figure 3.1 depicts a part of the renement graph for the family relations problem
dened in table 3.3, where the task is to learn a denition of the daughter relation
in terms of the relations f emale and parent.
At the top of the renement graph (lattice) is the clause with an empty body
c = daughter(X, Y ) . The renement operator generates the renements of c,
which are of the form (c) = {daughter(X, Y ) L}, where L is one of following
literals:
literals having as arguments the variables from the head of the clause: X = Y (applying a substitution X/Y ), f emale(X), f emale(Y ), parent(X, X), parent(X,
Y ), parent(Y, X), and parent(Y, Y ), and

3.5

Relational Decision Trees

75

literals that introduce a new distinct variable Z (Z


= X and Z
= Y ) in the clause
body: parent(X, Z), parent(Z, X), parent(Y, Z), and parent(Z, Y ).
This assumes that the language is restricted to denite clauses, hence literals of
the form not L are not considered; and nonrecursive clauses, hence literals with the
predicate symbol daughter are not considered.
The search for a clause starts at the top of the lattice, with the clause d(X, Y )
that covers all examples (positive and negative). Its renements are then considered,
then their renements in turn, and this is repeated until a clause is found which
covers only positive examples. In the example above, the clause daughter(X, Y )
f emale(X), parent(Y, X) is such a clause. Note that this clause can be reached
in several ways from the top of the lattice, e.g., by rst adding f emale(X), then
parent(Y, X), or vice versa.
The renement graph is typically searched heuristically levelwise, using heuristics
based on the number of positive and negative examples covered by a clause. As
the branching factor is very large, greedy search methods are typically applied
which only consider a limited number of alternatives at each level. Hill-climbing
considers only one best alternative at each level, while beam search considers n
best alternatives, where n is the beam width. Occasionally, complete search is used,
e.g., A best-rst search or breadth-rst search. This search can be bound from
below by using so-called bottom clauses, which can be constructed by least general
generalization [37] or inverse resolution/entailment [35].

3.5

Relational Decision Trees


Decision tree induction is one of the major approaches to data mining. Upgrading
this approach to a relational setting has thus been of great importance. In this
section, we rst look into what relational decision trees are, i.e., how they are
dened, then discuss how such trees can be induced from multi-relational data.
3.5.1

Relational Classication, Regression, and Model Trees

Without loss of generality, we can say the task of relational prediction is dened
by a two-place target predicate target(ExampleID, ClassV ar), which has as arguments an example ID and the class variable, and a set of background knowledge
predicates/relations. Depending on whether the class variable is discrete or continuous, we talk about relational classication or regression. Relational decision trees
are one approach to solving this task.
An example of a relational decision tree is given in gure 3.3. It predicts the maintenance action A to be taken on machine M (maintenance(M, A)), based on parts
the machine contains (haspart(M, X)), their condition (worn(X)), and ease of replacement (irreplaceable(X)). The target predicate here is maintenance(M, A),

76

Inductive Logic Programming in a Nutshell

Figure 3.2 A relational regression tree for predicting the degradation time
LogHLT of a chemical compound C (target predicate degrades(C, LogHLT )).

atom(C, A1, cl)


true

false

bond(C, A1, A2, BT ), atom(C, A2, n)


true

LogHLT=7.82

false

LogHLT=7.51

atom(C, A3, o)
true

LogHLT=6.08

false

LogHLT=6.73

the class variable is A, and background knowledge predicates are haspart(M, X),
worn(X), and irreplaceable(X).
Relational decision trees have much the same structure as propositional decision trees. Internal nodes contain tests, while leaves contain predictions for the
class value. If the class variable is discrete/continuous, we talk about relational
classication/regression trees. For regression, linear equations may be allowed in
the leaves instead of constant class-value predictions: in this case we talk about
relational model trees.
The tree in gure 3.3 is a relational classication tree, while the tree in
gure 3.2 is a relational regression tree. The latter predicts the degradation
time (the logarithm of the mean half-life time in water [18]) of a chemical
compound from its chemical structure, where the latter is represented by the
atoms in the compound and the bonds between them. The target predicate is
degrades(C, LogHLT ), the class variable LogHLT , and the background knowledge predicates are atom(C, AtomID, Element) and bond(C, A1 , A2 , BondT ype).
The test at the root of the tree atom(C, A1, cl) asks if the compound C has a
chlorine atom A1 and the test along the left branch checks whether the chlorine
atom A1 is connected to a nitrogen atom A2.
As can be seen from the above examples, the major dierence between propositional and relational decision trees is in the tests that can appear in internal nodes.
In the relational case, tests are queries, i.e., conjunctions of literals with existentially
quantied variables, e.g., atom(C, A1, cl) and haspart(M, X), worn(X). Relational
trees are binary: each internal node has a left (yes) and a right (no) branch. If the
query succeeds, i.e., if there exists an answer substitution that makes it true, the
yes branch is taken.
It is important to note that variables can be shared among nodes, i.e., a variable
introduced in a node can be referred to in the left (yes) subtree of that node. For
example, the X in irreplaceable(X) refers to the machine part X introduced in the
root node test haspart(M, X), worn(X). Similarly, the A1 in bond(C, A1, A2, BT )
refers to the chlorine atom introduced in the root node atom(C, A1, cl). One cannot

3.5

Relational Decision Trees


Table 3.7

77

A decision list representation of the relational decision tree in gure 3.3

maintenance(M, A) haspart(M, X), worn(X),


irreplaceable(X) !, A = send back
maintenance(M, A) haspart(M, X), worn(X), !,
A = repair in house
maintenance(M, A) A = no maintenance

refer to variables introduced in a node in the right (no) subtree of that node.
For example, referring to the chlorine atom A1 in the right subtree of the tree in
gure 3.2 makes no sense, as going along the right (no) branch means that the
compound contains no chlorine atoms.
The actual test that has to be executed in a node is the conjunction of the
literals in the node itself and the literals on the path from the root of the tree
to the node in question. For example, the test in the node irreplaceable(X)
in gure 3.3 is actually haspart(M, X), worn(X), irreplaceable(X). In other
words, we need to send the machine back to the manufacturer for maintenance only if it has a part which is both worn and irreplaceable. Similarly,
the test in the node bond(C, A1, A2, BT ), atom(C, A2, n) in gure 3.2 is in fact
atom(C, A1, cl), bond(C, A1, A2, BT ), atom(C, A2, n). As a consequence, one cannot transform relational decision trees to logic programs in the fashion one clause
per leaf (unlike propositional decision trees, where a transformation one rule per
leaf is possible).
Table 3.8 A decision list representation of the relational regression tree for
predicting the biodegradability of a compound, given in gure 3.2
degrades(C, LogHLT ) atom(C, A1, cl),
bond(C, A1, A2, BT ), atom(C, A2, n), LogHLT = 7.82, !
degrades(C, LogHLT ) atom(C, A1, cl),
LogHLT = 7.51, !
degrades(C, LogHLT ) atom(C, A3, o),
LogHLT = 6.08, !
degrades(C, LogHLT ) LogHLT = 6.73.

Table 3.9

A logic program representation of the relational decision tree in g-

ure 3.3
a(M ) haspart(M, X), worn(X), irreplaceable(X)
b(M ) haspart(M, X), worn(X)
maintenance(M, A) not a(M ), A = no aintenance
maintenance(M, A) b(M ), A = repair in house
maintenance(M, A) a(M ), not b(M ), A = send back

78

Inductive Logic Programming in a Nutshell

Relational decision trees can be easily transformed into rst-order decision lists,
which are ordered sets of clauses (clauses in logic programs are unordered). When
applying a decision list to an example, we always take the rst clause that applies
and return the answer produced. When applying a logic program, all applicable
clauses are used and a set of answers can be produced. First-order decision lists can
be represented by Prolog programs with cuts (!) [4]: cuts ensure that only the rst
applicable clause is used.
A decision list is produced by traversing the relational regression tree in a depthrst fashion, going down left branches rst. At each leaf, a clause is output that
contains the prediction of the leaf and all the conditions along the left (yes) branches
leading to that leaf. A decision list obtained from the tree in gure 3.3 is given in
table 3.7. For the rst clause (send back), the conditions in both internal nodes
are output, as the left branches out of both nodes have been followed to reach
the corresponding leaf. For the second clause, only the condition in the root is
output: to reach the repair in house leaf, the left (yes) branch out of the root has
been followed, but the right (no) branch out of the irreplaceable(X) node has been
followed. A decision list produced from the relational regression tree in gure 3.2
is given in table 3.8.
Generating a logic program from a relational decision tree is more complicated. It
requires the introduction of new predicates. We will not describe the transformation
process in detail, but rather give an example. A logic program, corresponding to
the tree in gure 3.3, is given in table 3.9.
3.5.2

Induction of Relational Decision Trees

The two major algorithms for inducing relational decision trees are upgrades of
the two most famous algorithms for inducting propositional decision trees. SCART
[26, 27] is an upgrade of CART [5], while TILDE [3, 13] is an upgrade of C4.5
[41]. According to the upgrading recipe, both SCART and TILDE have their
propositional counterparts as special cases. The actual algorithms thus closely follow
Table 3.10

The TDIDT part of the SCART algorithm for inducing relational

decision trees
procedure DivideAndConquer(TestsOnYesBranchesSofar, DeclarativeBias, Examples)
if TerminationCondition(Examples)
then
N ewLeaf = CreateNewLeaf(Examples)
return N ewLeaf
else
PossibleTestsNow = GenerateTests(TestsOnYesBranchesSofar, DeclarativeBias)
BestTest = FindBestTest(PossibleTestsNow, Examples)
(Split1 , Split2 ) = SplitExamples(Examples, TestsOnYesBranchesSofar, BestTest)
Lef tSubtree = DivideAndConquer(T estsOnY esBranchesSof ar BestT est, Split1 )
RightSubtree = DivideAndConquer(T estsOnY esBranchesSof ar, Split2 )
return [BestT est, Lef tSubtree, RightSubtree]

3.5

Relational Decision Trees

79

CART and C4.5. Here we illustrate the dierences between SCART and CART by
looking at the TDIDT (top-down induction of decision trees) algorithm of SCART
(table 3.10).
Given a set of examples, the TDID algorithm rst checks if a termination
condition is satised, e.g., if all examples belong to the same class c. If yes, a
leaf is constructed with an appropriate prediction, e.g., assigning the value c to the
class variable. Otherwise a test is selected among the possible tests for the node at
hand, examples are split into subsets according to the outcome of the test, and tree
construction proceeds recursively on each of the subsets. A tree is thus constructed
with the selected test at the root and the subtrees resulting from the recursive calls
attached to the respective branches.
The major dierence in comparison to the propositional case is in the possible
tests that can be used in a node. While in CART these remain (more or less)
the same regardless of where the node is in the tree (e.g., A = v or A < v for
each attribute and attribute value), in SCART the set of possible tests crucially
depends on the position of the node in the tree. In particular, it depends on the tests
along the path from the root to the current node, more precisely on the variables
appearing in those tests and the declarative bias. To emphasize this, we can think
of a GenerateTests procedure being separately employed before evaluating the
tests. The inputs to this procedure are the tests on positive branches from the root
to the current node and the declarative bias. These are also inputs to the top level
TDIDT procedure.
The declarative bias in SCART contains statements of the form
schema(CofL,TandM), where CofL is a conjunction of literals and TandM is a
list of type and mode declarations for the variables in those literals. Two such
statements, used in the induction of the regression tree in gure 3.2 are as follows:
schema((bond(V, W, X, Y), atom(V, X, Z)), [V:chemical:+, W:atomid:+,
X:atomid:, Y:bondtype:, Z:element: =]), and schema(bond (V, W, X,
Y), [V: chemical:+, W:atomid:+, X:atomid:, Y:bondtype: =]). In the
lists, each variable in the conjunction is followed by its type and mode declaration:
+ denotes that the variable must be bound (i.e., appear in TestsOnYesBranchesSofar), that it must not be bound, and = that it must be replaced by a constant
value.
Assuming we have taken the left branch out of the root in gure 3.2,
T estsOnY esBranchesSof ar = atom(C, A1, cl). Taking the declarative bias with
the two schema statements above, the only choice for replacing the variables V
and W in the schemata are the variables C and A1, respectively. The possible
tests at this stage are thus of the form bond(C, A1, A2, BT ), atom(C, A2, E),
where E is replaced with an element (such as cl - chlorine, s - sulphur, or n nitrogen), or of the form bond(C, A1, A2, BT ), where BT is replaced with a bond
type (such as single, double, or aromatic). Among the possible tests, the test
bond(C, A1, A2, BT ), atom(C, A2, n) is chosen.
The approaches to relational decision tree induction are among the fastest
multi-relational data mining approaches. They have been successfully applied to a

80

Inductive Logic Programming in a Nutshell

number of practical problems. These include learning to predict the biodegradability


of chemical compounds [18] and learning to predict the structure of diterpene
compounds from their nuclear magnetic resonance spectra [17].

3.6

Relational Association Rules


The discovery of frequent patterns and association rules is one of the most commonly studied tasks in data mining. Here we rst describe frequent relational patterns (frequent Datalog patterns) and relational association rules (query extensions). We then look into how a well-known algorithm for nding frequent itemsets
has been upgraded to discover frequent relational patterns.
3.6.1

Frequent Datalog Queries and Query Extensions

Dehaspe and colleagues [8], [9] (chapter 8 of [15]) consider patterns in the form
of Datalog queries, which reduce to SQL queries. A Datalog query has the form
? A1 , A2 , . . . An , where the Ai s are logical atoms.
An example Datalog query is
? person(X), parent(X, Y ), hasP et(Y, Z).
This query on a Prolog database containing predicates person, parent, and hasP et
is equivalent to the SQL query
select Person.Id, Parent.Kid, HasPet.Aid
from Person, Parent, HasPet
where Person.Id = Parent.Pid
and Parent.Kid = HasPet.Pid
on a database containing relations Person with argument Id, Parent with
arguments Pid and Kid, and HasPet with arguments Pid and Aid. This query
nds triples (x, y, z), where child y of person x has pet z.
Datalog queries can be viewed as a relational version of itemsets (which are sets
of items occurring together). Consider the itemset {person, parent, child, pet}. The
market-basket interpretation of this pattern is that a person, a parent, a child, and
a pet occur together. This is also partly the meaning of the above query. However,
the variables X, Y , and Z add extra information: the person and the parent are
the same, the parent and the child belong to the same family, and the pet belongs
to the child. This illustrates the fact that queries are a more expressive variant of
itemsets.
To discover frequent patterns, we need to have a notion of frequency. Given
that we consider queries as patterns and that queries can have variables, it is not
immediately obvious what the frequency of a given query is. This is resolved by

3.6

Relational Association Rules

81

specifying an additional parameter of the pattern discovery task, called the key. The
key is an atom which has to be present in all queries considered during the discovery
process. It determines what is actually counted. In the above query, if person(X) is
the key, we count persons; if parent(X, Y ) is the key, we count (parent,child) pairs;
and if hasP et(Y, Z) is the key, we count (owner,pet) pairs. This is described more
precisely below.
Submitting a query Q =? A1 , A2 , . . . An with variables {X1 , . . . Xm } to a Datalog database r corresponds to asking whether a grounding substitution exists (which
replaces each of the variables in Q with a constant), such that the conjunction
A1 , A2 , . . . An holds in r. The answer to the query produces answering substitutions = {X1 /a1 , . . . Xm /am } such that Q succeeds. The set of all answering
substitutions obtained by submitting a query Q to a Datalog database r is denoted
answerset(Q, r).
The absolute frequency of a query Q is the number of answer substitutions
for the variables in the key atom for which the query Q succeeds in the given
database, i.e., a(Q, r, key) = |{ answerset(key, r)|Q succeeds w.r.t. r}|. The
relative frequency (support) can be calculated as f (Q, r, key) = a(Q, r, key)/|{
answerset(key, r)}|. Assuming the key is person(X), the absolute frequency for
our query involving parents, children, and pets can be calculated by the following
SQL statement:
select count(distinct *)
from select Person.Id
from Person, Parent, HasPet
where Person.Id = Parent.Pid
and Parent.Kid = HasPet.Pid
Association rules have the form A C and the intuitive market-basket interpretation customers that buy A typically also buy C. If itemsets A and C have
supports fA and fC , respectively, the condence of the association rule is dened
to be cAC = fC /fA . The task of association rule discovery is to nd all association rules A C, where fC and cAC exceed prespecied thresholds (minsup and
minconf).
Association rules are typically obtained from frequent itemsets. Suppose we have
two frequent itemsets A and C, such that A C, where C = A B. If the support
of A is fA and the support of C is fC , we can derive an association rule A B,
which has condence fC /fA . Treating the arrow as implication, note that we can
derive A C from A B (A A and A B implies A A B, i.e., A C).
Relational association rules can be derived in a similar manner from frequent
Datalog queries. From two frequent queries Q1 =? l1 , . . . lm and Q2 =? l1 , . . . lm ,
lm+1 , . . . ln , where Q2 -subsumes Q1 , we can derive a relational association rule
Q1 Q2 . Since Q2 extends Q1 , such a relational association rule is named a query
extension.

82

Inductive Logic Programming in a Nutshell

A query extension is thus an existentially quantied implication of the form


? l1 , . . . lm ? l1 , . . . lm , lm+1 , . . . ln (since variables in queries are existentially
quantied). A shorthand notation for the above query extension is ? l1 , . . . lm 
lm+1 , . . . ln . We call the query ? l1 , . . . lm the body and the subquery lm+1 , . . . ln
the head of the query extension. Note, however, that the head of the query extension
does not correspond to its conclusion (which is ? l1 , . . . lm , lm+1 , . . . ln ).
Assume the queries Q1 =? person(X), parent(X, Y ) and Q2 =? person(X),
parent(X, Y ), hasP et(Y, Z) are frequent, with absolute frequencies of 40 and 30,
respectively. The query extension E, where E is dened as E =? person(X),
parent(X, Y )  hasP et(Y, Z), can be considered a relational association rule with
a support of 30 and condence of 30/40 = 75%. Note the dierence in meaning
between the query extension E and two obvious, but incorrect, attempts at dening
relational association rules. The clause person(X), parent(X, Y ) hasP et(Y, Z)
(which stands for the logical formula XY Z : person(X) parent(X, Y )
hasP et(Y, Z)) would be interpreted as follows: if a person has a child, then this
child has a pet. The implication ? person(X), parent(X, Y ) ? hasP et(Y, Z),
which stands for (XY : person(X) parent(X, Y )) (Y Z : hasP et(Y, Z))
is trivially true if at least one person in the database has a pet. The correct
interpretation of the query extension E is: if a person has a child, then this person
also has a child that has a pet.
3.6.2

Discovering Frequent Queries: WARMR

The task of discovering frequent queries is addressed by the RDM system WARMR
[8]. WARMR takes as input a database r, a frequency threshold minf req, and
declarative language bias L. L species a key atom and input-output modes for
predicates/relations, discussed below.
WARMR upgrades the well-known APRIORI algorithm for discovering frequent
patterns, which performs levelwise search [1] through the lattice of itemsets. APRIORI starts with the empty set of items and at each level l considers sets of items
of cardinality l. The key to the eciency of APRIORI lies in the fact that a large
frequent itemset can only be generated by adding an item to a frequent itemset.
Candidates at level l + 1 are thus generated by adding items to frequent itemsets
obtained at level l. Further eciency is achieved using the fact that all subsets of a
frequent itemset have to be frequent: only candidates that pass this test get their
frequency to be determined by scanning the database.
In analogy to APRIORI, WARMR searches the lattice of Datalog queries for
queries that are frequent in the given database r. In analogy to itemsets, a more
complex (specic) frequent query Q2 can only be generated from a simpler (more
general) frequent query Q1 (where Q1 is more general than Q2 if Q1 -subsumes
Q2 ; see section 3.4.2 for a denition of -subsumption). WARMR thus starts with
the query ? key at level 1 and generates candidates for frequent queries at level
l + 1 by rening (adding literals to) frequent queries obtained at level l.

3.6

Relational Association Rules

Table 3.11

83

An example specication of declarative language bias settings for

WARMR
warmode key(person(-)).
warmode(parent(+, -)).
warmode(hasPet(+, cat)).
warmode(hasPet(+, dog)).
warmode(hasPet(+, lizard)).

Suppose we are given a Prolog database containing the predicates person, parent,
and hasP et, and the declarative bias in table 3.11. The latter contains the key atom
parent(X) and input-output modes for the relations parent and hasP et. Inputoutput modes specify whether a variable argument of an atom in a query has to
appear earlier in the query (+), must not () or may, but need not (). Inputoutput modes thus place constraints on how queries can be rened, i.e., what atoms
may be added to a given query.
Given the above, WARMR starts the search of the renement graph of queries
at level 1 with the query ? person(X). At level 2, the literals parent(X, Y ),
hasP et(X, cat), hasP et(X, dog), and hasP et(X, lizard) can be added to this query,
yielding the queries ? person(X), parent(X, Y ), ? person(X), hasP et(X, cat),
? person(X), hasP et(X, dog), and ? person(X), hasP et(X, lizard). Taking the
rst of the level 2 queries, the following literals are added to obtain level 3 queries:
parent(Y, Z) (note that parent(Y, X) cannot be added, because X already appears
in the query being rened), hasP et(Y, cat), hasP et(Y, dog), and hasP et(Y, lizard).
While all subsets of a frequent itemset must be frequent in APRIORI, not all
subqueries of a frequent query need be frequent queries in WARMR. Consider the
query ? person(X), parent(X, Y ), hasP et(Y, cat) and assume it is frequent. The
subquery ?person(X), hasP et(Y, cat) is not allowed, as it violates the declarative
bias constraint that the rst argument of hasP et has to appear earlier in the query.
This causes some complications in pruning the generated candidates for frequent
queries: WARMR keeps a list of infrequent queries and checks whether the generated
candidates are subsumed by a query in this list. The WARMR algorithm is given
in table 3.12.
WARMR upgrades APRIORI to a multi-relational setting following the upgrading recipe (see section 3.3.5). The major dierences are in nding the frequency of
queries (where we have to count answer substitutions for the key atom) and the
candidate query generation (by using a renement operator and declarative bias).
WARMR has APRIORI as a special case: if we only have predicates of zero arity
(with no arguments), which correspond to items, WARMR can be used to discover
frequent itemsets.
More importantly, WARMR has as special cases a number of approaches that
extend the discovery of frequent itemsets with, e.g., hierarchies on items [45], as
well as approaches to discovering sequential patterns [2], including general epi-

84

Inductive Logic Programming in a Nutshell


Table 3.12

The WARMR algorithm for discovering frequent Datalog queries.

Algorithm WARMR( r, L, key, minfreq; Q)


Input: Database r; Declarative language bias L and key ;
threshold minfreq;
Output: All queries Q L with frequency minfreq
1.
2.
3.
4.
5.
6.
7.
8.

Initialize level d := 1
Initialize the set of candidate queries Q1 := {?- key}
Initialize the set of (in)frequent queries F := ; I :=
While Qd not empty
Find frequency of all queries Q Qd
Move those with frequency below minfreq to I
Update F := F Qd
Compute new candidates:
Qd+1 = WARMRgen(L; I; F; Qd ) )
9.
Increment d
10. Return F
Function WARMRgen(L; I; F; Qd );
1. Initialize Qd+1 :=
2. For each Qj Qd , and for each renement Qj L of Qj :
Add Qj to Qd+1 , unless:
(i) Qj is more specic than some query I, or
(ii) Qj is equivalent to some query Qd+1 F
3. Return Qd+1

sodes [32]. The individual approaches mentioned make use of the specic properties
of the patterns considered (very limited use of variables) and are more ecient
than WARMR for the particular tasks they address. The high expressive power
of the language of patterns considered has its computational costs, but it also has
the important advantage that a variety of dierent pattern types can be explored
without any changes in the implementation.
WARMR can be (and has been) used to perform propositionalization, i.e., to
transform MRDM problems to propositional (single table) form. WARMR is rst
used to discover frequent queries. In the propositional form, examples correspond
to answer substitutions for the key atom and the binary attributes are the frequent
queries discovered. An attribute is true for an example if the corresponding query
succeeds for the corresponding answer substitution. This approach has been applied
with considerable success to the tasks of predictive toxicology [10] and genome-wide
prediction of protein functional class [24].

3.7

Relational Distance-Based Methods


To upgrade distance-based approaches to learning, including prediction and clustering, it is necessary to upgrade the key notion of a distance measure from the
propositional to the relational case. Such a measure could then be used within

3.7

Relational Distance-Based Methods

85

haspart(M, X), worn(X)


yes

no

irreplaceable(X)
yes

A=send back

A=no maintenance
no

A=repair in house

Figure 3.3 A relational decision tree, predicting the class variable A in the target
predicate maintenance(M, A).

standard statistical approaches, such as nearest-neighbor prediction or hierarchical


agglomerative clustering. In their system RIBL, Emde and Wettschereck [21] propose a relational distance measure. Below we rst briey discuss this measure, then
outline how it has been used for relational classication and clustering [25].
3.7.1

The RIBL Distance Measure

Propositional distance measures are dened between examples that have the form
of vectors of attribute values. They essentially sum up the dierences between the
examples values along each of the dimensions of the vectors. Given two examples
x = (x1 , . . . , xn ) and y = (y1 , . . . , yn ), their distance might be calculated as
distance(x, y) =

n


dierence(xi , yi )/n,

i=1

where the dierence between attribute values is dened as

|xi yi | if continuous,

dierence(xi , yi ) =
0
if discrete and xi = yi ,

1
otherwise
In a relational representation, an example (also called instance or case) can
be described by a set of facts about multiple relations. A fact of the target predicate of the form target(ExampleID, A1 , ..., An ) species an instance
through its ID and properties, and additional information can be specied through
background knowledge predicates. In table 3.13, the target predicate member(PersonID,A,G,I,MT) species information on members of a particular club,
which includes age, gender, income, and membership type. The background predicates car(OwnerID, CT, T S, M ) and house(OwnerID, DistrictID, Y, S) provide information on property owned by club members: for cars this includes car

86

Inductive Logic Programming in a Nutshell

type, top speed, and manufacturer; for houses the district, construction year,
and size. Additional information is available on districts through the predicate
district(DistrictID, P, S, C), i.e., the popularity, size, and country of the district.
Table 3.13

Two examples on which to study a relational distance measure

member(person1 , 45 , male, 20 , gold )


member(person2 , 30 , female, 10 , platinum)
car(person1 , wagon, 200 , volkswagen)
car(person1 , sedan, 220 , mercedesbenz )
car(person2 , roadster , 240 , audi)
car(person2 , coupe, 260 , bmw )
house(person1 , murgle, 1987 , 560 )
house(person1 , montecarlo, 1990 , 210 )
house(person2 , murgle, 1999 , 430 )
district(montecarlo, famous, large, monaco)
district(murgle, famous, small , slovenia)

The basic idea behind the RIBL [21] distance measure is as follows. To calculate
the distance between two objects/examples, their properties are taken into account
rst (at depth 0). Next (at depth 1), objects immediately related to the two
original objects are taken into account, or more precisely, the distances between
the corresponding related objects. At depth 2, objects related to those at depth 1
are taken into account, and so on, until a user-specied depth limit is reached.
In our example, when calculating the distance between e1 = member(person1, 45,
male, 20, gold) and e2 = member(person2, 30, f emale, 10, platinum), the properties of the persons (age, gender, income, membership type) are rst compared and
dierences between them calculated and summed (as in the propositional case). At
depth 1, cars and houses owned by the two persons are compared, i.e., distances
between them are calculated. At depth 2, the districts where the houses reside are
taken into account when calculating the distances between houses. Before beginning
to calculate distances, RIBL collects all facts related to a person into a so-called
case. The case for person1 generated with a depth limit of 2 is given in gure 3.4.
Let us calculate the distance between the two club members according to the distance measure. d(e1 , e2 ) = 1/5(d(person1, person2)+d(45, 30)+d(male, f emale)+
d(20, 10) + d(gold, platinum)). With a depth limit of 0, the identiers person1
and person2 are treated as discrete values, d(person1, person2) = 1 and we have
d(e1 , e2 ) = (1 + (45 30)/100 + 1 + (20 10)/50 + 1)/5 = 0.67; the denominators
100 and 50 denote the highest possible dierences in age and income.
To calculate d(person1, person2) at level 1, we collect the facts directly related
to the two persons and partition them according to the predicates. Thus we have

3.7

Relational Distance-Based Methods

87

member(person1 , 45 , male, 20 , gold )

- car(person1 , wagon, 200 , volkswagen )


- car(person1 , sedan, 220 , mercedesbenz )
- house(person1 , murgle, 1987 , 560 )
- district(murgle, famous, small , slovenia)
- house(person1 , montecarlo, 1990 , 210 )
- district(montecarlo, famous, large, monaco)
All facts related to member(person1 , 45 , male, 2000000 , gold ) constructed with respect to the background knowledge in table 3.13 and a depth limit
of 2.

Figure 3.4

F1 , car = {car(person1 , wagon, 200 , volkswagen ),


car(person1 , sedan, 220 , mercedesbenz )}
F2 , car = {car(person2 , roadster , 240 , audi ),
car(person2 , coupe, 260 , bmw )}
F1 , house = {house(person1 , murgle, 1987 , 560 ),
house(person1 , montecarlo, 1990 , 210 )}
F2 , house = {house(person2 , murgle, 1999 , 430 )}.
Then d(person1, person2) = (d(F1 , car, F2 , car) + d(F1 , house, F2 , house))/2.
Distances between sets of facts are calculated as follows. We take the smaller set
of facts (or the rst, if they are of the same size): for d(F1 , house, F2 , house), we
take F2 , house. For each fact in this set, we calculate its distance to the nearest
element of the other set, e.g., F1 , house, summing up these distances (the house of
person2 is closer to the house of person1 in murgle then to the one in montecarlo).
We add a penalty for the possible mismatch in cardinality and normalize with the
cardinality of the larger set:

88

Inductive Logic Programming in a Nutshell

d(F1 , house, F2 , house) =


[1 + min(
d(house(person2, murgle, 1999, 430), house(person1, murgle, 1987, 560)),
d(house(person2, murgle, 1999, 430), house(person1, montecarlo, 1990, 210))]/2
= 0.5 [1 + min((0 + (1999 1987)/100 + |430 560|/1000)/3,
(1 + (1999 1990)/100 + (430 210)/1000)/3)]
= 0.5 + 0.5 min(0.25/3, 1.31/3) = 13/24.
For calculating d(F1 , car, F2 , car), we take F1 , car and note that both cars
of person1 are closer to the audi of person2 than to the bmw. We thus have
d(F1 , car, F2 , car) = 0.5 [mincF2 ,car d(car(person1 , wagon, 200 , volkswagen ), c)+
0.5 mincF2 ,car d(car(person1 , sedan, 220 , mercedesbenz ), c)] = 0.5 [(1 + |200
240|/100 + 1)/3, (1 + |220 240|/100 + 1)/3] = 11/15. Thus, at level 1, d(person1,
person2) = 0.5 (13/24 + 11/15) = 0.6375 and d(e1 , e2 ) = (0.6375 + (45 30)/100 +
1 + (20 10)/50 + 1)/5 = 0.5975.
Finally, at level 2, the distance between the two districts is taken into account
when calculating d(F1 , house, F2 , house). We have d(murgle, montecarlo) = (0 +
1 + 1)/3 = 2/3. However, since the house of person2 is closer to the house of person1
in murgle then to the one in montecarlo, the value of d(F1 , house, F2 , house) does
not change as it equals 0.5 [1 + min((0 + (1999 1987)/100 + |430 560|/1000)/3,
0.5 [1 + min((2/3 + (1999 1990)/100 + (430 210)/1000)/3)] = 0.5 + 0.5
min(0.25/3, (2/3 + 0.31)/3) = 13/24. d(e1 , e2 ) is thus the same at level 1 and
level 2 and is equal to 0.5975.
We should note here that the RIBL distance measure is not a metric [42].
However, some relational distance measures that are metrics have been proposed
recently [43]. Designing distance measures for relational data is still a largely open
and lively research area. Since distances and kernels are strongly related, this area
is also related to designing kernels for structured data.
3.7.2

Relational Distance-Based Learning

Once we have a relational distance measure, we can easily adapt classical statistical
approaches to prediction and clustering, such as the nearest-neighbor method and
hierarchical agglomerative clustering, to work on relational data. This is precisely
what has been done with the RIBL distance measure.
The original RIBL [21] addresses the problem of prediction, more precisely
classication. It uses the k-nearest neighbor method in conjunction with the RIBL
distance measure to solve the problem addressed. RIBL was successfully applied to
the practical problem of diterpene structure elucidation [17], where it outperformed
propositional approaches as well as a number of other relational approaches.

3.8

Recent Trends in ILP and RDM

89

RIBL2 [25] upgrades the RIBL distance measure by considering lists and
terms as elementary types, much like discrete and numeric values. Edit distances are used for these, while the RIBL distance measure is followed otherwise.
RIBL2 has been used to predict mRNA signal structure and to automatically
discover previously uncharacterized mRNA signal structure classes [25].
Two clustering approaches have been developed that use the RIBL distance
measure [25]. RDBC uses hierarchical agglomerative clustering, while FORC adapts
the k-means approach. The latter relies on nding cluster centers, which is easy for
numeric vectors but far from trivial in the relational case. FORC thus uses the
k-medoids method, which denes a cluster center as the existing case/example that
has the smallest sum of squared distances to all other cases in the cluster and only
uses distance information.

3.8

Recent Trends in ILP and RDM


Hot topics and recent advances in ILP and RDM mirror the hot topics in data
mining and machine learning. These include scalability issues, ensemble methods,
and kernel methods.
Scalability issues do indeed deserve a lot of attention when learning in a relational
setting, as the complexity of learning increases with the expressive power of the
hypothesis language. Scalability methods for ILP include classical ones, such as
sampling or turning the loop of hypothesis evaluation inside out (going through
each example once) in decision tree induction. Methods more specic to ILP, such
as query packs, have also been considered.
Boosting was the rst ensemble method to be used on top of a relational learning
system. This was followed by bagging. More recently, methods for learning random
forests have been adapted to the relational setting.
Kernel methods have become the mainstream of research in machine learning and
data mining in recent years. The development of kernel methods for learning in a
relational setting has thus emerged as a natural research direction. Signicant eort
has been devoted to the development of kernels for structured/relational data, such
as graphs and sequences.
The latest developments in ILP and RDM are discussed in a special issue of
SIGKDD Explorations [20]. Besides the topics mentioned above, the hottest research
topic in ILP and RDM is the study of probabilistic representations and learning
methods. A variety of these have been recently considered. A comprehensive survey
of such methods is presented in this book.

References
[1] R. Agrawal, H. Mannila, R. Srikant, H. Toivonen, and A. I. Verkamo. Fast
discovery of association rules. In U. Fayyad, G. Piatetsky-Shapiro, P. Smyth,

90

Inductive Logic Programming in a Nutshell

and R. Uthurusamy, editors, Advances in Knowledge Discovery and Data Mining,


pages 307328. AAAI Press, Menlo Park, CA, 1996.
[2] R. Agrawal and R. Srikant. Mining sequential patterns. In Proceedings of the
Eleventh International Conference on Data Engineering, 1995.
[3] H. Blockeel and L. De Raedt. Top-down induction of rst order logical decision
trees. Articial Intelligence, 101: 285297, 1998.
[4] I. Bratko. Prolog Programming for Articial Intelligence, 3rd edition. AddisonWesley, Harlow, UK, 2001.
[5] L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone. Classication and
Regression Trees. Wadsworth, Belmont, CA, 1984.
[6] P. Clark and R. Boswell. Rule induction with CN2: Some recent improvements.
In Proceedings of the Fifth European Working Session on Learning, 1991.
[7] P. Clark and T. Niblett. The CN2 induction algorithm. Machine Learning,
3(4): 261283, 1989.
[8] L. Dehaspe and H. Toivonen. Discovery of frequent datalog patterns. Data
Mining and Knowledge Discovery, 3(1): 736, 1999.
[9] L. Dehaspe and H. Toivonen. Discovery of relational association rules. In [15],
pages 189212, 2001.
[10] L. Dehaspe, H. Toivonen, and R. D. King. Finding frequent substructures in
chemical compounds. In Proceedings of the Fourth International Conference on
Knowledge Discovery and Data Mining, 1998.
[11] L. De Raedt. Logical settings for concept learning. Articial Intelligence, 95:
187201, 1997.
[12] L. De Raedt. Attribute-value learning versus inductive logic programming: the
missing links. In Proceedings of the Eighth International Conference on Inductive
Logic Programming, 1998.
[13] L. De Raedt, H. Blockeel, L. Dehaspe, and W. Van Laer. Three companions
for data mining in rst order logic. In [15], pages 105139, 2001.
[14] L. De Raedt and S. Dzeroski. First order jk-clausal theories are PAClearnable. Articial Intelligence, 70: 375392, 1994.
[15] S. Dzeroski and N. Lavrac, editors. Relational Data Mining. Springer-Verlag,
Berlin, 2001.
[16] S. Dzeroski, S. Muggleton, and S. Russell. PAC-learnability of determinate
logic programs. In Proceedings of the Fifth ACM Workshop on Computational
Learning Theory, 1992.
[17] S. Dzeroski, S. Schulze-Kremer, K. Heidtke, K. Siems, D. Wettschereck,
and H. Blockeel. Diterpene structure elucidation from 13 C NMR spectra with
inductive logic programming. Applied Articial Intelligence, 12: 363383, 1998.
[18] S. Dzeroski, H. Blockeel, B. Kompare, S. Kramer, B. Pfahringer, and W.
Van Laer. Experiments in predicting biodegradability. In Proceedings of the

References

91

International Workshop on Inductive Logic Programming, 1999.


[19] S. Dzeroski. Relational data mining applications: An overview. In [15], pages
339364, 2001.
[20] S. Dzeroski and L. De Raedt, editors. SIGKDD Explorations, Special Issue
on Multi-Relational Data Mining, 5(1), 2003.
[21] W. Emde and D. Wettschereck. Relational instance-based learning. In
Proceedings of the Thirteenth International Conference on Machine Learning,
1996.
[22] C. Hogger. Essentials of Logic Pogramming. Clarendon Press, Oxford, UK,
1990.
[23] A. Karalic and I. Bratko. First order regression. Machine Learning 26: 147176, 1997.
[24] R.D. King, A. Karwath, A. Clare, and L. Dehaspe. Genome scale prediction
of protein functional class from sequence using data mining. In Proceedings of the
Sixth International Conference on Knowledge Discovery and Data Mining, 2000.
[25] M. Kirsten, S. Wrobel, and T. Horv
ath. Distance based approaches to relational learning and clustering. In [15], pages 213232, 2001.
[26] S. Kramer. Structural regression trees. In Proceedings of the Thirteenth
National Conference on Articial Intelligence, 1996.
[27] S. Kramer and G. Widmer. Inducing classication and regression trees in rst
order logic. In [15], pages 140159, 2001.
[28] S. Kramer, N. Lavrac, and P. Flach. Propositionalization approaches to
relational data mining. In [15], pages 262291, 2001.
[29] N. Lavrac, S. Dzeroski, and M. Grobelnik. Learning nonrecursive denitions
of relations with LINUS. In Proceedings of the Fifth European Working Session
on Learning, 1991.
[30] N. Lavrac and S. Dzeroski. Inductive Logic Programming: Techniques
and Applications. Ellis Horwood, Chichester, UK, 1994. Freely available at
http://www-ai.ijs.si/SasoDzeroski/ILPBook/.
[31] J. Lloyd. Foundations of Logic Programming, 2nd edition. Springer-Verlag,
Berlin, 1987.
[32] H. Mannila and H. Toivonen. Discovering generalized episodes using minimal
occurrences. In Proceedings of the Second International Conference on Knowledge
Discovery and Data Mining, 1996.
[33] R. Michalski, I. Mozetic, J. Hong, and N. Lavrac. The multi-purpose incremental learning system AQ15 and its testing application on three medical domains. In Proceedings of the Fifth National Conference on Articial Intelligence,
1986.
[34] S. Muggleton. Inductive logic programming. New Generation Computing,
8(4): 295318, 1991.

92

Inductive Logic Programming in a Nutshell

[35] S. Muggleton. Inverse entailment and Progol. New Generation Computing,


13: 245286, 1995.
[36] S. Muggleton and W. Buntine. Machine invention of rst-order predicates
by inverting resolution. In Proceedings of the Fifth International Conference on
Machine Learning, 1988.
[37] S. Muggleton and C. Feng. Ecient induction of logic programs. In Proceedings of the First Conference on Algorithmic Learning Theory, 1990.
[38] C. Nedellec, C. Rouveirol, H. Ade, F. Bergadano, and B. Tausend. Declarative
bias in inductive logic programming. In L. De Raedt, editor, Advances in
Inductive Logic Programming, pages 82103. IOS Press, Amsterdam, 1996.
[39] G. Plotkin. A note on inductive generalization. In B. Meltzer and D. Michie,
editors, Machine Intelligence 5, pages 153163. Edinburgh University Press,
Edinburgh, 1969.
[40] J. R. Quinlan. Learning logical denitions from relations. Machine Learning,
5(3): 239266, 1990.
[41] J. R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann,
San Mateo, CA, 1993.
[42] J. Ramon. Clustering and Instance Based Learning in First Order Logic. PhD
Thesis. Katholieke Universiteit Leuven, Belgium, 2002.
[43] J. Ramon and M. Bruynooghe. A polynomial time computable metric between
point sets. Acta Informatica, 37(10): 765780.
[44] E. Shapiro. Algorithmic Program Debugging. MIT Press, Cambridge, MA,
1983.
[45] R. Srikant and R. Agrawal. Mining generalized association rules. In Proceedings of the Twenty-rst International Conference on Very Large Data Bases,
1995.
[46] A. Srinivasan. The Aleph manual. Technical Report, Computing Laboratory,
Oxford University, Oxford, UK, 2000.
[47] J. Ullman. Principles of Database and Knowledge Base Systems, volume 1.
Computer Science Press, Rockville, MI, 1988.
[48] V. Van Laer and L. De Raedt. How to upgrade propositional learners to rst
order logic: A case study. In [15], pages 235261, 2001.
[49] S. Wrobel. Inductive logic programming for knowledge discovery in databases.
In [15], pages 74101, 2001.
[50] S. Wrobel and S. Dzeroski. The ILP description learning problem: Towards
a general model-level denition of data mining in ILP. In Proceedings Fachgruppentreen Maschinelles Lernen. University of Dortmund, Germany, 1995.

4 An Introduction to Conditional Random


Fields for Relational Learning

Charles Sutton and Andrew McCallum

Conditional random elds (CRFs) combine the modeling exibility of graphical


models with the ability to use rich, nonindependent features of the input. In this
tutorial, we review modeling, inference, and parameter estimation in CRFs, both
on linear chains and on general graphical structures. We discuss dierences between
generative and discriminative modeling, latent-variable conditional models, and
practical aspects of CRF implementations. Finally, we present a case study applying
a loopy CRF to a relational problem in natural language processing.

4.1

Introduction
Relational data has two characteristics: rst, statistical dependencies exist between
the entities we wish to model, and second, each entity often has a rich set of features
that can aid classication. For example, when classifying web documents, the pages
text provides much information about the class label, but hyperlinks dene a
relationship between pages that can improve classication [55]. Graphical models
are a natural formalism for exploiting the dependence structure among entities.
Traditionally, graphical models have been used to represent the joint probability
distribution p(y, x), where the variables y represent the attributes of the entities
that we wish to predict, and the input variables x represent our observed knowledge
about the entities. But modeling the joint distribution can lead to diculties when
using the rich local features that can occur in relational data, because it requires
modeling the distribution p(x), which can include complex dependencies. Modeling
these dependencies among inputs can lead to intractable models, but ignoring them
can lead to reduced performance.
A solution to this problem is to directly model the conditional distribution p(y|x),
which is sucient for classication. This is the approach taken by conditional
random elds (CRFs) [24]. A CRF is simply a conditional distribution p(y|x) with

94

An Introduction to Conditional Random Fields for Relational Learning

an associated graphical structure. Because the model is conditional, dependencies


among the input variables x do not need to be explicitly represented, aording the
use of rich, global features of the input. For example, in natural language tasks,
useful features include neighboring words and word bigrams, prexes and suxes,
capitalization, membership in domain-specic lexicons, and semantic information
from sources such as WordNet. Recently there has been an explosion of interest
in CRFs, with successful applications including text processing [55, 37, 48, 49],
bioinformatics [47, 25], and computer vision [18, 23].
This chapter is divided into two parts. First, we present a tutorial on current
training and inference techniques for CRFs. We discuss the important special case
of linear-chain CRFs, and then we generalize these to arbitrary graphical structures.
We include a brief discussion of techniques for practical CRF implementations.
Second, we present an example of applying a general CRF to a practical relational
learning problem. In particular, we discuss the problem of information extraction,
that is, automatically building a relational database from information contained
in unstructured text. Unlike linear-chain models, general CRFs can capture longdistance dependencies between labels. For example, if the same name is mentioned
more than once in a document, all mentions probably have the same label, and it
is useful to extract them all, because each mention may contain dierent complementary information about the underlying entity. To represent these long-distance
dependencies, we propose a skip-chain CRF, a model that jointly performs segmentation and collective labeling of extracted mentions. On a standard problem
of extracting speaker names from seminar announcements, the skip-chain CRF has
better performance than a linear-chain CRF.

4.2

Graphical Models
4.2.1

Denitions

We consider probability distributions over sets of random variables V = X Y ,


where X is a set of input variables that we assume are observed, and Y is a set of
output variables that we wish to predict. Every variable v V takes outcomes from
a set V, which can be either continuous or discrete, although we discuss only the
discrete case in this chapter. We denote an assignment to X by x, and we denote
an assignment to a set A X by xA , and similarly for Y . We use the notation
1{x=x } to denote an indicator function of x which takes the value 1 when x = x
and 0 otherwise.
A graphical model is a family of probability distributions that factorize according
to an underlying graph. The main idea is to represent a distribution over a large
number of random variables by a product of local functions that each depend on
only a small number of variables. Given a collection of subsets A V , we dene
an undirected graphical model as the set of all distributions that can be written in

4.2

Graphical Models

95

the form
p(x, y) =

1 
A (xA , yA ),
Z

(4.1)

for any choice of factors F = {A }, where A : V n + . (These functions are


also called local functions or compatibility functions.) We will occasionally use the
term random eld to refer to a particular distribution among those dened by an
undirected model. To reiterate, we will consistently use the term model to refer to a
family of distributions, and random eld (or more commonly, distribution) to refer
to a single one.
The constant Z is a normalization factor dened as

A (xA , yA ),
(4.2)
Z=
x,y A

which ensures that the distribution sums to 1. The quantity Z, considered as a


function of the set F of factors, is called the partition function in the statistical
physics and graphical models communities. Computing Z is intractable in general,
but much work exists on how to approximate it.
Graphically, we represent the factorization (4.1) by a factor graph [21]. A factor
graph is a bipartite graph G = (V, F, E) in which a variable node vs V is
connected to a factor node A F if vs is an argument to A . An example of
a factor graph is shown graphically in gure 4.1 (right). In that gure, the circles
are variable nodes, and the shaded boxes are factor nodes.
In this chapter, we will assume that each local function has the form



Ak fAk (xA , yA ) ,
(4.3)
A (xA , yA ) = exp
k

for some real-valued parameter vector A , and for some set of feature functions or
sucient statistics {fAk }. This form ensures that the family of distributions over V
parameterized by is an exponential family. Much of the discussion in this chapter
actually applies to exponential families in general.
A directed graphical model, also known as a Bayesian network, is based on a
directed graph G = (V, E). A directed model is a family of distributions that
factorize as

p(v|(v)),
(4.4)
p(y, x) =
vV

where (v) are the parents of v in G. An example of a directed model is shown in


gure 4.1 (left).
We use the term generative model to refer to a directed graphical model in which
the outputs topologically precede the inputs, that is, no x X can be a parent of
an output y Y . Essentially, a generative model is one that directly describes how
the outputs probabilistically generate the inputs.

96

An Introduction to Conditional Random Fields for Relational Learning

The naive Bayes classier, as a directed model (left), and as a factor


graph (right).
Figure 4.1

4.2.2

Applications of Graphical Models

In this section we discuss a few applications of graphical models to natural language


processing (NLP). Although these examples are well-known, they serve both to
clarify the denitions in the previous section, and to illustrate some ideas that will
arise again in our discussion of CRFs. We devote special attention to the hidden
Markov model (HMM), because it is closely related to the linear-chain CRF.
4.2.2.1

Classication

First we discuss the problem of classication, that is, predicting a single class
variable y given a vector of features x = (x1 , x2 , . . . , xK ). One simple way to
accomplish this is to assume that once the class label is known, all the features
are independent. The resulting classier is called the naive Bayes classier. It is
based on a joint probability model of the form
p(y, x) = p(y)

K


p(xk |y).

(4.5)

k=1

This model can be described by the directed model shown in gure 4.1 (left). We
can also write this model as a factor graph, by dening a factor (y) = p(y), and
a factor k (y, xk ) = p(xk |y) for each feature xk . This factor graph is shown in
gure 4.1 (right).
Another well-known classier that is naturally represented as a graphical model is
logistic regression (sometimes known as the maximum entropy classier in the NLP
community). In statistics, this classier is motivated by the assumption that the log
probability, log p(y|x), of each class is a linear function of x, plus a normalization
constant. This leads to the conditional distribution:


1
exp y +
y,j xj ,
(4.6)
p(y|x) =

Z(x)
j=1



where Z(x) = y exp{y + K
j=1 y,j xj } is a normalizing constant, and y is a
bias weight that acts like log p(y) in naive Bayes. Rather than using one vector per
class, as in (4.6), we can use a dierent notation in which a single set of weights is
shared across all the classes. The trick is to dene a set of feature functions that are

4.2

Graphical Models

97

nonzero only for a single class. To do this, the feature functions can be dened as
fy ,j (y, x) = 1{y =y} xj for the feature weights and fy (y, x) = 1{y =y} for the bias
weights. Now we can use fk to index each feature function fy ,j , and k to index
its corresponding weight y ,j . Using this notational trick, the logistic regression
model becomes:
K


1
exp
p(y|x) =
k fk (y, x) .
(4.7)
Z(x)
k=1

We introduce this notation because it mirrors the usual notation for CRFs.
4.2.2.2

Sequence Models

Classiers predict only a single class variable, but the true power of graphical
models lies in their ability to model many variables that are interdependent. In this
section, we discuss perhaps the simplest form of dependency, in which the output
variables are arranged in a sequence. To motivate this kind of model, we discuss
an application from NLP, the task of named-entity recognition (NER). NER is the
problem of identifying and classifying proper names in text, including locations,
such as China; people, such as George Bush; and organizations, such as the United
Nations. The NER task is, given a sentence, rst to segment which words are part
of entities, and then to classify each entity by type (person, organization, location,
and so on). The challenge of this problem is that many named entities are too rare
to appear even in a large training set, and therefore the system must identify them
based only on context.
One approach to NER is to classify each word independently as one of either
Person, Location, Organization, or Other (meaning not an entity). The
problem with this approach is that it assumes that given the input, all of the namedentity labels are independent. In fact, the named-entity labels of neighboring words
are dependent; for example, while New York is a location, New York Times is an
organization.
This independence assumption can be relaxed by arranging the output variables
in a linear chain. This is the approach taken by HMMs [42]. An HMM models a
sequence of observations X = {xt }Tt=1 by assuming that there is an underlying
sequence of states Y = {yt }Tt=1 drawn from a nite state set S. In the named-entity
example, each observation xt is the identity of the word at position t, and each state
yt is the named-entity label, that is, one of the entity types Person, Location,
Organization, and Other.
To model the joint distribution p(y, x) tractably, an HMM makes two independence assumptions. First, it assumes that each state depends only on its immediate
predecessor, that is, each state yt is independent of all its ancestors y1 , y2 , . . . , yt2
given its previous state yt1 . Second, an HMM assumes that each observation variable xt depends only on the current state yt . With these assumptions, we can
specify an HMM using three probability distributions: rst, the distribution p(y1 )

98

An Introduction to Conditional Random Fields for Relational Learning

over initial states; second, the transition distribution p(yt |yt1 ); and nally, the
observation distribution p(xt |yt ). That is, the joint probability of a state sequence
y and an observation sequence x factorizes as
p(y, x) =

T


p(yt |yt1 )p(xt |yt ),

(4.8)

t=1

where, to simplify notation, we write the initial state distribution p(y1 ) as p(y1 |y0 ).
In NLP, HMMs have been used for sequence labeling tasks such as part-of-speech
tagging, named-entity recognition, and information extraction.
4.2.3

Discriminative and Generative Models

An important dierence between naive Bayes and logistic regression is that naive
Bayes is generative, meaning that it is based on a model of the joint distribution
p(y, x), while logistic regression is discriminative, meaning that it is based on
a model of the conditional distribution p(y|x). In this section, we discuss the
dierences between generative and discriminative modeling, and the advantages of
discriminative modeling for many tasks. For concreteness, we focus on the examples
of naive Bayes and logistic regression, but the discussion in this section actually
applies in general to the dierences between generative models and CRF.
The main dierence is that a conditional distribution p(y|x) does not include
a model of p(x), which is not needed for classication anyway. The diculty in
modeling p(x) is that it often contains many highly dependent features, which are
dicult to model. For example, in named-entity recognition, an HMM relies on only
one feature, the words identity. But many words, especially proper names, will not
have occurred in the training set, so the word-identity feature is uninformative. To
label unseen words, we would like to exploit other features of a word, such as its
capitalization, its neighboring words, its prexes and suxes, its membership in
predetermined lists of people and locations, and so on.
To include interdependent features in a generative model, we have two choices:
enhance the model to represent dependencies among the inputs, or make simplifying
independence assumptions, such as the naive Bayes assumption. The rst approach,
enhancing the model, is often dicult to do while retaining tractability. For example, it is hard to imagine how to model the dependence between the capitalization of
a word and its suxes, nor do we particularly wish to do so, since we always observe
the test sentences anyway. The second approach, adding independence assumptions
among the inputs, is problematic because it can hurt performance. For example,
although the naive Bayes classier performs surprisingly well in document classication, it performs worse on average across a range of applications than logistic
regression [7].
Furthermore, even when naive Bayes has good classication accuracy, its
probability estimates tend to be poor. To understand why, imagine training
naive Bayes on a data set in which all the features are repeated, that is,

4.2

Graphical Models

99

Figure 4.2 Diagram of the relationship between naive Bayes, logistic regression,
HMMs, linear-chain CRFs, generative models, and general CRFs.

x = (x1 , x1 , x2 , x2 , . . . , xK , xK ). This will increase the condence of the naive Bayes


probability estimates, even though no new information has been added to the data.
Assumptions like naive Bayes can be especially problematic when we generalize
to sequence models, because inference essentially combines evidence from dierent
parts of the model. If probability estimates at a local level are overcondent, it
might be dicult to combine them sensibly.
Actually, the dierence in performance between naive Bayes and logistic regression is due only to the fact that the rst is generative and the second discriminative;
the two classiers are, for discrete input, identical in all other respects. Naive Bayes
and logistic regression consider the same hypothesis space, in the sense that any
logistic regression classier can be converted into a naive Bayes classier with the
same decision boundary, and vice versa. Another way of saying this is that the naive
Bayes model (4.5) denes the same family of distributions as the logistic regression
model (4.7), if we interpret it generatively as

exp { k k fk (y, x)}

.
(4.9)
p(y, x) = 
)}
y, x
y,
x exp {
k k fk (
This means that if the naive Bayes model (4.5) is trained to maximize the conditional likelihood, we recover the same classier as from logistic regression. Conversely, if the logistic regression model is interpreted generatively, as in (4.9), and is
trained to maximize the joint likelihood p(y, x), then we recover the same classier
as from naive Bayes. In the terminology of Ng and Jordan [36], naive Bayes and
logistic regression form a generative-discriminative pair.
The principal advantage of discriminative modeling is that it is better suited to
including rich, overlapping features. To understand this, consider the family of naive
Bayes distributions (4.5). This is a family of joint distributions whose conditionals
all take the logistic regression form (4.7). But there are many other joint models,

100

An Introduction to Conditional Random Fields for Relational Learning

some with complex dependencies among x, whose conditional distributions also


have the form (4.7). By modeling the conditional distribution directly, we can
remain agnostic about the form of p(x). This may explain why it has been observed
that CRFs tend to be more robust than generative models to violations of their
independence assumptions [24]. Simply put, CRFs make independence assumptions
among y, but not among x.
Another way to make the same point is due to Minka[34]. Suppose we have a
generative model pg with parameters . By denition, this takes the form
pg (y, x; ) = pg (y; )pg (x|y; ).

(4.10)

But we could also rewrite pg using Bayes rule as


pg (y, x; ) = pg (x; )pg (y|x; ),

(4.11)


where pg (x; ) and pg (y|x; ) are computed by inference, i.e., pg (x; ) = y pg (y, x; )
and pg (y|x; ) = pg (y, x; )/pg (x; ).
Now, compare this generative model to a discriminative model over the same family of joint distributions. To do this, we dene a prior p(x) over inputs, such that p(x)
could have arisen from pg with some parameter setting. That is, p(x) = pc (x;  ) =


y pg (y, x| ). We combine this with a conditional distribution pc (y|x; ) that
could also have arisen from pg , that is, pc (y|x; ) = pg (y, x; )/pg (x; ). Then the
resulting distribution is
pc (y, x) = pc (x;  )pc (y|x; ).

(4.12)

By comparing (4.11) with (4.12), it can be seen that the conditional approach has
more freedom to t the data, because it does not require that =  . Intuitively,
because the parameters in (4.11) are used in both the input distribution and the
conditional, a good set of parameters must represent both well, potentially at the
cost of trading o accuracy on p(y|x), the distribution we care about, for accuracy
on p(x), which we care less about.
In this section, we have discussed the relationship between naive Bayes and
logistic regression in detail because it mirrors the relationship between HMMs and
linear-chain CRFs. Just as naive Bayes and logistic regression are a generativediscriminative pair, there is a discriminative analogue to HMMs, and this analogue
is a particular type of CRF, as we explain next. The analogy between naive Bayes,
logistic regression, generative models, and CRFs is depicted in gure 4.2.

4.3

Linear-Chain Conditional Random Fields


In the previous section, we have seen advantages both to discriminative modeling
and sequence modeling. So it makes sense to combine the two. This yields a linearchain CRF, which we describe in this section. First, in section 4.3.1, we dene linear-

4.3

Linear-Chain Conditional Random Fields

Figure 4.3

101

Graphical model of an HMM-like linear-chain CRF.

Graphical model of a linear-chain CRF in which the transition score


depends on the current observation.
Figure 4.4

chain CRFs, motivating them from HMMs. Then, we discuss parameter estimation
(section 4.3.2) and inference (section 4.3.3) in linear-chain CRFs.
4.3.1

From HMMs to CRFs

To motivate our introduction of linear-chain CRFs, we begin by considering the


conditional distribution p(y|x) that follows from the joint distribution p(y, x) of an
HMM. The key point is that this conditional distribution is in fact a CRF with a
particular choice of feature functions.
First, we rewrite the HMM joint (4.8) in a form that is more amenable to
generalization. This is

 

 
1
ij 1{yt =i} 1{yt1 =j} +
oi 1{yt =i} 1{xt =o} ,
p(y, x) = exp

Z
t

i,jS

iS oO

(4.13)
where = {ij , oi } are the parameters of the distribution, and can be any real
numbers. Every HMM can be written in this form, as can be seen simply by setting
ij = log p(y  = i|y = j) and so on. Because we do not require the parameters to
be log probabilities, we are no longer guaranteed that the distribution sums to 1,
unless we explicitly enforce this by using a normalization constant Z. Despite this
added exibility, it can be shown that (4.13) describes exactly the class of HMMs
in (4.8); we have added exibility to the parameterization, but we have not added
any distributions to the family.
We can write (4.13) more compactly by introducing the concept of feature
functions, just as we did for logistic regression in (4.7). Each feature function
has the form fk (yt , yt1 , xt ). In order to duplicate (4.13), there needs to be one
feature fij (y, y  , x) = 1{y=i} 1{y =j} for each transition (i, j) and one feature
fio (y, y  , x) = 1{y=i} 1{x=o} for each state-observation pair (i, o). Then we can write

102

An Introduction to Conditional Random Fields for Relational Learning

an HMM as

K


1
p(y, x) = exp
k fk (yt , yt1 , xt ) .
Z

(4.14)

k=1

Again, (4.14) denes exactly the same family of distributions as (4.13), and therefore
as the original HMM equation (4.8).
The last step is to write the conditional distribution p(y|x) that results from the
HMM (4.14). This is

K
exp
k=1 k fk (yt , yt1 , xt )
p(y, x)

=
.
(4.15)
p(y|x) = 

K

y p(y , x)
k fk (y  , y  , xt )
 exp
y

k=1

t1

This conditional distribution (4.15) is a linear-chain CRF, in particular one that


includes features only for the current words identity. But many other linear-chain
CRFs use richer features of the input, such as prexes and suxes of the current
word, the identity of surrounding words, and so on. Fortunately, this extension
requires little change to our existing notation. We simply allow the feature functions
fk (yt , yt1 , xt ) to be more general than indicator functions. This leads to the general
denition of linear-chain CRFs, which we present now.
Denition 4.1
Let Y, X be random vectors, = {k } K be a parameter vector, and
{fk (y, y  , xt )}K
k=1 be a set of real-valued feature functions. Then a linear-chain
conditional random eld is a distribution p(y|x) that takes the form
K


1
exp
k fk (yt , yt1 , xt ) ,
(4.16)
p(y|x) =
Z(x)
k=1

where Z(x) is an instance-specic normalization function


K



Z(x) =
exp
k fk (yt , yt1 , xt ) .
y

(4.17)

k=1

We have just seen that if the joint p(y, x) factorizes as an HMM, then the
associated conditional distribution p(y|x) is a linear-chain CRF. This HMM-like
CRF is pictured in gure 4.3. Other types of linear-chain CRFs are also useful,
however. For example, in an HMM, a transition from state i to state j receives the
same score, log p(yt = j|yt1 = i), regardless of the input. In a CRF, we can allow
the score of the transition (i, j) to depend on the current observation vector, simply
by adding a feature 1{yt =j} 1{yt1 =1} 1{xt =o} . A CRF with this kind of transition
feature, which is commonly used in text applications, is pictured in gure 4.4.
To indicate in the denition of linear-chain CRF that each feature function
can depend on observations from any time step, we have written the observation
argument to fk as a vector xt , which should be understood as containing all the

4.3

Linear-Chain Conditional Random Fields

103

components of the global observations x that are needed for computing features
at time t. For example, if the CRF uses the next word xt+1 as a feature, then the
feature vector xt is assumed to include the identity of word xt+1 .
Finally, note that the normalization constant Z(x) sums over all possible state
sequences, an exponentially large number of terms. Nevertheless, it can be computed
eciently by forward-backward, as we explain in section 4.3.3.
4.3.2

Parameter Estimation

In this section we discuss how to estimate the parameters = {k } of a linearchain CRF. We are given i.i.d. training data D = {x(i) , y(i) }N
i=1 , where each
(i)
(i)
(i)
(i) (i)
(i)
x(i) = {x1 , x2 , . . . xT } is a sequence of inputs, and each y(i) = {y1 , y2 , . . . yT }
is a sequence of the desired predictions. Thus, we have relaxed the i.i.d. assumption
within each sequence, but we still assume that distinct sequences are independent.
(In section 4.4, we will see how to relax this assumption as well.)
Parameter estimation is typically performed by penalized maximum likelihood.
Because we are modeling the conditional distribution, the following log-likelihood,
sometimes called the conditional log-likelihood, is appropriate:
() =

N


log p(y(i) |x(i) ).

(4.18)

i=1

One way to understand the conditional likelihood p(y|x; ) is to imagine combining


it with some arbitrary prior p(x;  ) to form a joint p(y, x). Then when we optimize
the joint log-likelihood
log p(y, x) = log p(y|x; ) + log p(x;  ),

(4.19)

the two terms on the right-hand side are decoupled, that is, the value of  does
not aect the optimization over . If we do not need to estimate p(x), then we can
simply drop the second term, which leaves (4.18).
After substituting in the CRF model (4.16) into the likelihood (4.18), we get the
following expression:
() =

K
T 
N 


(i)

(i)

(i)

k fk (yt , yt1 , xt )

i=1 t=1 k=1

N


log Z(x(i) ),

(4.20)

i=1

Before we discuss how to optimize this, we mention regularization. It is often the


case that we have a large number of parameters. As a measure to avoid overtting,
we use regularization, which is a penalty on weight vectors whose norm is too
large. A common choice of penalty is based on the Euclidean norm of and on a
regularization parameter 1/2 2 that determines the strength of the penalty. Then
the regularized log likelihood is
() =

K
T 
N 

i=1 t=1 k=1

(i)

(i)

(i)

k fk (yt , yt1 , xt )

N

i=1

log Z(x(i) )

K

2k
.
2 2

k=1

(4.21)

104

An Introduction to Conditional Random Fields for Relational Learning

The notation for the regularizer is intended to suggest that regularization can also
be viewed as performing maximum a posteriori estimation of , if is assigned
a Gaussian prior with mean 0 and covariance 2 I. The parameter 2 is a free
parameter which determines how much to penalize large weights. Determining the
best regularization parameter can require a computationally intensive parameter
sweep. Fortunately, often the accuracy of the nal model does not appear to be
sensitive to changes in 2 , even when 2 is varied up to a factor of 10. An alternative
choice of regularization is to use the 1 norm instead of the Euclidean norm, which
corresponds to an exponential prior on parameters [17]. This regularizer tends to
encourage sparsity in the learned parameters.
In general, the function () cannot be maximized in closed form, so numeric
optimization is used. The partial derivatives of (4.21) are
T
T 
N 
N 
K



k

(i) (i)
(i)
(i)
=
fk (yt , yt1 , xt )
fk (y, y  , xt )p(y, y  |x(i) )
.
k
2
i=1 t=1
i=1 t=1 y,y 
k=1
(4.22)
The rst term is the expected value of fk under the empirical distribution:

p(y, x) =

N
1 
1
(i) 1
(i) .
N i=1 {y=y } {x=x }

(4.23)

The second term, which arises from the derivative of log Z(x), is the expectation
p(x). Therefore, at the unregularized
of fk under the model distribution p(y|x; )
maximum likelihood solution, when the gradient is zero, these two expectations are
equal. This pleasing interpretation is a standard result about maximum-likelihood
estimation in exponential families.
Now we discuss how to optimize (). The function () is concave, which follows

from the convexity of functions of the form g(x) = log i exp xi . Convexity is
extremely helpful for parameter estimation, because it means that every local
optimum is also a global optimum. Adding regularization ensures that  is strictly
concave, which implies that it has exactly one global optimum.
Perhaps the simplest approach to optimize  is steepest ascent along the gradient
(4.22), but this requires too many iterations to be practical. Newtons method
converges much faster because it takes into account the curvature of the likelihood,
but it requires computing the Hessian, the matrix of all second derivatives. The size
of the Hessian is quadratic in the number of parameters. Since practical applications
often use tens of thousands or even millions of parameters, even storing the full
Hessian is not practical.
Instead, current techniques for optimizing (4.21) make approximate use of secondorder information. Particularly successful have been quasi-Newton methods such
as BFGS [3], which compute an approximation to the Hessian from only the rst
derivative of the objective function. A full K K approximation to the Hessian still
requires quadratic size, however, so a limited-memory version of BFGS is used, due
to Byrd et al. [6]. As an alternative to limited-memory BFGS, conjugate gradient

4.3

Linear-Chain Conditional Random Fields

105

is another optimization technique that also makes approximate use of second-order


information and has been used successfully with CRFs. Either can be thought of as
a black-box optimization routine that is a drop-in replacement for vanilla gradient
ascent. When such second-order methods are used, gradient-based optimization is
much faster than the original approaches based on iterative scaling in Laerty et al.
[24], as shown experimentally by several authors [49, 61, 26, 35]. Recently, stochastic
gradient methods, which make updates based on subsets of the training instances,
have been shown to be highly eective [58], and may be an attractive alternative
to second-order methods, which tend to evaluate the gradient over all the training
instances before making an update.
Finally, it is important to remark on the computational cost of training. Both the
partition function Z(x) in the likelihood and the marginal distributions p(yt , yt1 |x)
in the gradient can be computed by forward-backward, which uses computational
complexity O(T M 2 ). However, each training instance will have a dierent partition
function and marginals, so we need to run forward-backward for each training
instance for each gradient computation, for a total training cost of O(T M 2 N G),
where N is the number of training examples, and G the number of gradient
computations required by the optimization procedure. For many data sets, this
cost is reasonable, but if the number of states is large, or the number of training
sequences is very large, then this can become expensive. For example, on a standard
named-entity data set, with eleven labels and 200,000 words of training data, CRF
training nishes in under two hours on current hardware. However, on a part-ofspeech tagging data set, with forty-ve labels and 1 million words of training data,
CRF training requires over a week.
4.3.3

Inference

There are two common inference problems for CRFs. First, during training, computing the gradient requires marginal distributions for each edge p(yt , yt1 |x), and
computing the likelihood requires Z(x). Second, to label an unseen instance, we
compute the most likely (Viterbi) labeling y = arg maxy p(y|x). In linear-chain
CRFs, both inference tasks can be performed eciently and exactly by variants
of the standard dynamic-programming algorithms for HMMs. In this section, we
briey review the HMM algorithms, and extend them to linear-chain CRFs. These
standard inference algorithms are described in more detail by Rabiner [42].
First, we introduce notation which will simplify the forward-backward recursions.

An HMM can be viewed as a factor graph p(y, x) = t t (yt , yt1 , xt ) where Z = 1,
and the factors are dened as
def

t (j, i, x) = p(yt = j|yt1 = i)p(xt = x|yt = j).

(4.24)

If the HMM is viewed as a weighted nite-state machine, then t (j, i, x) is the


weight on the transition from state i to state j when the current observation is x.

106

An Introduction to Conditional Random Fields for Relational Learning

Now, we review the HMM forward algorithm, which is used to compute the
probability p(x) of the observations. The idea behind forward-backward is to rst

rewrite the naive summation p(x) = y p(x, y) using the distributive law:
p(x) =

T


t (yt , yt1 , xt )

(4.25)

y t=1



T (yT , yT1 , xT )

yT yT1

T1 (yT1 , yT2 , xT1 )

yT2

(4.26)

yT3

Now we observe that each of the intermediate sums is reused many times during
the computation of the outer sum, and so we can save an exponential amount of
work by caching the inner sums.
This leads to dening a set of forward variables t , each of which is a vector
of size M (where M is the number of states) which stores one of the intermediate
sums. These are dened as
def

t (j) = p(x1...t , yt = j)
=

t (j, yt1 , xt )

y1...t1

(4.27)
t1


t (yt , yt 1 , xt ),

(4.28)

t =1

where the summation over y1...t1 ranges over all assignments to the sequence
of random variables y1 , y2 , . . . , yt1 . The alpha values can be computed by the
recursion

t (j) =
t (j, i, xt )t1 (i),
(4.29)
iS

with initialization 1 (j) = 1 (j, y0 , x1 ). (Recall that y0 is the xed initial state of

the HMM.) It is easy to see that p(x) = yT T (yT ) by repeatedly substituting the
recursion (4.29) to obtain (4.26). A formal proof would use induction.
The backward recursion is exactly the same, except that in (4.26), we push in
the summations in reverse order. This results in the denition
def

t (i) = p(xt+1...T |yt = i)


T


t (yt , yt 1 , xt ),
=

(4.30)
(4.31)

yt+1...T t =t+1

and the recursion


t (i) =

t+1 (j, i, xt+1 )t+1 (j),

(4.32)

jS

which is initialized T (i) = 1. Analogously to the forward case, we can compute


def 
p(x) using the backward variables as p(x) = 0 (y0 ) = y1 1 (y1 , y0 , x1 )1 (y1 ).

4.3

Linear-Chain Conditional Random Fields

107

By combining results from the forward and backward recursions, we can compute
the marginal distributions needed for the gradient (4.22). Applying the distributive
law again, we see that
p(yt1 , yt |x) = t (yt , yt1 , xt )

t1


t (yt , yt 1 , xt )

y1...t2 t =1

T


yt+1...T

t =t+1

t (yt , yt 1 , xt ) , (4.33)

which can be computed from the forward and backward recursions as


p(yt1 , yt |x) t1 (yt1 )t (yt , yt1 , xt )t (yt ).

(4.34)

Finally, to compute the globally most probable assignment y = arg maxy p(y|x),
we observe that the trick in (4.26) still works if all the summations are replaced by
maximization. This yields the Viterbi recursion:
t (j) = max t (j, i, xt )t1 (i).

(4.35)

iS

Now that we have described the forward-backward and Viterbi algorithms for
HMMs, the generalization to linear-chain CRFs is fairly straightforward. The
forward-backward algorithm for linear-chain CRFs is identical to the HMM version,
except that the transition weights t (j, i, xt ) are dened dierently. We observe that
the CRF model (4.16) can be rewritten as
p(y|x) =

T
1 
t (yt , yt1 , xt ),
Z(x) t=1

where we dene


t (yt , yt1 , xt ) = exp

(4.36)


k fk (yt , yt1 , xt ) .

(4.37)

With that denition, the forward recursion (4.29), the backward recursion (4.32),
and the Viterbi recursion (4.35) can be used unchanged for linear-chain CRFs.
Instead of computing p(x) as in an HMM, in a CRF the forward and backward
recursions compute Z(x).
A nal inference task that is useful in some applications is to compute a marginal
probability p(yt , yt+1 , . . . yt+k |x) over a range of nodes. For example, this is useful
for measuring the models condence in its predicted labeling over a segment of
input. This marginal probability can be computed eciently using constrained
forward-backward, as by Culotta and McCallum[12].

108

4.4

An Introduction to Conditional Random Fields for Relational Learning

CRFs in General
In this section, we dene CRFs with general graphical structure, as they were
introduced originally [24]. Although initial applications of CRFs used linear chains,
there have been many later applications of CRFs with more general graphical
structures. Such structures are especially useful for relational learning, because
they allow relaxing the i.i.d. assumption among entities. Also, although CRFs have
typically been used for across-network classication, in which the training and
testing data are assumed to be independent, we will see that CRFs can be used for
within-network classication as well, in which we model probabilistic dependencies
between the training and testing data.
The generalization from linear-chain CRFs to general CRFs is fairly straightforward. We simply move from using a linear-chain factor graph to a more general
factor graph, and from forward-backward to more general (perhaps approximate)
inference algorithms.
4.4.1

Model

First we present the general denition of a CRF.


Denition 4.2
Let G be a factor graph over Y . Then p(y|x) is a CRF if for any xed x, the
distribution p(y|x) factorizes according to G.
Thus, every conditional distribution p(y|x) is a CRF for some, perhaps trivial,
factor graph. If F = {A } is the set of factors in G, and each factor takes the
exponential family form (4.3), then the conditional distribution can be written as

K(A)



1
exp
Ak fAk (yA , xA ) .
(4.38)
p(y|x) =

Z(x)
A G

k=1

In addition, practical models rely extensively on parameter tying. For example, in the linear-chain case, often the same weights are used for the factors
t (yt , yt1 , xt ) at each time step. To denote this, we partition the factors of G
into C = {C1 , C2 , . . . CP }, where each Cp is a clique template whose parameters are
tied. This notion of clique template generalizes that in Taskar et al. [55], Sutton
et al. [54], and Richardson and Domingos [43]. Each clique template Cp is a set
of factors which has a corresponding set of sucient statistics {fpk (xp , yp )} and
parameters p K(p) . Then the CRF can be written as
p(y|x) =

1 
Z(x)

Cp C c Cp

c (xc , yc ; p ),

(4.39)

4.4

CRFs in General

109

where each factor is parameterized as


c (xc , yc ; p ) = exp
and the normalization function is
 
Z(x) =

K(p)


k=1

pk fpk (xc , yc ) ,

c (xc , yc ; p ).

(4.40)

(4.41)

y Cp C c Cp

For example, in a linear-chain CRF, typically one clique template C =


{t (yt , yt1 , xt )}Tt=1 is used for the entire network.
Several special cases of CRFs are of particular interest. First, dynamic conditional
random elds [54] are sequence models which allow multiple labels at each time
step, rather than single labels as in linear-chain CRFs. Second, relational Markov
networks [55] are a type of general CRF in which the graphical structure and
parameter tying are determined by an SQL-like syntax. Finally, Markov logic
networks [43, 50] are a type of probabilistic logic in which there are parameters
for each rst-order rule in a knowledge base.
4.4.2

Applications of CRFs

CRFs have been applied to a variety of domains, including text processing, computer vision, and bioinformatics. In this section, we discuss several applications,
highlighting the dierent graphical structures that occur in the literature.
One of the rst large-scale applications of CRFs was by Sha and Pereira [49], who
matched state-of-the-art performance on segmenting noun phrases in text. Since
then, linear-chain CRFs have been applied to many problems in NLP, including
named-entity recognition [30], feature induction for NER [28], identifying protein
names in biology abstracts [48], segmenting addresses in webpages [13], nding
semantic roles in text [45], identifying the sources of opinions [8], Chinese word
segmentation [38], Japanese morphological analysis [22], and many others.
In bioinformatics, CRFs have been applied to RNA structural alignment [47]
and protein structure prediction [25]. Semi-Markov CRFs [46] add somewhat more
exibility in choosing features, which may be useful for certain tasks in information
extraction and especially bioinformatics.
General CRFs have also been applied to several tasks in NLP. One promising
application is to perform multiple labeling tasks simultaneously. For example,
Sutton et al. [54] show that a two-level dynamic CRF for part-of-speech tagging and
noun phrase chunking performs better than solving the tasks one at a time. Another
application is to multilabel classication, in which each instance can have multiple
class labels. Rather than learning an independent classier for each category,
Ghamrawi and McCallum [16] present a CRF that learns dependencies between
the categories, resulting in improved classication performance. Finally, the skip-

110

An Introduction to Conditional Random Fields for Relational Learning

chain CRF, which we present in section 4.5, is a general CRF that represents
long-distance dependencies in information extraction.
An interesting graphical CRF structure has been applied to the problem of proper
noun coreference, that is, of determining which mentions in a document, such as
Mr. President and he, refer to the same underlying entity. McCallum and Wellner
[31] learn a distance metric between mentions using a fully connected CRF in
which inference corresponds to graph partitioning. A similar model has been used
to segment handwritten characters and diagrams [11, 40].
In some applications of CRFs, ecient dynamic programs exist even though
the graphical model is dicult to specify. For example, McCallum et al[33] learn
the parameters of a string-edit model in order to discriminate between matching
and nonmatching pairs of strings. Also, there is work on using CRFs to learn
distributions over the derivations of a grammar [44, 9, 51, 57]. A potentially useful
unifying framework for this type of model is provided by case-factor diagrams [27].
In computer vision, several authors have used grid-shaped CRFs [18, 23] for
labeling and segmenting images. Also, for recognizing objects, Quattoni et al.
[41] use a tree-shaped CRF in which latent variables are designed to recognize
characteristic parts of an object.
4.4.3

Parameter Estimation

Parameter estimation for general CRFs is essentially the same as for linear-chains,
except that computing the model expectations requires more general inference
algorithms. First, we discuss the fully observed case, in which the training and
testing data are independent, and the training data is fully observed. In this case
the conditional log-likelihood is given by
() =


  K(p)

pk fpk (xc , yc ) log Z(x).

(4.42)

Cp C c Cp k=1

It is worth noting that the equations in this section do not explicitly sum over
training instances, because if a particular application happens to have i.i.d. training
instances, they can be represented by disconnected components in the graph G.
The partial derivative of the log-likelihood with respect to a parameter pk
associated with a clique template Cp is

 

=
fpk (xc , yc )
fpk (xc , yc )p(yc |x).
pk

c Cp

(4.43)

c Cp yc

The function () has many of the same properties as in the linear-chain case.
First, the zero-gradient conditions can be interpreted as requiring that the suf
cient statistics Fpk (x, y) =
c fpk (xc , yc ) have the same expectations under
the empirical distribution and under the model distribution. Second, the function
() is concave, and can be eciently maximized by second-order techniques such

4.4

CRFs in General

111

as conjugate gradient and L-BFGS. Finally, regularization is used just as in the


linear-chain case.
Now, we discuss the case of within-network classication, where there are dependencies between the training and testing data. That is, the random variables y are
partitioned into a set ytr that is observed during training and a set ytst that is
unobserved during training. It is assumed that the graph G contains connections
between ytr and ytst .
Within-network classication can be viewed as a kind of latent variable problem,
in which certain variables, in this case ytst , are not observed in the training data.
It is more dicult to train CRFs with latent variables, because optimizing the
likelihood p(ytr |x) requires marginalizing out the latent variables ytst . Because of
this dicultly, the original work on CRFs focused on fully observed training data,
but recently there has been increasing interest in training latent-variable CRFs
[41, 33].
Suppose we have a CRF with inputs x in which the output variables y are
observed in the training data, but we have additional variables w that are latent,
so that the CRF has the form
1  
c (xc , wc , yc ; p ).
(4.44)
p(y, w|x) =
Z(x)
Cp C c Cp

The objective function to maximize during training is the marginal likelihood



p(y, w|x).
(4.45)
() = log p(y|x) = log
w

The rst question is how even to compute the marginal likelihood (), because if
there are many variables w, the sum cannot be computed directly. The key is to

realize that we need to compute log w p(y, w|x) not for any possible assignment
y, but only for the particular assignment that occurs in the training data. This
motivates taking the original CRF (4.44), and clamping the variables Y to their
observed values in the training data, yielding a distribution over w:
 
1
c (xc , wc , yc ; p ),
(4.46)
p(w|y, x) =
Z(y, x)
Cp C c Cp

where the normalization factor is


 
Z(y, x) =

c (xc , wc , yc ; p ).

(4.47)

w Cp C c Cp

This new normalization constant Z(y, x) can be computed by the same inference
algorithm that we use to compute Z(x). In fact, Z(y, x) is easier to compute,
because it sums only over w, while Z(x) sums over both w and y. Graphically, this
amounts to saying that clamping the variables y in the graph G can simplify the
structure among w.

112

An Introduction to Conditional Random Fields for Relational Learning

Once we have Z(y, x), the marginal likelihood can be computed as


p(y|x) =

1  
Z(x) w

c (xc , wc , yc ; p ) =

Cp C c Cp

Z(y, x)
.
Z(x)

(4.48)

Now that we have a way to compute , we discuss how to maximize it with


respect to . Maximizing () can be dicult because  is no longer convex in
general (intuitively, log-sum-exp is convex, but the dierence of two log-sum-exp
functions might not be), so optimization procedures are typically guaranteed to nd
only local maxima. Whatever optimization technique is used, the model parameters
must be carefully initialized in order to reach a good local maximum.
We discuss two dierent ways to maximize : directly using the gradient, as in
Quattoni et al. [41]; and using expectation maximization (EM), as in McCallum
et al. [33]. To maximize  directly, we need to calculate its gradient. The simplest
way to do this is to use the following fact. For any function f (), we have
d log f
df
= f ()
,
d
d

(4.49)

which can be seen by applying the chain rule to log f and rearranging. Applying

this to the marginal likelihood () = log w p(y, w|x) yields




1
p(y, w|x)
= 
pk
p(y,
w|x)

pk
w
w


log p(y, w|x) .


p(w|y, x)
=
pk
w

(4.50)
(4.51)

This is the expectation of the fully observed gradient, where the expectation is
taken over w. This expression simplies to
 


=
p(wc |y, x)fk (yc , xc , wc )
pk

c Cp wc

p(wc , yc |xc )fk (yc , xc , wc ).

c Cp wc ,yc

(4.52)
This gradient requires computing two dierent kinds of marginal probabilities.
The rst term contains a marginal probability p(wc |y, x), which is exactly a
marginal distribution of the clamped CRF (4.46). The second term contains a
dierent marginal p(wc , yc |xc ), which is the same marginal probability required
in a fully-observed CRF. Once we have computed the gradient,  can be maximized
by standard techniques such as conjugate gradient. In our experience, conjugate
gradient tolerates violations of convexity better than limited-memory BFGS, so it
may be a better choice for latent-variable CRFs.
Alternatively,  can be optimized using EM. At each iteration j in the EM
algorithm, the current parameter vector (j) is updated as follows. First, in the
E-step, an auxiliary function q(w) is computed as q(w) = p(w|y, x; (j) ). Second,

4.4

CRFs in General

113

in the M-step, a new parameter vector (j+1) is chosen as



(j+1) = arg max
q(w ) log p(y, w |x;  ).


(4.53)

w

The direct maximization algorithm and the EM algorithm are strikingly similar.
This can be seen by substituting the denition of q into (4.53) and taking derivatives. The gradient is almost identical to the direct gradient (4.52). The only difference is that in EM, the distribution p(w|y, x) is obtained from a previous, xed
parameter setting rather than from the argument of the maximization. We are unaware of any empirical comparison of EM to direct optimization for latent-variable
CRFs.
4.4.4

Inference

In general CRFs, just as in the linear-chain case, gradient-based training requires


computing marginal distributions p(yc |x), and testing requires computing the most
likely assignment y = arg maxy p(y|x). This can be accomplished using any
inference algorithm for graphical models. If the graph has small treewidth, then the
junction tree algorithm can be used to exactly compute the marginals, but because
both inference problems are NP-hard for general graphs, this is not always possible.
In such cases, approximate inference must be used to compute the gradient. In this
section, we mention various approximate inference algorithms that have been used
successfully with CRFs. Detailed discussion of these are beyond the scope of this
chapter.
When choosing an inference algorithm to use within CRF training, the important
thing to understand is that it will be invoked repeatedly, once for each time that
the gradient is computed. For this reason, sampling-based approaches which may
take many iterations to converge, such as Markov chain Monte Carlo (MCMC),
have not been popular, although they might be appropriate in some circumstances.
Indeed, contrastive divergence [19], in which an MCMC sampler is run for only a
few samples, has been successfully applied to CRFs in vision [18].
Because of their computational eciency, variational approaches have been most
popular for CRFs. Several authors [55, 54] have used loopy belief propagation. Belief
propagation is an exact inference algorithm for trees which generalizes the forwardbackward. Although the generalization of the forward-backward recursions, which
are called message updates, are neither exact nor even guaranteed to converge if
the model is not a tree, they are still well-dened, and they have been empirically
successful in a wide variety of domains, including text processing, vision, and errorcorrecting codes. In the past ve years, there has been much theoretical analysis of
the algorithm as well. We refer the reader to [63] for more information.

114

An Introduction to Conditional Random Fields for Relational Learning

4.4.5

Discussion

This section contains miscellaneous remarks about CRFs. First, it is easily seen that
the logistic regression model (4.7) is a CRF with a single output variable. Thus,
CRFs can be viewed as an extension of logistic regression to arbitrary graphical
structures.
Linear-chain CRFs were originally introduced as an improvement to the maximumentropy Markov model (MEMM) [32], which is essentially a Markov model in which
the transition distributions are given by a logistic regression model. MEMMs can
exhibit the problems of label bias [24] and observation bias [20]. Both of these
problems can be readily understood graphically: the directed model of an MEMM
implies that for all time steps t, the observation xt is marginally independent of
the labels yt1 , yt2 . and so onan independence assumption which is usually
strongly violated in sequence modeling. Sometimes this assumption can be eectively avoided by including information from previous time steps as features, and
this explains why MEMMs have had success in some NLP applications.
Although we have emphasized the view of a CRF as a model of the conditional
distribution, one could view it as an objective function for parameter estimation of
joint distributions. As such, it is one objective among many, including generative
likelihood, pseudolikelihood [4], and the maximum-margin objective [56, 2]. Another
related discriminative technique for structured models is the averaged perceptron,
which has been especially popular in the natural language community [10], in large
part because of its ease of implementation. To date, there has been little careful
comparison of these, especially CRFs and max-margin approaches, across dierent
structures and domains.
Given this view, it is natural to imagine training directed models by conditional
likelihood, and in fact this is commonly done in the speech community, where it is
called maximum mutual information training. However, it is no easier to maximize
the conditional likelihood in a directed model than an undirected model, because in
a directed model the conditional likelihood requires computing log p(x), which plays
the same role as Z(x) in the CRF likelihood. In fact, training is more complex in a
directed model, because the model parameters are constrained to be probabilities
constraints which can make the optimization problem more dicult. This is in stark
contrast to the joint likelihood, which is much easier to compute for directed models
than undirected models (although recently several ecient parameter estimation
techniques have been proposed for undirected factor graphs, such as Abbeel et al.
[1] and Wainwright et al. [60]).
4.4.6

Implementation Concerns

There are a few implementation techniques that can help both training time and
accuracy of CRFs, but are not always fully discussed in the literature. Although
these apply especially to language applications, they are also useful more generally.

4.4

CRFs in General

115

First, when the predicted variables are discrete, the features fpk are ordinarily
chosen to have a particular form:
fpk (yc , xc ) = 1{yc =yc } qpk (xc ).

(4.54)

c , but
In other words, each feature is nonzero only for a single output conguration y
as long as that constraint is met, then the feature value depends only on the input
observation. Essentially, this means that we can think of our features as depending
only on the input xc , but that we have a separate set of weights for each output
conguration. This feature representation is also computationally ecient, because
computing each qpk may involve nontrivial text or image processing, and it need be
evaluated only once for every feature that uses it. To avoid confusion, we refer to
the functions qpk (xc ) as observation functions rather than as features. Examples of
observation functions are word xt is capitalized and word xt ends in ing.
This representation can lead to a large number of features, which can have
signicant memory and time requirements. For example, matching state-of-theart results on a standard natural language task, [49] uses 3.8 million features. Not
all of these features are ever nonzero in the training data. In particular, some
observation functions qpk are nonzero only for certain output congurations. This
point can be confusing: One might think that such features can have no eect on
the likelihood, but actually they do aect Z(x), so putting a negative weight on
them can improve the likelihood by making wrong answers less likely. In order to
save memory, however, sometimes these unsupported features, that is, those which
never occur in the training data, are removed from the model. In practice, however,
including unsupported features typically results in better accuracy.
In order to get the benets of unsupported features with less memory, we have
had success with an ad hoc technique for selecting only a few unsupported features.
The main idea is to add unsupported features only for likely paths, as follows: rst
train a CRF without any unsupported features, stopping after only a few iterations;
then add unsupported features fpk (yc , xc ) for cases where xc occurs in the training
data, and p(yc |x) > . McCallum[28] presents a more principled method of feature
selection for CRFs.
Second, if the observations are categorical rather than ordinal, that is, if they
are discrete but have no intrinsic order, it is important to convert them to binary
features. For example, it makes sense to learn a linear weight on fk (y, xt ) when fk
is 1 if xt is the word dog and 0 otherwise, but not when fk is the integer index
of word xt in the texts vocabulary. Thus, in text applications, CRF features are
typically binary; in other application areas, such as vision and speech, they are
more commonly real-valued.
Third, in language applications, it is sometimes helpful to include redundant
factors in the model. For example, in a linear-chain CRF, one may choose to include
both edge factors t (yt , yt1 , xt ) and variable factors t (yt , xt ). Although one could
dene the same family of distributions using only edge factors, the redundant node
factors provide a kind of backo, which is useful when there is too little data.

116

An Introduction to Conditional Random Fields for Relational Learning

In language applications, there is always too little data, even when hundreds of
thousands of words are available.
Finally, often the probabilities involved in forward-backward and belief propagation become too small to be represented within numeric precision. There are two
standard approaches to this common problem. One approach is to normalize each
of the vectors t and t to sum to 1, thereby magnifying small values. A second
approach is to perform computations in the logarithmic domain, e.g., the forward
recursion becomes
!"
#
log t (j, i, xt ) + log t1 (i) ,
(4.55)
log t (j) =
iS

where is the operator a b = log(ea + eb ). At rst, this does not seem much of
an improvement, since numeric precision is lost when computing ea and eb . But
can be computed as
a b = a + log(1 + eba ) = b + log(1 + eab ),

(4.56)

which can be much more numerically stable, particularly if we pick the version of
the identity with the smaller exponent. CRF implementations often use the logspace approach because it makes computing Z(x) more convenient, but in some
applications, the computational expense of taking logarithms is an issue, making
normalization preferable.

4.5

Skip-Chain CRFs
In this section, we present a case study of applying a general CRF to a practical
natural language problem. In particular, we consider a problem in information
extraction, the task of building a database automatically from unstructured text.
Recent work in extraction has often used sequence models, such as HMMs and
linear-chain CRFs, which model dependencies only between neighboring labels, on
the assumption that those dependencies are the strongest.
But sometimes it is important to model certain kinds of long-range dependencies
between entities. One important kind of dependency within information extraction
occurs on repeated mentions of the same eld. When the same entity is mentioned
more than once in a document, such as Robert Booth, in many cases all mentions
have the same label, such as Seminar-Speaker. We can take advantage of this
fact by favoring labelings that treat repeated words identically, and by combining
features from all occurrences so that the extraction decision can be made based on
global information. Furthermore, identifying all mentions of an entity can be useful
in itself, because each mention might contain dierent useful information. However,
most extraction systems, whether probabilistic or not, do not take advantage of
this dependency, instead treating the separate mentions independently.

4.5

Skip-Chain CRFs

117

Graphical representation of a skip-chain CRF. Identical words are


connected because they are likely to have the same label.

Figure 4.5

To perform collective labeling, we need to represent dependencies between distant


terms in the input. But this reveals a general limitation of sequence models,
whether generatively or discriminatively trained. Sequence models make a Markov
assumption among labels, that is, that any label yt is independent of all previous
labels given its immediate predecessors ytk . . . yt1 . This represents dependence
only between nearby nodesfor example, between bigrams and trigramsand
cannot represent the higher-order dependencies that arise when identical words
occur throughout a document.
To relax this assumption, we introduce the skip-chain CRF, a conditional model
that collectively segments a document into mentions and classies the mentions by
entity type, while taking into account probabilistic dependencies between distant
mentions. These dependencies are represented in a skip-chain model by augmenting
a linear-chain CRF with factors that depend on the labels of distant but similar
words. This is shown graphically in gure 4.5.
Even though the limitations of n-gram models have been widely recognized within
NLP, long-distance dependencies are dicult to represent in generative models,
because full n-gram models have too many parameters if n is large. We avoid this
problem by selecting which skip edges to include based on the input string. This kind
of input-specic dependence is dicult to represent in a generative model, because
it makes generating the input more complicated. In other words, conditional models
have been popular because of their exibility in allowing overlapping features;
skip-chain CRFs take advantage of their exibility in allowing input-specic model
structure.
4.5.1

Model

The skip-chain CRF is essentially a linear-chain CRF with additional long-distance


edges between similar words. We call these additional edges skip edges. The features
on skip edges can incorporate information from the context of both endpoints, so
that strong evidence at one endpoint can inuence the label at the other endpoint.
When applying the skip-chain model, we must choose which skip edges to include.
The simplest choice is to connect all pairs of identical words, but more generally we
can connect any pair of words that we believe to be similar, for example, pairs of

118

An Introduction to Conditional Random Fields for Relational Learning


Table 4.1 Input features qk (x, t) for the seminars data. In the above wt is the word
at position t, Tt is the part-of-speech tag at position t, w ranges over all words in the
training data, and T ranges over all part-of-speech tags returned by the Brill tagger.
The appears to be features are based on hand-designed regular expressions that
can span several tokens.
wt
wt
wt
wt
wt
wt
wt

=w
matches [A-Z][a-z]+
matches [A-Z][A-Z]+
matches [A-Z]
matches [A-Z]+
matches [A-Z]+[a-z]+[A-Z]+[a-z]
appears in list of rst names,
last names, honorics, etc.
wt appears to be part of a time followed by a dash
wt appears to be part of a time preceded by a dash
wt appears to be part of a date
Tt = T
qk (x, t + ) for all k and [4, 4]

words that belong to the same stem class, or have small edit distance. In addition,
we must be careful not to include too many skip edges, because this could result
in a graph that makes approximate inference dicult. So we need to use similarity
metrics that result in a suciently sparse graph. In the experiments below, we focus
on named-entity recognition, so we connect pairs of identical capitalized words.
Formally, the skip-chain CRF is dened as a general CRF with two clique
templates: one for the linear-chain portion, and one for the skip edges. For a sentence
x, let I = {(u, v)} be the set of all pairs of sequence positions for which there are
skip edges. For example, in the experiments reported here, I is the set of indices of
all pairs of identical capitalized words. Then the probability of a label sequence y
given an input x is modeled as
T

1 
t (yt , yt1 , x)
uv (yu , yv , x),
p (y|x) =
Z(x) t=1

(4.57)

(u,v)I

where t are the factors for linear-chain edges, and uv are the factors over skip
edges. These factors are dened as



1k f1k (yt , yt1 , x, t)
(4.58)
t (yt , yt1 , x) = exp

uv (yu , yv , x) = exp


k


2k f2k (yu , yv , x, u, v) ,

(4.59)

4.5

Skip-Chain CRFs

119

1
where 1 = {1k }K
k=1 are the parameters of the linear-chain template, and 2 =
K2
{2k }k=1 are the parameters of the skip template. The full set of model parameters
is = {1 , 2 }.
As described in section 4.4.6, both the linear-chain features and skip-chain
features are factorized into indicator functions of the outputs and observation
functions, as in (4.54). In general the observation functions qk (x, t) can depend
on arbitrary positions of the input string. For example, a useful feature for NER is
qk (x, t) = 1 if and only if xt+1 is a capitalized word.
The observation functions for the skip edges are chosen to combine the observations from each endpoint. Formally, we dene the feature functions for the skip
edges to factorize as

fk (yu , yv , x, u, v) = 1{yu =yu } 1{yv =yv } qk (x, u, v),

(4.60)

This choice allows the observation functions qk (x, u, v) to combine information from
the neighborhood of yu and yv . For example, one useful feature is qk (x, u, v) = 1
if and only if xu = xv = Booth and xv1 = Speaker:. This can be a useful
feature if the context around xu , such as Robert Booth is manager of control
engineering. . . , may not make clear whether or not Robert Booth is presenting a
talk, but the context around xv is clear, such as Speaker: Robert Booth. 1
Because the loops in a skip-chain CRF can be long and overlapping, exact
inference is intractable for the data we consider. The running time required by exact
inference is exponential in the size of the largest clique in the graphs junction tree.
In junction trees created from the seminars data, 29 of the 485 instances have a
maximum clique size of 10 or greater, and 11 have a maximum clique size of 14
or greater. (The worst instance has a clique with 61 nodes.) These cliques are far
too large to perform inference exactly. For reference, representing a single factor
that depends on 14 variables requires more memory than can be addressed in a
32-bit architecture. Instead, we perform approximate inference using loopy belief
propagation, which was mentioned in section 4.4.4. We use an asynchronous treebased schedule known as tree-based representation (TRP) [59].
4.5.2

Results

We evaluate skip-chain CRFs on a collection of 485 email messages announcing


seminars at Carnegie Mellon University. The messages are annotated with the
seminars starting time, ending time, location, and speaker. This data set is due to
Freitag [15], and has been used in much previous work.
Often the elds are listed multiple times in the message. For example, the speaker
name might be included both near the beginning and later on, in a sentence like If
you would like to meet with Professor Smith. . . As mentioned earlier, it can be

1. This example is taken from an actual error made by a linear-chain CRF on the seminars
data set. We present results from this data set in section 4.5.2.

120

An Introduction to Conditional Random Fields for Relational Learning


Table 4.2 Comparison of F1 performance on the seminars data. The top line gives
a dynamic Bayes net that has been previously used on this data set. The skip-chain
CRF beats the previous systems in overall F1 and on the speaker eld, which has
proved to be the hardest eld of the four. Overall F1 is the average of the F1 scores
for the four elds.
System

stime

etime

location

speaker

overall

BIEN Peshkin and Pfeer [39]


Linear-chain CRF
Skip-chain CRF

96.0
97.5
96.7

98.8
97.5
97.2

87.1
88.3
88.1

76.9
77.3
80.4

89.7
90.2
90.6

Number of inconsistently mislabeled tokens, that is, tokens that are


mislabeled even though the same token is labeled correctly elsewhere in the document. Learning long-distance dependencies reduces this kind of error in the speaker
and location elds. Numbers are averaged over ve folds.

Table 4.3

Field

Linear-chain

Skip-chain

stime

12.6

17

etime

3.2

5.2

location

6.4

0.6

speaker

30.2

4.8

useful to nd both such mentions, because dierent information can occur in the
surrounding context of each mention: for example, the rst mention might be near
an institutional aliation, while the second mentions that Smith is a professor.
We evaluate a skip-chain CRF with skip edges between identical capitalized
words. The motivation for this is that the hardest aspect of this data set is
identifying speakers and locations, and capitalized words that occur multiple times
in a seminar announcement are likely to be either speakers or locations.
Table 4.1 shows the list of input features we used. For a skip edge (u, v), the
input features we used were the disjunction of the input features at u and v, that
is,
qk (x, u, v) = qk (x, u) qk (x, v),

(4.61)

where is binary or. All of our results are averaged over ve-fold cross-validation
with an 80/20 split of the data. We report results from both a linear-chain CRF
and a skip-chain CRF with the same set of input features.
We calculate precision and recall as2

2. Previous work on this data set has traditionally measured precision and recall per
document, that is, from each document the system extracts only one eld of each type.
Because the goal of the skip-chain CRF is to extract all mentions in a document, these

4.5

Skip-Chain CRFs

121

# tokens extracted correctly


# tokens extracted
# tokens extracted correctly
.
R=
# true tokens of eld

P =

As usual, we report F1 = (2P R)/(P + R).


Table 4.2 compares a skip-chain CRF to a linear-chain CRF and to a dynamic
Bayes net used in previous work [39]. The skip-chain CRF performs much better
than all the other systems on the Speaker eld, which is the eld for which the
skip edges would be expected to make the most dierence. On the other elds,
however, the skip-chain CRF does slightly worse (less than 1% absolute F1).
We expected that the skip-chain CRF would do especially well on the speaker
eld, because speaker names tend to appear multiple times in a document, and a
skip-chain CRF can learn to label the multiple occurrences consistently. To test
this hypothesis, we measure the number of inconsistently mislabeled tokens, that
is, tokens that are mislabeled even though the same token is classied correctly
elsewhere in the document. Table 4.3 compares the number of inconsistently mislabeled tokens in the test set between linear-chain and skip-chain CRFs. For the
linear-chain CRF, on average 30.2 true speaker tokens are inconsistently mislabeled.
Because the linear-chain CRF mislabels 121.6 true speaker tokens, this situation
includes 24.7% of the missed speaker tokens.
The skip-chain CRF shows a dramatic decrease in inconsistently mislabeled
tokens on the speaker eld, from 30.2 tokens to 4.8. Consequently, the skip-chain
CRF also has much better recall on speaker tokens than the linear-chain CRF (70.0
R linear chain, 76.8 R skip chain). This explains the increase in F1 from linearchain to skip-chain CRFs, because the two have similar precision (86.5 P linear
chain, 85.1 skip chain). These results support the original hypothesis that treating
repeated tokens consistently especially benets recall on the Speaker eld.
On the Location eld, on the other hand, where we might also expect skipchain CRFs to perform better, there is no benet. We explain this by observing
in table 4.3 that inconsistent misclassication occurs much less frequently in this
eld.
4.5.3

Related Work

Recently, Bunescu and Mooney [5] have used a relational Markov network to
collectively classify the mentions in a document, achieving increased accuracy by
learning dependencies between similar mentions. In their work, however, candidate
phrases are extracted heuristically, which can introduce errors if a true entity is

metrics are inappropriate, so we cannot compare with this previous work. Peshkin and
Pfeer [39] do use the per-token metric (personal communication), so our comparison is
fair in that respect.

122

An Introduction to Conditional Random Fields for Relational Learning

not selected as a candidate phrase. Our model performs collective segmentation


and labeling simultaneously, so that the system can take into account dependencies
between the two tasks. The skip-chain CRF itself has also been presented elsewhere
[52].
As an extension to our work, Finkel et al. [14] augment the skip-chain model with
richer kinds of long-distance factors than just over pairs of words. These factors
are useful for modeling exceptions to the assumption that similar words tend to
have similar labels. For example, in named-entity recognition, the word China is
as a place name when it appears alone, but when it occurs within the phrase The
China Daily, it should be labeled as an organization. Because this model is more
complex than the original skip-chain model, Finkel et al. estimate its parameters in
two stages, rst training the linear-chain component as a separate CRF, and then
heuristically selecting parameters for the long-distance factors. Finkel et al. report
improved results both on the seminars data set that we consider in this chapter,
and on several other standard information extraction data sets.
Finally, the skip-chain CRF can also be viewed as performing extraction while
taking into account a simple form of coreference information, since the reason that
identical words are likely to have similar tags is that they are likely to be coreferent.
Thus, this model is a step toward joint probabilistic models for extraction and data
mining as advocated by McCallum and Jensen [29]. An example of such a joint
model is the one of Wellner et al. [62], which jointly segments citations in research
papers and predicts which citations refer to the same paper.

4.6

Conclusion
CRFs are a natural choice for many relational problems because they allow both
graphically representing dependencies between entities, and including rich observed
features of entities. In this chapter, we have presented a tutorial on CRFs, covering
both linear-chain models and general graphical structures. Also, as a case study in
CRFs for collective classication, we have presented the skip-chain CRF, a type of
general CRF that performs joint segmentation and collective labeling on a practical
language understanding task.
The main disadvantage of CRFs is the computational expense of training. Although CRF training is feasible for many real-world problems, the need to perform
inference repeatedly during training becomes a computational burden when there
are a large number of training instances, when the graphical structure is complex,
when there are latent variables, or when the output variables have many outcomes.
One focus of current research [1, 53, 60] is on more ecient parameter estimation
techniques.

References

123

Acknowledgments
We thank Tom Minka and Jerod Weinman for helpful conversations, and we thank
Francine Chen and Benson Limketkai for useful comments. This work was supported
in part by the Center for Intelligent Information Retrieval; in part by the Defense
Advanced Research Projects Agency (DARPA), the Department of the Interior,
NBC, Acquisition Services Division, under contract number NBCHD030010; and
in part by the Central Intelligence Agency, the National Security Agency, and the
National Science Foundation under NSF grants #IIS-0427594 and #IIS-0326249.
Any opinions, ndings and conclusions or recommendations expressed in this
material are the authors and do not necessarily reect those of the sponsors.

References
[1] P. Abbeel, D. Koller, and A. Y. Ng. Learning factor graphs in polynomial time
and sample complexity. In Proceedings of the Conference on Uncertainty in
Articial Intelligence, 2005.
[2] Y. Altun, I. Tsochantaridis, and T. Hofmann. Hidden Markov support vector
machines. In Proceedings of the International Conference on Machine Learning,
2003.
[3] D. Bertsekas. Nonlinear Programming. Athena Scientic, Nashua, NH, 2nd
edition, 1999.
[4] J. Besag. Eciency of pseudolikelihood estimation for simple gaussian elds.
Biometrika, 64(3):616618, 1977.
[5] R. Bunescu and R. J. Mooney. Collective information extraction with relational
Markov networks. In Proceedings of the Annual Meeting of the Association for
Computational Linguistics, 2004.
[6] R. Byrd, J. Nocedal, and R. Schnabel. Representations of quasi-Newton matrices and their use in limited memory methods. Mathematical Programming,
63(2):129156, 1994. ISSN 0025-5610.
[7] R. Caruana and A. Niculescu-Mizil. An empirical comparison of supervised
learning algorithms using dierent performance metrics. Technical Report
TR2005-1973, Cornell University, Ithica, NY, 2005.
[8] Y. Choi, C. Cardie, E. Rilo, and S. Patwardhan. Identifying sources of
opinions with conditional random elds and extraction patterns. In Proceedings
of Human Language Technology Conference and North American Chapter of
the Association for Computational Linguistics, 2005.
[9] S. Clark and J. Curran. Parsing the WSJ using CCG and log-linear models.
In Proceedings of the Annual Meeting of the Association for Computational
Linguistics, 2004.

124

An Introduction to Conditional Random Fields for Relational Learning

[10] M. Collins. Discriminative training methods for hidden Markov models:


Theory and experiments with perceptron algorithms. In Proceedings of the
Conference on Empirical Methods in Natural Language Processing, 2002.
[11] P. Cowans and M. Szummer. A graphical model for simultaneous partitioning
and labeling. In Tenth International Workshop on Articial Intelligence and
Statistics, 2005.
[12] A. Culotta and A. McCallum. Condence estimation for information extraction. In Proceedings of Human Language Technology Conference and North
American Chapter of the Association for Computational Linguistics, 2004.
[13] A. Culotta, R. Bekkerman, and A. McCallum. Extracting social networks and
contact information from email and the web. In Proceedings of the Conference
on Email and Anti-Spam, 2004.
[14] J. Finkel, T. Grenager, and C. Manning. Incorporating non-local information
into information extraction systems by Gibbs sampling. In Proceedings of the
Annual Meeting of the Association for Computational Linguistics, 2005.
[15] D. Freitag. Machine Learning for Information Extraction in Informal Domains. PhD thesis, Carnegie Mellon University, Pittsburgh, 1998.
[16] N. Ghamrawi and A. McCallum. Collective multi-label classication. In
Proceedings of the Conference on Information and Knowledge Management,
2005.
[17] J. Goodman. Exponential priors for maximum entropy models. In Proceedings
of Human Language Technology Conference and North American Chapter of
the Association for Computational Linguistics, 2004.
Carreira-Perpi
[18] X. He, R. S. Zemel, and M. A.
ni
an. Multiscale conditional
random elds for image labelling. In IEEE Computer Society Conference on
Computer Vision and Pattern Recognition, 2004.
[19] G. Hinton. Training products of experts by minimizing contrastive divergence.
Technical Report 2000-004, Gatsby Computational Neuroscience Unit, London,
2000.
[20] D. Klein and C. Manning. Conditional structure versus conditional estimation
in NLP models. In Proceedings of the Conference on Empirical Methods in
Natural Language Processing, 2002.
[21] F. Kschischang, B. Frey, and H. Loeliger. Factor graphs and the sum-product
algorithm. IEEE Transactions on Information Theory, 47(2):498519, 2001.
[22] T. Kudo, K. Yamamoto, and Y. Matsumoto. Applying conditional random
elds to Japanese morphological analysis. In Proceedings of the Conference on
Empirical Methods in Natural Language Processing, 2004.
[23] S. Kumar and M. Hebert. Discriminative elds for modeling spatial dependencies in natural images. In Proceedings of Neural Information Processing
Systems, 2003.

References

125

[24] J. Laerty, A. McCallum, and F. Pereira. Conditional random elds: Probabilistic models for segmenting and labeling sequence data. Proceedings of the
International Conference on Machine Learning, 2001.
[25] Y. Liu, J. Carbonell, P. Weigele, and V. Gopalakrishnan. Segmentation conditional random elds (SCRFs): A new approach for protein fold recognition.
In Proceedings of the ACM International Conference on Research in Computational Molecular Biology, 2005.
[26] R. Malouf. A comparison of algorithms for maximum entropy parameter
estimation. In Proceedings of the Conference on Natural Language Learning,
2002.
[27] D. McAllester, M. Collins, and F. Pereira. Case-factor diagrams for structured
probabilistic modeling. In Proceedings of the Conference on Uncertainty in
Articial Intelligence, 2004.
[28] A. McCallum. Eciently inducing features of conditional random elds. In
Proceedings of the Conference on Uncertainty in Articial Intelligence, 2003.
[29] A. McCallum and D. Jensen. A note on the unication of information
extraction and data mining using conditional-probability, relational models.
In IJCAI03 Workshop on Learning Statistical Models from Relational Data,
2003.
[30] A. McCallum and W. Li. Early results for named entity recognition with
conditional random elds, feature induction and web-enhanced lexicons. In
Proceedings of the Conference on Natural Language Learning, 2003.
[31] A. McCallum and B. Wellner. Conditional models of identity uncertainty
with application to noun coreference. In Proceedings of Neural Information
Processing Systems, 2005.
[32] A. McCallum, D. Freitag, and F. Pereira. Maximum entropy Markov models
for information extraction and segmentation. In Proceedings of the International Conference on Machine Learning, 2000.
[33] A. McCallum, K. Bellare, and F. Pereira. A conditional random eld for
discriminatively-trained nite-state string edit distance. In Proceedings of the
Conference on Uncertainty in Articial Intelligence, 2005.
[34] T. Minka. Discriminative models, not discriminative training. Technical Report MSR-TR-2005-144, Microsoft Research, October 2005.
ftp://ftp.research.microsoft.com/ pub/tr/TR-2005-144.pdf .
[35] T. P. Minka. A comparsion of numerical optimizers for logistic regression.
Technical report, Dept. of Statistics, Carnegie Mellon University, Pittsburgh,
2003.
[36] A. Ng and M. Jordan. On discriminative vs. generative classiers: A comparison of logistic regression and naive Bayes. In Proceedings of Neural Information
Processing Systems, 2002.

126

An Introduction to Conditional Random Fields for Relational Learning

[37] F. Peng and A. McCallum. Accurate information extraction from research


papers using conditional random elds. In Proceedings of Human Language
Technology Conference and North American Chapter of the Association for
Computational Linguistics, 2004.
[38] F. Peng, F. Feng, and A. McCallum. Chinese segmentation and new word
detection using conditional random elds. In Proceedings of the International
Conference on Computational Linguistics, 2004.
[39] L. Peshkin and A. Pfeer. Bayesian information extraction network. In
Proceedings of the International Joint Conference on Articial Intelligence,
2003.
[40] Y. Qi, M. Szummer, and T. Minka. Diagram structure recognition by Bayesian
conditional random elds. In Proceedings of the International Conference on
Computer Vision and Pattern Recognition, 2005.
[41] A. Quattoni, M. Collins, and T. Darrell. Conditional random elds for object
recognition. In Proceedings of Neural Information Processing Systems, 2005.
[42] L. Rabiner. A tutorial on hidden Markov models and selected applications in
speech recognition. Proceedings of the IEEE, 77(2):257 286, 1989.
[43] M. Richardson and P. Domingos. Markov logic networks. Machine Learning,
62(1-2):107136, 2006.
[44] S. Riezler, T. King, R. Kaplan, R. Crouch, J. Maxwell, and M. Johnson.
Parsing the Wall Street Journal using a lexical-functional grammar and discriminative estimation techniques. In Proceedings of the Annual Meeting of the
Association for Computational Linguistics, 2002.
[45] D. Roth and W. Yih. Integer linear programming inference for conditional
random elds. In Proceedings of the International Conference on Machine
Learning, 2005.
[46] S. Sarawagi and W. Cohen. Semi-Markov conditional random elds for information extraction. In Proceedings of Neural Information Processing Systems,
2005.
[47] K. Sato and Y. Sakakibara. RNA secondary structural alignment with
conditional random elds. Bioinformatics, 21:237242, 2005.
[48] B. Settles. Abner: An open source tool for automatically tagging genes,
proteins, and other entity names in text. Bioinformatics, 21(14):31913192,
2005.
[49] F. Sha and F. Pereira. Shallow parsing with conditional random elds. In
Proceedings of Human Language Technology Conference and North American
Chapter of the Association for Computational Linguistics, 2003.
[50] P. Singla and P. Domingos. Discriminative training of Markov logic networks.
In Proceedings of the National Conference on Articial Intelligence, 2005.
[51] C. Sutton. Conditional probabilistic context-free grammars. Masters thesis,
University of Massachusetts, Amherst, 2004. URL publications/cscfg.pdf.

References

127

http://www.cs.umass.edu/ casutton/publications.html.
[52] C. Sutton and A. McCallum. Collective segmentation and labeling of distant
entities in information extraction. Technical Report TR # 04-49, University of
Massachusetts, 2004. Presented at ICML Workshop on Statistical Relational
Learning and Its Connections to Other Fields.
[53] C. Sutton and A. McCallum. Piecewise training of undirected models. In
Proceedings of the Conference on Uncertainty in Articial Intelligence, 2005.
[54] C. Sutton, K. Rohanimanesh, and A. McCallum. Dynamic conditional random
elds: Factorized probabilistic models for labeling and segmenting sequence
data. In Proceedings of the International Conference on Machine Learning,
2004.
[55] B. Taskar, P. Abbeel, and D. Koller. Discriminative probabilistic models for
relational data. In Proceedings of the Conference on Uncertainty in Articial
Intelligence, 2002.
[56] B. Taskar, C. Guestrin, and D. Koller. Max-margin Markov networks. In
Proceedings of Neural Information Processing Systems, 2004.
[57] P. Viola and M. Narasimhan. Learning to extract information from semistructured text using a discriminative context free grammar. In Proceedings of
the ACM International Conference on Information Retrieval, 2005.
[58] S.V.N. Vishwanathan, N. Schraudolph, M. Schmidt, and K. Murphy. Accelerated training of copnditional random elds with stochastic meta-descent. In
Proceedings of the International Conference on Machine Learning, 2006.
[59] M. Wainwright, T. Jaakkola, and A. Willsky. Tree-based reparameterization
for approximate estimation on graphs with cycles. In Proceedings of Neural
Information Processing Systems, 2001.
[60] M. Wainwright, T. Jaakkola, and A. Willsky. Tree-reweighted belief propagation and approximate ML estimation by pseudo-moment matching. In Ninth
Workshop on Articial Intelligence and Statistics, 2003.
[61] H. Wallach. Ecient training of conditional random elds. MSc thesis,
University of Edinburgh, 2002.
[62] B. Wellner, A. McCallum, F. Peng, and M. Hay. An integrated, conditional
model of information extraction and coreference with application to citation
graph construction. In Proceedings of the Conference on Uncertainty in Articial Intelligence, 2004.
[63] J. Yedidia, W. T. Freeman, and Y. Weiss. Constructing free energy approximations and generalized belief propagation algorithms. Technical Report
TR2004-040, Mitsubishi Electric Research Laboratories, Cambridge, MA, 2004.

5 Probabilistic Relational Models

Lise Getoor, Nir Friedman, Daphne Koller, Avi Pfeer and Ben Taskar

Probabilistic relational models (PRMs) are a rich representation language for structured statistical models. They combine a frame-based logical representation with
probabilistic semantics based on directed graphical models (Bayesian networks).
This chapter gives an introduction to probabilistic relational models, describing semantics for attribute uncertainty, structural uncertainty, and class uncertainty. For
each case, learning algorithms and some sample results are presented.

5.1

Introduction
Over the last decade, Bayesian networks have been used with great success in
a wide variety of real-world and research applications. However, despite their
success, Bayesian networks are often inadequate for representing large and complex
domains. A Bayesian network for a given domain involves a prespecied set of
random variables, whose relationship to each other is xed in advance. Hence, a
Bayesian network cannot be used to deal with domains where we might encounter a
varying number of entities in a variety of congurations. This limitation of Bayesian
networks is a direct consequence of the fact that they lack the concept of an object
(or domain entity). Hence, they cannot represent general principles about multiple
similar objects which can then be applied in multiple contexts.
Probabilistic relational models (PRMs) [13, 18] extend Bayesian networks with
the concepts of objects, their properties, and relations between them. In a way,
they are to Bayesian networks as relational logic is to propositional logic. A PRM
species a template for a probability distribution over a database. The template
includes a relational component that describes the relational schema for our domain,
and a probabilistic component that describes the probabilistic dependencies that
hold in our domain. A PRM has a coherent formal semantics in terms of probability
distributions over sets of relational logic interpretations. Given a set of ground
objects, a PRM species a probability distribution over a set of interpretations
involving these objects (and perhaps other objects as well). A PRM, together with

130

Probabilistic Relational Models

a particular database of objects and relations, denes a probability distribution


over the attributes of the objects.
In this chapter, we describe the semantics for PRMs with dierent types of
uncertainty, and at the same time we describe the basic learning algorithms for
PRMs. We propose an algorithm for automatically constructing or learning a PRM
from an existing database. The learned PRM describes the patterns of interactions
between attributes. In the learning problem, our input contains a relational schema
that species the basic vocabulary in the domain the set of classes, the attributes
associated with the dierent classes, and the possible types of relations between
objects in the dierent classes. The training data consists of a fully specied instance
of that schema in the form of a relational database. Once we have learned a PRM,
it serves as a tool for exploratory data analysis and can be used to make predictions
and complex inferences in new situations. For additional details, including proofs
of all of the theorems, see [9].

5.2

PRM Representation
The two components of PRM syntax are a logical description of the domain
of discourse and a probabilistic graphical model template which describes the
probabilistic dependencies in the domain. Here we describe the logical description
of the domain as a relational schema, although it can be transformed into either a
frame-based representation or a logic-based syntax is a relatively straightforward
manner. Our probabilistic graphical component is depicted pictorially, although
it can also be represented in a logical formalism; for example in the probabilistic
relational language of [10]. We begin by describing the syntax and semantics for
PRMs which have the simplest form of uncertainty, attribute uncertainty, and then
move on to describing various forms of structural uncertainty.
5.2.1

Relational Language

The relational language allows us to describe the kinds of objects in our domain.
For example, gure 5.1(a) shows the schema for a simple domain that we will be
using as our running example. The domain is that of a university, and contains
professors, students, courses, and course registrations. The classes in the schema
are Professor, Student, Course, and Registration.
More formally, a schema for a relational model describes a set of classes, X =
{X1 , . . . , Xn }. Each class is associated with a set of descriptive attributes. For
example, professors may have descriptive attributes such as popularity and teaching
ability; courses may have descriptive attributes such as rating and diculty.
The set of descriptive attributes of a class X is denoted A(X). Attribute A of
class X is denoted X.A, and its space of values is denoted V(X.A). We assume
here that value spaces are nite. For example, the Student class has the descriptive

5.2

PRM Representation

131

Professor

Student

Popularity

Intelligence

Teaching-Ability

Ranking

Course

Registration

Instructor

Course

Rating

Student

Difficulty

Grade
Satisfaction

(a)

Student
John Doe
Student
Intelligence
Jane Doe
high
Intelligence
Performance
high
average
Ranking
average

Professor
Prof. Gump
Popularity
high
Teaching Ability
medium

Course
Phil142
Course
Difficulty
Phil101
low
Difficulty
Rating
low
high
Rating

Registration
#5639
Registration
Grade#5639
Registration
A
Grade
#5639
Satisfaction
A
Grade
3
Satisfaction
A
3
Satisfaction
3

high

(b)

Figure 5.1 (a) A relational schema for a simple university domain. The underlined
attributes are reference slots of the class and the dashed lines indicate the types
of objects referenced. (b) An example instance of this schema. Here we do not
show the values of the reference slots; we simply use dashed lines to indicate the
relationships that hold between objects.

attributes Intelligence and Ranking. The value space for Student.Intelligence in this
example is {high, low}.
In addition, we need a method for allowing an object to refer to another object.
For example we may want a course to have a reference to the instructor of the
course. And a registration record should refer both to the associated course and to
the student taking the course.
The simplest way of achieving this eect is using reference slots. Specically, each
class is associated with a set of reference slots. The set of reference slots of a class X
is denoted R(X). We use X. to denote the reference slot of X. Each reference slot
is typed, i.e., the schema species the range type of object that may be referenced.
More formally, for each in X, the domain type Dom[] is X and the range type
Range[] is Y for some class Y in X . For example, the class Course has reference
slot Instructor with range type Professor, and class Registration has reference slots
Course and Student. In gure 5.1(a) the reference slots are underlined.
There is a direct mapping between our representation and that of relational
databases. Each class corresponds to a single table and each attribute corresponds
to a column. Our descriptive attributes correspond to standard attributes in the
table, and our reference slots correspond to attributes that are foreign keys (key
attributes of another table).
For each reference slot , we can dene an inverse slot 1 , which is interpreted
as the inverse function of . For example, we can dene an inverse slot for the
Student slot of Registration and call it Registered-In. Note that this is not a oneto-one relation, but returns a set of Registration objects. More formally, if Dom[]
is X and Range[] is Y , then Dom[1 ] is Y and Range[1 ] is X.
Finally, we dene the notion of a slot chain, which allows us to compose slots,
dening functions from objects to other objects to which they are indirectly related. More precisely, we dene a slot chain 1 , . . . , k to be a sequence of slots

132

Probabilistic Relational Models

(inverse or otherwise) such that for all i, Range[i ] = Dom[i+1 ]. For example,
Student.Registered-In.Course.Instructor can be used to denote a students instructors. Note that a slot chain describes a set of objects from a class.1
The relational framework we have just described is motivated primarily by the
concepts of relational databases, although some of the notation is derived from
frame-based and object-oriented systems. However, the framework is a fully general
one, and is equivalent to the standard vocabulary and semantics of relational logic.
5.2.2

Schema Instantiation

An instance I of a schema is simply a standard relational logic interpretation of


this vocabulary. It species: for each class X, the set of objects in the class, I(X);
a value for each attribute x.A (in the appropriate domain) for each object x; and
a value y for each reference slot x., which is an object in the appropriate range
type, i.e., y Range[]. Conversely, y.1 = {x | x. = y}. We use A(x) as a
shorthand for A(X), where x is of class X. For each object x in the instance and
each of its attributes A, we use Ix.A to denote the value of x.A in I. For example,
gure 5.1(b) shows an instance of the schema from our running example. In this
(simple) instance there is one Professor, two Classes, three Registrations, and two
Students. The relations between them show that the professor is the instructor in
both classes, and that one student (Jane Doe) is registered only for one class
(Phil101), while the other student is registered for both classes.
5.2.3

Probabilistic Model

A PRM denes a probability distribution over a set of instances of a schema. Most


simply, we assume that the set of objects and the relations between them are xed,
i.e., external to the probabilistic model. Then, the PRM denes only a probability
distribution over the attributes of the objects in the model. The relational skeleton
denes the possible instantiations that we consider; the PRM denes a distribution
over the possible worlds consistent with the relational skeleton.
Denition 5.1
A relational skeleton r of a relational schema is a partial specication of an
instance of the schema. It species the set of objects r (Xi ) for each class and
the relations that hold between the objects. However, it leaves the values of the
attributes unspecied.
Figure 5.2(a) shows a relational skeleton for our running example. The relational
skeleton denes the random variables in our domain; we have a random variable for

1. It is also possible to dene slot chains as multi-sets of objects; here we have found it
sucient to make them sets of objects, but there may be domains where multi-sets are
desirable.

5.2

PRM Representation

133

Student
John Doe
Student
Intelligence
Jane Doe
high
Intelligence
Performance
???
average
Ranking
???

Professor
Prof. Gump
Popularity
???
Teaching Ability
???

Teaching-Ability

Popularity

Rating

Course
Phil142
Course
Difficulty
Phil101
low
Difficulty
Rating
???
high
Rating

Registration
#5639
Registration
Grade#5639
Registration
A
Grade#5639
Satisfaction
A
Grade
3
Satisfaction
???
3
Satisfaction

???

Difficulty

Intelligence
AVG

Ranking

Satisfaction
AVG
Grade

???

(a)

(b)

Figure 5.2 (a) The relational skeleton for the university domain. (b) The PRM
dependency structure for our university example.

each attribute of each object in the skeleton. A PRM then species a probability
distribution over completions I of the skeleton.
A PRM consists of two components: the qualitative dependency structure, S,
and the parameters associated with it, S . The dependency structure is dened by
associating with each attribute X.A a set of parents Pa(X.A). These correspond
to formal parents; they will be instantiated in dierent ways for dierent objects
in X. Intuitively, the parents are attributes that are direct inuences on X.A. In
gure 5.2(b), the arrows dene the dependency structure.
We distinguish between two types of formal parents. The attribute X.A can depend on another probabilistic attribute B of X. This formal dependence induces
a corresponding dependency for individual objects: for any object x in r (X), x.A
will depend probabilistically on x.B. For example, in gure 5.2(b), a professors
Popularity depends on her Teaching-Ability. The attribute X.A can also depend
on attributes of related objects X.K.B, where K is a slot chain. In gure 5.2(b),
the grade of a student depends on Registration.Student .Intelligence and Registration.Course.Diculty. Or we can have a longer slot chain, for example, the dependence of student satisfaction on Registration.Course.Instructor .Teaching-Ability.
In addition, we can have a dependence of student ranking on Student.RegisteredIn.Grade. To understand the semantics of this formal dependence for an individual
object x, recall that x.K represents the set of objects that are K-relatives of x.
Except in cases where the slot chain is guaranteed to be single-valued, we must
specify the probabilistic dependence of x.A on the multiset {y.B : y x.K}.
For example, a students rank depends on the grades in the courses in which he
or she are registered. However each student may be enrolled in a dierent number
of courses, and we will need a method of compactly representing these complex
dependencies.
The notion of aggregation from database theory gives us an appropriate tool to
address this issue: x.A will depend probabilistically on some aggregate property of
this multiset. There are many natural and useful notions of aggregation of a set: its
mode (most frequently occurring value); its mean value (if values are numerical);

134

Probabilistic Relational Models

Teaching-Ability
Teaching-Ability

Popularity
Popularity

Rating

Intelligence
Ranking

Difficulty
Satisfaction

D, I
h,h
h ,l
l ,h
l ,l

A
B
C
0.5 0.4 0.1
0.1 0.5 0.4
0.8 0.1 0.1
0.3 0.6 0.1

Rating

Intelligence
Ranking

Difficulty
Satisfaction
AVG

Grade

Grade

(a)

avg l
A 0.1
B 0.2
C 0.6

m h
0.2 0.7
0.4 0.4
0.3 0.1

(b)

Figure 5.3 (a) The CPD for Registration.Grade (b) The CPD for an aggregate
dependency of Student.Ranking on Student.Registered-In .Grade .

its median, maximum, or minimum (if values are ordered); its cardinality; etc.
In the preceding example, we can have a students ranking depend on her grade
point average (GPA), or the average grade in her courses (or in the case where the
grades are represented as letters, we may use median; in our example we blur the
distinction and assume that average is dened appropriately).
More formally, our language allows a notion of an aggregate ; takes a multiset
of values of some ground type, and returns a summary of it. The type of the
aggregate can be the same as that of its arguments. However, we allow other types
as well, e.g., an aggregate that reports the size of the set. We allow X.A to have
as a parent (X.K.B); the semantics is that for any x X, x.A will depend on
the value of (x.K.B). In our example PRM, there are two aggregate dependencies
dened, one that species that the ranking of a student depends on the average of
her grades and one that species that the rating of a course depends on the average
satisfaction of students in the course.
Given a set of parents Pa(X.A) for X.A, we can dene a local probability model
for X.A. We associate X.A with a conditional probability distribution (CPD) that
species P (X.A | Pa(X.A)). We require that the CPDs are legal. Figure 5.3 shows
two CPDs. Let U be the set of parents of X.A, U = Pa(X.A). Each of these parents
Ui whether a simple attribute in the same relation or an aggregate of a set of K
relatives has a set of values V(Ui ) in some ground type. For each tuple of values
u V(U), we specify a distribution P (X.A | u) over V(X.A). This entire set of
parameters comprises S .
Denition 5.2
A probabilistic relational model (PRM) for a relational schema R is dened as
follows. For each class X X and each descriptive attribute A A(X), we have:
a set of parents Pa(X.A) = {U1 , . . . , Ul }, where each Ui has the form X.B or
(X.K.B), where K is a slot chain and is an aggregate of X.K.B;

5.2

PRM Representation

135

a legal conditional probability distribution (CPD), P (X.A | Pa(X.A)).


5.2.4

PRM Semantics

As mentioned in the introduction, PRMs dene a distribution over possible worlds.


The possible worlds are instantiations of the database that are consistent with the
relational skeleton. Given any skeleton, we have a set of random variables of interest:
the attributes x.A of the objects in the skeleton. Formally, let r (X) denote the
set of objects in skeleton r whose class in X. The set of random variables for
r is the set of attributes of the form x.A where x r (Xi ) and A A(Xi ) for
some class Xi . The PRM species a probability distribution over the possible joint
assignments of values to all of these random variables.
For a given skeleton r , the PRM structure induces a ground Bayesian network
over the random variables x.A.
Denition 5.3
A PRM together with a skeleton r denes the following ground Bayesian
network:
There is a node for every attribute of every object x r (X), x.A.
Each x.A depends probabilistically on parents of the form x.B or x.K.B. If K
is not single-valued, then the parent is the aggregate computed from the set of
random variables {y | y x.K}, (x.K.B).
The CPD for x.A is P (X.A | Pa(X.A)).
As with Bayesian networks, the joint distribution over these assignments is
factored. That is, we take the product, over all x.A, of the probability in the CPD of
the specic value assigned by the instance to the attribute given the values assigned
to its parents. Formally, this is written as follows:
 
P (Ix.A | IPa(x.A) )
P (I | r , S, S ) =
xr AA(x)

Xi AA(Xi ) xr (Xi )

P (Ix.A | IPa(x.A) ).

(5.1)

This expression is very similar to the chain rule for Bayesian networks. There
are three primary dierences. First, our random variables are the attributes of a
set of objects. Second, the set of parents of a random variable can vary according
to the relational context of the object the set of objects to which it is related.
Third, the parameters are shared; the parameters of the local probability models
for attributes of objects in the same class are identical.
5.2.5

Coherence of Probabilistic Model

As in any denition of this type, we have to take care that the resulting function
from instances to numbers does indeed dene a coherent probability distribution,

136

Probabilistic Relational Models

i.e., where the sum of the probability of all instances is 1. In Bayesian networks,
where the joint probability is also a product of CPDs, this requirement is satised
if the dependency graph is acyclic: a variable is not an ancestor of itself. A similar
condition is sucient to ensure coherence in PRMs as well.
5.2.5.1

Instance Dependency Graph

We want to ensure that our probabilistic dependencies are acyclic, so that a random
variable does not depend, directly or indirectly, on its own value. To do so, we can
consider the graph of dependencies among attributes of objects in the skeleton,
which we will call the instance dependency graph, Gr .
Denition 5.4
The instance dependency graph Gr for a PRM and a relational skeleton r has
a node for each descriptive attribute of each object x r (X) in each class X X .
Each x.A has the following edges:
1. Type I edges: For each formal parent of x.A, X.B, we introduce an edge from
x.B to x.A.
2. Type II edges: For each formal parent X.K.B, and for each y x.K, we dene
an edge from y.B to x.A.
Type I edges correspond to intra-object dependencies and type II edges correspond
to inter-object dependencies. We say that a dependency structure S is acyclic
relative to a relational skeleton r if the instance dependency graph Gr over the
variables x.A is acyclic. In this case, we are guaranteed that the PRM denes a
coherent probabilistic model over complete instantiations I consistent with r :
Theorem 5.5
Let be a PRM whose dependency structure S is acyclic relative to a relational
skeleton r . Then and r dene a coherent probability distribution over instantiations I that extend r via (5.1).
5.2.5.2

Class Dependency Graph

The instance dependency graph we just described allows us to check whether a


dependency structure S is acyclic relative to a xed skeleton r . However, we often
want stronger guarantees: we want to ensure that our dependency structure is
acyclic for any skeleton that we are likely to encounter. How do we guarantee
this property based only on the class-level PRM? To do so, we consider potential
dependencies at the class level. More precisely, we dene a class dependency graph,
which reects these dependencies.
Denition 5.6
The class dependency graph G for a PRM has a node for each descriptive
attribute X.A, and the following edges:

5.2

PRM Representation

137

Teaching-Ability

Popularity

Rating

Intelligence
Ranking

Difficulty
Satisfaction

Grade

Figure 5.4

The class dependency graph for the school PRM.

1. Type I edges: For any attribute X.A and any of its parents X.B, we introduce an
edge from X.B to X.A.
2. Type II edges: For any attribute X.A and any of its parents X.K.B we introduce
an edge from Y.B to X.A, where Y = Range[X.K].
Figure 5.4 shows the dependency graph for our school domain.
The most obvious approach for using the class dependency graph is simply to
require that it be acyclic. This requirement is equivalent to assuming a stratication
among the attributes of the dierent classes, and requiring that the parents of an
attribute precede it in the stratication ordering. As theorem 5.7 shows, if the
class dependency graph is acyclic, we can never have that x.A depends (directly or
indirectly) on itself.
Theorem 5.7
If the class dependency graph G is acyclic for a PRM , then for any skeleton r ,
the instance dependency graph is acyclic.
The following corollary follows immediately:
Corollary 5.8
Let be a PRM whose class dependency structure S is acyclic. For any relational
skeleton r , , and r dene a coherent probability distribution over instantiations
I that extend r via (5.1).
For example, if we examine the PRM of gure 5.2(b), we can easily convince ourselves that we cannot create a cycle in any instance. Indeed, as we saw in gure 5.4,
the class dependency graph is acyclic. Note, however, that if we introduce additional
dependencies we can create cycles. For example, if we make Professor.TeachingAbility depend on the rating of courses she teaches (e.g., if high teaching ratings
increase her motivation), then the resulting class dependency graph is cyclic, and
there is no stratication order that is consistent with the PRM structure. An inability to stratify the class dependency graph implies that there are skeletons for
which the PRM will induce a distribution with cyclic dependencies.

138

Probabilistic Relational Models

(Father)

(Mother)

Person

Blood Type

Blood Type

P-chromosome

Person

Person.M-chromosome

Person.P-chromosome

P-chromosome
M-chromosome

M-chromosome

Person.BloodType

P-chromosome

BloodTest.Contaminated

Person
M-chromosome

BloodTest.Result

Blood Type
Contaminated

Result

Blood Test

(a)

(b)

(a) A simple PRM for the genetics domain. (b) The corresponding dependency graph. Dashed edges correspond to green dependencies, dotted edges
correspond to yellow dependencies, and solid edges correspond to red dependencies.

Figure 5.5

5.2.5.3

Guaranteed Acyclic Relationships

In some important cases, a cycle in the class dependency graph is not problematic,
it will not result in a cyclic instance dependency graph. This can be the case when
we have additional domain constraints on the form of skeletons we may encounter.
Consider, for example, a simple genetic model of the inheritance of a single gene
that determines a persons blood type, shown in gure 5.5(a). Each person has
two copies of the chromosome containing this gene, one inherited from her mother,
and one inherited from her father. There is also a possibly contaminated test that
attempts to recognize the persons blood type. Our schema contains two classes:
Person and BloodTest. Class Person has reference slots Mother and Father and
descriptive attributes Gender, P-Chromosome (the chromosome inherited from the
father), and M-Chromosome (inherited from the mother). BloodTest has a reference
slot Test-Of (not shown explicitly in the gure) that points to the owner of the test,
and descriptive attributes Contaminated and Result.
In our genetic model, the genotype of a person depends on the genotype of
her parents; thus, at the class level, we have Person.P-Chromosome depending
directly on Person.P-Chromosome. As we can see in gure 5.5(b), this dependency
results in a cycle that clearly violates the acyclicity requirements of our simple class
dependency graph. However, it is clear to us that the dependencies in this model are
not actually cyclic for any skeleton that we will actually encounter in this domain.
The reason is that, in legitimate skeletons for this schema, a person cannot be
his own ancestor, which disallows the situation of the persons genotype depending
(directly or indirectly) on itself. In other words, although the model appears to be
cyclic at the class level, we know that this cyclicity is always resolved at the level
of individual objects.

5.2

PRM Representation

139

Our ability to guarantee that the cyclicity is resolved relies on some prior
knowledge that we have about the domain. We want to allow the user to give us
information such as this, so that we can make stronger guarantees about acyclicity
and allow richer dependency structures in the PRM. In particular, the user can
specify that certain reference slots are guaranteed acyclic. In our genetics example,
Father and Mother are guaranteed acyclic; cycles involving these attributes may in
fact be legal. Moreover, they are mutually guaranteed acyclic, so that compositions
of the slots are also guaranteed acyclic. Figure 5.5(b) shows the class dependency
graph for the genetics domain, with guaranteed acyclic edges shown as dashed
edges.
We allow the user to assert that certain reference slots Rga = {1 , . . . , k } are
guaranteed acyclic; i.e., we are guaranteed that there is a partial ordering ga such
that if y is a -relative for some Rga of x, then y ga x. We say that a slot
chain K is guaranteed acyclic if each of its component s is guaranteed acyclic.
This prior knowledge allows us to guarantee the legality of certain dependency
models. We start by building a colored class dependency graph that describes the
direct dependencies between the attributes.
Denition 5.9
The colored class dependency graph G for a PRM has the following edges:
1. Yellow edges: If X.B is a parent of X.A, we have a yellow edge X.B X.A.
2. Green edges: If (X.K.B) is a parent of X.A, Y = Range[X.K], and K is
guaranteed acyclic, we have a green edge Y.B X.A.
3. Red edges: If (X.K.B) is a parent of X.A, Y = Range[X.K], and K is not
guaranteed acyclic, we have a red edge Y.B X.A.
Note that there might be several edges, perhaps of dierent colors, between two
attributes.
The intuition is that dependency along green edges relates objects that are
ordered by an acyclic order. Thus, these edges by themselves or combined with
intra-object dependencies (yellow edges) cannot cause a cyclic dependency. We
must, however, take care with other dependencies, for which we do not have
prior knowledge, as these might form a cycle. This intuition suggests the following
denition:
Denition 5.10
A (colored) dependency graph is stratied if every cycle in the graph contains at
least one green edge and no red edges.
Theorem 5.11
If the colored class dependency graph is stratied for a PRM , then for any
skeleton r , the instance dependency graph is acyclic.

140

Probabilistic Relational Models

In other words, if the colored dependency graph of S and Rga is stratied, then
for any skeleton r for which the slots in Rga are jointly acyclic, S denes a coherent
probability distribution over assignments to r .
This notion of stratication generalizes the two special cases we considered
above. When we do not have any guaranteed acyclic relations, all the edges in
the dependency graph are colored either yellow or red. Then the graph is stratied
if and only if it is acyclic. In the genetics example, all the parent relations would
be in Rga . The only edges involved in cycles are green edges.
We can also support multiple guaranteed acyclic relations by using dierent
shades of green for each set of guaranteed acyclic relations. Then a cycle is safe
as long as it contains at most one shade of green edge.

5.3

The Dierence between PRMs and Bayesian Networkss


The PRM species a probability distribution using the same underlying principles
used in specifying Bayesian networks. The assumption is that each of the random
variables in the PRM in this case the attributes x.A of the individual objects x
is directly inuenced by only a few others. The PRM therefore denes for each
x.A a set of parents, which are the direct inuences on it, and a local probabilistic
model that species the dependence on these parents. In this way, the PRM is like
a Bayesian Network.
However, there are two primary dierences between PRMs and Bayesian networks. First, a PRM denes the dependency model at the class level, allowing it to
be used for any object in the class. In some sense, it is analogous to a universally
quantied statement. Second, the PRM explicitly uses the relational structure of
the skeleton, in that it allows the probabilistic model of an attribute of an object to
depend also on attributes of related objects. The specic set of related objects can
vary with the skeleton r ; the PRM species the dependency in a generic enough
way that it can apply to an arbitrary relational structure.
One can understand the semantics of a PRM together with a particular relational
skeleton r by examining the ground Bayesian network dened earlier. The network
has a node for each attribute of the objects in the skeleton. The local probability
models for attributes of objects in the same class are identical (we can view the
parameters as being shared); however, the distribution for a node will depend on the
values of its parents, and the parents of each node are determined by the skeleton.
It is important to note the construction of the ground Bayesian Network is just a
thought experiment; in many cases there is no need to actually construct this large
underlying Bayesian network.

5.4

5.4

PRMs with Structural Uncertainty

141

PRMs with Structural Uncertainty


The previous section gave the syntax and semantics for the most basic type of
PRM, a PRM in which there is uncertainty over the the attributes of the objects in
the relational skeleton. As discussed in the last section, this is already a signicant
generalization beyond propositional Bayesian networks. In this section, we propose
probabilistic models for the attributes of the objects in a relational model and also
for the relational or link structure itself. In other words, we model the probability
that certain relationships hold between objects. We propose two mechanisms for
modeling link uncertainty: reference uncertainty and existence uncertainty.
The PRM framework presented so far focuses on modeling the distribution over
the attributes of the objects in the model. It takes the relational structure itself
the objects and the relational links between entities to be background knowledge,
determined outside the probabilistic model. This assumption implies that the model
cannot be used to predict the relational structure itself. A more subtle yet very
important point is that the relational structure is informative in and of itself. For
example, the links from and to a webpage are very informative about the type of
webpage [6], and the citation links between papers are very informative about the
paper topics [5].
By making objects and links rst-class citizens in the model, our language easily
allows us to place a probabilistic model directly over them. In other words, we
can extend our framework to dene probability distributions over the presence of
relational links between objects in our model. By introducing these aspects of the
world into the model, and correlating them with other attributes, we can both
predict the link structure and use the presence of links to reach conclusions about
attribute values.

5.5

Probabilistic Model of Link Structure


In our discussion so far, all relations between attributes are determined by the
relational skeleton r ; only the descriptive attributes are uncertain. The relational
skeleton species the set of objects in all classes, as well as all the relationships
that hold between them (in other words, it species the values for all of the
reference slots). Consider the simple university domain of section 5.2 describing
professors, courses, students, and registrations. The relational skeleton species the
complete relational structure in the model: it species which professor teaches each
course, and it species all of the registrations of students in courses. In our simple
university example, the relational skeleton (shown in gure 5.2(a)) contains all of
the information except for the values for the descriptive attributes.
There is one distinction we will add to our relational schema. It is useful to
distinguish between an entity and a relationship, as in entity-relationship diagrams.
In our language, classes are used to represent both entities and relationships. We

142

Probabilistic Relational Models

Bibliography
1. ----- ?
2. ----- ?
3. ----- ?
Scientific Paper

Document Collection

Figure 5.6

Reference uncertainty in a simple citation domain.

introduce XE to denote the set of classes that represent entities, and XR to denote
those that represent relationships. We note that the distinctions are prior knowledge
about the domain, and are therefore part of the domain specication. We use the
generic term object to refer both to entities and to relationships.
5.5.1

Reference Uncertainty

Consider a simple citation domain illustrated in gure 5.6. Here we have a document
collection. Each document has a bibliography that references some of the other
documents in the collection. We may know the number of citations made by each
document (i.e., it is outside the probabilistic model). By observing the citations
that are made, we can use the links to reach conclusions about other attributes in
the model. For example, by observing the number of citations to papers of various
topics, we may be able to infer something about the topic of the citing paper.
gure 5.7(a) shows a simple schema for this domain. We have two classes, Paper
and Cites. The Paper class has information about the topic of the paper and the
words contained in the paper. For now, we simply have an attribute for each word
that is true if the word occurs in the page and false otherwise. The Cites class
represents the citation of one paper, the Cited paper, by another paper, the Citing
paper. (In the gure, for readability, we show the Paper class twice.) In this model,
we assume that the set of objects is prespecied, but relations among them, i.e.,
reference slots, are subject to probabilistic choices. Thus, rather than being given
a full relational skeleton r , we assume that we are given an object skeleton o .
The object skeleton species only the objects o (X) in each class X X , but
not the values of the reference slots. In our example, the object skeleton species
the objects in class Paper and the objects in class Cites, but the reference slots of
the Cites relation, Cites.Cited and Cites.Citing are unspecied. In other words, the
probabilistic model does not provide a model of the total number of citation links,
but only a distribution over their endpoints. gure 5.7 shows an object skeleton
for the citation domain.

5.5

Probabilistic Model of Link Structure

Paper
Topic
Words

Paper
Cites
Cited
Citing

Topic
Words

(a)

143

Paper
Paper
P2
P5
Paper
Topic
Paper
Topic
P4Paper
Theory
P3
AI
Topic
P1Topic
Theory
TopicAI
???

Paper
Paper
P2
P5
Paper
Topic
Paper
Topic
P4Paper
Theory
P3
Reg
Reg
AI
Topic
P1Topic
Theory
TopicAI
Reg
Reg
Cites
???

(b)

Figure 5.7 (a) A relational schema for the citation domain. (b) An object skeleton
for the citation domain.

5.5.1.1

Probabilistic Model

In the case of reference uncertainty, we specify a probabilistic model for the value
of the reference slots X.. The domain of a reference slot X. is the set of keys
(unique identiers) of the objects in the class Y to which X. refers. Thus, we need
to specify a probability distribution over the set of all objects in Y . For example,
for Cites.Cited, we must specify a distribution over the objects in class Paper.
A naive approach is to simply have the PRM specify a probability distribution
directly over the objects o (Y ) in Y . For example, for Cites.Cited, we would have
to specify a distribution over the primary keys of Paper. This approach has two
major aws. Most obviously, this distribution would require a parameter for each
object in Y , leading to a very large number of parameters. This is a problem both
from a computational perspective the model becomes very large and from
a statistical perspective we often would not have enough data to make robust
estimates for the parameters. More importantly, we want our dependency model
to be general enough to apply over all possible object skeletons o ; a distribution
dened in terms of the objects within a specic object skeleton would not apply to
others.
In order to achieve a general and compact representation, we use the attributes
of Y to dene the probability distribution. In this model, we partition the class Y
into subsets labeled 1 , . . . , m according to the values of some of its attributes,
and specify a probability for choosing each partition, i.e., a distribution over the
partitions. We then select an object within that partition uniformly.
For example, consider a description of movie theater showings as in gure 5.8(a).
For the foreign key Shows.Movie, we can partition the class Movie by Genre,
indicating that a movie theater rst selects the genre of movie it wants to show,
and then selects uniformly among the movies with the selected genre. For example,
a movie theater may be much more likely to show a movie which is a thriller
than a foreign movie. Having selected, for example, to show a thriller, the theater
then selects the actual movie to show uniformly from within the set of thrillers.
In addition, just as in the case of descriptive attributes, the partition choice can

144

Probabilistic Relational Models

M1
Movie.Genre = foreign

M2
Movie.Genre = thriller

Paper
P5
Topic
AI Paper
P3
Topic
AI

Paper
P4
Paper
Topic
P2
Topic PaperTheory
Theory P1
Topic
Theory

P1

P2

Paper.Topic = Theory

Paper.Topic = AI

Paper

Theater

Type
Location
Profit

Paper
Paper
P1
Paper
P1
Topic
Paper
P1
Topic
Paper
Theory
P1
Topic
Theory
P1
Topic
Theory
Topic
Theory
Theory

Shows

Theater
Movie

Type

m1 m2

M1 M2
0.1 0.9
0.2 0.8
art theater 0.7 0.3
megaplex

(a)

Topic
Words

Cites
Citing
Cited

Topic
Theory
AI

P1 P2
0.1 0.9
0.99 0.01

(b)

Figure 5.8 (a) An example of reference uncertainty for a movie theaters showings.
(b) A simple example of reference uncertainty in the citation domain

depend on other attributes in our model. Thus, the selector attribute can have
parents. As illustrated in the gure, the choice of movie genre might depend on
the type of theater. Consider another example in our citation domain. As shown in
gure 5.8(b), we can partition the class Paper by Topic, indicating that the topic
of a citing paper determines the topics of the papers it cites; and then the cited
paper is chosen uniformly among the papers with the selected topic.
We make this intuition precise by dening, for each slot , a partition function
. We place several restrictions on the partition function which are captured in
the following denition:
Denition 5.12
Let X. be a reference slot with domain Y . Let : Y Dom[ ] be a function
where Dom[ ] is a nite set of labels. We say that is a partition function for
if there is a subset of the attributes of Y , P[] A(Y ), such that for any y Y
and any y  Y , if the values of the attributes P[] of y and y  are the same, i.e., for
each A P[], y.A = y  .A, then (y) = (y  ). We refer to P[] as the partition
attributes for .
Thus, the values of the partition attributes are all that is required to determine the
partition to which an object belongs.
In our rst example, Shows.Movie : Movie {foreign, thriller} and the partition
attributes are P[Shows.Movie] = {Genre}. In the second example, Cites.Cited :
Paper {AI, Theory} and the partition attributes are P[Cites.Cited] = {Topic}.
There are a number of natural methods for specifying the partition function.
It can be dened simply by having one partition for each possible combination
of values of the partition attributes, i.e., one partition for each value in the cross
product of the partition attribute values. Our examples above take this approach.
In both cases, there is only a single partition attribute, so specifying the partition
function in this manner is not too unwieldy, but for larger collections of partition
attributes or for partition attributes with large domains, this method for dening
the partitioning function may be problematic. A more exible and scalable approach

5.5

Probabilistic Model of Link Structure

145

is to dene the partition function using a decision tree built over the partition
attributes. In this case, there is one partition for each of the leaves in the decision
tree.
Each possible value determines a subset of Y from which the value of (the
referent) will be selected. For a particular instantiation I of the database, we use
I(Y ) to represent the set of objects in I(Y ) that fall into the partition .
We now represent a probabilistic model over the values of by specifying a
distribution over possible partitions, which encodes how likely the reference value
of is to fall into one partition versus another. We formalize our intuition above
by introducing a selector attribute S , whose domain is Dom[ ]. The specication
of the probabilistic model for the selector attribute S is the same as that of any
other attribute: it has a set of parents and a CPD. In our earlier example, the CPD
of Show.SMovie might have as a parent Theater.Type. For each instantiation of the
parents, we have a distribution over Dom[S ]. The choice of value for S determines
the partition Y from which the reference value of is chosen; the choice of reference
value for is uniformly distributed within this set.
Denition 5.13
A probabilistic relational model with reference uncertainty over a relational
schema R has the same components as in denition 5.2. In addition, for each
reference slot R(X) with Range[] = Y , we have:
a partition function with a set of partition attributes P[] A(Y );
a new selector attribute S within X which takes on values in the range of ;
a set of parents and a CPD for S .
To dene the semantics of this extension, we must dene the probability of
reference slots as well as descriptive attributes:
 

P (x.A | Pa(x.A))
P (I | o , ) =
XX xo (X) AA(X)

R(X),y=x.

P (x.S = [y] | Pa(x.S ))


,
|I(Y[y] )|

(5.2)

where [y] refers to (y) the partition that the partition function assigns y.
Note that the last term in (5.2) depends on I in three ways: the interpretation of
x. = y, the values of the attributes P[] within the object y, and the size of Y[y] .
The above probability is not well-dened if there are no objects in a partition, so
in that case we dene it to be zero.
5.5.2

Coherence of the Probabilistic Model

As in the case of PRMs with attribute uncertainty, we must be careful to guarantee


that our probability distribution is in fact coherent. In this case, the object
skeleton does not specify which objects are related to which, and therefore the
mapping of formal to actual parents depends on probabilistic choices made in the

146

Probabilistic Relational Models

model. The associated ground Bayesian network will therefore be cumbersome and
not particularly intuitive. We dene our coherence constraints using an instance
dependency graph, relative to our PRM and object skeleton.
Denition 5.14
The instance dependency graph for a PRM and an object skeleton o is a
graph Go with the nodes and edges described below. For each class X and each
x o (X), we have the following nodes:
a node x.A for every descriptive attribute X.A;
a node x. and a node x.S , for every reference slot X..
The dependency graph contains ve types of edges:
Type I edges: Consider any attribute (descriptive or selector) X.A and formal
parent X.B. We dene an edge x.B x.A, for every x o (X).
Type II edges: Consider any attribute (descriptive or selector) X.A and formal
parent X.K.B where Dom[X.K] = Y . We dene an edge y.B x.A, for every
x o (X) and y o (Y ).
Type III edges: Consider any attribute X.A and formal parent X.K.B, where
K = 1 , . . . , k , and Dom[i ] = Xi . We dene an edge x.1 x.A, for every
x o (X). In addition, for each i > 1, we add an edge xi .i x.A for every
xi o (Xi ) and for every x o (X).
Type IV edges: Consider any slot X. and partition attribute Y.B P[] for
Y = Range[]. We dene an edge y.B x.S for every x o (X) and y o (Y ).
Type V edges: Consider any slot X.. We dene an edge x.S x. for every
x o (X).
We say that a dependency structure S is acyclic relative to an object skeleton o
if the directed graph Go is acyclic.
Intuitively, type I edges correspond to intra-object dependencies and type II edges
to inter-object dependencies. These are the same edges that we had in the dependency graph for regular PRMs, except that they also apply to selector attributes.
Moreover, there is an important dierence in our treatment of type II edges. In this
case, the skeleton does not specify the value of x., and hence we cannot determine
from the skeleton on which object y the attribute x.A actually depends. Therefore,
our instance dependency graph must include an edge from every attribute y.B.
Type III edges represent the fact that the actual choice of parent for x.A depends
on the value of the slots used to dene it. When the parent is dened via a slot
chain, the actual choice depends on the values of all the slots along the chain. Since
we cannot determine the particular object from the skeleton, we must include an
edge from every slot xi .i potentially included in the chain.
Type V edges represent the dependency of a slot on the attributes dening the
associated partition. To see why this dependence is required, we observe that our
choice of reference value for x. depends on the values of the partition attributes

5.5

Probabilistic Model of Link Structure

147

P[x.] of all of the dierent objects y in Y . Thus, these attributes must be


determined before x. is determined. Finally, type V edges represent the fact that
the actual choice of parent for x.A depends on the value of the selector attributes
for the slots used to dene it. In our example, as P[Shows.Movie] = {Movie.Genre},
the genres of all movies must be determined before we can select the value of the
reference slot Shows.Movie.
Based on this denition, we can specify conditions under which (5.2) species a
coherent probability distribution.
Theorem 5.15
Let be a PRM with reference uncertainty whose dependency structure S is acyclic
relative to an object skeleton o . Then and o dene a coherent probability
distribution over instantiations I that extend o via (5.2).
This theorem is limited in that it is very specic to the constraints of a given
object skeleton. As in the case of PRMs without relational uncertainty, we want to
learn a model in one setting, and be assured that it will be acyclic for any skeleton
we might encounter. We accomplish this goal by extending our denition of class
dependency graph. We do so by extending the class dependency graph to contain
edges that correspond to the edges we dened in the instance dependency graph.
Denition 5.16
The class dependency graph G for a PRM with reference uncertainty has a node
for each descriptive or selector attribute X.A and each reference slot X., and the
following edges:
Type I edges: For any attribute X.A and formal parent X.B, we have an edge
X.B X.A.
Type II edges: For any attribute X.A and formal parent X..B where
Range[] = Y , we have an edge Y.B X.A.
Type III edges: For any attribute X.A and formal parent Y.K.B, where
K = 1 , . . . , k , and Dom[i ] = Xi , we dene an edge X.1 X.A. In addition,
for each i > 1, we add an edge X.i X.A.
Type IV edges: For any slot X. and partition attribute Y.B for Y = Range[],
we have an edge Y.B X.S .
Type V edges: For any slot X., we have an edge X.S X..
Figure 5.9 shows the class dependency graph for our extended movie example.
While the proof is a bit more complex than in the attribute uncertainty case, the
following analogous theorem holds:
Theorem 5.17
Let be a PRM with reference uncertainty whose class dependency structure
S is acyclic. For any object skeleton o , and o dene a coherent probability
distribution over instantiations I that extend o via (5.2).

148

Probabilistic Relational Models

S
Type V
Type

Theater
Type III

Genre

Type II

Theater
Type
Location

Movie
Shows
STheater
Theater
S Movie
Movie
Profit

Type IV
S

Genre

Type I

Type V

Popularity

Location

Popularity

Movie
Type III

Type II

Profit

(a)

(b)

Figure 5.9 (a) A PRM for the movie theater example. The partition attributes
are indicated using dashed lines. (b) The dependency graph for the movie theater
example. The dierent edge types are labeled.

??
?

Document Collection

Figure 5.10

5.5.3

Document Collection

Existence uncertainty in a simple citation domain.

Existence Uncertainty

The second form of structural uncertainty we introduce is called existence uncertainty. In this case, we make no assumptions about the number of links that exist.
The number of links that exist and the identity of the links are all part of the
probabilistic model and can be used to make inferences about other attributes in
our model. In our citation example above, we might assume that the set of papers
is part of our background knowledge, but we want to provide an explicit model for
the presence or absence of citations. Unlike the reference uncertainty model of the
previous section, we do not assume that the total number of citations is xed, but
rather that each potential citation can be present or absent.

5.5

Probabilistic Model of Link Structure

149

Paper
PaperPaper
P5 P2
Paper
Paper
Topic
Topic
P4Paper
P3
Theory
AI
Topic
P1Topic
Theory
Topic
AI
???

???

PaperPaper
P5 P2
Paper
Paper
Topic
Topic
P4Paper
P3
Theory
AI P1
Topic
Topic
Theory
Topic
AI
???

Topic
Words

(a)

Paper
Topic
Words

Cites
Exists
Citer.Topic

Theory
Theory
AI
AI

Cited.Topic

Theory
AI
Theory
AI

False

True

0.995
0.999
0.997
0.993

0005
0001
0003
0008

(b)

Figure 5.11 (a) An entity skeleton for the citation domain. (b) A CPD for the
Exists attribute of Cites.

5.5.3.1

Semantics of Relational Model

The object skeleton used for reference uncertainty assumes that the number of
objects in each relation is known. Thus, if we consider a division of objects into
entities and relations, the number of objects in classes of both types is xed.
Existence uncertainty assumes even less background information than specied by
the object skeleton. Specically, we assume that the number of relationship objects
is not xed in advance. This situation is illustrated in gure 5.10.
We assume that we are given only an entity skeleton e , which species the set
of objects in our domain only for the entity classes. Figure 5.11(a) shows an entity
skeleton for the citation example. Our basic approach is to allow other objects
within the model those in the relationship classes to be undetermined, i.e.,
their existence can be uncertain. In other words, we introduce into the model all
of the objects that can potentially exist in it; with each of them, we associate a
special binary variable that tells us whether the object actually exists or not. We
call entity classes determined and relationship classes undetermined.
To specify the set of potential objects, we note that relationship classes typically
represent many-many relationships; they have at least two reference slots, which
refer to determined classes. For example, our Cite class has the two reference
slots, Citing and Cited . Thus the potential domain of the Cites class in a given
instantiation I is I(Paper) I(Paper). Each potential object x in this class has
the form Cite[y1 , y2 ]. Each such object is associated with a binary attribute x.E
that species whether paper y1 did or did not cite paper y2 .
Denition 5.18
Consider a schema with determined and undetermined classes, and let e be an
entity skeleton over this schema. We dene the induced relational skeleton, r [e ],
to be the relational skeleton that contains the following objects:
If X is a determined class, then r [e ](X) = e (X).
Let X be an undetermined class with reference slots 1 , . . . , k whose range types
are Y1 , . . . , Yk respectively. Then r [e ](X) contains an object X[y1 , . . . , yk ] for
all tuples y1 , . . . , yk r [e ](Y1 ) r [e ](Yk ).

150

Probabilistic Relational Models

The relations in r [e ] are dened in the obvious way: Slots of objects of determined
classes are taken from the entity skeleton. Slots of objects of undetermined classes
are induced from the object denition: X[y1 , . . . , yk ].i is yi .
To ensure that the semantics of schemata with undetermined classes is welldened, we need a few tools. Specically, we need to ensure that the set of potential
objects is well-dened and nite. It is clear that if we allow cyclic references (e.g.,
an undetermined class with a reference to itself), then the set of potential objects
is not nite. To avoid such situations, we need to put some requirements on the
schema.
Denition 5.19
A set of classes X is stratied if there exists a partial ordering over the classes
such that for any reference slot X. with range type Y , Y X.
Lemma 5.20
If the set of undetermined classes in a schema is stratied, then given any entity
skeleton e the number of potential objects in any undetermined class is nite.
As discussed, each undetermined X has a special existence attribute X.E whose
values are V(E) = {true, false}. For uniformity of notation, we introduce an E
attribute for all classes; for classes that are determined, the E value is dened to
be always true. We require that all of the reference slots of a determined class X
have a range type which is also a determined class.
For a PRM with stratied undetermined classes, we dene an instantiation to
be an assignment of values to the attributes, including the Exists attribute, of all
potential objects.
5.5.3.2

Probabilistic Model

We now specify the probabilistic model dened by the PRM. By treating the Exists
attributes as standard descriptive attributes, we can essentially build our denition
directly on top of the denition of standard PRMs.
Specically, the existence attribute for an undetermined class is treated in the
same way as a descriptive attribute in our dependency model, in that it can have
parents and children, and has an associated CPD. gure 5.11(b) illustrates a CPD
for the Cites.Exists attribute. In this example, the existence of a citation depends
on the topic of the citing paper and the topic of the cited paper; e.g., it is more
likely that citations will exist between papers with the same topic.
Using the induced relational skeleton and treating the existence events as descriptive attributes, we have set things up so that (5.1) applies with minor changes.
There are two important changes to the denition of the distribution:
We want to enforce that x.E = false if x..E = false for one of the slots of X.
Suppose that X has the slots 1 , . . . , k , we dene the eective CPD for X.E as

5.6

PRMs with Class Hierarchies

151

follows. Let Pa (X.E) = Pa(X.E) {X.1 .E, . . . , X.k .E}, and dene

P (X.E | Pa(X.E)) if X.i .E = true, i = 1, . . . , k,

P (X.E | Pa (X.E)) =
0
otherwise
We want to decouple the attributes of nonexistent objects from the rest
of the PRM. Thus, if X.A is a descriptive attribute, we dene Pa (X.A) =
Pa(X.A) {X.E}, and

P (X.A | Pa(X.A)) if X.E = true,

P (X.A | Pa (X.A)) =
1
otherwise
|V(X.A)|
It is easy to verify that in both cases P (X.A | Pa (X.A)) is a legal conditional
distribution.
In eect, these constraints specify a new PRM , in which we treat X.E as a
standard descriptive attribute. For each attribute (including the Exists attribute),
we dene the parents of X.A in to be Pa (X.A) and the associated CPD to be
P (X.A | Pa (X.A)).
Given an entity skeleton e , a PRM with exists uncertainty species a distribution over a set of instantiations I consistent with r [e ]:



P (I | e , ) = P (I | r [e ], ) =
P (x.A | Pa (x.A))
XX xr [e ](X) AA(x)

(5.3)
We can similarly dene the the class dependency graph for a PRM with exists
uncertainty using the corresponding notions for the standard PRM . As there, we
require that the class dependency graph G is acyclic. One immediate consequence
of this requirement is that the schema is stratied.
Lemma 5.21
If the class dependency graph G is acyclic, then there is a stratication of the
undetermined classes.
Based on this denition, we can prove the following result:
Theorem 5.22
Let be a PRM with existence uncertainty and an acyclic class dependency graph.
Let e be an entity skeleton. Then (5.3) denes a coherent distribution on all
instantiations I of the induced relational skeleton r [e ].

5.6

PRMs with Class Hierarchies


Next we propose methods for discovering useful renements of a PRMs dependency
model. We begin by introducing probabilistic relational models with class hierarchies
(PRMs-CH). PRMs-CH extend PRMs by including class hierarchies over the ob-

152

Probabilistic Relational Models

jects. Subclasses allow us to specialize the probabilistic model for some instances
of a class. For example, if we have a class movie in our relational schema, we
might consider subclasses of movies, such as documentaries, action movies, British
comedies, etc. The popularity of an action movie (a subclass of movies) may depend on its budget, whereas the popularity of a documentary (another subclass
of movies) may depend on the reputation of the director. Subclassing allows us to
model probabilistic dependencies at the appropriate level of detail. For example,
we can have the parents of the popularity attribute in the action movie subclass
be dierent than the parents of the same attribute in the documentary subclass. In
addition, subclassing allows additional dependency paths to be represented in the
model that would not be allowed in a PRM that does not support subclasses. For
example, whether a person enjoys action movies may depend on whether she enjoys
documentaries. PRMs-CH provide a general mechanism that allow us to dene a
rich set of dependencies.
To motivate our extensions, consider a simple PRM for the movie domain. Let
us restrict attention to the three classes, Person, Movie, and Vote. We can have
the attributes of Vote depending on attributes of the person voting (via the slot
Vote.Voter ) and on attributes of the movie (via the slot Vote.Movie). However,
given the attributes of all the people and the movie in the model, the dierent
votes are (conditionally) i.i.d.
5.6.1

Class Hierarchies

Our aim is to rene the notion of a class, such as Movie, into ner subclasses,
such as action movies, comedy, documentaries, etc. Moreover, we want to allow
recursive renements of this structure, so that we might rene action movies into
the subclasses spy movies, car chase movies, and kung-fu movies.
A class hierarchy for a class X denes an IS-A hierarchy for objects from class
X. The root of the class hierarchy is simply class X itself. The subclasses of X are
organized into an inheritance hierarchy. The leaves of the class hierarchy describe
basic classesthese are the most specic characterization of objects that occur
in the database. The interior nodes describe abstractions of the base-level classes.
The intent is that the class hierarchy is designed to capture useful and meaningful
abstractions in a particular domain.
More formally, a hierarchy H[X] for a class X is a rooted directed acyclic graph
dened by a subclass relation over a nite set of subclasses C[X]. For c, d C[X],
if c d, we say that Xc is a direct subclass of Xd , and Xd is a direct superclass of
Xc . The root of the tree is the class X. Class
corresponds to the original class
X. We dene to be the transitive closure of ; if c d, we say that Xc is a
subclass of Xd . For example, gure 5.12 shows the simple class hierarchy for the
Movie class.
We denote the sublcasses of the hierarchy by C[(]H[X]). We achieve subclassing
for a class X by requiring that there be an additional subclass indicator attribute
X.Class that determines the subclass to which an object belongs. Thus, if c is a

5.6

PRMs with Class Hierarchies

153

Movie

Figure 5.12

Comedy

Action-Movie

Spy-Movie

Car-Chase-Movie

Documentary

Kung-Fu-Movie

A simple class hierarchy for Movie.

subclass, then I(Xc ) contains all objects x X for which x.Class c, i.e., all
objects that are in some class which is a subclass of c. In our example, Movie has
a subclass indicator variable Movie.Class with possible values
{Comedy, Action-Movie, Documentary, Spy-Movie, Car-Chase-Movie, Kung-Fu-Movie}
.
Subclasses allow us to make ner distinctions when constructing a probabilistic
model. In particular, they allow us to specialize CPDs for dierent subclasses in
the hierarchy.
Denition 5.23
A probabilistic relational model with subclass hierarchy is dened as follows. For
each class X X , we have
a class hierarchy H[X] = (C[X], );
a subclass indicator attribute X.Class such that V(X.Class) = C[(]H[X]);
a CPD for X.Class;
for each subclass c C[X] and attribute A A(X) we have either
a set of parents Pac (X.A) and a CPD that describes P (X.A | Pac (X.A)); or
an inherited indicator that species that the CPD for X.A in c is inherited
from its direct superclass. The root of the hierarchy cannot have the inherited
indicator.
With the introduction of subclass hierarchies, we can rene our probabilistic
dependencies. Before each attribute X.A had an associated CPD. Now, if we like,
we can specialize the CPD for an attribute within a particular subclass. We can
associate a dierent CPD with the attributes of dierent subclasses. For example,
the attribute Action-Movie.Popularity may have a dierent conditional distribution
from the attribute Documentary.Popularity. Further, the distribution for each of
the attributes may depend on a completely dierent set of parents. Continuing
our earlier example, if the popularity of an action movie depends on its budget,

154

Probabilistic Relational Models

then Action-Movie.Popularity would have as parents Action-Movie.Budget. However,


for documentaries, the popularity depends on the reputation of the director; then
Documentary.Popularity would have the parent Documentary.Director.Reputation.
We dene P (X.A | Pac (X.A)) to be the CPD associated with A in Xd , where d
is the most specialized superclass of c (which may be c itself) such that the CPD
of X.A in d is not marked with the inherited indicator.
5.6.2

Rened Slot References

At rst glance, the increase in representational power provided by supporting


subclasses is deceptively small. It seems that little more than an extra constructed
type variable has been added, and that the structure that is exploited by the new
subclassed CPDs could just as easily have been provided using structured CPDs,
such as the tree-structured CPDs or decision graphs [1, 4]. For example, the root
node in the tree-structured CPD for attribute X.A can split on the class attribute,
X.Class, and then the subtrees can dene the appropriate specializations of the
CPD. In reality, it is not quite so simple; now X.A would need to have as parents
the union of all of the parents of its subclasses. However, the representational power
is quite similar.
However, the representational power has been extended in a very important way.
Certain dependency structures that would have been disallowed in the original
framework are now allowed. These dependencies appear circular when examined
only at the class level; however, when rened and modeled at the subclass level,
they are no longer cyclic. One way of understanding this phenomenon is that, once
we have rened the class, the subclass information allows us to disentangle and
order the dependencies.
Returning to our earlier example, suppose that we have the classes Voter, Movie,
and Vote. Vote has reference slots Person and Movie and an attribute Ranking
that gives the score that a person has given for a movie. Suppose we want to model
a correlation between a persons votes for documentaries and her votes for action
movies. (This correlation might be a negative one.) In the unrened model, we do
not have a way of referring to a persons votes for some particular subset of movies;
we can only consider aggregates over a persons entire set of votes. Furthermore,
even if we could introduce such a dependence, the dependency graph would show
a dependence of Vote.Rank on itself.
When we create subclasses of movie, we can also create specializations of any
classes that make reference to movies. For example Vote has a reference slot
Vote.Movie. Suppose we create subclasses of Movie: Comedy, Action-Movie, and
Documentary. Then we can create corresponding specializations of Vote: ComedyVote, Action-Vote, and Documentary-Vote. Each of these subclasses refers only to a
particular category of votes.
The introduction of subclasses of votes provides us with a way of isolating a
persons votes on some subset of movies. In particular, we can try to introduce
a dependence of Documentary-Vote.Rank on Action-Vote.Rank. In order to allow

5.6

PRMs with Class Hierarchies

155

this dependency, we need a mechanism for constructing slot chains that restrict
the types of objects along the path to belong to specic subclasses. Recall that a
reference slot is a function from Dom[] to Range[], i.e. from X to Y . We can
introduce renements of a slot reference by restricting the types of the objects in
the range.
Denition 5.24
Let be a slot (reference or inverse) of X with range Y . Let d be a subclass of Y .
A rened slot reference d for to d is a relation between X and Y :
For x X, y Y, y x.d if x X and y Yd , then y x..
Returning to our earlier example, suppose that we have subclasses of Movie:
Comedy, Action-Movie, and Documentary. In addition, suppose we also have subclasses of Vote, Comedy-Vote and Action-Vote, and Documentary-Vote. To get from
a person to her votes, we use the inverse of slot reference Person.Votes. Now we
can construct renements of Person.Votes, VotesComedy-Vote , VotesAction-Vote , and
VotesDocumentary-Vote .
Let us name these slots Comedy-Votes and Action-Votes, and Documentary-Votes.
To specify the dependency of a persons rankings for documentaries on their rankings for action movies we can say that Documentary-Vote.Rank has a parent which is
the persons action movie rankings: (Documentary-Vote.Person.Action-Votes.Rank).
5.6.3

Support for Instance-Level Dependencies

The introduction of subclasses brings the benet that we can now provide a smooth
transition from the PRM, a class-based probabilistic model, to models that are
more similar to Bayesian networks. To see this, suppose our subclass hierarchy
for movies is very deep and starts with the general class and ends in the most
rened levels with particular movie instances. Thus, at the most rened version
of the model we can dene the preferences of a person by either class-based
dependency (the probability of enjoying documentary movies depends on whether
the individual enjoys action movies) or instance-based dependency (the probability
of enjoying Terminator II depends on whether the individual enjoys The Hunt for
Red October ). The latter model is essentially the same as the Bayesian network
models learned by Breese et al. [2] in the context of collaborative ltering for TV
programs.
In addition, the new exibility in dening rened slot references allows us to
make interesting combinations of these types of dependencies. For example, whether
an individual enjoys a particular movie(e.g., True Lies) can be enough to predict
whether she watches a whole other category of movies (e.g., James Bond movies).

156

Probabilistic Relational Models

5.6.4

Semantics

Using this denition, the semantics for PRM-CH are given by the following equation:
 

P (x.Class)
P (x.A | Pax.c (x.A)).
(5.4)
P (I | r , ) =
X xr (X)

AA(X)

As before, the probability of an instantiation of the database is the product of


CPDs of the instance attributes; the key dierence is that here, in addition to the
skeleton determining the parents on an attribute, the subclass to which the object
belongs determines which local probability model is used.
5.6.5

Coherence of Probabilistic Model

As in the case of PRMs with attribute uncertainty, we must be careful to guarantee


that our probability distribution is in fact coherent. In this case, while the relational
skeleton species which objects are related to which, it does not specify the subclass
indicator for each object, so the mapping of formal to actual parents depends on
the probabilistic choice for the subclass for the object. In addition, for rened slot
references, the existence of the edge will depend on the subclass of the object.
We will indicate edge existence by the coloring of an edge: a black edge exists in
the graph, a gray edge may exist in the graph, and a white edge is invisible in
the graph. As in previous sections, we dene our coherence constraints using an
instance dependency graph, relative to our PRM and relational skeleton.
Denition 5.25
The colored instance dependency graph for a CH-PRM CH and a relational
skeleton r is a graph Gr . The graph has the following nodes, for each class X
and for each x r (X):
A descriptive attribute node x.A, for every descriptive attribute X.A A(X);
a subclass indicator node x.Class.
$
Let Pa (X.A) = cC[X] Pac (X.A). The dependency graph contains four types
of edges. For each attribute X.A (both descriptive attributes and the subclass
indicator), we add the following edges:
Type I edges: For every x r (X) and for each formal parent X.B Pa (X.A),
we dene an edge x.B x.A. This edge is black if the parents have not been
specialized (which will be the case for the subclass indicator, x.Class, and possibly
other attributes as well). All the other edges are colored gray.
Type II edges: For every x r (X) and for each formal parent X.K.B
Pa (X.A), if y x.K in r , we dene an edge y.B x.A. If the CPD has been
specialized, or if K contains any rened slot references this edge is colored gray;
otherwise is is colored black.

5.6

PRMs with Class Hierarchies

157

As before, type I edges correspond to intra-object dependencies and type II edges


to inter-object dependencies. But since an object may be from any subclass, even
though the relational skeleton species the objects to which it is related, until we
know the subclass of an object, we do not know which of the local probability
models applies. In addition, in the case where a parent of an object is dened via a
rened slot reference, we also do not know the set of related objects until we know
their subclasses. Thus, we add edges for every possible parent and color the edges
used in dening parents gray. Type I and type II edges are gray when they are
parents in a specialized CPD. In addition, type II edges may be gray if a rened
slot reference is used in the denition of a parent.
At this point, the problem with our instance dependency graph is that there are
some edges which are known to occur (the black edges) and some edges that may
or may not exist (depending on the subclass of an object). How do we ensure our
instance dependency graph is acyclic? In this case, we must ensure that the instance
dependency graph is acyclic for any setting of the subclass indicators. Note that
this is a probabilistic event. First, we extend our notion of acyclicity for our colored
instance dependency graph.
Denition 5.26
A colored instance dependency graph is acyclic if, for any instantiation of the
subclass indicators, there is an acyclic ordering of the nodes relative to the black
edges in the graph. Given any a particular assignment of subclass indicators, we
determine the black edges as follows:
Given a subclass assignment y.Class, all of the edges involving this object are
colored either black or white. Let y.Class = d. The edges for any parent nodes
are colored black if they are dened by the CPD Pad (X.A), and white otherwise.
In addition, the edges corresponding to any rened slot references, d (x, y), are
set: If y.Class = d, the edge is colored black; otherwise it is painted white.
Based on this denition, we can specify conditions under which (5.4) species a
coherent probability distribution.
Theorem 5.27
Let CH be a PRM with class hierarchies whose colored dependency structure S
is acyclic relative to a relational skeleton r . Then CH and r dene a coherent
probability distribution over instantiations I that extend r via (5.4).
As in the previous case of PRMs with attribute uncertainty and PRMs with
structural uncertainty, we want to learn a model in one setting, and be assured
that it will be acyclic for any skeleton we might encounter. Again we achieve this
goal through our denition of class dependency graph. We do so by extending the
class dependency graph to contain edges that correspond to the edges we dened
in the instance dependency graph.

158

Probabilistic Relational Models

Movie.Class
MovieAction.Budget

Budget

Person
Popularity

Age

Action-Movie
Action-Vote
Rank
Budget

Type I

MovieAction.Popularity

Person.Age
Type II

Type II

MovieDoc.Budget

VoteAction.Rank
Type II

MovieDoc.Popularity
Popularity

Documentary

Rank

DocumentaryVote

(a)

VoteDoc.Rank

Vote.Class

(b)

Figure 5.13 (a) A simple PRM with class hierarchies for the movie domain. (b)
The class dependency graph for this PRM.

Denition 5.28
The class dependency graph for a PRM with class hierarchy CH has the following
set of nodes for each X X :
for each subclass c C[X] and attribute A A(X), a node Xc .A;
a node for the subclass indicator X.Class;
and the following edges:
Type I edges: For any node Xc .A and formal parent Xc .B Pac (Xc .A) we have
an edge Xc .B Xc .A.
Type II edges: For any attribute Xc .A and formal parent Xc ..B Pac (Xc .A),
where Range[] = Y , we have an edge Y.B Xc .A.
Type III edges: For any attribute Xc .A, and for any direct superclass d, c d,
we add an edge Xc .A Xd .A.
Figure 5.13 shows a simple class dependency graph for our movie example. The
PRM-CH is given in gure 5.13(a) and the class dependency graph is shown in
gure 5.13(b).
It is now easy to show that if this class dependency graph is acyclic, then the
instance dependency graph is acyclic.
Lemma 5.29
If the class dependency graph is acyclic for a PRM with class hierarchies CH , then
for any relational skeleton r , the colored instance dependency graph is acyclic.
And again we have the following corollary:
Corollary 5.30
Let CH be a PRM with class hierarchies whose class dependency structure S is
acyclic. For any relational skeleton r , CH and r dene a coherent probability
distribution over instantiations I that extend r via (5.4).

5.7

5.7

Inference in PRMs

159

Inference in PRMs
An important aspect of any probabilistic representation is the support for making
inferences; having made some observations, how do we condition on these observations and update our probabilistic model? Inference in PRMs supports many
interesting patterns of reasoning. Oftentimes we can view the inference as inuence
owing between the interrelated objects. Consider a simple example of inference
about a particular student in our school PRM. A priori we may believe a student
is likely to be smart. We may observe his grades in several courses and see that
for the most part he received As, but in one class he received a C. This may cause
us to slightly reduce our belief that the student is smart, but it will not change
it signicantly. However, if we nd that most of the other students that took the
course received high grades, we then may believe that the course is an easy course.
Since it is unlikely that a smart student got a low grade in an easy course, our
probability for the student being smart now goes down substantially.
There are several potential approaches for performing inference eectively in
PRMs. In a few cases, particularly when the skeleton is small, or it results in a
network with low tree width, we can do exact inference in the ground Bayesian
network. In other cases, when there are certain types of regularities in the ground
Bayesian network, we can still perform exact inference by carefully exploiting and
reusing computations. And in cases where the ground Bayesian network is very
large and we cannot exploit regularities in its structure, we resort to approximate
inference.
5.7.1

Exact Inference

We can always resort to exact inference on the ground Bayesian Network, but
the ground Bayesian Network may be very large and thus this inference may
prove intractable. Under certain circumstances, inference algorithms can exploit
the model structure to make inference tractable. Previous work on inference in
structured probabilistic models [14, 19, 18] shows how eective inference can be
done for a number of dierent structured probabilistic models. The algorithms make
use of the structure imposed by the class hierarchy to decompose the distribution
and eectively reuse computation.
There are two ways in which aspects of the structure can be used to make
inference more ecient. The rst structural aspect is the natural encapsulation
of objects that occurs in a well-designed class hierarchy. Ideally, the interactions
between objects will occur via a small number of object attributes, and the majority
of interactions between attributes will be encapsulated within the class. This can
provide a natural decomposition of the model suitable for inference. The complexity
of the inference will depend on the width of the connections between objects; if
the width is small, we are guaranteed an ecient procedure.

160

Probabilistic Relational Models

The second structural aspect that is used to make inference ecient is the fact
that similar objects occur many times in the model. Pfeer et al. [19] describe
a recursive inference algorithm that caches the computations that are done for
fragments of the model; these computations then need only be performed once; we
can reuse them for another object occurring in the same context. We can think of
this object as a generic object, which occurs repeatedly in the model. Exploiting
these structural aspects of the model allow Pfeer et al. [19] to achieve impressive
speedups; in a military battlespace domain the structured inference was orders of
magnitudes faster than the standard Bayesian Network exact inference algorithm.
5.7.2

Approximate Inference

Unfortunately the methods used in the inference algorithm above often are not
applicable for the PRMs we study. In the majority of cases, there are no generic
objects that can be exploited. Unlike standard Bayesian Network inference, we
cannot decompose this task into separate inference tasks over the objects in the
model, as they are typically all correlated. Thus, inference in the PRM requires
inference over the ground network dened by instantiating a PRM for a particular
skeleton.
In general, the ground network can be fairly complex, involving many objects that
are linked in various ways. (For example, in some of our experiments, the networks
involve hundreds of thousands of nodes.) Exact inference over these networks is
clearly impractical, so we must resort to approximate inference. We use belief
propagation (BP), a local message-passing algorithm, introduced by Pearl [17]. The
algorithm is guaranteed to converge to the correct marginal probabilities for each
node only for singly connected Bayesian networks. However, empirical results [16]
show that it often converges in general networks, and when it does the marginals
are a good approximation to the correct posterior.
We provide a brief outline of one variant of BP, and refer the reader to [20, 16, 15]
for more details. Consider a Bayesian network over some set of nodes (which in our
case would be the variables x.A). We rst convert the graph into a family graph,
with a node Fi for each variable Vi in the Bayesian network, containing Vi and its
parents. Two nodes are connected if they have some variable in common. The CPD
of Vi is associated with Fi . Let i represent the factor dened by the CPD; i.e., if Fi
contains the variables V, Y1 , . . . , Yk , then i is a function from the domains of these
variables to [0, 1]. We also dene i to be a factor over Vi that encompasses our
evidence about Vi : i (Vi ) 1 if Vi is not observed. If we observe Vi = v, we have


that i (v) = 1 and 0 elsewhere. Our posterior distribution is then i i i i ,
where is a normalizing constant.
The BP algorithm is now very simple. At each iteration, all the family nodes
simultaneously send messages to all others, as follows:


i i
mki ,
mij (Fi Fj )
Fi Fj

kN (i){j}

5.8

Learning

161

where is a (dierent) normalizing constant and N (i) is the set of families that
are neighbors of Fi in the family graph. This process is repeated until the beliefs
converge. At any point in the algorithm, our marginal distribution about any family

Fi is bi = i i kN (i) mki . Each iteration is linear in the number of edges in the
Bayesian network. While the algorithm is not guaranteed to converge, it typically
converges after just a few iterations. After convergence, the bi give us approximate
marginal distributions over each of the families in the ground network.

5.8

Learning
Next, we turn our attention to learning a PRM. In the learning problem, our input
contains a relational schema that describes the basic vocabulary in the domain
the set of classes, the attributes associated with the dierent classes, and the
possible types of relations between objects in the dierent classes. For simplicity,
in the description that follows, we assume the training data consists of a fully
specied instance of that schema; if there are missing values, then an expectation
maximization (EM) algorithm is needed as well. We begin by describing learning
PRMs with attribute uncertainty, next describe the extensions to support learning
PRMs with structural uncertainty, and then describe support for learning PRMs
with class hierarchies.
We assume that the training instance is given in the form of a relational database.
Although our approach would also work with other representations (e.g., a set of
ground facts completed using the closed-world assumption), the ecient querying
ability of relational databases is particularly helpful in our framework, and makes
it possible to apply our algorithms to large data sets.
There are two components of the learning task: parameter estimation and structure learning. In the parameter estimation task, we assume that the qualitative
dependency structure of the PRM is known; i.e., the input consists of the schema
and training database (as above), as well as a qualitative dependency structure
S. The learning task is only to ll in the parameters that dene the CPDs of the
attributes. In the structure learning task, the dependency structure is not provided
(although the user can, if available, provide prior knowledge about the structure,
e.g., in the form of constraints) and the goal is to extract an entire PRM, structure
as well as parameters, from the training database alone. We discuss each of these
problems in turn.
5.8.1

Parameter Estimation

We begin with learning the parameters for a PRM where the dependency structure
is known. In other words, we are given the structure S that determines the set of
parents for each attribute, and our task is to learn the parameters S that dene the
CPDs for this structure. Our learning is based on a particular training set, which we
will take to be a complete instance I. While this task is relatively straightforward, it

162

Probabilistic Relational Models

is of interest in and of itself. In addition, it is a crucial component in the structurelearning algorithm described in the next section.
The key ingredient in parameter estimation is the likelihood function: the probability of the data given the model. This function captures the response of the
probability distribution to changes in the parameters. The likelihood of a parameter set is dened to be the probability of the data given the model. For a PRM,
the likelihood of a parameter set S is: L(S | I, , S) = P (I | , S, S ). As usual,
we typically work with the log of this function:
l(S | I, , S) = log P (I | , S, S )

 


log P (Ix.A | IPa(x.A) ) .


=
Xi AA(Xi )

(5.5)

x(Xi )

The key insight is that this equation is very similar to the log-likelihood of data
given a Bayesian network [11]. In fact, it is the likelihood function of the Bayesian
network induced by the PRM given the skeleton. The main dierence from standard
Bayesian network parameter learning is that parameters for dierent nodes in the
network are forced to be identicalthe parameters are shared or tied.
5.8.1.1

Maximum Likelihood Parameter Estimation

We can still use the well-understood theory of learning from Bayesian networks.
Consider the task of performing maximum likelihood parameter estimation. Here,
our goal is to nd the parameter setting S that maximizes the likelihood L(S |
I, , S) for a given I, and S. This estimation is simplied by the decomposition
of log-likelihood function into a summation of terms corresponding to the various
attributes of the dierent classes:

 


log P (Ix.A | IPa(x.A) )


l(S | I, , S) =
Xi AA(Xi )

x(Xi )

C X.A [v, u] log v|u

(5.6)

Xi AA(Xi ) vV(X.A) uV(PaX.A)

where C X.A [v, u] is the number of times we observe Ix.A = v and IPa(x.A) = u Each
of the terms in the above sum can be maximized independently of the rest. Hence,
maximum likelihood estimation reduces to independent maximization problems,
one for each CPD.
For many parametric models, such as the exponential family, maximum likelihood
estimation can be done via sucient statistics that summarize the data. In the case
of multinomial CPDs, these are just the counts we described above, C X.A [v, u], the
number of times we observe each of the dierent values v, u that the attribute X.A
and its parents can jointly take.
An important property of the database setting is that we can easily compute
sucient statistics. To compute C X.A [v, v1 , . . . , vk ], we simply query over the class

5.8

Learning

163

X and its parents classes, and project onto the appropriate set of attributes. For
example, to learn the parameters for the grade CPD from our school example, we
can compute the sucient statistics with the following SQL query:
SELECT grade, intelligence, diculty, count(*)
FROM from registration, student, course
GROUP BY grade, intelligence, diculty
In some cases, it is useful to materialize a view that can be used to compute
the sucient statistics. This is benecial when the relationship between the child
attribute and the parent attribute is many-one rather than one-one or one-many.
For example, consider the dependence of attributes of Student on attributes of
Registration. In our example PRM, a students ranking depends on the students
grades. In this case we would construct a view using the following SQL query:
CREATE VIEW v1
SELECT student.*, AVERAGE(grade) AS ave grade,
AVERAGE(satisfaction) as ave satisfaction
FROM student s, registration r
WHERE s.student id = r.student
To compute the statistics we would then project on the appropriate attributes
from view v1:
SELECT ranking, ave grade, COUNT(*)
FROM v1
GROUP BY ranking, ave grade
Thus both the creation of the view and the process of counting occurrences can
be computed using simple database queries, and can be executed eciently. The
view creation for each combination of classes is done once during the full learning
algorithm (we will see exactly at which point this is done in the next section when
we describe the search). If the tables being joined are indexed on the appropriate
set of foreign keys, the construction of this view is ecient: the number of rows
in the resulting table is the size of the child attributes table; in our example this
is |Student|. Computing the sucient statistics can be done in one pass over the
resulting table. The size of the resulting table is simply the number of unique
combinations of attribute values. We are careful to cache sucient statistics so they
are only computed once. In some cases, we can compute new sucient statistics
from a previously cached set of sucient statistics; we make use of this in our
algorithm as well.

164

Probabilistic Relational Models

5.8.1.2

Bayesian Parameter Estimation

In many cases, maximum likelihood parameter estimation is not robust: it overts


the training data. The Bayesian approach uses a prior distribution over the parameters to smooth the irregularities in the training data, and is therefore signicantly
more robust. As we will see in Section 5.8.2, the Bayesian framework also gives us
a good metric for evaluating the quality of dierent candidate structures.
Roughly speaking, the Bayesian approach introduces a prior over the unknown
parameters, and performs Bayesian conditioning, using the data as evidence, to
compute a posterior distribution over these parameters. To apply this idea in our
setting, recall that the PRM parameters S are composed of a set of individual probability distributions X.A|u for each conditional distribution of the form
P (X.A | Pa(X.A) = u). Following the work on Bayesian approaches for learning
Bayesian networks [11], we make two assumptions. First, we assume parameter independence: the priors over the parameters X.A|u for the dierent X.A and u are
independent. Second, we assume that the prior over X.A|u is a Dirichlet distribution. Briey, a Dirichlet prior for a multinomial distribution of a variable V is
specied by a set of hyperparameters {[v] : v V(V ))}. A distribution on the
parameters of P (V ) is Dirichlet if

v[v]1 .
P (V )
v

(For more details see [7].) If X.A can take on k values, then the prior is
P (X.A|u ) = Dir(X.A|u | 1 , . . . , k ).
For a parameter prior satisfying these two assumptions, the posterior also has
this form. That is, it is a product of independent Dirichlet distributions over the
parameters X.A|u . In other words,
P (X.A|u | I, , S) = Dir(X.A|u |X.A [v1 , u]+C X.A [v1 , u], . . . , X.A [vk , u]+C X.A [vk , u]).
Now that we have the posterior, we can compute the probability of new data. In
the case where the new instance is conditionally independent of the old instances
given the parameter values (which is always the case in Bayesian network models,
but may not be true here), then the probability of the new data case can be
conveniently rewritten using the expected parameters:
Proposition 5.31
Assuming multinomial CPDs, prior independence, and Dirichlet priors, with hyperparameters X.A [v, u], we have that
E [P (X.A = v | Pa(X.A) = u) | I] =
C X.A [v, u] + X.A [v, u]
.
k
i=1 C X.A [vi , u] + X.A [vi , u]

5.8

Learning

165

This suggests that the Bayesian estimate for S should be estimated using this
formula as well. Unfortunately, the expected parameter is not the proper Bayesian
solution for computing probability of new data in the case where the new data
instance is not independent of previous data given the parameters. Suppose that
we want to use the posterior to evaluate the probability of an instance I  of
another skeleton  . If there are two instances x and x of the class X such that


v I (Pa(x.A)) = v I (Pa(x .A)), then we will be relying on the same multinomial
parameter vector twice. Using the chain rule, we see that the second probability
depends on the posterior of the parameters after seeing the training data, and
the rst instance. In other words, the probability of a relational database given a
distribution over parameter values is not identical to the probability of the data set
when we have a point estimate of the parameters (i.e., when we act as though we
know their values). However if the posterior is sharply peaked (i.e., we have a strong
prior, or we have seen many training instances), we can approximate the solution
using the expected parameters of proposition 5.31. We use this approximation in
our computation of the estimates for the parameters.
5.8.1.3

Structure Learning

We now move to the more challenging problem of learning a dependency structure


automatically, as opposed to having it given by the user. There are three important
issues that need to be addressed. We must determine which dependency structures
are legal; we need to evaluate the goodness of dierent candidate structures; and
we need to dene an eective search procedure that nds a good structure.
5.8.1.4

Legal Structures

We saw in section 5.2.5.2 that we could construct a class dependency graph for
a PRM, and the PRM dened a coherent probability distribution if the class
dependency graph was stratied. During learning it is straightforward to maintain
this structure, and consider only models whose dependency structure passes this
test.

Maintaining a stratied class dependency graph Given a stratied class


dependency graph G(V, E), we can check whether local changes to the structure
destroy the stratication. The operations we are concerned with are ones which add
an edge (u, v) into the structure (clearly deleting an edge cannot introduce a cycle).
We can check whether a new edge will introduce a cycle in time O(|V | + |E|).
Let G(V, E) be our stratied class dependency graph and let G (V, E {(u, v)})
be the class dependency graph with edge (u, v) added. Clearly if there is a cycle in
G , it must contain (u, v).
We can check whether the new edge introduces a cycle by checking to see if, using
this edge, there is a path u, v, . . . , u. This reduces to checking to see if there is a

166

Probabilistic Relational Models

path in the graph from v to u. We can do a a simple depth-rst search to explore


the graph to check for a path in O(|V | + |E|).
5.8.2

Evaluating Dierent Structures

Now that we know which structures are legal, we need to decide how to evaluate
dierent structures in order to pick one that ts the data well. We adapt Bayesian
model selection methods to our framework. We would like to nd the MAP
(maximum a posteriori) structure. Formally, we want to compute the posterior
probability of a structure S given an instantiation I. Using Bayes rule we have that
P (S | I, ) P (I | S, )P (S | ).
This score is composed of two parts: the prior probability of the structure, and the
probability of the data assuming that structure.
The rst component is P (S | ), which denes a prior over structures. We
assume that the choice of structure is independent of the skeleton, and thus
P (S | ) = P (S). In the context of Bayesian networks, we often use a simple
uniform prior over possible dependency structures. Unfortunately, this assumption
does not work in our setting. The problem is that there may be innitely many
possible structures.2 In our genetics example, a persons genotype can depend on
the genotype of his parents, or of his grandparents, or of his great-grandparents,
etc. A simple and natural solution penalizes long indirect slot chains, by having
log P (S) proportional to the sum of the lengths of the chains K appearing in S.
The second component is the marginal likelihood :

P (I | S, ) = P (I | S, S , )P (S | S) dS .
If we use a parameter-independent Dirichlet prior (as above), this integral decomposes into a product of integrals each of which has a simple closed-form solution.
This is a simple generalization of the ideas used in the Bayesian score for Bayesian
networks [12].
Proposition 5.32
If I is a complete assignment, and P (S | S) satises parameter independence and
is a Dirichlet with hyperparameters X.A [v, u], then the marginal likelihood of I
2. Although there are only a nite number that are reasonable to consider for a given
skeleton.

5.8

Learning

167

given S is
P (I | S, ) =

 

i AA(Xi )

DM({C Xi .A [v, u]}, {Xi .A [v, u]}) ,

(5.7)

uV(()Pa(Xi .A))

where DM({C [v]}, {[v]}) =


is the Gamma function.

(
P

v [v])
v ([v]+C [v])

([v]+C [v])
,
([v])

and (x) =

%
0

tx1 et dt

Hence, the marginal likelihood is a product of simple terms, each of which


corresponds to a distribution P (X.A | u) where u V(Pa(X.A)). Moreover, the
term for P (X.A | u) depends only on the hyperparameters X.A [v, u] and the
sucient statistics C X.A [v, u] for v V(X.A).
The marginal likelihood term is the dominant term in the probability of a
structure. It balances the complexity of the structure with its t to the data. This
balance can be seen explicitly in the asymptotic relation of the marginal likelihood
to explicit penalization, such as the minimum description length (MDL) score (see,
e.g., [11]).
Finally, we note that the Bayesian score requires that we assign a prior over parameter values for each possible structure. Since there are many (perhaps innitely
many) alternative structures, this is a formidable task. In the case of Bayesian
networks, there is a class of priors that can be described by a single network [12].
These priors have the additional property of being structure equivalent, that is, they
guarantee that the marginal likelihood is the same for structures that are, in some
strong sense, equivalent. These notions have not yet been dened for our richer
structures, so we defer the issue to future work. Instead, we simply assume that
some simple Dirichlet prior (e.g., a uniform one) has been dened for each attribute
and parent set.
5.8.3

Structure Search

Now that we have both a test for determining whether a structure is legal, and
a scoring function that allows us to evaluate dierent structures, we need only
provide a procedure for nding legal high-scoring structures. For Bayesian networks,
we know that this task is NP-hard [3]. As PRM learning is at least as hard as
Bayesian network learning (a Bayesian network is simply a PRM with one class
and no relations), we cannot hope to nd an ecient procedure that always nds
the highest-scoring structure. Thus, we must resort to heuristic search.
As is standard in Bayesian network learning [11], we use a greedy local search
procedure that maintains a current candidate structure and iteratively modies
it to increase the score. At each iteration, we consider a set of simple local
transformations to the current structure, score all of them, and pick the one with
the highest score. In the case where we are learning multinomial CPDs, the three
operators we use are: add edge, delete edge, and reverse edge. In the case where we

168

Probabilistic Relational Models

are learning tree CPDs, following [4], our operators consider only transformations to
the CPD trees. The tree structure induces the dependency structure, as the parents
of X.A are simply those attributes that appear in its CPD tree. In this case, the
two operators we use are: split replaces a leaf in a CPD tree by an internal node
with two leafs; and trim replaces the subtree at an internal node by a single leaf.
The simplest heuristic search algorithm is a greedy hill-climbing search, using our
score as a metric. We maintain our current candidate structure and iteratively improve it. At each iteration, we consider the appropriate set of local transformations
to that structure, score all of them, and pick the one with highest score.
We refer to this simple algorithm as the greedy algorithm. There are several
common variants to improve the robustness of hill-climbing methods. One is is to
make use of random restarts to deal with local maxima. In this algorithm, when we
reach a local maximum, we take some xed number of random steps, and then we
restart our search process. Another common approach is to make use of a tabulist,
which keeps track of the most recent states visited, and allows only steps which do
not return to a recently visited state. A more sophisticated approach is to make use
of a simulated annealing style of algorithm which uses the following procedure: in
the early phases of the search we are likely to take random steps (rather than the
best step), but as the search proceeds (i.e., the temperature cools) we are less likely
to take random steps and more likely to take the best greedy step. The algorithms we
have used are either the simple greedy algorithm or a simple randomized algorithm.
Regardless of the specic heuristic search algorithm used, an important component of the search is the scoring of candidate structures. As in Bayesian networks,
the decomposability property of the score has signicant impact on the computational eciency of the search algorithm. First, we decompose the score into a sum
of local scores corresponding to individual attributes and their parents. (This local score of an individual attribute is exactly the logarithm of the term in square
brackets in (5.7).) Now, if our search algorithm considers a modication to our
current structure where the parent set of a single attribute X.A is dierent, only
the component of the score associated with X.A will change. Thus, we need only
reevaluate this particular component, leaving the others unchanged; this results in
major computational savings.
However, there are still a very large number of possible structures to consider.
We propose a heuristic search algorithm that addresses this issue. At a high level,
the algorithm proceeds in phases. At each phase k, we have a set of potential
parents Pot k (X.A) for each attribute X.A. We then do a standard structure search
restricted to the space of structures in which the parents of each X.A are in
Pot k (X.A). The advantage of this approach is that we can precompute the view
corresponding to X.A, Pot k (X.A); most of the expensive computations the joins
and the aggregation required in the denition of the parents are precomputed in
these views. The sucient statistics for any subset of potential parents can easily be
derived from this view. The above construction, together with the decomposability
of the score, allows the steps of the search (say, greedy hill-climbing) to be done
very eciently.

5.8

Learning

169

The success of this approach depends on the choice of the potential parents.
Clearly, a bad initial choice can result to poor structures. Following [8], which
examines a similar approach in the context of learning Bayesian networks, we
propose an iterative approach that starts with some structure (possibly one where
each attribute does not have any parents), and select the sets Pot k (X.A) based
on this structure. We then apply the search procedure and get a new, higherscoring, structure. We choose new potential parents based on this new structure
and reiterate, stopping when no further improvement is made.
It remains only to discuss the choice of Pot k (X.A) at the dierent phases. Perhaps
the simplest approach is to begin by setting Pot 1 (X.A) to be the set of attributes
in X. In successive phases, Pot k+1 (X.A) would consist of all of Pak (X.A), as well
as all attributes that are related to X via slot chains of length < k. Of course, these
new attrributes may require aggregation; we may either specify the appropriate
aggregator or search over the space of possible aggregators.
This scheme expands the set of potential parents at each iteration. In some cases,
however, it may result in large set of potential parents. In such cases we may want to
use a more rened algorithm that only adds parents to Pot k+1 (X.A) if they seem to
add value beyond Pak (X.A). There are several reasonable ways of evaluating the
additional value provided by new parents. Some of these are discussed by Friedman
et al. [8] in the context of learning Bayesian networks. These results suggest that
we should evaluate a new potential parent by measuring the change of score for
the family of X.A if we add (X.K.B) to its current parents. We can then choose
the highest scoring of these, as well as the current parents, to be the new set of
potential parents. This approach would allow us to signicantly reduce the size of
the potential parent set, and thereby of the resulting view, while typically causing
insignicant degradation in the quality of the learned model.
5.8.4

Learning PRMs with Structural Uncertainty

Next, we describe how to extend the basic PRM learning algorithm to deal with
structural uncertainty. For PRMs with reference uncertainty, in addition we also
attempt to learn the rules that govern the link models. For PRMs with existence
uncertainty we learn the probability of existence of relationship objects.
5.8.4.1

Learning with Reference Uncertainty

The extension to scoring required to deal with reference uncertainty is not a dicult
one. Once we x the partitions dened by the attributes P[], a CPD for S
compactly denes a distribution over values of . Thus, scoring the success in
predicting the value of can be done eciently using standard Bayesian methods
used for attribute uncertainty (e.g., using a standard Dirichlet prior over values of
).
The extension to search the model space for incorporating reference uncertainty
involves expanding our search operators to allow the addition (and deletion) of

170

Probabilistic Relational Models

attributes to partition denition for each reference slot. Initially, the partition of
the range class for a slot X. is not given in the model. Therefore, we must also
search for the appropriate set of attributes P[]. We introduce two new operators,
rene and abstract, which modify the partition by adding and deleting attributes
from P[]. Initially, P[] is empty for each . The rene operator adds an attribute
into P[]; the abstract operator deletes one. As mentioned earlier, we can dene
the partition simply by looking at the cross product of the values for each of
the partition attributes, or using a decision tree. In the case of a decision tree,
rene adds a split to one of the leaves and abstract removes a split. These newly
introduced operators are treated by the search algorithm in exactly the same way
as the standard edge-manipulation operators: the change in the score is evaluated
for each possible operator, and the algorithm selects the best one to execute.
We note that, as usual, the decomposition of the score can be exploited to
substantially speed up the search. In general, the score change resulting from an
operator is reevaluated only after applying an operator  that modies the parent
or partition set of an attribute that modies. This is also true when we consider
operators that modify the parent of selector attributes.
5.8.4.2

Learning with Existence Uncertainty

The extension of the Bayesian score to PRMs with existence uncertainty is straight
forward; the exists attribute is simply a new descriptive attribute. The only new
issue is how to compute sucient statistics that include existence attributes x.E
without explicitly enumerating all the nonexistent entities. We perform this computation by counting, for each possible instantiation of Pa(X.E), the number of
potential objects with that instantiation, and subtracting the actual number of
objects x with that parent instantiation.
Let u be a particular instantiation of Pa(X.E). To compute C X.E [true, u], we
can use a standard database query to compute how many objects x (X) have
Pa(x.E) = u. To compute C X.E [false, u], we need to compute the number of
potential entities. We can do this without explicitly considering each (x1 , . . . , xk )
I(Y1 ) I(Yk ) by decomposing the computation as follows: Let be a reference
slot of X with Range[] = Y . Let Pa (X.E) be the subset of parents of X.E along
slot and let u be the corresponding instantiation. We count the number of y
consistent with u . If Pa (X.E) is empty, this count is simply |I(Y )|. The product
of these counts is the number of potential entities. To compute C X.E [false, u], we
simply subtract C X.E [true, u] from this number.
No extensions to the search algorithm are required to handle existence uncertainty. We simply introduce the new attributes X.E, and integrate them into the
search space. Our search algorithm now considers operators that add, delete, or
reverse edges involving the exist attributes. As usual, we enforce coherence using
the class dependency graph. In addition to having an edge from Y.E to X.E for
every slot R(X) whose range type is Y , when we add an edge from Y.B to
X.A, we add an edge from Y.E to X.E and an edge from Y.E to X.A.

5.8

Learning

5.8.5

171

Learning PRM-CHs

We now turn to learning PRMs with class hierarchies. We examine two scenarios:
in one case the class hierarchies are given as part of the input and in the other, in
addition to learning the PRM, we also must learn the class hierarchy. The learning
algorithms use the same criteria for scoring the models; however, the search space
is signicantly dierent.
5.8.6

Class Hierarchies Provided in the Input

We begin with the simpler learning with class hierarchies scenario, where we assume
that the class hierarchy is given as part of the input. As in section 5.8, we restrict
attention to fully observable data sets. Hence, we assume that, in our training set,
the class of each object is given. Without this assumption, the subclass indicator
attribute would play the role of a hidden variable, greatly complicating the learning
algorithm.
As discussed above, we need a scoring function that allows us to evaluate dierent
candidate structures, and a search procedure that searches over the space of possible
structures. The scoring function remains largely unchanged. For each object x in
each class X, we have the basic subclass c to which it belongs. For each attribute A
of this object, the probabilistic model then species the subclass d of X from which
c inherits the CPD of X.A. Then x.A contributes only to the sucient statistics for
the CPD of Xd .A. With that recomputation of the sucient statistics, the Bayesian
score can now be computed unchanged.
Next we extend our search algorithm to make use of the subclass hierarchy. First,
we extend our phased search to allow the introduction of new subclasses. Then, we
introduce a new set of operators. The new operators allow us to rene and abstract
the CPDs of attributes in our model, using our class hierarchy to guide us.
5.8.6.1

Introducing New Subclasses

New subclasses can be introduced at any point in the search. We may construct
all the subclasses at the start of our search, or we may consider introducing them
more gradually, perhaps at each phase of the search. Regardless of when the new
subclasses are introduced, the search space is greatly expanded, and care must be
taken to avoid the construction of an intractable search problem. Here we describe
the mechanics of the introduction of the new subclasses.
For each new subclass introduced, each attribute for the subclass is associated
with a CPD. A CPD can be marked as either inherited or specialized. Initially,
only the CPD for attributes of X
are marked as specialized; all the other CPDs
are inherited. Our original search operators those that add and delete parents
can be applied to attributes at all levels of the class hierarchy. However, we
only allow parents to be added and deleted from attributes whose CPDs have been
specialized. Note that any change to the parents of an attribute is propagated to

172

Probabilistic Relational Models

any descendents of the attribute whose CPDs are marked as inherited from this
attribute.
Next, we introduce the operators Specialize and Inherit. If Xc .A currently has
an inherited CPD, we can apply Specialize(Xc .A). This has two eects. First, it
recomputes the parameters of that CPD to utilize only the sucient statistics of
the subclass c. To understand this point, assume that Xc .A was being inherited
from Xd prior to the specialization. The CPD of Xd .A was being computed using all
objects in I(Xd ). After the change, the CPD will be computed using just the objects
in I(Xc ). The second eect of the operator is that it makes the CPD modiable,
in that we can now add new parents or delete them. The Inherit operator has the
opposite eect.
In addition, when a new subclass is introduced, we construct new rened slot
references that make use of the subclass. Let D be a newly introduced subclass
of Y . For each reference slot of some class X with range Y , we introduce a
new rened slot reference D . In addition, we add each reference slot of Y to D;
however, we rene the domain from Y to D. In other words, if we have the new
reference slot  , where Dom[ ] = D and Range[ ] = X.
5.8.6.2

Learning Subclass Hierarchies

We next examine the case where the subclass hierarchies are not given as part of
the input. In this case, we will learn them at the same time we are learning the
PRM.
As above, we wish to avoid the problem of learning from partially observable data.
Hence, we need to assume that the basic subclasses are observed in the training set.
At rst glance, this requirement seems incompatible with our task denition: if the
class hierarchy is not known, how can we observe subclasses in the training data?
We resolve this problem by dening our class hierarchy based on the standard class
attributes. For example, movies might be associated with an attribute specifying the
genre action, drama, or documentary. If our search algorithm decides that this
attribute is a useful basis for forming subclasses, we would dene subclasses based in
a deterministic way on its values. Another attribute might be the reputation of the
director. The algorithm might choose to rene the class hierarchy by partitioning
sitcoms according to the values of this attribute. Note that, in this case, the class
hierarchy depends on an attribute of a related class, not the class itself.
We implement this approach by requiring that the subclass indicator attribute
be a deterministic function of its parents. These parents are the attributes used to
dene the subclass hierarchy. In our example, Movie.Class would have as parents
Movie.Genre and Movie.Director.Reputation. Note that, as the function dening the
subclass indicator variable is required to be deterministic, the subclass is eectively
observed in the training data (due to the assumption that all other attributes are
observed).
We restrict attention to decision-tree CPDs. The leaves in the decision tree
represent the basic subclasses, and the attributes used for splitting the decision

5.9

Conclusion

173

tree are the parents of the subclass indicator variable. We can allow binary splits
that test whether an attribute has a particular value, or, if we nd it necessary, we
can allow a split on all possible values of an attribute.
The decision tree gives a simple algorithm for determining the subclass of an
object. In order to build the decision tree during our search, we introduce a new
operator Split(X, c, X.K.B), where c is a leaf in the current decision tree for X.Class
and X.K.B is the attribute on which we will split that subclass.
Note that this step expands the space of models that can be considered, but in
isolation does not change the score of the model. Thus, if we continue to use a purely
greedy search, we would never take these steps. There are several approaches for
addressing this problem. One is to use some lookahead for evaluating the quality of
such a step. Another is to use various heuristics for guiding us toward worthwhile
splits. For example, if an attribute is the common parent of many other attributes
within Xc , it may be a good candidate on which to split.
The other operators, Specialize and Inherit, remain the same; they simply use the
subclasses dened by the decision tree.

5.9

Conclusion
In this chapter we have described a comprehensive framework for learning a statistical model from relational data. We have presented a method for the automatic
construction of a PRM from an existing database. Our method learns a structured
statistical model directly from the relational database, without requiring the data
to be attened into a xed attribute-value format. We have shown how to perform
parameter estimation, developed a scoring criterion for use in structure selection,
and dened the model search space. We have also provided algorithms for guaranteeing the coherence of the learned model.

References
[1] C. Boutilier, N. Friedman, M. Goldszmidt, and D. Koller. Context-specic
independence in Bayesian networks. In Proceedings of the Conference on
Uncertainty in Articial Intelligence, 1996.
[2] J. Breese, D. Heckerman, and C. Kadie. Empirical analysis of predictive
algorithms for collaborative ltering. In Proceedings of the Conference on
Uncertainty in Articial Intelligence, 1998.
[3] D. Chickering. Learning Bayesian networks is NP-complete. In Articial
Intelligence and Statistics, 1996.
[4] D. Chickering, D. Heckerman, and C. Meek. A Bayesian approach to learning
Bayesian networks with local structure. In Proceedings of the Conference on
Uncertainty in Articial Intelligence, 1997.

174

Probabilistic Relational Models

[5] D. Cohn and T. Hofmann. The missing linka probabilistic model of document
content and hypertext connectivity. In Proceedings of Neural Information
Processing Systems, 2001.
[6] M. Craven, D. DiPasquo, D. Freitag, A. McCallum, T. Mitchell, K. Nigam,
and S. Slattery. Learning to extract symbolic knowledge from the World Wide
Web. In Proceedings of the National Conference on Articial Intelligence, 1998.
[7] M. H. DeGroot. Optimal Statistical Decisions. McGraw-Hill, New York, 1970.
[8] N. Friedman, I. Nachman, and D. Peer. Learning of Bayesian network structure
from massive datasets: The sparse candidate algorithm. In Proceedings of
the Conference on Uncertainty in Articial Intelligence, 1999.
[9] L. Getoor. Learning Statistical Models from Relational Data. PhD thesis,
Stanford University, Stanford, CA, 2001.
[10] L. Getoor and J. Grant. PRL: A probabilistic relational language. Machine
Learning Journal, 62(1-2):731, 2006.
[11] D. Heckerman. A tutorial on learning with Bayesian networks. In M. I. Jordan,
editor, Learning in Graphical Models, pages 301354. MIT Press, Cambridge,
MA, 1998.
[12] D. Heckerman, D. Geiger, and D. Chickering. Learning Bayesian networks:
The combination of knowledge and statistical data. Machine Learning, 20:
197243, 1995.
[13] D. Koller and A. Pfeer. Probabilistic frame-based systems. In Proceedings
of the Conference on Uncertainty in Articial Intelligence, 1998.
[14] D. Koller and A. Pfeer. Object-oriented Bayesian networks. In Proceedings
of the Conference on Uncertainty in Articial Intelligence, 1997.
[15] D. MacKay, R. McEliece, and J. Cheng. Turbo decoding as an instance
of Pearls belief propagation algorithm. IEEE Journal on Selected Areas in
Communication, 16(2):140152, 1997.
[16] K. Murphy and Y. Weiss. Loopy belief propagation for approximate inference:
An empirical study. In Proceedings of the Conference on Uncertainty in
Articial Intelligence, 1999.
[17] J. Pearl. Probabilistic Reasoning in Intelligent Systems. Morgan Kaufmann,
San Francisco, 1988.
[18] A. Pfeer. Probabilistic Reasoning for Complex Systems. PhD thesis, Stanford
University, Stanford, CA, 2000.
[19] A. Pfeer, D. Koller, B. Milch, and K. Takusagawa. spook: A system for
probabilistic object-oriented knowledge representation. In Proceedings of the
Conference on Uncertainty in Articial Intelligence, 1999.
[20] Y. Weiss. Correctness of local probability propagation in graphical models
with loops. Neural Computation, 12(1):141, 2000.

6 Relational Markov Networks

Ben Taskar, Pieter Abbeel, Ming-Fai Wong, and Daphne Koller

One of the key challenges for statistical relational learning is the design of a representation language that allows exible modeling of complex relational interactions.
Many of the formalisms presented in this book are based on the directed graphical models (probabilistic relational models, probabilistic entity-relationship models, Bayesian logic programs). In this chapter, we present a probabilistic modeling
framework that builds on undirected graphical models (also known as Markov random elds or Markov networks). Undirected models address two limitations of the
previous approach. First, undirected models do not impose the acyclicity constraint
that hinders representation of many important relational dependencies in directed
models. Second, undirected models are well suited for discriminative training, where
we optimize the conditional likelihood of the labels given the features, which generally improves classication accuracy. We show how to train these models eectively, and how to use approximate probabilistic inference over the learned model
for collective classication and link prediction. We provide experimental results on
hypertext and social network domains, showing that accuracy can be signicantly
improved by modeling relational dependencies.1

6.1

Introduction
We focus on supervised learning as a motivation for our framework. The vast
majority of work in statistical classication methods has focused on at data
data consisting of identically structured entities, typically assumed to be i.i.d.
However, many real-world data sets are innately relational: hyperlinked webpages,
cross-citations in patents and scientic papers, social networks, medical records,
and more. Such data consists of entities of dierent types, where each entity type is

1. This chapter is based on work in [21, 22].

176

Relational Markov Networks

characterized by a dierent set of attributes. Entities are related to each other via
dierent types of links, and the link structure is an important source of information.
Consider a collection of hypertext documents that we want to classify using
some set of labels. Most naively, we can use a bag-of-words model, classifying each
webpage solely using the words that appear on the page. However, hypertext has a
very rich structure that this approach loses entirely. One document has hyperlinks
to others, typically indicating that their topics are related. Each document also
has internal structure, such as a partition into sections; hyperlinks that emanate
from the same section of the document are even more likely to point to similar
documents. When classifying a collection of documents, these are important cues
that can potentially help us achieve better classication accuracy. Therefore, rather
than classifying each document separately, we want to provide a form of collective
classication, where we simultaneously decide on the class labels of all of the entities
together, and thereby can explicitly take advantage of the correlations between the
labels of related entities.
Another challenge arises from the task of predicting which entities are related to
which others and what are the types of these relationships. For example, in a data
set consisting of a set of hyperlinked university webpages, we might want to predict
not just which page belongs to a professor and which to a student, but also which
professor is which students advisor. In some cases, the existence of a relationship
will be predicted by the presence of a hyperlink between the pages, and we will have
only to decide whether the link reects an advisor-advisee relationship. In other
cases, we might have to infer the very existence of a link from indirect evidence,
such as a large number of coauthored papers.
We propose the use of a joint probabilistic model for an entire collection of
related entities. Following the work of Laerty et al. [13], we base our approach on
discriminatively trained undirected graphical models, or Markov networks [17]. We
introduce the framework of relational Markov networks (RMNs), which compactly
denes a Markov network over a relational data set. The graphical structure of
an RMN is based on the relational structure of the domain, and can easily model
complex patterns over related entities. For example, we can represent a pattern
where two linked documents are likely to have the same topic. We can also capture
patterns that involve groups of links: for example, consecutive links in a document
tend to refer to documents with the same label. As we show, the use of an undirected
graphical model avoids the diculties of dening a coherent generative model for
graph structures in directed models. It thereby allows us tremendous exibility in
representing complex patterns.
Undirected models lend themselves well to discriminative training, where we optimize the conditional likelihood of the labels given the features. Discriminative
training, given sucient data, generally provides signicant improvements in classication accuracy over generative training (see [23]). We provide an eective parameter estimation algorithm for RMNs which uses conjugate gradient combined
with approximate probabilistic inference (belief propagation [17, 14, 12]) for estimating the gradient. We also show how to use approximate probabilistic inference

6.2

Relational Classication and Link Prediction

177

over the learned model for collective classication and link prediction. We provide
experimental results on a webpage classication and social network task, showing
signicant gains in accuracy arising both from the modeling of relational dependencies and the use of discriminative training.

6.2

Relational Classication and Link Prediction


Consider hypertext as a simple example of a relational domain. A relational domain
is dened by a schema, which describes entities, their attributes, and the relations
between them. In our domain, there are two entity types: Doc and Link. If a webpage
is represented as a bag of words, Doc would have a set of Boolean attributes
Doc.HasWordk indicating whether the word k occurs on the page. It would also
have the label attribute Doc.Label, indicating the topic of the page, which takes on
a set of categorical values. The Link entity type has two attributes: Link.From and
Link.To, both of which refer to Doc entities.
In general, a schema species of a set of entity types E = {E1 , . . . , En }. Each
type E is associated with three sets of attributes: content attributes E.X (e.g.,
Doc.HasWordk ), label attributes E.Y (e.g., Doc.Label), and reference attributes
E.R (e.g. Link.To). For simplicity, we restrict label and content attributes to take
on categorical values. Reference attributes include a special unique key attribute
E.K that identies each entity. Other reference attributes E.R refer to entities of
a single type E  = Range(E.R) and take values in Domain(E  .K).
An instantiation I of a schema E species the set of entities I(E) of each entity
type E E and the values of all attributes for all of the entities. For example, an
instantiation of the hypertext schema is a collection of webpages, specifying their
labels, the words they contain, and the links between them. We will use I.X, I.Y,
and I.R to denote the content, label, and reference attributes in the instantiation
I; I.x, I.y, and I.r to denote the values of those attributes. The component I.r,
which we call an instantiation skeleton or instantiation graph, species the set of
entities (nodes) and their reference attributes (edges). A hypertext instantiation
graph species a set of webpages and links between them, but not their words or
labels.
To address the link prediction problem, we need to make links rst-class citizens
in our model. Following Getoor et al. [7], we introduce into our schema object
types that correspond to links between entities. Each link object  is associated
with a tuple of entity objects (o1 , . . . , ok ) that participate in the link. For example,
a Hyperlink link object would be associated with a pair of entities the linking
page, and the linked-to page, which are part of the link denition. We note that
link objects may also have other attributes; e.g., a hyperlink object might have
attributes for the anchor words on the link.
As our goal is to predict link existence, we must consider links that exist and
links that do not. We therefore consider a set of potential links between entities.
Each potential link is associated with a tuple of entity objects, but it may or may

178

Relational Markov Networks

not actually exist. We denote this event using a binary existence attribute Exists,
which is true if the link between the associated entities exists and false otherwise.
In our example, our model may contain a potential link  for each pair of webpages,
and the value of the variable .Exists determines whether the link actually exists or
not. The link prediction task now reduces to the problem of predicting the existence
attributes of these link objects.

6.3

Graph Structure and Subgraph Templates


The structure of the instantiation graph has been used extensively to infer its
importance in scientic publications [5] and hypertext [10]. Several recent papers
have proposed algorithms that use the link graph to aid classication. Chakrabarti
et al. [2] use system-predicted labels of linked documents to iteratively relabel
each document in the test set, achieving a signicant improvement compared to a
baseline of using the text in each document alone. A similar approach was used
by Neville and Jensen [16] in a dierent domain. Slattery and Mitchell [19] tried to
identify directory (or hub) pages that commonly list pages of the same topic, and
used these pages to improve classication of university webpages. However, none
of these approaches provide a coherent model for the correlations between linked
webpages. Thus, they apply combinations of classiers in a procedural way, with
no formal justication.
Taskar et al. [20] suggest the use of probabilistic relational models (PRMs) for the
collective classication task. PRMs [11, 6] are a relational extension to Bayesian networks [17]. A PRM species a probability distribution over instantiations consistent
with a given instantiation graph by specifying a Bayesian network-like templatelevel probabilistic model for each entity type. Given a particular instantiation graph,
the PRM induces a large Bayesian network over that instantiation that species
a joint probability distribution over all attributes of all of the entities. This network reects the interactions between related instances by allowing us to represent
correlations between their attributes.
In our hypertext example, a PRM might use a naive Bayes model for words,
with a directed edge between Doc.Label and each attribute Doc.HadWordk ; each of
these attributes would have a conditional probability distribution P (Doc.HasWordk |
Doc.Label) associated with it, indicating the probability that word k appears in the
document given each of the possible topic labels. More importantly, a PRM can
represent the interdependencies between topics of linked documents by introducing
an edge from Doc.Label to Doc.Label of two documents if there is a link between
them. Given a particular instantiation graph containing some set of documents
and links, the PRM species a Bayesian network over all of the documents in the
collection. We would have a probabilistic dependency from each documents label
to the words on the document, and a dependency from each documents label to
the labels of all of the documents to which it points. Taskar et al. [20] show that

6.3

Graph Structure and Subgraph Templates

179

this approach works well for classifying scientic documents, using both the words
in the title and abstract and the citation-link structure.
However, the application of this idea to other domains, such as webpages, is
problematic since there are many cycles in the link graph, leading to cycles in the
induced Bayesian network, which is therefore not a coherent probabilistic model.
Getoor et al. [8] suggest an approach where we do not include direct dependencies
between the labels of linked webpages, but rather treat links themselves as random
variables. Each two pages have a potential link, which may or may not exist
in the data. The model denes the probability of the link existence as a function
of the labels of the two endpoints. In this link existence model, labels have no
incoming edges from other labels, and the cyclicity problem disappears. This model,
however, has other fundamental limitations. In particular, the resulting Bayesian
network has a random variable for each potential link N 2 variables for collections
containing N pages. This quadratic blowup occurs even when the actual link graph
is very sparse. When N is large (e.g., the set of all webpages), a quadratic growth is
intractable. Even more problematic are the inherent limitations on the expressive
power imposed by the constraint that the directed graph must represent a coherent
generative model over graph structures. The link existence model assumes that the
presence of dierent edges is a conditionally independent event. Representing more
complex patterns involving correlations between multiple edges is very dicult. For
example, if two pages point to the same page, it is more likely that they point to
each other as well. Such interactions between many overlapping triples of links do
not t well into the generative framework.
Furthermore, directed models such as Bayesian networks and PRMs are usually
trained to optimize the joint probability of the labels and other attributes, while the
goal of classication is a discriminative model of labels given the other attributes.
The advantage of training a model only to discriminate between labels is that
it does not have to trade o between classication accuracy and modeling the
joint distribution over nonlabel attributes. In many cases, discriminatively trained
models are more robust to violations of independence assumptions and achieve
higher classication accuracy than their generative counterparts.
In our experiments, we found that the combination of a relational language with
a probabilistic graphical model provides a very exible framework for modeling
complex patterns common in relational graphs. First, as observed by Getoor et al.
[7], there are often correlations between the attributes of entities and the relations
in which they participate. For example, in a social network, people with the same
hobby are more likely to be friends.
We can also exploit correlations between the labels of entities and the relation
type. For example, only students can be teaching assistants in a course. We can
easily capture such correlations by introducing cliques that involve these attributes.
Importantly, these cliques are informative even when attributes are not observed
in the test data. For example, if we have evidence indicating an advisor-advisee
relationship, our probability that X is a faculty member increases, and thereby our
belief that X participates in a teaching assistant link with some entity Z decreases.

180

Relational Markov Networks

We also found it useful to consider richer subgraph templates over the link graph.
One useful type of template is a similarity template, where objects that share a
certain graph-based property are more likely to have the same label. Consider, for
example, a professor X and two other entities Y and Z. If Xs webpage mentions Y
and Z in the same context, it is likely that the X-Y relation and the Y-Z relation are
of the same type; for example, if Y is Professor Xs advisee, then probably so is Z.
Our framework accomodates these patterns easily, by introducing pairwise cliques
between the appropriate relation variables.
Another useful type of subgraph template involves transitivity patterns, where
the presence of an A-B link and of a B-C link increases (or decreases) the likelihood
of an A-C link. For example, students often assist in courses taught by their advisor.
Note that this type of interaction cannot be accounted for by just using pairwise
cliques. By introducing cliques over triples of relations, we can capture such patterns
as well. We can incorporate even more complicated patterns, but of course we are
limited by the ability of belief propagation to scale up as we introduce larger cliques
and tighter loops in the Markov network.
We note that our ability to model these more complex graph patterns relies on
our use of an undirected Markov network as our probabilistic model. In contrast,
the approach of Getoor et al. [8] uses directed graphical models (Bayesian networks
and PRMs [11]) to represent a probabilistic model of both relations and attributes.
Their approach easily captures the dependence of link existence on attributes of
entities. But the constraint that the probabilistic dependency graph be a directed
acyclic graph makes it hard to see how we would represent the subgraph patterns
described above. For example, for the transitivity pattern, we might consider simply
directing the correlation edges between link existence variables arbitrarily. However,
it is not clear how we would then parameterize a link existence variable for a link
that is involved in multiple triangles. See [20] for further discussion.

6.4

Undirected Models for Classication


As discussed, our approach to the collective classication task is based on the use
of undirected graphical models. We begin by reviewing Markov networks, a at
undirected model. We then discuss how Markov networks can be extended to the
relational setting.
6.4.1

Markov Networks

We use V to denote a set of discrete random variables and v an assignment of


values to V. A Markov network for V denes a joint distribution over V. It consists
of a qualitative component, an undirected dependency graph, and a quantitative
component, a set of parameters associated with the graph. For a graph G, a clique
is a set of nodes Vc in G, not necessarily maximal, such that each Vi , Vj Vc is
connected by an edge in G. Note that a single node is also considered a clique.

6.4

Undirected Models for Classication

181

Denition 6.1
Let G = (V, E) be an undirected graph with a set of cliques C(G). Each c C(G)
is associated with a set of nodes Vc and a clique potential c (Vc ), which is a nonnegative function dened on the joint domain of Vc . Let = {c (Vc )}cC(G) . The

Markov net (G, ) denes the distribution P (v) = Z1 cC(G) c (vc ), where Z is

the partition function a normalization constant given by Z = v c (vc ).
Each potential c is simply a table of values for each assignment vc that denes
a compatibility between values of variables in the clique. The potential is often
represented by a log-linear combination of a small set of features:

wi fi (vc )} = exp{wc fc (vc )} .
c (vc ) = exp{
i

The simplest and most common form of a feature is the indicator function
f (Vc ) (Vc = vc ). However, features can be arbitrary logical predicates of the
variables of the clique, Vc . For example, if the variables are binary, a feature might
signify the parity or whether the variables are all the same value. More generally,
the features can be real-valued functions, not just binary predicates. See further
discussion of features at the end of section 6.4.
We will abbreviate log-linear representation as follows:

wc fc (vc ) log Z = w f (v) log Z;
log P (v) =
c

where w and f are the vectors of all weights and features.


For classication, we are interested in constructing discriminative models using
conditional Markov nets which are simply Markov networks renormalized to model
a conditional distribution.
Denition 6.2
Let X be a set of random variables on which we condition and Y be a set of target
(or label) random variables. A conditional Markov network is a Markov network
1
(x , y ), where
(G, ) which denes the distribution P (y | x) = Z(x)
cC(G)
 c c c 
Z(x) is the partition function, now dependent on x: Z(x) = y c (xc , yc ).
Logistic regression, a well-studied statistical model for classication, can be
viewed as the simplest example of a conditional Markov network. In standard form,
1
exp{yw x}. Viewing the
for Y = 1 and X {0, 1}n (or X n ), P (y | x) = Z(x)
model as a Markov network, the cliques are simply the edges ck = {Xk , Y } with
potentials k (xk , y) = exp{ywk xk }. In this example, each feature is of the form
fk (xk , y) = yxk .
6.4.2

Relational Markov Networks

We now extend the framework of Markov networks to the relational setting. A


relational Markov network species a conditional distribution over all of the labels

182

Relational Markov Networks

An unrolled Markov net over linked documents. The links follow a


common pattern: documents with the same label tend to link to each other more
often.
Figure 6.1

of all of the entities in an instantiation given the relational structure and the content
attributes. (We provide the denitions directly for the conditional case, as the
unconditional case is a special case where the set of content attributes is empty.)
Roughly speaking, it species the cliques and potentials between attributes of
related entities at a template level, so a single model provides a coherent distribution
for any collection of instances from the schema.
For example, suppose that pages with the same label tend to link to each other,
as in gure 6.1. We can capture this correlation between labels by introducing,
for each link, a clique between the labels of the source and the target page. The
potential on the clique will have higher values for assignments that give a common
label to the linked pages.
To specify what cliques should be constructed in an instantiation, we will dene
a notion of a relational clique template. A relational clique template species tuples
of variables in the instantiation by using a relational query language. For our link
example, we can write the template as a kind of SQL query:
SELECT doc1.Category, doc2.Category
FROM Doc doc1, Doc doc2, Link link
WHERE link.From = doc1.Key and link.To = doc2.Key
Note the three clauses that dene a query: the FROM clause species the cross
product of entities to be ltered by the WHERE clause and the SELECT clause
picks out the attributes of interest. Our denition of clique templates contains the
corresponding three parts.
Denition 6.3
A relational clique template C = (F, W, S) consists of three components:
F = {Fi } a set of entity variables, where an entity variable Fi is of type E(Fi ).
W(F.R) a Boolean formula using conditions of the form Fi .Rj = Fk .Rl .
F.S F.X F.Y a selected subset of content and label attributes in F.

6.4

Undirected Models for Classication

183

For the clique template corresponding to the SQL query above, F consists
of doc1 , doc2 , and link of types Doc, Doc, and Link, respectively. W(F.R) is
link.F rom = doc1.Key link.T o = doc2.Key and F.S is doc1.Category and
doc2.Category.
A clique template species a set of cliques in an instantiation I:
C(I) {c = f .S : f I(F) W(f .r)},
where f is a tuple of entities {fi } in which each fi is of type E(Fi ); I(F) =
I(E(F1 )) . . . I(E(Fn )) denotes the cross product of entities in the instantiation;
the clause W(f .r) ensures that the entities are related to each other in specied
ways; and nally, f .S selects the appropriate attributes of the entities. Note that
the clique template does not specify the nature of the interaction between the
attributes; that is determined by the clique potentials, which will be associated
with the template.
This denition of a clique template is very exible, as the WHERE clause of
a template can be an arbitrary predicate. It allows modeling complex relational
patterns on the instantiation graphs. To continue our webpage example, consider
another common pattern in hypertext: links in a webpage tend to point to pages of
the same category. This pattern can be expressed by the following template:
SELECT doc1.Category, doc2.Category
FROM Doc doc1, Doc doc2, Link link1, Link link2
WHERE link1.From = link2.From and link1.To = doc1.Key
and link2.To = doc2.Key and not doc1.Key = doc2.Key
Depending on the expressive power of our template denition language, we
may be able to construct very complex templates that select entire subgraph
structures of an instantiation. We can easily represent patterns involving three (or
more) interconnected documents without worrying about the acyclicity constraint
imposed by directed models. Since the clique templates do not explicitly depend on
the identities of entities, the same template can select subgraphs whose structure
is fairly dierent. The RMN allows us to associate the same clique potential
parameters with all of the subgraphs satisfying the template, thereby allowing
generalization over a wide range of dierent structures.
Denition 6.4
A relational Markov network M = (C, ) species a set of clique templates C and
corresponding potentials = {C }CC to dene a conditional distribution:
 
1
C (I.xc , I.yc ),
P (I.y | I.x, I.r) =
Z(I.x, I.r)
CC cC(I)

where Z(I.x, I.r) is the normalizing partition function:


  
Z(I.x, I.r) =
C (I.xc , I.yc ).
I.y CC cC(I)

184

Relational Markov Networks

Using the log-linear representation of potentials, C (VC ) = exp{wC fC (VC )},


we can write
 
wC fC (I.xc , I.yc ) log Z(I.x, I.r)
log P (I.y | I.x, I.r) =
CC cC(I)

wC fC (I.x, I.y, I.r) log Z(I.x, I.r)

CC

= w f (I.x, I.y, I.r) log Z(I.x, I.r),


where
fC (I.x, I.y, I.r) =

fC (I.xc , I.yc )

cC(I)

is the sum over all appearances of the template C(I) in the instantiation, and f is
the vector of all fC .
Given a particular instantiation I of the schema, the RMN M produces an
unrolled Markov network over the attributes of entities in I. The cliques in the
unrolled network are determined by the clique templates C. We have one clique for
each c C(I), and all of these cliques are associated with the same clique potential
C . In our webpage example, an RMN with the link feature described above would
dene a Markov net in which, for every link between two pages, there is an edge
between the labels of these pages. Figure 6.1 illustrates a simple instance of this
unrolled Markov network.
Note that we leave the clique potentials to be specied using arbitrary sets of
feature functions. A common set is the complete table of indicator functions, one
for each instantiation of the discrete-valued variables in the clique. However, this
results in a large number of parameters (exponential in the number of variables).
Often, as we encounter in our experiments, only a subset of the instantiations is
of interest or many instantiations are essentially equivalent because of symmetries.
For example, in an edge potential between labels of two webpages linked from a
given page, we might want to have a single feature tracking whether the two labels
are the same. In the case of triad cliques enforcing transitivity, we might constrain
features to be symmetric functions with respect to the variables. In the presence of
continuous-valued variables, features are often a predicate on the discrete variables
multiplied by a continuous value. We do not prescribe a language for specifying
features (as does Markov logic; see chapter 11), although in our implementation,
we use a combination of logical formulae and custom-designed functions.

6.5

Learning the Models


We focus here on the case where the clique templates are given; our task is to
estimate the clique potentials, or feature weights. Thus, assume that we are given a
set of clique templates C which partially specify our (relational) Markov network,

6.5

Learning the Models

185

and our task is to compute the weights w for the potentials . In the learning task,
we are given some training set D where both the content attributes and the labels
are observed. Any particular setting for w fully species a probability distribution
Pw over D, so we can use the likelihood as our objective function, and attempt to
nd the weight setting that maximizes the likelihood (ML) of the labels given other
attributes. However, to help avoid overtting, we assume a prior over the weights
(a zero-mean Gaussian), and use maximum a posteriori (MAP) estimation. More
precisely, we assume& that dierent
parameters are a priori independent and dene
'
1
2
2

p(wi ) = 22 exp wi /2 . Both the ML and MAP objective functions are


concave and there are many methods available for maximizing them. Our experience
is that conjugate gradient performs fairly well for logistic regression and relational
Markov nets. However, recent experience with conditional random elds (CRFs)
suggests the L-BFGS method might be somewhat faster [18].
6.5.1

Learning Markov Networks

We rst consider discriminative MAP training in the at setting. In this case D


is simply a set of i.i.d. instances; let d index over all labeled training data D. The

discriminative likelihood of the data is d Pw (yd | xd ). We introduce the parameter
prior, and maximize the log of the resulting MAP objective function:
L(w, D) =

(w f (xd , yd ) log Z(xd ))

dD

||w||22
+C .
2 2

The gradient of the objective function is computed as



w
L(w, D) =
(f (xd , yd ) IEPw [f (xd , Yd )]) 2 .

dD

The last term is the shrinking eect of the prior and the other two terms are the
dierence between the expected feature counts and the empirical feature counts,
where the expectation is taken relative to Pw :

IEPw [f (xd , Yd )] =
f (xd , yd )Pw (yd | xd ) .
y

Thus, ignoring the eect of the prior, the gradient is zero when empirical and
expected feature counts are equal.2 The prior term gives the smoothing we expect
from the prior: small weights are preferred in order to reduce overtting. Note that
the sum over y  is just over the possible categorizations for one data sample every
time.
2. The solution of ML estimation with log-linear models is also the solution to the dual
problem of maximum entropy estimation with constraints that empirical and expected
feature counts must be equal [4].

186

Relational Markov Networks

6.5.2

Learning RMNs

The analysis for the relational setting is very similar. Now, our data set D is actually
a single instantiation I, where the same parameters are used multiple times once
for each dierent entity that uses a feature. A particular choice of parameters w
species a particular RMN, which induces a probability distribution Pw over the
unrolled Markov network. The product of the likelihood of I and the parameter
prior dene our objective function, whose gradient L(w, I) again consists of the
empirical feature counts minus the expected feature counts and a smoothing term
due to the prior:
f (I.y, I.x, I.r) IEw [f (I.Y, I.x, I.r)]

w
,
2

where the expectation EPw [f (I.Y, I.x, I.r)] is



f (I.y , I.x, I.r)Pw (I.y | I.x, I.r) .
I.y

This last formula reveals a key dierence between the relational and the at
case: the sum over I.y involves the exponential number of assignments to all the
label attributes in the instantiation. In the at case, the probability decomposes
as a product of probabilities for individual data instances, so we can compute the
expected feature count for each instance separately. In the relational case, these
labels are correlated indeed, this correlation was our main goal in dening this
model. Hence, we need to compute the expectation over the joint assignments to all
the entities together. Computing these expectations over an exponentially large set
is the expensive step in calculating the gradient. It requires that we run inference
on the unrolled Markov network.
6.5.3

Inference in Markov Networks

The inference task in our conditional Markov networks is to compute the posterior distribution over the label variables in the instantiation given the content
variables. Exact algorithms for inference in graphical models can execute this process eciently for specic graph topologies such as sequences, trees, and other low
treewidth graphs. However, the networks resulting from domains such as our hypertext classication task are very large (in our experiments, they contain tens
of thousands of nodes) and densely connected. Exact inference is completely intractable in these cases.
We therefore resort to approximate inference. There is a wide variety of approximation schemes for Markov networks, including sampling and variational methods.
We chose to use belief propagation(BP) for its simplicity and relative eciency and
accuracy. BP is a local message passing algorithm introduced by Pearl [17] and
later related to turbo-coding by McEliece et al. [14]. It is guaranteed to converge to
the correct marginal probabilities for each node only for singly connected Markov

6.6

Experimental Results

187

networks. Empirical results [15] show that it often converges in general networks,
and when it does, the marginals are a good approximation to the correct posteriors.
As our results in section 6.6 show, this approach works well in our domain. We refer
the reader to chapter 2 in this book for a detailed description of the BP algorithm.

6.6

Experimental Results
We present experiments with collective classication and link prediction, in both
hypertext and social network data.
6.6.1

Experiments on WebKB

We experimented with our framework on the WebKB data set [3], which is an
instance of our hypertext example. The data set contains webpages from four different computer science departments: Cornell, Texas, Washington, and Wisconsin.
Each page has a label attribute, representing the type of webpage which is one of
course, faculty, student, project, or other . The data set is problematic in that the
category other is a grab bag of pages of many dierent types. The number of pages
classied as other is quite large, so that a baseline algorithm that simply always
selected other as the label would get an average accuracy of 75%. We could restrict
attention to just the pages with the four other labels, but in a relational classication setting, the deleted webpages might be useful in terms of their interactions
with other webpages. Hence, we compromised by eliminating all other pages with
fewer than three outlinks, making the number of other pages commensurate with
the other categories.3 For each page, we have access to the entire HTML of the
page and the links to other pages. Our goal is to collectively classify webpages into
one of these ve categories. In all of our experiments, we learn a model from three
schools and test the performance of the learned model on the remaining school,
thus evaluating the generalization performance of the dierent models.
Unfortunately, we cannot directly compare our accuracy results with previous
work because dierent papers use dierent subsets of the data and dierent training/test splits. However, we compare to standard text classiers such as naive Bayes,
logistic regression, and support vector machines, which have been demonstrated to
be successful on this data set [9].

3. The resulting category distribution is: course (237), faculty (148), other (332), researchproject (82), and student (542). The number of remaining pages for each school are: Cornell
(280), Texas (292), Washington (315), and Wisconsin (454). The number of links for each
school are: Cornell (574), Texas (574), Washington (728) and Wisconsin (1614).

Relational Markov Networks

Words

0.35

Words+Meta

0.35

0.3

0.3

0.25

0.25
Test Error

Test Error

188

0.2
0.15

Link

Section

Link+Section

0.2
0.15

0.1

0.1

0.05

0.05

Logistic

Nave Bayes

Svm

(a)

Logistic

Cor

Tex

Wash

Wisc

AVG

(b)

(a) Comparison of Naive Bayes, Svm, and Logistic on WebKB, with


and without metadata features. (Only averages over the four schools are shown
here.) (b) Flat versus collective classication on WebKB: at logistic regression
with metadata, and three dierent relational models: Link, Section, and a combined
Section+Link. Collectively classifying page labels (Link, Section, Section+Link)
consistently reduces the error over the at model (logistic regression) on all schools,
for all three relational models.
Figure 6.2

6.6.1.1

Flat Models

The simplest approach we tried predicts the categories based on just the text content
on the webpage. The text of the webpage is represented using a set of binary
attributes that indicate the presence of dierent words on the page. We found that
stemming and feature selection did not provide much benet and simply pruned
words that appeared in fewer than three documents in each of the three schools
in the training data. We also experimented with incorporating metadata: words
appearing in the title of the page, in anchors of links to the page, and in the
last header before a link to the page [24]. Note that metadata, although mostly
originating from pages linking into the considered page, are easily incorporated as
features, i.e., the resulting classication task is still at feature-based classication.
Our rst experimental setup compares three well-known text classiers Naive
Bayes, linear support vector machines 4 (Svm), and logistic regression (Logistic)
using words and metawords. The results, shown in gure 6.2(a), show that the
two discriminative approaches outperform Naive Bayes. Logistic and Svm give very
similar results. The average error over the four schools was reduced by around 4%
by introducing the metadata attributes.

4. We trained one-against-others SVM for each category and during testing, picked the
category with the largest margin.

6.6

Experimental Results

6.6.1.2

189

Relational Models

Incorporating metadata gives a signicant improvement, but we can take additional


advantage of the correlation in labels of related pages by classifying them collectively. We want to capture these correlations in our model and use them for transmitting information between linked pages to provide more accurate classication.
We experimented with several relational models. Recall that logistic regression is
simply a at conditional Markov network. All of our relational Markov networks
use a logistic regression model locally for each page.
Our rst model captures direct correlations between labels of linked pages. These
correlations are very common in our data: courses and research projects almost
never link to each other; faculty rarely link to each other; students have links to
all categories but mostly to courses. The Link model, shown in gure 6.1, captures
this correlation through links: in addition to the local bag of words and metadata
attributes, we introduce a relational clique template over the labels of two pages
that are linked.
A second relational model uses the insight that a webpage often has internal
structure that allows it to be broken up into sections. For example, a faculty
webpage might have one section that discusses research, with a list of links to
all of the projects of the faculty member, a second section might contain links to
the courses taught by the faculty member, and a third to his advisees. This pattern
is illustrated in gure 6.3. We can view a section of a webpage as a ne-grained
version of Kleinbergs hub [10] (a page that contains a lot of links to pages of a
particular category). Intuitively, if we have links to two pages in the same section,
they are likely to be on similar topics. To take advantage of this trend, we need
to enrich our schema with a new relation Section, with attributes Key, Doc (the
document in which it appears), and Category. We also need to add the attribute
Section to Link to refer to the section it appears in. In the RMN, we have two new
relational clique templates. The rst contains the label of a section and the label
of the page it is on:
SELECT doc.Category, sec.Category
FROM Doc doc, Section sec
WHERE sec.Doc = doc.Key
The second clique template involves the label of the section containing the link and
the label of the target page.
SELECT sec.Category, doc.Category
FROM Section sec, Link link, Doc doc
WHERE link.Sec = sec.Key and link.To = doc.Key
The original data set did not contain section labels, so we introduced them using
the following simple procedure. We dened a section as a sequence of three or more
links that have the same path to the root in the HTML parse tree. In the training
set, a section is labeled with the most frequent category of its links. There is a sixth

190

Relational Markov Networks

Figure 6.3

An illustration of the Section model.

category, none, assigned when the two most frequent categories of the links are less
than a factor of 2 apart. In the entire data set, the breakdown of labels for the
sections we found is: course (40), faculty (24), other (187), research.project (11),
student (71), and none (17). Note that these labels are hidden in the test data, so
the learning algorithm now also has to learn to predict section labels. Although not
our nal aim, correct prediction of section labels is very helpful. Words appearing
in the last header before the section are used to better predict the section label by
introducing a clique over these words and section labels.
We compared the performance of Link, Section, and Section+Link (a combined
model which uses both types of cliques) on the task of predicting webpage labels,
relative to the baseline of at logistic regression with metadata. Our experiments
used MAP estimation with a Gaussian prior on the feature weights with standard
deviation of 0.3. Figure 6.2(b) compares the average error achieved by the dierent
models on the four schools, training on three and testing on the fourth. We see
that incorporating any type of relational information consistently gives signicant
improvement over the baseline model. The Link model incorporates more relational
interactions, but each is a weaker indicator. The Section model ignores links outside
of coherent sections, but each of the links it includes is a very strong indicator. In
general, we see that the Section model performs slightly better. The joint model
is able to combine benets from both and generally outperforms all of the other
models. The only exception is for the task of classifying the Wisconsin data. In
this case, the joint Section+Link model contains many links, as well as some large
tightly connected loops, so belief propagation did not converge for a subset of nodes.
Hence, the results of the inference, which was stopped at a xed arbitrary number
of iterations, were highly variable and resulted in lower accuracy.
6.6.1.3

Discriminative vs. Generative

Our last experiment illustrates the benets of discriminative training in relational


classication. We compared three models. The Exists+Naive Bayes model is a completely generative model proposed by Getoor et al. [8]. At each page, a naive Bayes
model generates the words on a page given the page label. A separate generative
model species a probability over the existence of links between pages conditioned

Experimental Results

191

0.35

Exists+Nave Bayes

Exists+Logistic

Link

0.3
0.25
Test Error

6.6

0.2
0.15
0.1
0.05
0
Cor

Tex

Wash

Wisc

AVG

Figure 6.4 Comparison of generative and discriminative relational models. Exists+Naive Bayes is completely generative. Exists+Logistic is generative in the links,

but locally discriminative in the page labels given the local features (words, metawords). The Link model is completely discriminative.
on both pages labels. We can also consider an alternative Exists+Logistic model that
uses a discriminative model for the connection between page label and words
i.e., uses logistic regression for the conditional probability distribution of page label
given words. This model has equivalent expressive power to the naive Bayes model
but is discriminatively rather than generatively trained. Finally, the Link model is
a fully discriminative (undirected) variant we have presented earlier, which uses a
discriminative model for the label given both words and link existence. The results,
shown in gure 6.4, show that discriminative training provides a signicant improvement in accuracy: the Link model outperforms Exists+Logistic which in turn
outperforms Exists+Naive Bayes.
As illustrated in table 6.1, the gain in accuracy comes at some cost in training
time: for the generative models, parameter estimation is closed form while the
discriminative models are trained using conjugate gradient, where each iteration
requires inference over the unrolled RMN. On the other hand, both types of
models require inference when the model is used on new data; the generative
model constructs a much larger, fully connected network, resulting in signicantly
longer testing times. We also note that the situation changes if some of the data
is unobserved in the training set. In this case, generative training also requires an
iterative procedure (such as the expectation macimation algorihtm (EM)) where
each iteration uses the signicantly more expressive inference.
6.6.2

Experiments on extended WebKB

We collected and manually labeled a new relational data set inspired by WebKB [3].
Our data set consists of computer science department webpages from three schools:
Stanford, Berkeley, and MIT. A total of 2954 pages are labeled into one of eight
categories: faculty, student, research scientist, sta, research group, research project,

192

Relational Markov Networks

Average train/test running times (seconds). All runs were done on a


700Mhz Pentium III. Training times are averaged over four runs on three schools
each. Testing times are averaged over four runs on one school each.
Table 6.1

Training
Testing

Links

Links+Section

Exists+NB

1530
7

6060
10

1
100

course, and organization (organization refers to any large entity that is not a
research group). Owned pages, which are owned by an entity but are not the main
page for that entity, were manually assigned to that entity. The average distribution
of classes across schools is: organization (9%), student (40%), research group (8%),
faculty (11%), course (16%), research project (7%), research scientist (5%), and
sta (3%).
We established a set of candidate links between entities based on evidence of a
relation between them. One type of evidence for a relation is a hyperlink from an
entity page or one of its owned pages to the page of another entity. A second type
of evidence is a virtual link : We assigned a number of aliases to each page using
the page title, the anchor text of incoming links, and email addresses of the entity
involved. Mentioning an alias of a page on another page constitutes a virtual link.
The resulting set of 7161 candidate links were labeled as corresponding to one of
ve relation types advisor (faculty, student), member (research group/project,
student/faculty/research scientist), teach (faculty/research scientist/sta, course),
TA (student, course), part-of (research group, research project) or none,
denoting that the link does not correspond to any of these relations.
The observed attributes for each page are the words on the page itself and the
metawords on the page the words in the title, section headings, anchors to the
page from other pages. For links, the observed attributes are the anchor text, text
just before the link (hyperlink or virtual link), and the heading of the section in
which the link appears.
Our task is to predict the relation type, if any, for all the candidate links. We
tried two settings for our experiments: with page categories observed (in the test
data) and page categories unobserved. For all our experiments, we trained on two
schools and tested on the remaining school.
Observed entity labels We rst present results for the setting with observed
page categories. Given the page labels, we can rule out many impossible relations;
the resulting label breakdown among the candidate links is: none (38%), member
(34%), part-of (4%), advisor (11%), teach (9%), TA (5%).
There is a huge range of possible models that one can apply to this task. We
selected a set of models that we felt represented some range of patterns that
manifested in the data.
Link-Flat is our baseline model, predicting links one at a time using multinomial
logistic regression. This is a strong classier, and its performance is competitive

Experimental Results

193

0.95

0.85

0.85
0.8

Flat
Neigh

0.8
Accuracy

Flat
Triad
Section
Section & Triad

0.9
Accuracy

6.6

0.75
0.7
0.65

0.75

0.6

0.7
ber

mit

(a)

sta

ave

ber

m it

sta

ave

(b)

(a) Relation prediction with entity labels given. Relational models on


average performed better than the baseline Flat model. (b) Entity label prediction.
Relational model Neigh performed signicantly better.

Figure 6.5

with other classiers (e.g., support vector machines). The features used by this
model are the labels of the two linked pages and the words on the links going from
one page and its owned pages to the other page. The number of features is around
1000.
The relational models try to improve upon the baseline model by modeling the
interactions between relations and predicting relations jointly. The Section model
introduces cliques over relations whose links appear consecutively in a section on a
page. This model tries to capture the pattern that similarly related entities (e.g.,
advisees, members of projects) are often listed together on a webpage. This pattern
is a type of similarity template, as described in section 6.3. The Triad model is a
type of transitivity template, as discussed in section 6.3. Specically, we introduce
cliques over sets of three candidate links that form a triangle in the link graph. The
Section & Triad model includes the cliques of the two models above.
As shown in gure 6.2(a), both the Section and Triad models outperform the at
model, and the combined model has an average accuracy gain of 2.26%, or 10.5%
relative reduction in error. As we only have three runs (one for each school), we
cannot meaningfully analyze the statistical signicance of this improvement.
As an example of the interesting inferences made by the models, we found a
student-professor pair that was misclassied by the Flat model as none (there is only
a single hyperlink from the students page to the advisors) but correctly identied
by both the Section and Triad models. The Section model utilizes a paragraph on the
students webpage describing his or her research, with a section of links to research
groups and the link to his or her advisor. Examining the parameters of the Section
model clique, we found that the model learned that it is likely for people to mention
their research groups and advisors in the same section. By capturing this trend, the
Section model is able to increase the condence of the student-advisor relation. The
Triad model corrects the same misclassication in a dierent way. Using the same
example, the Triad model makes use of the information that both the student and

Relational Markov Networks

0.75

P/R Breakeven Point

194

Phased (Flat/Flat)
Phased (Neigh/Flat)
Phased (Neigh/Sec)
Joint+Neigh
Joint+Neigh+Sec

0.7
0.65
0.6
0.55
0.5
0.45
ber

mit

sta

ave

Relation prediction without entity labels. Relational models performed


better most of the time, even though there are schools in which some models
performed worse.

Figure 6.6

the teacher belong to the same research group, and the student TAed a class taught
by his advisor. It is important to note that none of the other relations are observed
in the test data, but rather the model bootstraps its inferences.

Unobserved entity labels When the labels of pages are not known during
relations prediction, we cannot rule out possible relations for candidate links based
on the labels of participating entities. Thus, we have many more candidate links that
do not correspond to any of our relation types (e.g., links between an organization
and a student). This makes the existence of relations a very low-probability event,
with the following breakdown among the potential relations: none (71%), member
(16%), part-of (2%), advisor (5%), teach (4%), TA (2%). In addition, when we
construct a Markov network in which page labels are not observed, the network
is much larger and denser, making the (approximate) inference task much harder.
Thus, in addition to models that try to predict page entity and relation labels
simultaneously, we also tried a two-phase approach, where we rst predict page
categories, and then use the predicted labels as features for the model that predicts
relations.
For predicting page categories, we compared two models. The Entity-Flat model
is a multinomial logistic regression that uses words and metawords from the page
and its owned pages in separate bags of words. The number of features is roughly
10, 000. The Neighbors model is a relational model that exploits another type of
similarity template: pages with similar URLs often belong to the same category or
tightly linked categories (research group/project, professor/course). For each page,
two pages with URLs closest in edit distance are selected as neighbors, and we
introduced pairwise cliques between neighboring pages. Figure 6.5(b) shows that
the Neighbors model clearly outperforms the Flat model across all schools, by an
average of 4.9% accuracy gain.

Experimental Results

195

0.75

0.75
0.7

flat
compatibility

ave p/r breakeven point

ave p/r breakeven point

6.6

0.65
0.6
0.55
0.5
0.45

flat
compatibility

0.7
0.65
0.6
0.55
0.5
0.45
0.4

0.4
10% observed

25% observed

50% observed

(a)

DD

JL

TX

67

FG

LM

BC

SS

(b)

Figure 6.7 (a) Average precision-recall breakeven point for 10%, 25%, 50% observed links. (b)
Average precision-recall breakeven point for each fold of school residences at 25% observed links.

Given the page categories, we can now apply the dierent models for link
classication. Thus, the Phased (Flat/Flat) model uses the Entity-Flat model to
classify the page labels, and then the Link-Flat model to classify the candidate
links using the resulting entity labels. The Phased (Neighbors/Flat) model uses the
Neighbors model to classify the entity labels, and then the Link-Flat model to classify
the links. The Phased (Neighbors/Section) model uses the Neighbors to classify the
entity labels and then the Section model to classify the links.
We also tried two models that predict page and relation labels simultaneously.
The Joint + Neighbors model is simply the union of the Neighbors model for page
categories and the Flat model for relation labels given the page categories. The Joint
+ Neighbors + Section model additionally introduces the cliques that appeared in
the Section model between links that appear consecutively in a section on a page.
We train the joint models to predict both page and relation labels simultaneously.
As the proportion of the none relation is so large, we use the probability
of none to dene a precision-recall curve. If this probability is less than some
threshold, we predict the most likely label (other than none); otherwise we predict
the most likely label (including none). As usual, we report results at the precisionrecall breakeven point on the test data. Figure 6.6 shows the breakeven points
achieved by the dierent models on the three schools. Relational models, both
phased and joint, did better than at models on the average. However, performance
varies from school to school and for both joint and phased models, performance on
one of the schools is worse than that of the at model.
6.6.3

Social Network Data

The data set we used has been collected by a portal website at a large university that
hosts an online community for students [1]. Among other services, it allows students
to enter information about themselves, create lists of their friends, and browse the
social network. Personal information includes residence, gender, major, and year, as
well as favorite sports, music, books, social activities, etc. We focused on the task of
predicting the friendship links between students from their personal information

196

Relational Markov Networks

and a subset of their links. We selected students living in sixteen dierent residences
or dorms and restricted the data to the friendship links only within each residence,
eliminating interresidence links from the data to generate independent training/test
splits. Each residence has about fteen to twenty-ve students and an average
student lists about 25% of his or her housemates as friends.
We used an eight-fold train-test split, where we trained on fourteen residences and
tested on two. Predicting links between two students from just personal information
alone is a very dicult task, so we tried a more realistic setting, where some
proportion of the links is observed in the test data, and can be used as evidence for
predicting the remaining links. We used the following proportions of observed links
in the test data: 10%, 25%, and 50%. The observed links were selected at random,
and the results we report are averaged over ve folds of these random selection
trials.
Using just the observed portion of links, we constructed the following at features:
for each student, the proportion of students in the residence that list him/her and
the proportion of students he/she lists; for each pair of students, the proportion of
other students they have as common friends. The values of the proportions were
discretized into four bins. These features capture some of the relational structure
and dependencies between links: Students who list (or are listed by) many friends
in the observed portion of the links tend to have links in the unobserved portion as
well. More importantly, having friends in common increases the likelihood of a link
between a pair of students.
The Flat model uses logistic regression with the above features as well as personal
information about each user. In addition to the individual characteristics of the two
people, we also introduced a feature for each match of a characteristic; for example,
both people are computer science majors or both are freshmen.
The Compatibility model uses a type of similarity template, introducing cliques
between each pair of links emanating from each person. Similarly to the Flat model,
these cliques include a feature for each match of the characteristics of the two
potential friends. This model captures the tendency of a person to have friends
who share many characteristics (even though the person might not possess them).
For example, a student may be friends with several computer science majors, even
though he is not a CS major himself. We also tried models that used transitivity
templates, but the approximate inference with 3-cliques often failed to converge or
produced erratic results.
Figure 6.7(a) compares the average precision-recall breakpoint achieved by the
dierent models at the three dierent settings of observed links. Figure 6.7(b) shows
the performance on each of the eight folds containing two residences each. Using
a paired t -test, the Compatibility model outperforms Flat with p-values 0.0036,
0.00064, and 0.054 respectively.

6.7

6.7

Discussion and Conclusions

197

Discussion and Conclusions


We propose an approach for collective classication and link prediction in relational
domains. Our approach provides a coherent probabilistic foundation for the process
of collective prediction, where we want to classify multiple entities and links,
exploiting the interactions between the variables. We have shown that we can
exploit a very rich set of relational patterns in classication, signicantly improving
the classication accuracy over standard at classication.
We show that the use of a probabilistic model over link graphs allows us to
represent and exploit interesting subgraph patterns in the link graph. Specically,
we have found two types of patterns that seem to be benecial in several places.
Similarity templates relate the classication of links or objects that share a certain
graph-based property (e.g., links that share a common endpoint). Transitivity
templates relate triples of objects and links organized in a triangle.
Our results use a set of relational patterns that we have discovered to be useful
in the domains that we have considered. However, many other rich and interesting
patterns are possible. Thus, in the relational setting, even more so than in simpler
tasks, the issue of feature construction is critical. It is therefore important to explore
the problem of automatic feature induction, as in [4].
Finally, we believe that the problem of modeling link graphs has numerous
other applications, including analyzing communities of people and the hierarchical
structure of organizations, identifying people or objects that play certain key roles,
predicting current and future interactions, and more.

References
[1] L. Adamic, O. Buyukkokten, and E. Adar. A social network caught in the web.
http://www.hpl.hp.com/shl/papers/social/, 2002.
[2] S. Chakrabarti, B. Dom, and P. Indyk. Enhanced hypertext categorization
using hyperlinks. In Proceedings of ACM International Conference on Management of Data, 1998.
[3] M. Craven, D. DiPasquo, D. Freitag, A. McCallum, T. Mitchell, K. Nigam,
and S. Slattery. Learning to extract symbolic knowledge from the World Wide
Web. In Proceedings of the National Conference on Articial Intelligence, 1998.
[4] S. Della Pietra, V. Della Pietra, and J. Laerty. Inducing features of random
elds. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19
(4):380393, 1997.
[5] L. Egghe and R. Rousseau. Introduction to Informetrics. Elsevier, Amsterdam,
1990.
[6] N. Friedman, L. Getoor, D. Koller, and A. Pfeer. Learning probabilistic
relational models. In Proceedings of the International Joint Conference on

198

Relational Markov Networks

Articial Intelligence, 1999.


[7] L. Getoor, N. Friedman, D. Koller, and B. Taskar. Learning probabilistic
models of relational structure. In Proceedings of the International Conference
on Machine Learning, 2001.
[8] L. Getoor, E. Segal, B. Taskar, and D. Koller. Probabilistic models of text
and link structure for hypertext classication. In Proceedings of the IJCAI01
Workshop on Text Learning: Beyond Supervision, 2001.
[9] T. Joachims. Transductive inference for text classication using support vector
machines. In Proceedings of the International Conference on Machine Learning,
1999.
[10] J. M. Kleinberg. Authoritative sources in a hyperlinked environment. Journal
of the ACM, 46(5):604632, 1999.
[11] D. Koller and A. Pfeer. Probabilistic frame-based systems. In Proceedings
of the National Conference on Articial Intelligence, 1998.
[12] F. Kschischang and B. Frey. Iterative decoding of compound codes by
probability propagation in graphical models. IEEE Journal of Selected Areas
in Communications, 16(2):219230, 1998.
[13] J. Laerty, A. McCallum, and F. Pereira. Conditional random elds: Probabilistic models for segmenting and labeling sequence data. In Proceedings of
the International Conference on Machine Learning, 2001.
[14] R. McEliece, D. MacKay, and J. Cheng. Turbo decoding as an instance of
Pearls belief propagation algorithm. IEEE Journal on Selected Areas in
Communications, 16(2):140152, 1998.
[15] K. P. Murphy, Y. Weiss, and M. I. Jordan. Loopy belief propagation for
approximate inference: an empirical study. In Proceedings of the Conference
on Uncertainty in Articial Intelligence, 1999.
[16] J. Neville and D. Jensen. Iterative classication in relational data. In
Proceedings of the AAAI-2000 Workshop on Learning Statistical Models from
Relational Data, 2000.
[17] J. Pearl. Probabilistic Reasoning in Intelligent Systems. Morgan Kaufmann,
San Francisco, 1988.
[18] F. Sha and F. Pereira. Shallow parsing with conditional random elds. In
Proceedings of Human Language Technology Conference and North American
Chapter of the Association for Computational Linguistics, 2003.
[19] S. Slattery and T. Mitchell. Discovering test set regularities in relational
domains. In Proceedings of the International Conference on Machine Learning,
2000.
[20] B. Taskar, E. Segal, and D. Koller. Probabilistic classication and clustering
in relational data. In Proceedings of the International Joint Conference on
Articial Intelligence, 2001.

References

199

[21] B. Taskar, P. Abbeel, and D. Koller. Discriminative probabilistic models for


relational data. In Proceedings of the Conference on Uncertainty in Articial
Intelligence, 2002.
[22] B. Taskar, M. Wong, P. Abbeel, and D. Koller. Link prediction in relational
data. In Proceedings of Neural Information Processing Systems, 2003.
[23] V. Vapnik. The Nature of Statistical Learning Theory. Springer-Verlag, New
York, 1995.
[24] Y. Yang, S. Slattery, and R. Ghani. A study of approaches to hypertext
categorization. Journal of Intelligent Information Systems, 18(2):219241,
2002.

7 Probabilistic Entity-Relationship Models,


PRMs, and Plate Models

David Heckerman, Chris Meek, Daphne Koller

In this chapter, we introduce a graphical language for relational data called the
probabilistic entity-relationship (PER) model. The model is an extension of the
entity-relationship model, a common model for the abstract representation of
database structure. We concentrate on the directed version of this modelthe
directed acyclic probabilistic entity-relationship (DAPER) model. The DAPER
model is closely related to the plate model and the probabilistic relational model
(PRM), existing models for relational data. The DAPER model is more expressive
than either existing model, and also helps to demonstrate their similarity. In
addition to describing the new language, we discuss important facets of modeling
relational data, including the use of restricted relationships, self relationships, and
probabilistic relationships. Many examples are provided.

7.1

Introduction
For over a century, statistical modeling has focused primarily on at datadata
that can be encoded naturally in a single two-dimensional table having rows and
columns. The disciplines of pattern recognition, machine learning, and data mining
have had a similar focus. Notable exceptions include hierarchical models (e.g., [11])
and spatial statistics (e.g., [1]). Over the last decade, however, perhaps due to the
ever-increasing volumes of data being stored in databases, the modeling of nonat
or relational data has increased signicantly. During this time, several graphical
languages for relational data have emerged including plate models (e.g.,[3, 9]) and
probabilistic relational models (PRMs) (e.g., [5]). These models are to relational
data what ordinary graphical models (e.g., directed acyclic graphs and undirected
graphs) are to at data.
In this chapter, we introduce a new graphical model for relational datathe
probabilistic entity-relationship (PER) model. This model class is more expressive

202

Probabilistic Entity-Relationship Models, PRMs, and Plate Models

than either PRMs or plate models. We concentrate on a particular type of PER


modelthe directed acyclic probabilistic entity-relationship (DAPER) modelin
which all probabilistic arcs are directed. It is this version of the PER model that is
most similar to the plate model and the PRM. We dene new versions of the plate
model and the PRM such that their expressiveness is equivalent to the DAPER
model, and then compare the new and old denitions. Consequently, we both
demonstrate the similarity among the original languages as well as enhance their
abilities to express conditional independence in relational data. Our hope is that
this demonstration of similarity will foster greater communication and collaboration
among statisticians who mostly use plate models and computer scientists who
mostly use PRMs.
We in fact began this work with an eort to unify traditional PRMs and plate
models. In the process, we discovered that it was important to distinguish between
the concepts of entity and relationship (discussed in detail in the next section).
We in turn discovered an existing language that does sothe entity-relationship
(ER) modela commonly used model for the abstract representation of database
structure. We then extended this language to handle probabilistic relationships,
creating the PER model.
We should emphasize that the languages we discuss are neither meant to serve
as a database schema nor meant to be built on top of one. In practice, database
schemata are built up over a long period of time as the needs of the database
consumers change. Consequently, schemata for real databases are often not optimal
or are completely unusable as the basis for statistical modeling. The languages we
describe here are meant to be used as statistical modeling tools, independent of the
schema of the database being modeled.
This work borrows heavily from concepts surrounding PRMs described in, e.g.,
Friedman et al. [5] and Getoor et al. [8]. Where possible, we use similar nomenclature, notation, and examples.

7.2

Background: Graphical Models


As mentioned, we shall concentrate on directed models in this chapter. Accordingly,
we rst review (ordinary) directed acyclic models.
A directed acyclic graphical (DAG) model for a nite set of attributes X =
(X1 , . . . , Xn ) with joint distribution p(x) has two components: (1) a directed
acyclic graphsometimes referred to as the structure of the modelthat encodes
a set of conditional independencies among the attributes, and (2) a collection
of local distributions. The nodes in the directed acyclic graph are in one-to-one
correspondence with the attributes in X. To keep notation simple, we use Xi to
refer to the node corresponding to attribute Xi . Whether Xi refers to an attribute
or node will be clear from the context. The absence of arcs in the directed acyclic
graph encode probabilistic independencies that allow the joint distribution for X

7.2

Background: Graphical Models

203

to be written as
p(x) =

n


p(xi |pai ),

(7.1)

i=1

where pai are the attributes corresponding to the parents of node Xi . The local
distributions of the DAG model is the set of conditional probability distributions
p(xi |pai ), i = 1, . . . , n. Thus, a DAG model for X species the joint distribution
for X.
An example DAG model structure for attributes (X, Y, Z, W ) is shown in gure 7.1(a). The structure (i.e., the missing arcs) encode the independencies: (1) X
and Z are independent given Y , and (2) (Y, Z) and W are independent given X.
We note that DAG models can be interpreted as a generative model for the data. In
our example, we can generate a sample for (X, Y, Z, W ) by rst sampling X, then
Y and W given X, and nally Z given Y .
As we shall see, when working with relational data, it is often necessary to express
constraints or restrictions among attributes. Such restrictions can be encoded in a
DAG model, which we review here.
As a simple example, suppose we have a generative story for binary (0/1)
attributes X, Y, Z, and W that can be described by the DAG model structure
shown in gure 7.1(a). In addition, suppose we know that at most two of these
attributes take on the value 1. We can add this restriction to the model as shown
in gure 7.1(b). Here, we have added a binary node named R. Associated with this
node (not shown in the gure) is a local distribution wherein R = 1 with probability
1 when at most two of its parents take on value 1, and with probability zero
otherwise. To encode the restriction, we set R = 1. Note that R is a deterministic
attribute. That is, given the parents of R, R is known with certainty. As is commonly
done in the graphical modeling literature, we indicate deterministic nodes with
double ovals.1
Assuming that the restriction always holdsthat is, R is always equal to 1it
is not meaningful to work with the joint distribution p(x, y, z, w, r). Instead, the
appropriate distribution to make inferences with is
p(x|r = 1) = p(x) p(y|x) p(z|y) p(w|x) p(r = 1|x, y, z, w).

(7.2)

Readers familiar with directed factor-graph models [4] will recognize that this
distribution for (X, Y, Z, W ) can be encoded by a directed factor-graph model in
which node R is replaced by the factor f (x, y, z, w) = p(r = 1|x, y, z, w). More
generally, the factor-graph model is perhaps a more natural model for situations

1. DAG models can also be used to encode soft restrictions. For example, if we know that
zero, one, two, three, and four of the attributes X take on the value 1 with probabilities
p0 , p1 , p2 , p3 , and p4 , respectively, we can encode this soft restriction using the DAG model
structure in gure 7.1(b) where R is no longer deterministic and has the appropriate local
probability distribution.

204

Probabilistic Entity-Relationship Models, PRMs, and Plate Models

X
W

W
Y
R

(a)

(b)

(a) A DAG model. (b) A similar DAG model with an added restriction
among the attributes.

Figure 7.1

having both a generative component and restrictions. In this chapter, however, we


use the DAG representation of restrictions so that we remain within the class of
DAG models and thereby simplify the presentation.

7.3

The Basic Ideas


Before we describe languages for the statistical modeling of relational data, we begin with a description of a language for modeling the data itself. The language we
discuss is the entity-relationship (ER) model, a commonly used abstract representation of database structure (e.g., [19]). The creation of an ER model is often the
rst step in the process of building a relational database. Features of anticipated
data and how they interrelate are encoded in an ER model. The ER model is then
used to create a relational schema for the database, which in turn is used to build
the database itself.
It is important to note that an ER model is a representation of a database
structure, not of a particular database that contains data. That is, an ER model
can be developed prior to the collection of any data, and is meant to anticipate the
data and the relationships therein.
When building ER models, we distinguish between entities, relationships, and
attributes. An entity corresponds to a thing or object that is or may be stored in
a database or data set2; a relationship corresponds to a specic interaction among
entities; and an attribute corresponds to a variable describing some property of
an entity or relationship. Throughout the chapter, we use examples to illustrate
concepts.
Example 7.1
A university database maintains records on students and their IQs, courses and
their diculty, and the courses taken by students and the grades they receive.

2. In what follows, we make no distinction between a database and a data set.

7.3

The Basic Ideas

205

In this example, we can think of individual students (e.g., john, mary) and individual courses (e.g., cs107, stat10) as entities.3 Naturally, there will be many
students and courses in the database. We refer to the set of students (e.g.,
{john,mary,. . .}) as an entity set. The set of courses (e.g., {cs107,stat10,. . . }) is
another entity set. Most important, because an ER model can be built before any
data is collected, we need the concept of an entity classa reference to a set of
entities without a specication of the entities in the set. In our example, the entity
classes are Student and Course.
A relationship is a list of entities. In our example, a possible relationship is the
pair (john, cs107), meaning that john took the course cs107. Using nomenclature
similar to that for entities, we talk about relationship sets and relationship classes.
A relationship set is a collection of like relationshipsthat is, a collection of
relationships each relating entities from a xed list of entity classes. In our example,
we have the relationship set of student-course pairs. A relationship class refers to
an unspecied set of like relationships. In our example, we have the relationship
class Takes.
The IQ of john and the diculty of cs107 are examples of attributes. We use the
term attribute class to refer to an unspecied collection of like attributes. In our
example, Student has the single attribute class Student.IQ and Course has the single
attribute class Course.Di. Relationships also can have attributes; and relationship
classes can have attribute classes. In our example, Takes has the attribute class
Takes.Grade.
An ER model for the structure of a database graphically depicts entity classes,
relationships classes, attribute classes, and their interconnections. An ER model for
Example 7.1 is shown in gure 7.2(a). The entity classes (Student and Course) are
shown as rectangular nodes; the relationship class (Takes) is shown as a diamondshaped node; and the attribute classes (Student.IQ, Course.Di, and Takes.Grade)
are shown as oval nodes. Attribute classes are connected to their corresponding
entity or relationship class, and the relationship class is connected to its associated
entity classes. (Solid edges are customary in ER models. Here, we use dashed edges
so that we can later use solid edges to denote probabilistic dependencies.)
An ER model describes the potential attributes and relationships in a database. It
says little about actual data. A skeleton for a set of entity and relationship classes is
specication of the entities and relationships associated with a particular database.
That is, a skeleton for a set of entity and relationship classes is a collection of
corresponding entity and relationship sets. An example skeleton for our university
database example is shown in gure 7.2(b).
An ER model applied to a skeleton denes a specic set of attributes. In particular, for every entity class and every attribute class of that entity class, an attribute
is dened for every entity in the class; and for every relationship class and every at-

3. In a real database, longer names would be needed to dene unique students and courses.
We keep the names short in our example to make reading easier.

206

Probabilistic Entity-Relationship Models, PRMs, and Plate Models

Student
john
mary

Diff

Course

Course
cs107

Takes

Grade

stat10

Takes

Student

IQ

(a)

(b)

cs107.Diff

T(john,cs107).G

(c)

john.IQ

T(mary,cs107).G

Student

Course

john

cs107

mary

cs107

mary

stat10

stat10.Diff

T(mary.stat10).G

mary.IQ

(a) An ER model depicting the structure of a university database. (b)


An example skeleton for the entity and relationship classes in the ER model. (c)
The attributes dened by the application of the ER model to the skeleton. The
attribute names are abbreviated.

Figure 7.2

tribute class of that relationship class, an attribute is dened for every relationship
in the class. The attributes dened by the ER model in gure 7.2(a) applied to the
skeleton in gure 7.2(b) are shown in gure 7.2(c). In what follows, we use ER model
to mean both the ER diagramthe graph in gure 7.2(a)and the mechanism by
which attributes are generated from skeletons.
A skeleton still says nothing about the values of attributes. An instance for an
ER model consists of (1) a skeleton for the entity and relationship classes in that
model, and (2) an assignment of a value to every attribute generated by the ER
model and the skeleton. That is, an instance of an ER model is an actual database.
Let us now turn to the probabilistic modeling of relational data. To do so, we
introduce a specic type of probabilistic ER model: the DAPER model. Roughly

7.3

The Basic Ideas

207

speaking, a DAPER model is an ER model with directed (solid) arcs among the
attribute classes that represent probabilistic dependencies among corresponding attributes, and local distribution classes that dene local distributions for attributes.
Recall that an ER model applied to a skeleton denes a set of attributes. Similarly, a DAPER model applied to a skeleton denes a set of attributes as well as
a DAG model for these attributes. Thus, a DAPER model can be thought of as a
language for expressing conditional independence among unrealized attributes that
eventually become realized given a skeleton.
As with the ER diagram and model, we sometimes distinguish between a DAPER
diagram, which consists of the graph only, and the DAPER model, which consists of
the diagram, the local distribution classes, and the mechanism by which a DAPER
model denes a DAG model given a skeleton.
Example 7.2
In the university database (Example 7.1), a students grade in a course depends
both on the students IQ and on the diculty of the course.
The DAPER model (or diagram) for this example is shown in gure 7.3(a). The
model extends the ER model in gure 7.2 with the addition of arc classes and
local distribution classes. In particular, there is an arc class from Student.IQ to
Takes.Grade and an arc class from Course.Di to Takes.Grade. These arc classes
are denoted as a solid directed arc. A local distribution class for Takes.Grade (not
shown) represents the probabilistic dependence of grade on IQ and diculty.
Just as we expand attribute classes in a DAPER model to attributes in a
DAG model given a skeleton, we expand arc classes to arcs. In doing so, we
sometimes want to limit the arcs that are added to a DAG model. In the current
problem, for example, we want to draw an arc from attribute c.Di for course c to
attribute Takes(s, c ).Grade for course c and any student s, only when c = c . This
limitation is achieved by adding a constraint to the arc classnamely, the constraint
course[Di] = course[Grade] (see gure 7.3(a)). Here, the terms course[Di] and
course[Grade] refer to the entities c and c , respectivelythe entities associated
with the attributes at the ends of the arc.
The arc class from Student.IQ to Takes.Grade has a similar constraint: student[IQ] = student[Grade]. This constraint says that we draw an arc from attribute
s.IQ for student s =student[IQ] to Takes(s , c).Grade for student s =student[Grade]
and any course c only when s = s . As we shall see, constraints in DAPER models
can be quite expressivefor example, they may include rst-order expressions on
entities and relationships.
Figure 7.3(c) shows the DAG (structure) generated by the application of
the DAPER model in gure 7.3(a) to the skeleton in gure 7.3(b). (The attribute names in the DAG model are abbreviated.) The arc from stat10.Di to
Takes(mary,cs107).Grade, e.g., is disallowed by the constraint on the arc class from
Course.Di to Takes.Grade.
Regardless of what skeleton we use, the DAG model generated by the DAPER
model in gure 7.3(a) will be acyclic. In general, as we show in section 7.7, if the

208

Probabilistic Entity-Relationship Models, PRMs, and Plate Models

attribute classes and arc classes in the DAPER diagram form an acyclic graph,
then the DAG model generated from any skeleton for the DAPER model will be
acyclic. Weaker conditions are also sucient to guarantee acyclicity. We describe
one in section 7.7.
In general, a local distribution class for an attribute class is a specication from
which local distributions for attributes corresponding to the attribute class can be
constructed, when a DAPER model is expanded to a DAG model. In our example,
the local distribution class for Takes.Gradewritten p(Takes.Grade|Student.IQ,
Course.Di)is a specication from which the local distributions for Takes(s, c).Grade,
for all students s and courses c, can be constructed. In our example, each attribute
Takes(s, c).Grade will have two parents: s.IQ and c.Di. Consequently, the local
distribution class need only be a single local probability distribution. We discuss
more complex situations in section 7.4.
Whereas most of this chapter concentrates on issues of representation, the
problems of probabilistic inference, learning local distributions, and learning model
structure are also of interest. For all of these problems, it is natural to extend
the concept of an instance to that of a partial instance; an instance in which
some of the attributes do not have values. A simple approach for performing
probabilistic inference about attributes in a DAPER model given a partial instance
is to (1) explicitly construct a ground graph, (2) instantiate known attributes from
the partial instance, and (3) apply standard probabilistic inference techniques to
the ground graph to compute the quantities of interest. One can improve upon
this simple approach by utilizing the additional structure provided by a relational
modelfor example, by caching inferences in subnetworks. Koller and Pfeer[15],
for example, have done preliminary work in this direction. With regard to learning,
note that from a Bayesian perspective, learning about both the local distributions
and model structure can be viewed as probabilistic inference about (missing)
attributes (e.g., parameters) from a partial instance. In addition, there has been
substantial research on learning PRMs (e.g., [8]) and much of this work is applicable
to DAPER models.
We shall explore PER models in much more detail in subsequent sections. Here,
let us examine two alternate languages for relational data: plate models and PRMs.
Plate models were developed independently by Buntine[3] and the BUGS team
(e.g., [9]) as a language for compactly representing graphical models in which there
are repeated measurements. We know of no formal denition of a plate model, and
so we provide one here. This denition deviates slightly from published examples of
plate models, but it enhances the expressivity of such models while retaining their
essence (see section 7.5).
According to our denition, plate and DAPER models are equivalent. The
invertible mapping from a DAPER to a plate model is as follows. Each entity
class in a DAPER model is drawn as a large rectanglecalled a plate. The plate
is labeled with the entity-class name. Plates are allowed to intersect or overlap. A
relationship class for a set of entity classes is drawn at the named intersection of
the plates corresponding to those entities. If there is more than one relationship

7.3

The Basic Ideas

209

Student
john
mary

Diff

Course

course[Diff] =
course[Grade]

Course
cs107

Takes

Grade

stat10

student[IQ] =
student[Grade]

Student

Takes

IQ

(a)

(b)

cs107.Diff

T(john,cs107).G

(c)

john.IQ

T(mary,cs107).G

Student

Course

john

cs107

mary

cs107

mary

stat10

stat10.Diff

T(mary.stat10).G

mary.IQ

(a) A DAPER model showing that a students grade in a course


depends on both the students IQ and the diculty of the course. The solid
directed arcs correspond to probabilistic dependencies. These arcs are annotated
with constraints. (b) An example skeleton for the entity and relationship classes in
the ER model (the same one shown in gure 6.2). (c) The DAG model (structure)
dened by the application of the DAPER model to the ER skeleton.

Figure 7.3

class among the same set of entity classes, the plates are drawn such that there
is a distinct intersection for each of the relationship classes. Attribute classes of
an entity class are drawn as ovals inside the rectangle corresponding to the entity
but outside any intersection. Attribute classes associated with a relationship class
are drawn in the intersection corresponding to the relationship class. Arc classes
and constraints are drawn just as they are in DAPER models. In addition, local
distribution classes are specied just as they are in DAPER models.
The plate model corresponding to the DAPER model in gure 7.3(a) is shown in
gure 7.4(a). The two rectangles are the plates corresponding to the Student and

210

Probabilistic Entity-Relationship Models, PRMs, and Plate Models

Course entity classes. The single relationship class between Student and Course
Takesis represented as the named intersection of the two plates. The attribute
class Student.IQ is drawn inside the Student plate and outside the Course plate;
the attribute class Course.Di is drawn inside the Course plate and outside the
Student plate; and the attribute class Takes.Grade is drawn in the intersection of
the Student and Course plate. The arc classes and their constraints are identical to
those in the DAPER model.
PRMs were developed in [5] explicitly for the purpose of representing relational
data. The PRM extends the relational modelanother commonly used representation for the structure of a databasein much the same way as the PER model
extends the ER model. In this chapter, we shall dene directed PRMs such that
they are equivalent to DAPER models and, hence, plate models. This denition deviates from the one given by, e.g., [5], but enhances the expressivity of the language
as previously dened (see section 7.6).
The invertible mapping from a DAPER model to a directed PRM (by our
denition) takes place in two stages. First, the ER model component of the DAPER
model is mapped to a relational model in a standard way (e.g., see [19]). In
particular, both entity and relationship classes are represented as tables. Foreign
keysor what Getoor et al.[8] call reference slotsare used in the relationshipclass tables to enocde the ER connections in the ER model. Attribute classes
for entity and relationship classes are represented as attributes or columns in the
corresponding tables of the relational model. Second, the probabilistic components
of the DAPER model are mapped to those of the directed PRM. In particular, arc
classes and constraints are drawn just as they are in the DAPER model.
The directed PRM corresponding to the DAPER model in gure 7.3(a) is shown
in gure 7.4(b). (The local distribution for Takes.Grade is not shown.) The Student
entity class and its attribute class Student.IQ appear in a table, as does the Course
entity class and its attribute class Course.Di. The Takes relationship and its
attribute class Takes.Grade is shown as a table containing the foreign keys Student
and Course. The arc classes and their constraints are drawn just as they are in the
DAPER model.

7.4

Probabilistic Entity-Relationship Models


We now examine DAPER models in detail. After reviewing the fundamentals,
we discuss the representation of restricted relationships, self relationships, and
probabilistic relationships.
In what follows, we use the following conventions in our notation. We use either capitalized friendly names (e.g., Student, Course) or tokens (e.g., E) for
entity classes. We use non capitalized friendly names or abbreviations (e.g., student[Grade], s) for corresponding entities. Similarly, we use capitalized friendly
names (e.g., Takes) or tokens (e.g., R) for relationship classes. We use, e.g., R(s, c)
to say that entities s and c are a relationship associated with the relationship class

7.4

Probabilistic Entity-Relationship Models

211

Course

Course

Diff

Diff

course[Diff] =
course[Grade]

Takes

Takes

Course
Student
Grade

Grade
student[IQ] =
student[Grade]

Student

IQ

(a)

course[Diff] =
course[Grade]

(b)

student[IQ] =
student[Grade]

IQ

Student

A plate model (a) and probabilistic relational model (b) corresponding


to the DAPER model in Figure7.3(a).

Figure 7.4

R. We use X to refer to an arbitrary class when the distinction between an entity


and relationship class is unimportant. We use expressions such as X.A to represent
an attribute class of class X, and x.A to represent an (ordinary) attribute of entity
x.
7.4.1

Fundamentals

A DAPER model can be viewed as a macro languagea language that, given a


skeleton, expands to a DAG model. We use the term ground graph to refer to
the structure of the DAG model created by the expansion of a DAPER model
given a skeleton. An important part of this expansion is the drawing of arcs in the
ground graph. Because the DAPER model is so compact, a mechanism is needed to
constrain the drawing of arcs. Without such a mechanism, important conditional
independence relations could not be expressed. As we have seen, this mechanism in
a DAPER model takes the form of constraints on arc classes. To better understand
how these constraints work, consider the following four related examples.
Example 7.3
A database contains diseases and symptoms for a given patient. Every disease is a
potential cause of every symptom.
The DAPER model for this example is shown in gure 7.5(a). The entity classes
Disease and Symptom have attribute classes Disease.Present and Symptom.Present,
respectively, and there are no relationship classes. In the diagram, the arc class
from Disease.Present to Symptom.Present has no constraint. Because there is no
constraint, the ground graph generated by the application of this DAPER model to
any given skeleton is a full bipartite graph. The bipartite graph generated by the

212

Probabilistic Entity-Relationship Models, PRMs, and Plate Models

Disease

Present

d1.Present

d 2 .Present

d 3 .Present

Symptom

Present

s1.Present

s2 .Present

s3 .Present

(a)

(b)

(a) A DAPER model for a complete bipartite graph between symptoms


and diseases. (b) A ground graph (a DAG model structure) generated from the
DAPER model given a skeleton with three diseases and three symptoms.
Figure 7.5

DAPER model applied to a skeleton in which there are three diseases and three
symptoms is shown in gure 7.5(b).
We give this example rst to emphasize that arc classes need not have constraints.
Now, let us see what happens when we include such constraints.
Example 7.4
Extending example 7.3, suppose a physician has identied the possible causes of
each symptom.
The DAPER model for example 7.4 is shown in gure 7.6(a). With respect to the
model in gure 7.5(a), there is now the relationship class Causes, where Causes(d, s)
is true if the physician has identied disease d as a possible cause of symptom s.
Also new is the constraint Causes(d, s) on the arc class. This constraint says that,
when we expand the DAPER model to a DAG model given a skeleton, we draw
an arc from d.Present to s.Present only when Causes(d, s) holds. Note that, in the
diagram we use d and s to refer to the entities associated with Disease.Present
and Symptom.Present, respectively. In what follows, we will continue to make strong
abbreviations as in this example, although such abbreviations are not required and
may be undesirable for computer implementations of the PER language.
In the next two examples, we consider more complex constraints.
Example 7.5
Extending example 7.3 in a dierent way, suppose the physician has identied both
primary (major) and secondary (minor) causes of disease.
The DAPER model for example 7.5 is shown in gure 7.7(a). There are now two
relationship classesPrimary (1o ) Causes and Secondary (2o ) Causesbetween
the two entity classes, and the constraint is a disjunctive one: 1o Causes(d, s)
2o Causes(d, s). This constraint says that, when the DAPER model is expanded to
a DAG model given a skeleton, an arc is drawn from d.Present to s.Present only
when d is a primary and/or secondary cause of s.

7.4

Probabilistic Entity-Relationship Models

Causes

Present

Disease

Causes (d , s )

Causes

Present

Symptom

(a)

213

Disease

Symptom

d1

s1

d1

s2

d1

s3

d2

s2

d3

s3

(b)

d1.Present

d 2 .Present

d 3 .Present

s1.Present

s2 .Present

s3 .Present

(c)

Figure 7.6 (a) A DAPER model for incomplete bipartite graph of diseases and
symptoms. (b) A possible skeleton identifying diseases, symptoms, and potential
causes of symptoms. (c) A DAG model resulting from the expansion of the DAPER
model to the skeleton.

Example 7.6
Extending example 7.3 in a dierent way, suppose that both diseases and symptoms
have category labelslabels drawn from the same set of categories. The possible
causes of a symptom are diseases that have at least one category in common with
that symptom.
The DAPER model for this example is shown in gure 7.7(b). Here, we have
introduced a third entity classCategorywhose entities have relationships with
Disease and Symptom. In particular, R1(d, c) holds when disease d is in category
c; and R2(s, c) holds when symptom s is in category c. In this model, the arc class
has the constraint cR1(d, c) R2(c, s), where c is an arbitrary entity in Category.
Thus, when the DAPER model is expanded to a DAG given a skeleton, an arc will
be drawn from d.Present to s.Present only when d and s share at least one category.
To understand how constraints are written and used in general, consider a
DAPER model with an arc class from X.A to Y.B. When this model is expanded
to a ground graph given a skeleton, depending on the constraint, we might draw
an arc from x.A to y.B for any x and y in the skeleton. To determine whether we
do so, we look at the tail and head entities associated with this putative arc. The
tail entities of the putative arc from x.A to y.B are the set of entities associated
with x. If X is an entity class, then the tail entity is just the entity x. If X is
a relationship class, then the tail entities are those entities in the relationship
tuple x. Similarly, the head entities of this arc are the set of entities associated
with y. For example, given the DAPER model and skeleton in gure 7.3 for the
university database, the tail and head entities of the putative arc from john.IQ to
Takes(john,cs107).Grade are (john) and (john,cs107), respectively. A constraint on
the arc class from X.A to Y.B in a DAPER model is any rst-order expression
involving entities and relationship classes in the DAPER model such that the
expression is bound when the tail and head entities are taken to be constants.
To determine whether we draw an arc from x.A to y.B, we evaluate the rst-order
expression using the tail and head entities of the putative arc. It must evaluate

214

Probabilistic Entity-Relationship Models, PRMs, and Plate Models

Disease
Disease

Present

Present
R1

1 Causes (d , s )
2o Causes(d , s )
o

1o Causes

2o Causes

Symptom

(a)

Figure 7.7

Category

Present

c R1 (d , c)
R2 ( s, c )

R2

(b)

Symptom

Present

(a) A disjunctive constraint. (b) A constraint containing the existence

quantier.
to true or false. We draw the arc from x.A to y.B only if the expression is true.
Continuing with the same university database example, let us determine whether
to draw an arc from john.IQ to Takes(john,cs107).Grade. The relevant constraint
student[IQ] = student[Grade]references the tail entity student[IQ] = john and
the head entity student[Grade] = john. Thus, the expression evaluates to true and
we draw the arc.
Next, let us consider the local distribution class. A local distribution class for
attribute class X.A is any specication from which the local distributions for
attribute x.A, for any entity or relationship x in class X, may be constructed. In
gure 7.3(c), each attribute for a students grade in a course has two parentsone
attribute corresponding to the diculty of the course and another corresponding to
the IQ of the student. Consequently, the local distribution class for Takes.Grade in
the DAPER model can be a single (ordinary) local distribution. In general, however,
a more complicated specication is needed. For example, in the ground graph
of gure 7.6(c), the attribute s1 .Present has one parent, whereas the attributes
s2 .Present and s3 .Present have two parents. Consequently, the local distribution
class for Symptom.Present must be something more than a single local distribution.
In general, a local distribution class for X.A may take the form of an enumeration
of local distributions. In our example, we could specify a local distribution for every
possible parent set of s.Present for every symptom s in every possible skeleton. Of
course, such enumerations are cumbersome. Instead, a local distribution class is
typically expressed as a canonical distribution such as noisy OR, logistic, or linear
regression. Friedman et al.[5] refer to such specications as aggregators.
So far, we have considered only DAPER models in which all attributes derive
from attributes classes. In practice, however, it is often convenient to include
(ordinary) attributes in a DAPER model. For example, in a Bayesian approach to
learning the conditional probability distribution of Takes.Grade given Student.IQ

7.4

Probabilistic Entity-Relationship Models

215

and Course.Di in example 7.2, we may add to the DAPER model an ordinary
attribute corresponding to this uncertain distribution, as shown in gure 7.8(a).
(If Grade is binary, e.g., would correspond to the parameter of a Bernoulli
distribution.) The ground graph obtained from this DAPER model applied to the
skeleton in gure 7.8(b) is shown in gure 7.8(c). Note that the attribute appears
only once in the ground graph and that, because there is no annotation on the arc
class from to Takes.Grade, there is an arc from to each grade attribute.
Although this view makes DAPER models easy to understand, formally, we do
not allow such models to contain (ordinary) attributes. Instead, we specify that,
for any DAPER model, (1) there is an entity classGlobalthat is not drawn; (2)
for any skeleton, this entity class has precisely one entity; and (3) every attribute
class not connected explicitly to some visible entity class is connected to Global.
This view is equivalent to the informal one just presented, but leads to simpler
denitions and notation in our formal treatment of DAPER models in section 7.7.
7.4.2

Restricted Relationships

We now consider restricted relationships or, more precisely, restricted relationship


classes. A relationship class R in an ER (or PER) model is restricted when some
skeletons for the entity and relationship classes of the ER model are prohibited. In
practice, many ER models contain restricted relationship classes; and graphical notation has been developed for common restrictions (e.g., [20]). Similarly, restricted
relationship classes are an extremely useful tool for modeling with PER models. In
this section, we consider several examples.
Example 7.7
A binary outcome O is measured on patients in multiple hospitals. Each patient is
treated in exactly one hospital. It is believed that outcomes in any given hospital h
are i.i.d. given Bernoulli parameter h.; and that these Bernoulli parameters are
themselves i.i.d. across hospitals given hyperparameters .
A DAPER model for this example is shown in gure 7.9(a). Here, entity classes
Patient and Hospital are related by the relationship class In. The ground graph
for a skeleton containing m hospitals and ni patients in hospital i is shown in
gure 7.9(b). This ground graph is the DAG model (structure) of what is often
called a hierarchical model in the Bayesian literature (e.g., [7]).
In this example, the relationship class In is restricted in the sense that (patient,hospital) pairs are many to oneeach patient is in exactly one hospital. This
restriction is represented graphically by a curved arrowhead on the edge from In
to Hospital in gure 7.9(a). The curved arrowhead is a standard notation in the
language of ER models [20]; and we adopt this same notation for PER models.
In general, given an ER or PER model with relationship class R connecting entity classes E1 , . . . , En , if knowing entities in classes E1 , . . . , Ei1 , . . . , Ei+1 , . . . , En
uniquely determines entity Ei for any allowed skeleton, then a curved arrowhead is
attached to the edge from R to Ei .

216

Probabilistic Entity-Relationship Models, PRMs, and Plate Models

Student
john
mary

Diff

Course

Course

c[D] = c[G]

cs107

Takes

stat10

Grade

Takes

s[IQ] = s[G]

Student

IQ

(a)

Student

Course

john

cs107

mary

cs107

mary

stat10

(b)

cs107.Diff

T(john,cs107).G

(c)

john.IQ

T(mary,cs107).G

stat10.Diff

T(mary.stat10).G

mary.IQ

A modication to gure7.3 in which the local distribution for


Takes.Grade given Student.IQ and Course.Di is uncertain. (a) The DAPER model.
(b) A skeleton (identical to the one in gure7.3). (c) The ground graph.
Figure 7.8

Note that, due to the many-to-one restriction in this problem, we could equivalently attach the attribute class O to In rather than to Patient. A DAPER model
equivalent to the one in gure 7.9(a) is shown in gure 7.9(c).
Example 7.8
The occurrence of words in a document is used to infer its topic. The occurrence
of words is mutually independent given document topic. Document topics are i.i.d.
given multinomial parameters t . The occurrence of word w in a document with
topic t is i.i.d. given t and Bernoulli parameters w|t .
This example is commonly referred to a binary naive Bayes classication [18]. A
DAPER model for this problem is shown in gure 7.10. The entity classes Document

7.4

Probabilistic Entity-Relationship Models

217

Hospital

h1.

In(h, p)

In

p11.O

hm.

p1n1 .O

pm1.O

p mnm .O

Patient

(a)

(b)

Hospital

h[ ] = h[O]
O

In

Patient

(c)

Figure 7.9 (a) A DAPER model for patient outcomes across multiple hospitals
(example 7.7). (b) The ground graph (a hierarchical model structure) for a skeleton
containing m hospitals and ni patients in hospital i applied to the DAPER model
in (a). (c) A DAPER model equivalent to the one in (a).

and Word are related by the single relationship class F. The attribute classes are
Document.Topic representing the topic of a document, Word.w|t representing the
set of Bernoulli parameters w|t for a word, and F(d, w).In representing whether
word w is in document d. The relationship class F is restricted to be a Full
relationship class. That is, in any allowed skeleton, all pairs (document,word) must
be represented.4 We indicate this restriction on the DAPER diagram by placing
the annotation Full next to the relationship class. As we shall see in what follows,
the Full restriction is useful in many situations.
4. In a practical database implementation, this relationship would be encoded sparsely,
despite the Full restriction. That is, relationship (d, w) would be stored in the database
only when word w appears in document d.

218

Probabilistic Entity-Relationship Models, PRMs, and Plate Models

Document

Topic

d [T] = d [In]
Full

In

w[ w|t ] = w[In]
Word

Figure 7.10

7.4.3

w|t

A DAPER model for binary naive Bayes document classication.

Self Relationships

Self relationships are relationships that relate like entities (and perhaps other
entities as well). A self-relationship class is one that contains self relationships.
Examples of self-relationship classes are common in databases: people are managers
of other people, cities are near other cities, timestamps follow timestamps, and so
on. ER models can represent self relationships in a natural manner. The extension
to PER models is also straightforward, as we illustrate with the following three
examples.
Example 7.9
In the university database example (example 7.2), a students grade in a course
depends on whether an advisor of the student is a friend of a teacher of the course.
The ER model for the data in this example is shown in gure 7.11(a). With
respect to the ER model in gure 7.2(a), Professor is a new entity class and Advises,
Teaches, and F are new relationship classes. Advises(p, s) means that professor p
is an advisor of student s. Teaches(p, c) means that professor p teaches course c.
(Students may have more than one advisor and courses may have more than one
teacher.)
The relationship class F is introduced to model whether one professor is a friend of
another. F is our rst example of a self-relationship classit contains relationships
between professor pairs. The two dashed lines connecting F and the Professor entity
class in the diagram indicate that F is a self-relationship class. F has one attribute
class F.Friend, where the attribute F(p, pf ).Friend is true if professor pf is a friend
of professor p. Note that F has the Full constraint so that we can model whether
any one professor is a friend of another. Also note that F(p1 , p2 ).Friend may be true
while F(p2 , p1 ).Friend may be false.
The DAPER model for this example, including the new probabilistic relationship
between F.Friend and Takes.Grade, is shown in gure 7.11(b). The constraint on
the arc class from F.Friend to Takes.Grade is Teaches(p, c) Advises(pf , s). Thus,
in any ground graph generated from this model, there is an arc from attribute
F(p, pf ).Friend to attribute Takes(s, c).Grade whenever a teacher of the course is p

7.4

Probabilistic Entity-Relationship Models

219

and an advisor of the student is pf precisely the additional dependence described


in the example.
In the diagram, note that the relationship class F has the label F(p, pf ). The
ordered pair (p, pf ) following F is introduced to unambiguously identify the dierent
roles of the entity class in the self relationship. In this case, p and pf refer to
the roles of professor and professors friend, respectively. This added notation in
DAPER models is needed for the unambiguous specication of constraints. For
example, suppose we had written the constraint on the arc class from F.Friend to
Takes.Grade as Teaches(pf , c) Advises(p, s). This constraint means something
dierent than the previous onenamely, that the students grade depends on
whether the courses teacher is a friend of the students advisor.
Although not a standard convention for ER models, we allow an alternative
representation for self relationships. Namely, we allow entity classes participating
in a self-relationship class to be copied. The DAPER model in gure 7.11(b) drawn
with this alternative convention is shown in gure 7.11(c). Here, there are two
instances of the Professor entity class named Professor (Teacher) and Professor
(Advisor). Note that copying allows us to annotate the role that each copy of
the entity class plays in the self-relationship class. Models drawn with this copy
convention are sometimes (but not always) more transparent. A similar convention
is used in PRMs [5].
Example 7.10
A hidden Markov model (HMM) has hidden attributes slice.H, observed attributes
slice.X, and uncertain parameters h and x|h .
A DAPER model for such an HMM is shown in gure 7.12(a). The only entity
class in the model is Slice. Its entities correspond to the time slices in the HMM. The
only relationship class in the modelNextis a restricted, self-relationship class.
Next(s, s+1 ) holds precisely when time slice s+1 immediately follows time slice s.
Thus, Next is an example of a relationship class whose constraint induces a total
order on its entities. We use Order to annotate this restriction. The attributes H
and X correspond to the hidden and observed attributes in the HMM, respectively.
The attribute classes h and x|h (connected to the Global entity class, which is
not shown) represent the uncertain distributions.
Because arc classes can have constraints, DAPER models may contain arc classes
that are self arcsarcs whose head and tail nodes are the same.5 In this example,
the self arc is used to represent the Markov chain of hidden attributes H. Another
graphical modelMarkov transition diagramsuses self arcs in the much the same
way. When a self arc appears in a DAPER model, it is not clear which way to draw
arcs when expanding the model to a DAG model. In our example, do we draw arcs
from s.H to s+1 .H, or in the opposite direction? To remove the ambiguity, we use
5. We use the term self arc to refer both to arc classes and to arcs. The use will be clear
from the context.

220

Probabilistic Entity-Relationship Models, PRMs, and Plate Models

Full
F

Full
F(p,pf)

Friend

Professor

Professor

Teaches

Course

Friend

Teaches

Diff

Teaches ( p, c )
Advises ( p f , s )

Diff

Course

c[D] = c[G ]
Takes

Grade

Takes

Advises

(a)

Grade

Advises
IQ

Student

(b)

s[IQ] = s[G ]
IQ

Student

Full
F(p,pf)

Professor
(Advisor)

Friend

Professor
(Teacher)

Teaches

Teaches ( p, c )
Advises ( p f , s )

Course

Diff
c[D] = c[G ]

Takes
Advises

(c)

Grade

s[IQ] = s[G ]
Student

IQ

(a) An ER model showing Student, Course, and Professor entities and


relationships among them. (b) A DAPER model showing that a students grade in
a course depends on whether the courses teacher likes the students advisor. (c)
The same model in (b) in which the Professor entity class has been copied.

Figure 7.11

7.4

Probabilistic Entity-Relationship Models

221

barhat notation. In this example, the constraint is written Next(


s, s+1 ) indicating
that the arc in drawn from s.H to s+1 .H. In general, we use a bar and hat to denote
head and tail entities, respectively.
When this DAPER model is expanded to a ground graph, the attribute s0 .H
where s0 corresponds to the rst time slicehas no parents. In contrast, the
attribute s.H where s corresponds to any other slice has one parent. Consequently,
the local distribution class for Slice.H may be specied by two (ordinary) local
distributions: p(s0 .H) and p(si+1 .H|si .H) for i > 0.
A DAPER model using the copy convention for the HMM is shown in gure 7.12(b). Note that the attribute class Slice.X need be represented in only one
copy of the entity class. The probabilistic dependencies between s.H and s.X, for
all slices s, are captured by the inclusion of X in one copy. Also note that, in
this example and in any diagram where the copy convention is used, the barhat
notation is not needed.
Example 7.11
A gene is transmitted through inheritance. The gene-allele frequencies are uncertain.
A DAPER model for this example is shown in gure 7.13(a). The model contains
a single entity class Person and a single three-way, restricted, self relationship class
Family. The relationship Family(pc , pm , pf ) holds when child pc has mother and
father pm and pf , respectively. The relationship class has the 2DAG constraint,
meaning that each child has at most two parents and cannot be his or her own
ancestor. The constraint on the single arc class indicates that only the gene of
a childs mother and father inuences the gene of the child. Note that the local
distribution class for Gene has three components: (1) p(gene|no parents) = , (2)
p(gene|one parent), and (3) and p(gene|two parents). Figure 7.13(b) shows the same
model in which the entity class Person appears three times.
When a DAPER model contains self relationships, its expansion can produce an
invalid DAG modelin particular, one with a ground graph that contains directed
cycles. For example, suppose we have a DAPER model where entity class E has
a self-relationship class R, and E.A has a self arc with no constraint. Then when
we expand this model given a skeleton containing R(e, e), the ground graph will
contain the self arc from e.A to e.A. In general, we need to ensure the ground graph
is ayclic given all skeletons under consideration. In section 7.7, we describe sucient
conditions (including the absence of self relationships) that guarantee the acyclicity
of ground graphs. In general, to determine whether the DAPER model produces
only acyclic ground graphs for a given set of skeletons, one can check each ground
graph individually.
7.4.4

Probabilistic Relationships

In many situations, relationships may be uncertain or random. In this section, we


consider several examples and how they are represented with DAPER models.

222

Probabilistic Entity-Relationship Models, PRMs, and Plate Models

Next ( s , s+1 )
Order

Next(s,s+1)

s[ H ] = s[ X ]

Slice

x|h

(a)

Order

Next(s,s+1)
H

Slice (+1)

Next (s, s+1 )

Slice

s[ H ] = s[ X ]

x|h

(b)
Figure 7.12 (a) The DAPER model representation of a hidden Markov model.
(b) The same model in which Slice is copied.

Example 7.12 Relationship Existence


A database contains academic papers and citations for a subset of those papers.
Using the citations we have, we model how the topics of two papers inuence whether
one paper cites the other.6
If each paper in the database came with its citations, we could model this
database with the ER model shown in gure 7.14(a). Here, the single (copied) entity
class Paper has the self relationship Cites, where Cites(pcg , pcd ) holds when pcg is the
citing paper and pcd is the cited paper. In our example, however, we are uncertain
about the citations of papers whose citations have not been recorded. That is, we
are uncertain about the relationships in the relationship class Cites. To model this

6. We assume that citation lists for papers are missing at random.

7.4

Probabilistic Entity-Relationship Models

223

pm Fam( p c , pm , p f )
p f Fam( p c , pm , p f )

2DAG

Family(pc,pm,pf)

Gene

Person

(a)

Person
(Mother)
2DAG

Person
(Father)

Gene

Gene

p f Fam( pc , pm , p f )

Family(pc,pm,pf)

pm Fam( pc , pm , p f )
Person
(Child)

Gene

(b)
Figure 7.13 (a) The DAPER model for gene transmission through inheritance.
(b) The same model in which Person is copied.

uncertainty, we use a DAPER model in which Cites is a Full relationship class


with attribute class Cites.Exists, where Cites(pcg , pcd ).Exists is true when paper
pcg cites paper pcd . In addition, to model how the topics of two papers inuence
this existence, we add the attribute class Paper.Topic and the arc classes as shown
in gure 7.14(b).
In general, if we have a relationship class R that is uncertain, we model it in a
DAPER model by making that relationship class Full and adding the attribute
class R.Exists. Getoor et al. [8] discuss this type of uncertainty under the name
existence uncertainty and use a similar mechanism to represent it in PRMs.
In many situations, relationship classes can be both probabilistic and restricted.
In the remainder of this section, we consider two examples.
Example 7.13
Modifying example 7.12, we now know that the database was constructed such that
it contains at most ten citations from the bibliography of any paper.7

7. We assume that citations above ten in number were censored at random.

224

Probabilistic Entity-Relationship Models, PRMs, and Plate Models

Paper
(Citing)

Paper
(Citing)

Topic
p[T ] = pcg [ E ]
Full

Cites(pcg,pcd)

Cites

Exists
p[T ] = pcd [E ]

Paper
(Cited)

Paper
(Cited)

(a)

Topic

(b)

Figure 7.14 (a) An ER model for a citation database. (b) A DAPER model for
the situation where citations are uncertain.

Paper
(Citing)

Topic
p[T] = pcg [E]
Full

Cites(pcg,pcd)

pcg [E ] = p[<= 10]

Exists

<=10

p[T] = pcd [ E]

Paper
(Cited)

Topic

A DAPER model for the situation where citations are uncertain and
limited to ten per paper.

Figure 7.15

The DAPER model in gure 7.15 shows the DAPER model for this example,
where the Cites relationship class is both uncertain and restricted. As discussed in
section 7.2, we encode the restrictions using instantiated deterministic nodes. With
respect to gure 7.14(b), we have added a binary, attribute class P aper. <= 10.
The double oval associated with this attribute class indicates that this attribute
expands to deterministic attributes in a ground graph. In particular, a ground
graph attribute p. <= 10 will have parents Cites(pcg , pcd ).Exists, for all pcd , and
will be true exactly when ten or fewer of these parents are true. To encode the
restriction, we set p. <= 10 to true for every p when performing inference in the
ground graph.

7.4

Probabilistic Entity-Relationship Models

Paper
(Citing)

225

Topic

R1 ( p , c )

R1

Cites

c[ E ] = c[ M]

Full

R2

Exists

MutEx

p[T ] = p[ E ]
Paper
(Cited)

Figure 7.16

Topic

A DAPER model for the situation where only the cited papers are

uncertain.
Example 7.14 Partial Relationship Existence
Modifying example 7.12 once again, the citation database now has a complete set
of citations, but some of citations are so garbled that the identities of some of the
cited papers are uncertain.
One way to think about this uncertainty is that the relationships Cites(pcg , pcd )
are uncertain only in their second argument. Getoor et al. [8] refer to this uncertainty as reference uncertainty and present a special mechanism for representing it
in PRMs. We take an alternative approach that uses only concepts that we have
already discussed.
A DAPER model for this example is shown in gure 7.16. With respect to the
DAPER model in gure 7.14(b), we have added the entity class Cites, and the
relationship classes R1 and R2 between Paper and Cites. An entity pair in Cites
corresponds to a citationa citing and a cited paper. R1 (pcg , c) holds when paper
pcg is the citing paper in c, and R2 (pcd , c) holds when pcd is the cited paper in
c. The relationship class R1 is a restricted (many-to-one) relationship class. In
contrast, the relationship class R2 is a probabilistic relationship class, restricted to
be Full. The uncertainty in this relationship class is encoded with the attribute
class R2 .Exists, where R2 (pcd , c).Exists is true precisely when citation c cites paper
pcd . To model the restriction that the possible cited papers of c are mutually
exclusive, we rst introduce the deterministic, attribute class Cites.MutEx. In any
ground graph obtained from this DAPER model, c.M utEx will be true exactly
when one of its parents R2 (pcd , c).Exists is true. For any inference we perform with
the ground graph, we set c.M utEx to true for every citation c.

226

Probabilistic Entity-Relationship Models, PRMs, and Plate Models

Slice (+1)
Slice
h

Next ( s, s+1 )

s[ H ] = s[ X ]

x|h

Figure 7.17

Next(s,s+1)
Order

A plate model for an HMM corresponding to the DAPER model in

gure7.12(b).

7.5

Plate Models
In this section, we revisit our denition of the plate model, give examples, and
describe how our denition diers from previously published examples.
As discussed in section 7.3, we dene the plate model by giving an invertible
mapping from DAPER to plate model. Thus, the two model types are equivalent in
the sense that they can represent the same conditional independence relationships
for any given skeleton.
Summarizing the mapping from DAPER to plate model given in section 7.3,
entity classes are drawn as large named rectangles called plates; a relationship class
for a set of entity classes is drawn at the named intersection of the corresponding
plates; attribute classes are drawn inside the rectangle corresponding to its entity
or relationship class; and arc classes and constraints are drawn just as they are
in DAPER models. For example, as we have discussed, the DAPER model in
gure 7.3(a) has the corresponding plate model in gure 7.4(a). As another example,
the DAPER model for the HMM shown in gure 7.12(b) has the corresponding plate
model in gure 7.17. Note that, because plate models represent relationship classes
as the intersection of plates, plates (corresponding to entity classes) must be copied
when the model contains self-relationship classes.
The plate model corresponding to the DAPER model for the patient-hospital
example in gure 7.3(a) is shown in gure 7.18(a). In this plate model, there are
no attributes in the Patient plate outside the intersection. Thus, one can move the
Patient plate fully inside the Hospital plate, yielding the diagram in gure 7.18(b).
We allow this nesting in our framework. Furthermore, plates may be nested to an
arbitrary depth. This convention corresponds to one found in published examples
of plate models.
There are three dierences between plate models as we have dened them
and traditional plate modelsplates models as they have been described in the
literature. In all three cases, our denition provides a more expressive language.

7.5

Plate Models

227

Hospital

Hospital

Hospital

h[ ] = h[O ]

h[ ] = h[O ]

In

Patient/In

Patient/In

Patient

(a)

(b)

(c)

(a) A plate model corresponding to the DAPER model in


gure7.12(a). (b) An equivalent plate model illustrating the graphical convention
of nesting. (c) A traditional plate model, equivalent to the one in (b), in which the
constraint h[] = h[O] is implicit.

Figure 7.18

One, in traditional plate models, an arc class emanating from an attribute class in
a plate cannot leave that plate. Given this constraint, any arc class from attribute
class E.X must point either to attribute class E.Y or to attribute class R.Y , where
R is nested inside E.
Two, when a traditional plate model is expanded to a ground graph, arcs are
drawn only between attributes corresponding to the same entity. To be more precise,
consider a plate model containing the arc class from E.X to E.Y . In a traditional
plate model, the arc class implicitly has the constraint e[X] = e[Y ]. Similarly,
consider a plate model containing the arc class from E.X to R.Y where R is
nested inside E, possibly many levels deep. Because R in nested inside E, for
any relationship r R, the entities associated with r must uniquely determine
an e E. Let r(e) be the set of the relationships r that uniquely determine e.
Now, when this traditional plate model is expanded to a ground graph, arcs are
drawn from e.X to r.Y only when r r(e). As an example, consider gure 7.18(c),
which shows the traditional plate model for the patienthospital example. Here,
E=Hospital, R=In, and r(h) = p {(h, p)} for all hospitals h. Thus, the arc class
from Hospital. to In(h, p).O has the constraint h[] = h[O]. This constraint is
implicit (see gure 7.18(c)).
Three, traditional plate models contain no arc-class constraints other than the
implicit ones just described.
The DAPER and plate model (as we have dened them) are equivalent. Nonetheless, in some situations, a DAPER model may be easier to understand than an
equivalent plate model, and vice versa. When there are many entity and relationship
classes (plates and intersections), DAPER models are often easier to understand.

228

Probabilistic Entity-Relationship Models, PRMs, and Plate Models

In particular, drawing intersections when there are many plates can be dicult
(although not impossible; see [10]). In contrast, when there are few entities and the
nesting convention can be used, plates are often easier to understand.

7.6

Probabilistic Relational Models


In this section, we examine directed PRMs.
Recall that, as in the case of the plate model, we have specied an invertible
mapping from a DAPER model to a directed PRM. Thus, DAPER models, plate
models, and directed PRMs are equivalent. As described earlier, the mapping from
a DAPER to directed PRM takes place in two stages: the ER model component
of the DAPER model is mapped to a relational model, and then the probabilistic
component of the DAPER model is mapped to the directed PRM. In the rst stage,
entity classes are mapped to tables; relationship classes are mapped to tables with
foreign keys making the connections to entities; and attribute classes are mapped
to attributes (columns) in relational tables. In the second stage, arc classes and
constraints are drawn just as they are in the DAPER model.
There is one important dierence between the directed PRM by our denition and
the traditional PRMs as dened by Friedman et al.[5]. The dierence is not in the
relational-model component. The components for a PRM and traditional PRM are
identical. Rather, the dierence lies in how the probabilistic component is specied.
In our PRM, the probabilistic component is a graphical augmentation of the
relational model. In a traditional PRM, the probabilistic component takes the form
of a list of arc classes. To illustrate this dierence, compare the PRM in gure 7.4(b)
with the corresponding traditional PRM in gure 7.19. In the latter gure, the
arc classes pointing to Takes.Grade are specied in a separate list consisting of
Takes.Course.Di Takes.Grade and Takes.Student.IQ Takes.Grade.
The terms Takes.Course.Di and Takes.Student.IQ are examples of what Friedman et al.[5] call slot chains. In general, a slot chain is a sequence of foreign key
(or inverse foreign key) references. The linear nature of slot chains makes them less
expressive than the rst-order constraints in (our) PRMs. For example, in example 7.9 where a students grade in a course depends on whether the courses teacher
likes the students advisor (example 7.9), there are two relationship paths from
F.Friend to Student.Grade: one through Advises and one through Takes. This double path cannot be represented by a slot chain.
Getoor et al. [8] extend PRMs, allowing slot chains to mention probabilistic
relationships. DAPER models are not so expressive. In the following section, we
introduce contingent DAPER models that remove this limitation.
In practice, we nd both DAPER models and PRMs easy to understand.
Database designers who prefer ER models over relational models may prefer
DAPER models over PRMs, and vice versa. We note, however, that the purpose
of DAPER models and PRMs is not the implementation of mechanisms for data
storage, but rather the modeling of probabilistic dependencies. Consequently, even

7.7

Technical Details

229

Course
Diff

Takes
Course
Student
Grade

Takes.Course.Diff
Takes.Student.IQ

Takes.Grade
Takes.Grade

Student
IQ

Figure 7.19

A traditional PRM corresponding to the model in gure7.4(b).

those who prefer to design databases with relational models may prefer the DAPER
model for probabilistic modeling, as DAPER models make explicit the distinction
between entities and relationships.

7.7

Technical Details
In this section, we formalize many of the concepts we have described. In addition,
we state and prove a few relevant facts.
We use E and R to denote the set of entity and relationship classes, respectively.
We use E and R (sometimes with subscripts) to denote an entity and relationship
class, respectively, and X to denote an arbitrary class in E R. We use (E) and
(R) to denote an entity and relationship set, respectively, and (X) to denote
an arbitrary (E) or (R). We use e and r to denote a particular entity and
relationship, respectively, and x to denote an arbitrary entity or relationship. We
use X.A to denote the attribute class A associated with class X, and A(X) to
denote the set of attribute classes associated with class X. We use x.A to denote
an attribute associated with entity or relationship x, and A(x) to denote the set of
attributes associated with x. Each attribute class and attribute is associated with
a domaina set of possible values. The domain of x.A is the same as the domain
of X.A for every x X.
First, we dene the ER model in the following series of denitions.
Denition 7.1
An entity-relationship diagram for entity classes E, relationship classes R, and attribute classes A is a graph in which rectangular nodes correspond to entity classes,
diamond nodes correspond to relationship classes, and oval nodes correspond to

230

Probabilistic Entity-Relationship Models, PRMs, and Plate Models

attribute classes of entity or relationship classes. The node corresponding to a


relationship class among entities E1 , . . . , En E is connected to the nodes corresponding to these entities with a dashed edge. Attribute classes corresponding to
an entity or relationship class are connected to this class with dashed edges.
Denition 7.2
A skeleton for entity classes E and relationship classes Rdenoted ER consists
of (1) an entity set (E) for every E E and (2) a relationship set (R) for every
R R that is consistent with any constraints imposed by the relationship classes.
Denition 7.3
An entity-relationship model for entity classes E, relationship classes R, and attribute classes A is an ER diagram for E, R, and A that denes a set of (ordinary)
attributes A(ER ) for any skeleton ER . In particular, attribute x.A is in A(ER )
if and only if there is an X in E R and an x (X) such that A is in A(X).
Denition 7.4
An entity-relationship instance for an ER model for E, R, and Adenoted IERA
consists of (1) a skeleton ER and (2) a value for every attribute in A(ER ).
Now we consider domains wherein attributes may be probabilistic and dene the
DAPER model through the following series of denitions.
Denition 7.5
Given an entity or relationship class X with entity or relationship x, the ordered
set of entities e(x) associated with x is as follows. If X is an entity class, then
e(x) = (x). If X is a relationship class containing relationships R(e1 , . . . , en ), then
e(x) = (e1 , . . . , en ).
Note that the set e(x) is ordered to preserve roles associated with self-relationship
classes.
Denition 7.6
Given an ER model with attribute classes X.A and Y.B, the constraint CAB (e(x), e(y))
for the ordered pair (X.A, Y.B) is a rst-order expression that is bound when the
elements of e(x) and e(y) are taken as constants. The atoms of this expression have
the form R(e1 , . . . , en ) where R is a relationship class connected to entity classes
E1 , . . . , En or a predened relationship class such as equality, less than, greater
than, and rst.
Denition 7.7
A directed probabilistic entity-relationship (DPER) diagram for entity classes E,
relationship classes R, and attribute classes A consists of (1) an ER model for E,
R, and A, and (2) a set of arc classes drawn as solid directed arcs corresponding to
probabilistic dependencies. There can be at most one arc class from attribute class
X.A to attribute class Y.B; and any arc class may have a constraint CAB (e(x), e(y)).
The set of arc classes pointing to X.A is the parent class of X.A, denoted PA(X.A).

7.7

Technical Details

231

Denition 7.8
A ground graph for a DPER diagram and skeleton ER for E, R, and A is a directed
graph constructed as follows. For every attribute in A(ER ), there is a corresponding
node in the graph. For any attribute x.A A(ER ), its parent set pa(x.A) are
those attributes y.B A(y) such that there is an arc class from Y.B to X.A and
the expression CAB (e(x), e(y)) is true.
Denition 7.9
Given ER , a set of skeletons for E, R, and A, a DPER diagram for E, R, and A
is acyclic with respect to ER if, for every ER ER , the ground graph for the
DPER diagram and ER is acyclic.
Theorem 7.10
If the probabilistic arcs of a DPER diagram for E, R, and A form an acyclic graph,
then the DPER diagram is ayclic with respect to ER for any ER .
Proof Suppose the theorem is false. Consider a cyclic ground graph for some
skeleton. Denote the attributes in the cycle by (x1 .A1 x2 .A2 . . . xn .An )
where x1 .A1 = xn .An . For each attribute xi .Ai there is an associated attribute
class Xi .Ai . From denition 7.8, we know that there must be an edge from
Xi .Ai Xi+1 .Ai+1 . Because X1 .A1 = Xn .An , there must be a cycle in the DPER
diagram, which is a contradiction. Q.E.D.
Friedman et al.[5] prove something equivalent.
Denition 7.11
A directed acyclic probabilistic entity-relationship (DAPER) model for entity classes
E, relationship classes R, attribute classes A, and skeletons ER consists of (1) an
DPER diagram for E, R, and A that is acyclic with respect to every ER ER ,
and (2) a local distribution classdenoted P (X.A|PA(X.A))for each attribute
class X.A. Each local distribution class is a collection of information sucient to
determine a local distribution p(x.A|pa(x.A)) for any x.A A(ER ). For every
ER ER , the DAPER model species a DAG model for A(ER ). The structure
of this DAG model is the ground graph of the DPER diagram for ER . The local
distributions of this DAG model are the local distributions p(x.A|pa(x.A)).
An immediate consequence of denition 7.11 is that, given D, a DAPER model for
E, R, A, and ER and a skeleton ER ER , we can write the joint distribution
for A(ER ) as follows:



p(IERA |ER , D) =
p(x.A|pa(x.A)).
(7.3)
XER x(X) AA(X)

In the remainder of this section, we describe a condition weaker than the one in
theorem 7.10 that guarantees the creation of acyclic ground graphs from a DPER
model. In this discussion, we use R(e1 , . . . , en ) to denote a particular relationship
in a relationship set (R).

232

Probabilistic Entity-Relationship Models, PRMs, and Plate Models

Denition 7.12
A relationship class R is a self-relationship class with respect to entity class E if a
relationship in R contains two or more references to entities in the entity class E.
Denition 7.13
A projected pairwise self-relationship class is obtained from a self-relationship class
by projecting two of the entities in the relationships that are from the same entity
class.
For example, the Family relationship class is a self-relationship class that can be
projected into the Father-Child relationship class and the Mother-Child relationship
class; and both are projected pairwise self-relationship classes.
Denition 7.14
Given skeleton ER for E and R, a relationship set (R) for a self-relationship class
R is cyclic if there exists a projected pairwise self-relationship class R for some
entity set E containing entities e1 , . . . , en such that R (e1 , e2 ), . . . , R (en1 , en ) and
R (en , e1 ). If a relationship set is not cyclic, it is acyclic.
Denition 7.15
An arc class in a DPER model is called a self arc if both the head and tail of the
arc are the same attribute class. A self-arc class is simple if there is exactly one
entity class associated with the attribute class associated with the self arc.
Theorem 7.16
If (1) the arc classes excluding the self arcs of the DPER diagram for E, R, and A
form an acyclic graph, (2) every self arc class is simple and has a constraint with
no disjunctions, no negations, and contains a self-relationship class for the entity
class associated with the self arc, and (3) for every self-relationship class R, (R)
is acyclic for every ER ER , then the DPER diagram is acyclic with respect to
ER .
Proof Suppose the theorem is false. Consider a ground graph G for some skeleton
ER containing a shortest cycle (x1 .A1 , . . . , xn .An ) where x1 .A1 = xn .An . Suppose
that the cycle contains at least two distinct attribute classes, that is, Xi .Ai =
Xi+1 .Ai+1 . This implies that there must be a cycle in the DAPER diagram with
the self-arc classes removed; however, from condition 1 and theorem 7.10 this cannot
be the case. Therefore, all of the attribute classes in the cycle must be the same
and must be included due to a single self-arc class. Due to condition (2), the cycle
in the self-arc class must imply a cyclic self-relationship class but this contradicts
condition (3). Q.E.D.

7.8

Extensions and Future Work

233

Y =1
Z

A contingent DAG model (structure) showing the context-specic


independence X and Z are independent given Y = 0, but dependent given Y = 1.
Figure 7.20

7.8

Extensions and Future Work


In this chapter, we have concentrated on the DAPER model, a model that expands
into a DAG model given a skeleton. In this section, we examine classes of PER
models that expand into graphical models other than traditional DAG models.
Many of the ideas here are preliminary and provide opportunities for future work.
An important limitation of traditional graphical models is their inability to
represent context-specic independence. An example of such independence is the
pair of independencies: (1) X and Z are independent given Y = 0, and (2) X and
Z are dependent given Y = 1. Many extensions to graphical models have been
developed that can represent particular classes of context-specic independence
including decision-tree-DAG model hybrids (e.g., see [2]); contingent DAG models
[6]; and similarity networks [12].
Let us consider a variation on contingent DAG models that uses notation slightly
dierent from that in Fung and Shachter [6]. To understand this model class,
consider the context-specic independence described in the previous paragraph:
X and Z are independent given Y = 0, but dependent given Y = 1. Figure 7.20
shows a contingent DAG model (structure) for this independence. This contingent
DAG model has a state constraint on the arc from Y to Z that reads X = 1.
This constraint means that there is a dependence of Y on Z only when X = 1.
In general, state constraints in contingent DAG models function much the way
constraints do in DAPER models. In DAPER models, constraints are rst-order
expressions over entities that control the expansion to a DAG model. In contingent
DAG models, state constraints are Boolean expressions over attributestates that
control the expression of conditional independence.
Now consider the contingent DAPER modela model that expands to a contingent DAG model. The model is identical to an ordinary DAPER model except
that arc classes are now annotated with an order pair. The rst component of the
ordered pair is a constraint just as is found in the ordinary DAPER model. The
second component is a state constraint class that species the state constraints to
be written during the expansion to a contingent DAG model. The state constraint

234

Probabilistic Entity-Relationship Models, PRMs, and Plate Models

class is a Boolean expression over attribute-states that may take head and tail
entities as arguments.
Example 7.15 Identity Uncertainty
We have video images of multiple cars of dierent colors. We know how many
cars there are and have zero or more observations of each cars color, but we are
uncertain about what observations go with what cars.
Pasula and Russell[18] describe this example as having identity uncertainty. We
can represent this example using the contingent DAPER model in gure 7.21(a).
The two entity classes, Car and Observation, are related by the relationship class
Of, where Of(o, c) holds when observation o corresponds to car c. The probabilistic
relationship Of has the many-to-one restriction: an observation is associated with
exactly one car. As in previous examples, the many-to-one restriction is represented
by the Full relationship class Of, together with the attribute class Of.Exists and
the deterministic node MutEx (which is set to true). The arc class from Car.Color
to Observation.Color is annotated with the ordered pair (Of(o, c), Of(o, c).Exists =
true). The rst component says that we draw an arc from c.Color to o.Observation
only when Of(o, c) is true. (In this case, this constraint is vacuous because the
relationship class F is Full.) The second component says that, when we draw such
an arc, we add to it the state constraint Of(o, c).Exists = true. Figure 7.21(b) shows
the expansion of this contigent DAPER model to a contingent DAG model for a
skeleton containing one car and two observations. Note that, because there is only
one car, the MutEx nodes are redudant and can be omitted.
In this example, we know how many cars there are. If we do not, we can place
a probability distribution on the number of cars and stipulate that the DAPER
model in gure 7.21(a) should be applied to each possible number of cars.
Let us now discuss possibilities for relational modeling with undirected models. A
commonly used (nonrelational) undirected model is the undirected graphical (UG)
model. This model class has more than one denitiondenitions that coincide
only for positive distributions [17]. Here, we dene a UG for attributes X with
joint distribution p(x) as a model having two components: (1) an undirected graph
(the model structure) whose nodes are in one-to-one correspondence with X, and
(2) a collection of non-negative clique functions m (xm ), m = 1, . . . , M , where m
indexes the maximal cliques of the graph and Xm are the attributes in X in the
mth maximal clique, such that
p(x) = c

M


m (xm ).

(7.4)

m=1

The term c is a normalization constant. As is the case for the DAG model, the UG
model for X denes the joint distribution for X. The clique functions are sometimes
called potentials.
A UG model for (X, Y, Z) is shown in gure 7.22(a). The graph has a single
maximal clique consisting of all three attributes, and hence represents an arbitrary
distribution for these attributes.

7.8

Extensions and Future Work

235

Color

Car

o (M ) = o (E )
MutEx

Full

Exists

(a)

Color

Observation

o1.MutEx

c.Color

Of(c,o1).Exists

Of (o1 , c).Exists = true


o2.MutEx

(Of (o, c ), Of (o, c ).Exists = true)

Of

Of (o2 , c ).Exists = true

Of(c,o2).Exists
o1.Color

(b)

o2.Color

Figure 7.21 (a) A contingent DAPER model for example 7.15, an example of
identity uncertainty. (b) A contigent DAG model resulting from the expansion of
the model in (a) given a skeleton containing one car and two observations.

A related but more general undirected model is the hierarchical log-linear graphical (HLLG) model. An HLLG model is a model having two components: (1) an
undirected hypergraph (the model structure) whose nodes are in one-to-one correspondence with X, and (2) a collection of potentials h (xh ), h = 1, . . . , H, where
h indexes the hyperarcs of the graph and xh are the attributes in X of the hth
hyperarc, such that
p(x) = c

H


h (xh ).

(7.5)

h=1

Again, an HLLG model for X denes the joint distribution for X. In this chapter,
we represent a hyperarc as a triangle connecting multiple nodes with undirected
edges. For example, gure 7.22(b) shows an HLLG model with a single hyperedge.
By virtue of (7.4) and (7.5), both UG and HLLG model structures dene
factorization constraints on distributions. In this sense, HLLG models are more
general than UG models. That is, given any UG model structure, there exists an
HLLG model structure that can encode the same factorization constraints, but not
vice versa. For example, the UG structure in gure 7.22(a) has the equivalent HLLG
model structure shown in gure 7.22(b). In contrast, the HLLG model structure
shown in gure 7.22(c) encodes the factorization constraint
p(x, y, z) = c 1 (x, y) 2 (y, z) 3 (x, z),

236

Probabilistic Entity-Relationship Models, PRMs, and Plate Models

(a)

(b)

(c)

(a) A UG model structure. (b) An equivalent HLLG model structure.


(c) An HLLG model that encodes pairwise interactions.

Figure 7.22

which cannot be represented by a UG model structure. Also, we note that the


factorization constraints of any HLLG model can be encoded with a factor-graph
model [16] in which all potentials are non-negative.
Turning to relational modeling, let us consider the hierarchical log-linear probabilistic entity-relationship (HELPER) model. A model in this class expands into
an HLLG model. Like the DAPER model, a HELPER model is an extension of
an ER model. In contrast to the DAPER model, the probabilistic component of a
HELPER model is expressed as hyperedge classes and potential classes on those
hyperedges. Hyperedge classes are expanded to an hyperedges according to constraints. These constraints, in turn, may be any rst-order expression that is bound
given the entities associated with the endpoints of the hyperedge.
Example 7.16
An arbitrary hierarchical log-linear graphical model with at most two-way interactions.
The HELPER diagram for this example is shown in gure 7.23(a). There is
a single entity class Variable corresponding to the attributes in the hierarchical
log-linear model, a single attribute class X, and a single self-relationship class
Neigh, where Neigh(v1 , v2 ) if v1 .X and v2 .X have a pairwise interaction. The only
hyperedge class in the model is a self edge that connects Variable.X with itself. The
constraint on this hyperedge class is such that v1 .X and v2 .X will be neighbors in
the ground graph only when Neigh(v1 , v2 ) holds. Note that the Neigh relationship
class is restricted to be upper triangular so that the expanded graph has no self
arcs and has at most one arc between any two attributes.
A sample skeleton for three attributes and the resulting hierarchical log-linear
model is shown in gure 7.23(a) and b, respectively.
Whereas HLLG models have a natural relational counterpart, UG models do not.
To understand this point, imagine a PER model that expands to a UG model. Such
a model would need a mechanism for specifying potentials in the ground graph. Such
potentials, however, are not dened until the maximal cliques of the ground graph
are determined, and these cliques will depend on the skeleton used to expand the
PER model.

References

237

Variable
a
b

Neigh(v1 , v2 )

UpperT

a.X

Neigh(v1,v2)
Neigh

Variable

(a)

v1

v2

(b)

b.X

c.X

(c)

(a) A HELPER model for an arbitrary hierarchical log-linear model


with at most two-way interactions. (b) An example skeleton. (c) The hierarchical
log-linear model resulting from the model in (a) applied to the skeleton in (b).
Figure 7.23

Finally, there are numerous classes of graphical models that we have not yet
explored, including mixed directed and undirected models (e.g., see [17]); directed
factor-graph models [4]; inuence diagrams [14]; and dependency networks [13]. The
development of PER models that expand to models in these classes also provides
opportunities for research.

Acknowledgments
We thank David Blei, Tom Dietterich, Brian Milch, and Ben Taskar for useful
comments.

References
[1] J. Besag. Spatial interaction and the statistical analysis of lattice systems.
Journal of the Royal Statistical Society, 36:192236, 1974.
[2] C. Boutlier, N. Friedman, M. Goldszmidt, and D. Koller. Context-specic
independence in Bayesian networks. In Proceedings of the Conference on
Uncertainty in Articial Intelligence, 1996.
[3] W. Buntine. Operations for learning with graphical models.
Articial Intelligence Research, 2(159-225), 1994.

Journal of

[4] B. Frey. Extending factor graphs so as to unify directed and undirected


graphical models. In Proceedings of the Conference on Uncertainty in Articial

238

Probabilistic Entity-Relationship Models, PRMs, and Plate Models

Intelligence, 2003.
[5] N. Friedman, L. Getoor, D. Koller, and A. Pfeer. Learning probabilistic
relational models. In Proceedings of the International Joint Conference on
Articial Intelligence, 1999.
[6] R. Fung and R. Shachter. Contingent belief networks. 1990.
[7] A. Gelman, J. Carlin, H. Stern, and D. Rubin.
Chapman and Hall, London, 1995.

Bayesian Data Analysis.

[8] L. Getoor, N. Friedman, D. Koller, and B. Taskar. Learning probabilistic


models of link structure. Journal of Machine Learning Research, 3:679707,
2002.
[9] W. Gilks, A. Thomas, and D. Spiegelhalter. A language and program for
complex Bayesian modeling. The Statistician, 43:169177, 1994.
[10] J. Gill, J. Howse, S. Kent, and J. Taylor. Projections in Venn-Euler diagrams.
In Proceedings of the IEEE Symposium on Visual Languages, 2000.
[11] I. Good. The Estimation of Probabilities. MIT Press, Cambridge, MA, 1965.
[12] D. Heckerman. Probabilistic Similarity Networks. MIT Press, Cambridge,
MA, 1991.
[13] D. Heckerman, D. Chickering, C. Meek, R. Rounthwaite, and C. Kadie. Dependency networks for inference, collaborative ltering, and data visualization.
Journal of Machine Learning Research, 1:4975, 2000.
[14] R. Howard and J. Matheson. Inuence diagrams. In Readings on the
Principles and Applications of Decision Analysis, volume 2, pages 721762.
Strategic Decisions Group, Menlo Park, CA, 1981.
[15] D. Koller and A. Pfeer. Object-oriented Bayesian networks. In Proceedings
of the Conference on Uncertainty in Articial Intelligence, 1997.
[16] F. Kschischang, B. Frey, and H. Loeliger. Factor graphs and the sum-product
algorithm. IEEE Transactions on Information Theory, 47:498519, 2001.
[17] S. Lauritzen. Graphical Models. Claredon Press, Oxford, UK, 1996.
[18] H. Pasula and S. Russell. Approximate inference in rst-order probabilistic
languages. In Proceedings of the International Joint Conference on Articial
Intelligence, 2001.
[19] J. Ullman and J. Widom. A First Course in Database Systems. Prentice Hall,
Upper Saddle River, NJ, 2002.

8 Relational Dependency Networks

Jennifer Neville and David Jensen

Recent work on graphical models for relational data has demonstrated signicant
improvements in classication and inference when models represent the dependencies among instances. Despite its use in conventional statistical models, the assumption of instance independence is contradicted by most relational data sets.
For example, in citation data there are dependencies among the topics of a papers references, and in genomic data there are dependencies among the functions
of interacting proteins. In this chapter we present relational dependency networks
(RDNs), a graphical model that is capable of expressing and reasoning with such
dependencies in a relational setting. We discuss RDNs in the context of relational
Bayes networks and relational Markov networks and outline the relative strengths
of RDNsnamely, the ability to represent cyclic dependencies, simple methods for
parameter estimation, and ecient structure learning techniques. The strengths of
RDNs are due to the use of pseudo-likelihood learning techniques, which estimate an
ecient approximation of the full joint distribution. We present learned RDNs for
a number of real-world data sets and evaluate the models in a prediction context,
showing that RDNs identify and exploit cyclic relational dependencies to achieve
signicant performance gains over conventional conditional models.

8.1

Introduction
Many data sets routinely captured by businesses and organizations are relational in
nature, yet until recently most machine learning research has focused on attened
propositional data. Instances in propositional data record the characteristics of
homogeneous and statistically independent objects; instances in relational data
record the characteristics of heterogeneous objects and the relations among those
objects. Examples of relational data include citation graphs, the World Wide
Web, genomic structures, fraud detection data, epidemiology data, and data on
interrelated people, places, and events extracted from text documents.

240

Relational Dependency Networks

The presence of autocorrelation provides a strong motivation for using relational


techniques for learning and inference. Autocorrelation is a statistical dependency
between the values of the same variable on related entities and is a nearly ubiquitous
characteristic of relational data sets [14]. More formally, autocorrelation is dened
with respect to a set of related instance pairs PR = {(oi , oj ) : oi , oj O}; it is the
correlation between the values of a variable X on the instance pairs (oi .x, oj .x)
such that (oi , oj ) PR . Recent analyses of relational data sets have reported
autocorrelation in the following variables:
Topics of hyperlinked webpages [4, 39]
Industry categorization of corporations that share board members [30]
Fraud status of cellular customers who call common numbers [5]
Topics of coreferent scientic papers [38, 29]
Functions of proteins located together in a cell [28]
Box oce receipts of movies made by the same studio [14]
Industry categorization of corporations that co-occur in news stories [1]
Tuberculosis infection among people in close contact [10]
When relational data exhibit autocorrelation there is a unique opportunity
to improve model performance because inferences about one object can inform
inferences about related objects. Indeed, recent work in relational domains has
shown that collective inference over an entire data set results in more accurate
predictions than conditional inference for each instance independently [e.g., 4,
30, 21], and that the gains over conditional models increase as autocorrelation
increases [16].
Joint relational models are able to exploit autocorrelation by estimating a joint
probability distribution over an entire relational data set and collectively inferring
the labels of related instances. Recent research has produced several novel types of
graphical models for estimating joint probability distributions for relational data
that consist of nonindependent and heterogeneous instances [e.g., 10, 39]. We will
refer to these models as probabilistic relational models (PRMs).1 PRMs extend
traditional graphical models such as Bayesian networks to relational domains,
removing the assumption of i.i.d. instances that underlies conventional learning
techniques. PRMs have been successfully evaluated in several domains, including
the World Wide Web, genomic data, and scientic literature.
Directed PRMs, such as relational Bayes networks2 (RBNs) [10], can model
autocorrelation dependencies if they are structured in a manner that respects the

1. Several previous papers [e.g., 8, 10] use the term probabilistic relational model to refer
to a specic model that is now often called a relational Bayesian network [Koller, personal
communication]. In this paper, we use PRM in its more recent and general sense.
2. We use the term relational Bayesian network to refer to Bayesian networks that have
been upgraded to model relational databases. The term has also been used by Jaeger [13]

8.1

Introduction

241

acyclicity constraint of the model. While domain knowledge can sometimes be used
to structure the autocorrelation in an acyclic manner, often an acyclic ordering
is unknown or does not exist. For example, in genetic pedigree analysis there
is autocorrelation among the genes of relatives [20]. In this domain, the causal
relationship is from ancestor to descendent so we can use the temporal parentchild relationship to structure the dependencies in an acyclic manner (i.e., parents
genes will never be inuenced by the genes of their children). However, given a set
of hyperlinked webpages, there is little information to use to determine the causal
direction of the dependency between their topics. In this case, we can only represent
an (undirected) correlation between the topics of two pages, not a (directed) causal
relationship. The acyclicity constraint of directed PRMs precludes the learning of
arbitrary autocorrelation dependencies and thus severely limits the applicability of
these models in relational domains.
Undirected PRMs, such as relational Markov networks (RMNs) [39], can represent and reason with arbitrary forms of autocorrelation. However, research on these
models has focused primarily on parameter estimation and inference procedures.
The current RMN learning algorithm does not select featuresmodel structure
must be prespecied by the user. While in principle it is possible for RMN techniques to learn cyclic autocorrelation dependencies, inecient parameter estimation makes this dicult in practice. Because parameter estimation requires multiple
rounds of inference over the entire data set, it is impractical to incorporate it as
a subcomponent of feature selection. Recent work on conditional random elds for
sequence analysis includes a feature selection algorithm [24] that could be extended
for RMNs. However, the algorithm abandons estimation of the full joint distribution and uses pseudo-likelihood estimation, which makes the approach tractable
but removes some of the advantages of reasoning with the full joint distribution.
In this chapter, we outline relational dependency networks (RDNs), an extension
of dependency networks (DNs) [11] for relational data. RDNs can represent and
reason with the cyclic dependencies required to express and exploit autocorrelation
during collective inference. In this regard, they share certain advantages of RMNs
and other undirected models of relational data [4, 6]. Also, to our knowledge, RDNs
are the rst PRM capable of learning cyclic autocorrelation dependencies. RDNs
oer a relatively simple method for structure learning and parameter estimation,
which results in models that are easier to understand and interpret. In this regard
they share certain advantages of RBNs and other directed models [37, 12]. The
primary distinction between RDNs and other existing PRMs is that RDNs are an
approximate model. RDN models approximate the full joint distribution and thus
are not guaranteed to specify a coherent probability distribution. However, the
quality of the approximation will be determined by the data available for learning

to refer to Bayesian networks where the nodes correspond to relations and their values
represent possible interpretations of those relations in a specic domain.

242

Relational Dependency Networks

if the models are learned from large data sets, and combined with Monte Carlo
inference techniques, the approximation should not be a disadvantage.
We start by reviewing the details of DNs for propositional data. Then we describe
the general characteristics of PRM models and outline the specics of RDN learning
and inference procedures. We evaluate RDN learning and inference algorithms on
both synthetic and real-world data sets, presenting learned RDNs for subjective
evaluation and evaluating the models in a prediction context. Of particular note,
all the real-world data sets exhibit multiple autocorrelation dependencies that were
automatically discovered by the RDN learning algorithm. Finally, we review related
work and conclude with a discussion of future directions.

8.2

Dependency Networks
Graphical models represent a joint distribution over a set of variables. The primary
distinction between representations such as Bayesian networks and Markov networks and DNs is that DNs are an approximate representation. DNs approximate
the joint distribution with a set of conditional probability distributions (CPDs) that
are learned independently. This approach to learning results in signicant eciency
gains over exact models. However, because the CPDs are learned independently, DN
models are not guaranteed to specify a consistent joint distribution. This precludes
DNs from being used to infer causal relationships and limits the applicability of
exact inference techniques. Nevertheless, DNs can encode predictive relationships
(i.e., dependence and independence), and Gibbs sampling inference techniques [e.g.,
27] can be used to recover a full joint distribution, regardless of the consistency of
the local CPDs.
8.2.1

DN Representation

Dependency networks are an alternative form of graphical model that approximate


the full joint distribution with a set of conditional probability distributions that are
each learned independently. A DN encodes probabilistic relationships among a set
of variables X in a manner that combines characteristics of both undirected and
directed graphical models. Dependencies among variables are represented with a
bidirected graph G = (V, E), where conditional independence is interpreted using
graph separation, as with undirected models. However, as with directed models,
dependencies are quantied with a set of conditional probability distributions P .
Each node vi V corresponds to an Xi X and is associated with a probability
distribution conditioned on the other variables, P (vi ) = p(xi |x {xi }). The parents
of node i are the set of variables that render Xi conditionally independent of the
other variables (p(xi |pai ) = p(xi |x {xi })), and G contains a directed edge from
each parent node vj to each child node vi (e(vj , vi ) E i Xj pai ). The CPDs
in P do not necessarily factor the joint distribution so we cannot compute the
joint probability for a set of values x directly. However, given G and P , a joint

8.3

Relational Dependency Networks

243

distribution can be recovered through Gibbs sampling (see below for details). From
the joint distribution, we can extract any probabilities of interest.
8.2.2

DN Learning

Both the structure and parameters of DN models are determined through learning
the local CPDs. The DN learning algorithm learns a separate distribution for each
variable Xi , conditioned on the other variables in the data (i.e., X {Xi }). Any
conditional learner can be used for this task (e.g., logistic regression, decision trees).
The CPD is included in the model as P (vi ) and the variables selected by the
conditional learner form the parents of Xi (e.g., if p(xi |{x xi }) = xj + xk , then
pai = {xj , xk }). The parents are then reected in the edges of G appropriately. If
the conditional learner is not selective (i.e., the algorithm does not select a subset
of the features), the DN model will be fully connected (i.e., pai = x {xi }). In
order to build understandable DNs, it is desirable to use a selective learner that
will learn CPDs that use a subset of the variables.
8.2.3

DN Inference

Although the DN approach to structure learning is simple and ecient, it can


result in an inconsistent network, both structurally and numerically. In other words,
there may be no joint distribution from which each of the CPDs can be obtained
using the rules of probability. Learning the CPDs independently with a selective
conditional learner can result in a network that contains a directed edge from Xi
to Xj , but not from Xj to Xi . This is a structural inconsistencyXi and Xj are
dependent but Xj is not represented in the CPD for Xi . In addition, learning the
CPDs independently from nite samples may result in numerical inconsistencies
in parameter estimates, where the derived joint distribution does not sum to
one. In practice, Heckerman et al. [11] show that DNs are nearly consistent if
learned from large data sets because the data serves a coordinating function to
ensure some degree of consistency among the CPDs. However, even when a DN is
inconsistent, approximate inference techniques can still be used to estimate a full
joint distribution and extract probabilities of interest. Gibbs sampling can be used
to recover a full joint distribution, regardless of the consistency of the local CPDs,
provided that each Xi is discrete and its CPD is positive [11].

8.3

Relational Dependency Networks


Several characteristics of DNs are particularly desirable for modeling relational
data. First, learning a collection of conditional models oers signicant eciency
gains over learning a full joint model. This is generally true, but is even more
pertinent to relational settings where the feature space is very large. Second,
networks that are easy to interpret and understand aid analysts assessment of

244

Relational Dependency Networks

the utility of the relational information. Third, the ability to represent cycles
in a network facilitates reasoning with autocorrelation, a common characteristic
of relational data. In addition, whereas the need for approximate inference is a
disadvantage of DNs for propositional data, due to the complexity of relational
model graphs in practice, all PRMs use approximate inference.
RDNs extend DNs to work with relational data in much the same way that RBNs
extend Bayesian networks and RMNs extend Markov networks. These extensions
take a graphical model formalism and upgrade [17] it to a rst-order logic representation with an entity-relationship model. We start by describing the general
characteristics of PRMs and then discuss the details of RDNs in this context.
8.3.1

Probabilistic Relational Models

PRMs represent a joint probability distribution over the attributes of a relational


data set. When modeling propositional data with a graphical model, there is a single
graph G that comprises the model. In contrast, there are three graphs associated
with models of relational data: the data graph GD , the model graph GM , and the
inference graph GI . These correspond to the skeleton, model, and ground graph as
outlined in Heckerman et al. [12].
First, the relational data set is represented as a typed, attributed data graph
GD = (VD , ED ). For example, consider the data graph in gure 8.1(a). The nodes
VD represent objects in the data (e.g., authors, papers) and the edges ED represent
relations among the objects (e.g., author-of, cites).3 Each node vi VD and edge
ej ED is associated with a type T (vi ) = tvi (e.g., paper, cited-by). Each item4
t
type t T has a number of associated attributes, Xt = (X1t , ..., Xm
) (e.g., topic,
year). Consequently, each object vi and link ej is associated with a set of attribute
te
te
te
tv
tv
tvi
values determined by their type, Xvi i = (Xvi 1i , ..., Xvi m
), Xej j = (Xej j1 , ..., Xej jm ).
A PRM model represents a joint distribution over the values of the attributes in
te
tv
the data graph, x = {xvi i : vi V, tvi = T (vi )} {xej j : ej E, tej = T (ej )}.
Next, the dependencies among attributes are represented in the model graph
GM = (VM , EM ). Attributes of an item can depend probabilistically on other
attributes of the same item, as well as on attributes of other related objects or
links in GD . For example, the topic of a paper may be inuenced by attributes of
the authors that wrote the paper. Instead of dening the dependency structure over
attributes of specic objects, PRMs dene a generic dependency structure at the
level of item types. Each node v VM corresponds to an Xkt , where t T Xkt Xt .
t
: (vi V ei E) T (i) = t) is tied together and
The set of attributes Xtk = (Xik
modeled as a single variable. This approach of typing items and tying parameters
across items of the same type is an essential component of PRM learning. It enables

3. We use rectangles to represent objects, circles to represent random variables, dashed


lines to represent relations, and solid lines to represent probabilistic dependencies.
4. We use the generic term item to refer to objects or links.

8.3

Relational Dependency Networks

Figure 8.1

245

Example (a) data graph and (b) model graph.

generalization from a single instance (i.e., one data graph) by decomposing the data
graph into multiple examples of each item type (e.g., all paper objects), and building
a joint model of dependencies between and among attributes of each type.
As in conventional graphical models, each node is associated with a probability
distribution conditioned on the other variables. Parents of Xkt are either (1) other
attributes associated with type tk (e.g., paper topic depends on paper type), or (2)
attributes associated with items of type tj where items tj are related to items tk in
GD (e.g., paper topic depends on author rank ). For the latter type of dependency, if
the relation between tk and tj is one-to-many, the parent consists of a set of attribute
values (e.g., author ranks). In this situation, current PRM models use aggregation
functions to generalize across heterogeneous items (e.g., one paper may have two
authors while another may have ve). Aggregation functions are used to either map
sets of values into single values, or to combine a set of probability distributions into
a single distribution.
Consider the RDN model graph GM in gure 8.1(b). It models the data in
gure 8.1(a), which has two object types: paper and author. In GM , each item
type is represented by a plate, and each attribute of each item type is represented
as a node. Edges characterize the dependencies among the attributes at the type
level. The representation uses a modied plate notationdependencies among
attributes of the same object are contained inside the rectangle and arcs that cross
the boundary of the rectangle represent dependencies among attributes of related
objects. For example, month i depends on type i , while avgrank j depends on the
type k and topic k for all papers k related to author j in GD .
There is a nearly limitless range of dependencies that could be considered by
algorithms learning PRM models. In propositional data, learners model a xed
set of attributes intrinsic to each object. In contrast, in relational data, learners
must decide how much to model (i.e., how much of the relational neighborhood
around an item can inuence the probability distribution of an items attributes).
For example, a papers topic may depend on the topics of other papers written by its
authorsbut what about the topics of the references in those papers or the topics
of other papers written by coauthors of those papers? Two common approaches to

246

Relational Dependency Networks

limiting search in the space of relational dependencies are (1) exhaustive search of
all dependencies within a xed-distance neighborhood (e.g., attributes of items up
to k links away), or (2) greedy iterative-deepening search, expanding the search in
the neighborhood in directions where the dependencies improve the likelihood.
Finally, during inference, a PRM uses a model graph GM and a data graph
GD to instantiate an inference graph GI = (VI , VE ) in a process sometimes called
rollout. The rollout procedure used by PRMs to produce GI is nearly identical
to the process used to instantiate sequence models such as hidden Markov models.
GI represents the probabilistic dependencies among all the variables in a single test
set (here GD is usually dierent from GD used for training). The structure of GI is
determined by both GD and GM each item-attribute pair in GD gets a separate,
local copy of the appropriate CPD from GM . The relations in GD constrain the way
that GM is rolled out to form GI . PRMs can produce inference graphs with wide
variation in overall and local structure because the structure of GI is determined
by the specic data graph, which typically has nonuniform structure. For example,
gure 8.2 shows the RDN from gure 8.1b rolled out over a data set of three authors
and three papers, where P1 is authored by A1 and A2 , P2 is authored by A2 and
A3 , and P3 is authored by A3 . Notice that there are a variable number of authors
per paper. This illustrates why current PRMs use aggregation in their CPDsfor
example, the CPD for paper-type must be able to deal with a variable number of
author ranks.

Figure 8.2

8.3.2

Example PRM inference graph.

RDN Representation

Relational dependency networks encode probabilistic relationships in a similar


manner to DNs, extending the representation to a relational setting. RDNs use
a bidirected model graph GM with a set of conditional probability distributions P .
Each node vi VM corresponds to an Xkt Xt , t T and is associated with a
conditional distribution p(xtk | paxtk ). Figure 8.1b illustrates an example RDN model
graph for the data graph in gure 8.1a. The graphical representation illustrates

8.3

Relational Dependency Networks

247

the qualitative component (GD ) of the RDNit does not depict the quantitative
component (P ) of the model, which consists of CPDs that use aggregation functions.
Although conditional independence is inferred using an undirected view of the
graph, bidirected edges are useful for representing the set of variables in each CPD.
For example, in gure 8.1b the CPD for Year contains Topic but the CPD for Topic
does not contain Type. This depicts any inconsistencies that result from the RDN
learning technique.
8.3.3

RDN Learning

Learning a PRM model consists of two tasks: learning the dependency structure
among the attributes of each object type, and estimating the parameters of the local
probability models for an attribute given its parents. Relatively ecient techniques
exist for learning both the structure and parameters of RBN models. However, these
techniques exploit the requirement that the CPDs factor the full distributiona
requirement that imposes acyclicity constraints on the model and precludes the
learning of arbitrary autocorrelation dependencies. On the other hand, although in
principle it is possible for RMN techniques to learn cyclic autocorrelation dependencies, ineciencies due to calculating the normalizing constant Z in undirected
models make this dicult in practice. Calculation of Z requires a summation over
all possible states X. When modeling the joint distribution of propositional data,
the number of states is exponential in the number of attributes (i.e., O(2m )). When
modeling the joint distribution of relational data, the number of states is exponential in the number of attributes and the number of instances. If there are N
objects, each with m attributes, then the total number of states is O(2N m ). For
any reasonable-size data set, a single calculation of Z is an enormous computational
burden. Feature selection generally requires repeated parameter estimation while
measuring the change in likelihood aected by each attribute, which would require
recalculation of Z on each iteration.
The RDN learning algorithm uses a more ecient alternativeestimating the
set of conditional distributions independently rather than jointly. This approach is
based on pseudo-likehood techniques [2], which were developed for modeling spatial
data sets with similar autocorrelation dependencies. Pseudo-Likelihood estimation
avoids the complexities of estimating Z and the requirement of acyclicity. In addition, this approach can utilize existing techniques for learning CPDs of relational
data such as rst-order Bayesian classiers [7], structural logistic regression [35], or
ACORA [34].
Instead of optimizing the log-likelihood of the full joint distribution, we optimize
the pseudo-loglikelihood for each variable independently, conditioned on all other
attribute values in the data:
  
p(xtvi |paxtvi ),
(8.1)
P L(GD ; ) =
tT Xit X t vT (v)

248

Relational Dependency Networks

Table 8.1

RDN learning algorithm

Learn RDN (GD , R, Qt , Xt ):


P
For each t T :
For each Xkt Xt :

Use R to learn a CPD for Xkt given the attributes {Xkt  =k } Xt =t
in the relational neighborhood dened by Qt .
P P CP DX t
k

Use P to form GM .

With this approach we give up the asymptotic eciency guarantees of maximum


likelihood estimators. However, under some general conditions the consistency of
maximum pseudo-likelihood estimators can be established [9], which implies that,
as sample size , pseudo-likelihood estimators will produce unbiased estimates
of the true parameters.
On the surface (8.1) may appear similar to the joint distribution specied by an
RBN. However, the CPDs in the pseudo-likelihood are not required to factor the
t
, we
joint distribution of GD . More specically, when we consider the variable Xvi
t regardless of whether the estimation of
condition on the values of the parents paXvi
t
t
t was conditioned on X . The parents of X
paXvi
vi
vi may include the values of other
t


attributes (e.g., Xvi such that t = t or i = i) or the values of the same variable
on related items (e.g., Xvt  i such that v  = v).
The RDN learning algorithm is similar to the DN learning algorithm, except we
use a relational probability estimation algorithm to learn a set of conditional models,
maximizing the pseudo-likelihood for each variable separately. The algorithm input
consists of
GD : a relational data graph
R: a conditional relational learner
Qt : a set of queries that specify the types T and limits the relational neighborhood
that is considered in R for each T
Xt : a set of attributes for each item type
Table 8.3.3 outlines the learning algorithm in pseudocode. It cycles over each
attribute of each item type and learns a separate CPD, conditioned on the other
values in the training data. We discuss details of the subcomponents (querying and
relational learners) next.

8.3

Relational Dependency Networks

249

(a) Example QGraph query: Textual annotations specify match conditions on attribute values; numerical annotations (e.g., [0..]) specify constraints on
the cardinality of matched objects (e.g., zero or more authors), and (b) matching
subgraph.

Figure 8.3

8.3.3.1

Queries

The queries specify the relational neighborhoods that will be considered by the
conditional learner R, and their structure denes a typing over instances in the
database. Subgraphs are extracted from a larger graph database using the visual
query language QGraph [3]. Queries allow for variation in the number and types
of objects and links that form the subgraphs and return collections of all matching
subgraphs from the database.
For example, consider the query in gure 8.3a.5 The query species match
criteria for a target item (paper) and its local relational neighborhood (authors and
references). The example query matches all research papers that were published in
1995 and returns for each paper a subgraph that includes all authors and references
associated with the paper. Figure 8.3b shows a hypothetical match to this query:
a paper with two authors and seven references.
The query denes a typing over the objects of the database (e.g., people that have
authored a paper are categorized as authors) and species the relevant relational
context for the target item type in the model. For example, given this query the
model R would model the distribution of a papers attributes given the attributes
of the paper itself and the attributes of its related authors and references. The
queries are a means of restricting model search. Instead of setting a depth limit on
the extent of the search, the analyst has a more exible means with which to limit
the search (e.g., we can consider other papers written by the papers authors but
not other authors of the papers references).

5. We have modied the QGraph representation to conform to our convention of using


rectangles to represent objects and dashed lines to represent relations.

250

Relational Dependency Networks

8.3.3.2

Conditional Relational Learners

The conditional relational learner R is used for both parameter estimation and
structure learning in RDNs. The variables selected by R are reected in the edges
of G appropriately. If R selects all of the available attributes, the RDN model will
be fully connected.
In principle, any conditional relational learner can be used as a subcomponent
to learn the individual CPDs. In this chapter, we discuss the use of two dierent conditional modelsrelational Bayesian classiers (RBCs) [32] and relational
probability trees (RPTs) [31].
Relational Bayesian classiers RBCs extend Bayesian classiers to a relational
setting. RBC models treat heterogeneous relational subgraphs as a homogeneous
set of attribute multisets. For example, when considering the references of a single
paper the publication dates of those references form multisets of varying size (e.g.,
{1995, 1995, 1996}, {1975, 1986, 1998, 1998}). The RBC assumes each value of
a multiset is independently drawn from the same multinomial distribution.6 This
approach is designed to mirror the independence assumption of the naive Bayesian
classier. In addition to the conventional assumption of attribute independence, the
RBC also assumes attribute value independence within each multiset.
For a given item type T , the query scope species the set of item types TR
that form the relevant relational neighborhood for T . For example, in gure 8.3(a)
T = paper and TR = {paper, author, ref erence, authorof, cites}. To estimate the
CPD for attribute X on items T (e.g., paper topic), the model considers all the
attributes associated with the types in TR . RBCs are non-selective models so all
the attributes are included as parents:

 
p(x|pax )
p(xtvi |x) p(x),
tTR Xit X t vTR (x)

Relational probability trees RPTs are selective models that extend classication trees to a relational setting. RPT models also treat heterogeneous relational
subgraphs as a set of attribute multisets, but instead of modeling the multisets as
independent values drawn from a multinomial, the RPT algorithm uses aggregation functions to map a set of values into a single feature value. For example, when
considering the publication dates of references of a research paper the RPT could
construct a feature that tests whether the average publication date was after 1995.
Figure 8.4 provides an example RPT learned on citation data.
The RPT algorithm automatically constructs and searches over aggregated relational features to model the distribution of the target variable X. The algorithm
constructs features from the attributes associated with the types specied in the

6. Alternative constructions are possible but prior work [32] has shown this approach
achieves superior performance over a wide range of conditions.

8.3

Relational Dependency Networks

Figure 8.4

251

Example RPT to predict machine learning paper topic.

query. The algorithm considers four classes of aggregation functions to group multiset values: Mode, Count, Proportion, Degree. For discrete attributes, the algorithm
constructs features for all unique values of an attribute. For continuous attributes,
the algorithm constructs features for a number of dierent discretizations, binning the values by frequency (e.g., year > 1992). Count, proportion, and degree
features consider a number of dierent thresholds (e.g., proportion(A) > 10%).
Feature scores are calculated using chi-square to measure correlation between the
feature and the class. The algorithm uses prepruning in the form of a p-value cuto and a depth cuto to limit tree size. All experiments reported herein used
= 0.05/|attributes|, depth cuto=7, and considered ten thresholds and discretizations per feature.
The RPT learning algorithm adjusts for biases toward particular features due to
degree disparity and autocorrelation in relational data [14, 15]. We have shown that
RPTs build signicantly smaller trees than other conditional models and achieve
equivalent, or better, performance [31]. These characteristics of RPTs are crucial
for learning understandable RDN models and have a direct impact on inference
eciency because smaller trees limit the size of the nal inference graph.
8.3.4

RDN Inference

The RDN inference graph GI is potentially much larger than the original data
graph. To model the full joint distribution there must be a separate node (and CPD)
for each attribute value in GD . To construct GI , the set of template CPDs in P
is rolled out over the test-set data graph. Each item-attribute pair gets a separate,
local copy of the appropriate CPD. Consequently, the total number of nodes in

252

Relational Dependency Networks



T(v)
the inference graph will be
| + eED |XT(e) |. Rollout facilitates
vVD |X
generalization across data graphs of varying sizewe can learn the CPD templates
from one data graph and apply the model to a second data graph with a dierent
number of objects by rolling out more CPD copies. This approach is analogous to
other graphical models that tie distributions across the network and roll out copies
of model templates (e.g., hidden Markov models).
We use Gibbs sampling for inference in RDN models. Gibbs sampling can be
used to extract a unique joint distribution, regardless of the consistency of the
model [11].
Table 8.3.4 outlines the inference algorithm. To estimate a joint distribution, we
start by rolling out the model GM onto the target data set GD , forming the inference
graph GI . The values of all unobserved variables are initialized to values drawn from
their prior distributions. Gibbs sampling then iteratively relabels each unobserved
variable by drawing from its local conditional distribution, given the current state
of the rest of the graph. After a sucient number of iterations (burnin), the values
will be drawn from a stationary distribution and we can use the samples to estimate
probabilities of interest.
For prediction tasks we are often interested in the marginal probabilities associated with a single variable X (e.g., paper topic). Although Gibbs sampling may
be a relatively inecient approach to estimating the probability associated with a
joint assignment of values of X (e.g., when |X| is large), it is often reasonably fast
to estimate the marginal probabilities for each X.
There are many implementation issues that can improve the estimates obtained
from a Gibbs sampling chain, such as length of burn-in and number of samples.
For the experiments reported in this chapter we used xed-length chains of 2000
samples (each iteration relabels every value sequentially) with burn-in set at 100.
Empirical inspection indicated that the majority of chains had converged by 500
samples.

8.4

Experiments
The experiments in this section demonstrate the utility of RDNs as a joint model of
relational data. First, we use synthetic data to assess the impact of training-set size
and autocorrelation on RDN learning and inference, showing that accurate models
can be learned at reasonable data set sizes and that the model is robust to varying
levels of autocorrelation. Next, we learn RDN models of three real-world data sets
to illustrate the types of domain knowledge that the models discover automatically.
In addition, we evaluate RDN models in a prediction context, where only a single
attribute is unobserved in the test set, and report signicant performance gains
compared to two conditional models.

8.4

Experiments

Table 8.2

253

RDN inference algorithm

Infer RDN (GD , GM , P, iter, burnin):


GI (VI , EI ) (, )
For each t T in GM :
For each Xkt Xt in GM :
For each vi VD s.t. T (vi ) = t:
VI VI {Xvti k }
For each vj VD s.t. Xvj paX t :

\\ form GI from GD and GM

vi k

EI EI {eij }
For each v VI :
Randomly initialize xv to an arbitrary value
S
For i iter:
For each v VI , in random order:
Resample xv from p(xv |x {xv })
xv xv
If i > burnin:
S S {x}
Use samples S to estimate probabilities of interest

8.4.1

\\ initialize Gibbs sampling


\\ Gibbs sampling procedure

Synthetic Data Experiments

To explore the eects of training-set size and autocorrelation on RDN learning


and inference, we generated homogeneous data graphs with autocorrelation due to
an underlying (hidden) group structure. Each object has four Boolean attributes:
X1 , X2 , X3 , and X4 . The data generation procedure uses a simple RDN where
X1 is autocorrelated (through objects one link away), X2 depends on X1 , and the
other two attributes have no dependencies. To generate data with autocorrelated
X1 values, we used manually specied conditional models for p(X1 |X1 R , X2 ).
We compare two dierent RDN models: RDNRBC uses RBCs for the component
model R; RDNRP T uses RPT for R. The RPT performs feature selection, which
may result in structural inconsistencies in the learned RDN. The RBC does not
use feature selection so any deviation from the true model is due to numerical
inconsistencies alone. Note that the two models do not consider identical feature
spaces so we can only roughly assess the impact of feature selection by comparing
RDNRBC and RDNRP T results.
8.4.1.1

RDN Learning

The rst set of synthetic experiments examines the eectiveness of the RDN learning
algorithm. Theoretical analysis indicates that, in the limit, the true parameters will

254

Relational Dependency Networks

Figure 8.5

Evaluation of RDN learning.

maximize the pseudo-likelihood function. This indicates that the pseudo-likelihood


function, evaluated at the learned parameters, will be no greater than the pseudolikelihood of the true model (on average). To evaluate the quality of the RDN
parameter estimates, we calculated the pseudo-likelihood of the testset data using
both the true model (used to generate the data) and the learned models. If the
pseudo-likelihood given the learned parameters approaches the pseudo-likelihood
given the true parameters, then we can conclude that parameter estimation is
successful. We also measured the standard error of the pseudo-likelihood estimate
for a single test set using learned models from ten dierent training sets. This
illustrates the amount of variance due to parameter estimation.
Figure 8.5 graphs the pseudo-loglikelihood of learned models as a function of
training-set size for three levels of autocorrelation. Training-set size was varied at
the levels {50, 100, 250, 500, 1000, 5000}. We varied p(X1 |X1 R , X2 ) to generate data
with approximate levels of autocorrelation corresponding to {0.25, 0.50, 0.75}. At
each training set size (and autocorrelation level), we generated ten test sets. For
each test set, we generated ten training sets and learned RDNs. Using each learned
model, we measured the pseudo-likelihood of the test set (size 250) and averaged
the results over the ten models.
Figure 8.5 plots the mean pseudo-likelihood of the test sets for both the learned
models and the RDN used for data generation, which we refer to as True Model.
The top row reports experiments with data generated from an RDNRP T , where we

8.4

Experiments

255

learned RDNRP T models. The bottom row reports experiments with data generated
from an RDNRBC , where we learned RDNRBC models.
These experiments show that the learned RDNRP T models are a good approximation to the true model by the time training-set size reaches 500, and that RDN
learning is robust with respect to varying levels of autocorrelation. As expected,
however, when training-set size is small, the RDNs are a better approximation for
data sets with low levels of autocorrelation (see gure 8.5a).
There appears to be little dierence between the RDNRP T and RDNRBC when
autocorrelation is low, but otherwise the RDNRBC needs signicantly more data
to estimate the parameters accurately. This may be in part due to the models lack
of selectivity, which necessitates the estimation of a greater number of parameters.
However, there is little improvement even when we increase the size of the training
sets to 10,000 objects. Furthermore, the discrepancy between the estimated model
and the true model is greatest when autocorrelation is moderate. This indicates
that the inaccuracies may be due to the naive Bayes independence assumption and
its tendency to produce biased probability estimates [40].
8.4.1.2

RDN Inference

The second set of synthetic experiments evaluates the RDN inference procedure in
a prediction context, where only a single attribute is unobserved in the test set. We
generated data in the manner described above and learned RDNs for X1 . At each
autocorrelation level, we generated ten training sets (size 500) and learned RDNs.
For each training set, we generated ten test sets (size 250) and used the learned
models to infer marginal probabilities for the class labels of the test-set instances.
To evaluate the predictions, we report area under the ROC curve (AUC).7 These
experiments used the same levels of autocorrelation outlined above.
We compare the performance of three types of models. First, we measure the
performance of RPT and RBC models. These are conditional models that reason
about each instance independently and do not use the class labels of related instances. Next, we measure the performance of the two RDN models described above:
RDNRBC and RDNRP T . These are collective models that reason about instances
jointly, using the inferences about related instances to improve overall performance.
Lastly, we measure performance of the two RDN models while allowing the true
labels of related instances to be used during inference. This demonstrates the level
of performance possible if the RDNs could infer the true labels of related instances
ceil
ceil
and RDNRP
with perfect accuracy. We refer to these as ceiling models: RDNRBC
T.
Note that conditional models can reason about autocorrelation dependencies in
a limited manner by using the attributes of related instances. For example, if there
is a correlation between the words on a webpage and its topic, and the topics of
hyperlinked webpages are autocorrelated, then we can improve the inference about

7. Squared-loss results are qualitatively similar to the AUC results reported in gure 8.6.

256

Relational Dependency Networks

Figure 8.6

Evaluation of RDN inference.

a single page by modeling the contents of its neighboring pages. Recent work has
shown that collective models are a low-variance means of reducing bias that work
by modeling the autocorrelation dependencies directly [16]. Conditional models are
also able to exploit autocorrelation dependencies through modeling the attributes
of related instances, but variance increases dramatically as the number of attributes
increases.
During inference we varied the number of known class labels in the test set, measuring performance on the remaining unlabeled instances. This serves to illustrate
model performance as the amount of information seeding the inference process increases. We expect performance to be similar when other information seeds the
inference processfor example, when some labels can be inferred from intrinsic attributes, or when weak predictions about many related instances serve to constrain
the system. Figure 8.6 graphs AUC results for each of the models as the level of
known class labels is varied.
In all congurations, RDNRP T performance is equivalent, or better than, RP T
performance. This indicates that even modest levels of autocorrelation can be exploited to improve predictions using RDNRP T models. RDNRP T performance is
ceil
indistinguishable from that of RDNRP
T except when autocorrelation is high and
there are no labels to seed inference. In this situation, there is little information to
constrain the system during inference so the model cannot fully exploit the autocorrelation dependencies. When there is no information to anchor the predictions,
there will be an identiability problemsymmetric labelings that are highly au-

8.4

Experiments

Figure 8.7

257

Data schema for (a) IMDb, (b) Cora, (c) NASD.

tocorrelated, but with opposite values, will be equally likely. In situations where
there is little seed information, identiability problems can bias RDN performance
toward random.
In contrast, RDNRBC performance is superior to RBC performance only when
there is moderate to high autocorrelation and sucient seed information. When
ceil
autocorrelation is low, the RBC model is comparable to both the RDNRBC
and RDNRBC models. Even when autocorrelation is moderate or high, RBC
performance is still relatively high. Since the RBC model is low-variance and there
are only four attributes in our data sets, it is not surprising that the RBC model
is able to exploit autocorrelation to improve performance. What is more surprising
is that RDNRBC requires substantially more seed information than RDNRP T in
order to reach ceiling performance. This indicates that our choice of model should
take test-set characteristics (e.g., number of known labels) into consideration.
8.4.2

Empirical Data Experiments

We learned RDN models for three real-world relational data sets to illustrate the
types of domain knowledge that can be garnered, and evaluated the models in a
prediction context, where the values of a single attribute are unobserved. Figure 8.7
depicts the objects and relations in each data set.
The rst data set is drawn from the Internet Movie Database (IMDb: www.imdb.com).
We collected a sample of 1382 movies released in the United States between 1996
and 2001, with their associated actors, directors, and studios. In total, this sample
contains approximately 42,000 objects and 61,000 links.
The second data set is drawn from Cora, a database of computer science research papers extracted automatically from the web using machine learning techniques [25]. We selected the set of 4330 machine learning papers along with associated authors, cited papers, and journals. The resulting collection contains approximately 13,000 objects and 26,000 links. For classication, we sampled the 1669
papers published between 1993 and 1998.

258

Relational Dependency Networks

The third data set is from the National Association of Securities Dealers (NASD)
c sys[33]. It is drawn from NASDs Central Registration Depository (CRD)
tem, which contains data on approximately 3.4 million securities brokers, 360,000
branches, 25,000 rms, and 550,000 disclosure events. Disclosures record disciplinary information on brokers, including information on civil judicial actions, customer complaints, and termination actions. Our analysis was restricted to small and
moderate-size rms with fewer than fteen brokers, each of whom has an approved
NASD registration. We selected a set of 10,000 brokers who were active in the years
1997-2001, along with 12,000 associated branches, rms, and disclosures.
8.4.2.1

RDN Models

The RDN models in gures 8.8, 8.9, and 8.10 continue with the RDN representation
introduced in gure 8.1b. Each item type is represented by a separate plate. Arcs
inside a plate represent dependencies among the attributes of a single object, and
arcs crossing the boundaries of plates represent dependencies among attributes of
related objects. An arc from x to y indicates the presence of one or more features
of x in the conditional model learned for y. When the dependency is on attributes
of objects more than a single link away, the arc is labeled with a small rectangle to
indicate the intervening related-object type. For example, in gure 8.8 movie genre
is inuenced by the genres of other movies made by the movies director, so the arc
is labeled with a small D rectangle.
In addition to dependencies among attribute values, relational learners may also
learn dependencies between the structure of relations (edges in GD ) and attribute
values. Degree relationships are represented by a small black circle in the corner
of each platearcs from this circle indicate a dependency between the number of
related objects and an attribute value of an object. For example, in gure 8.8 movie
receipts are inuenced by the number of actors in the movie.
For each data set, we learned RDNs using queries that include all neighbors up to
two links away in the data graph. For example in the IMDb, when learning a model
of movie attributes we considered the attributes of associated actors, directors,
producers, and studios, as well as movies related to those objects.
On the IMDb data, we learned an RDN model for ten discrete attributes including
actor gender and movie opening weekend receipts (> $2 million). Figure 8.8 shows
the resulting RDN model. Four of the attributesmovie receipts, movie genre,
actor birth year, and director rst movie yearexhibit autocorrelation dependencies. Exploiting this type of dependency has been shown to signicantly improve
classication accuracy of RMNs compared to RBNs, which cannot model cyclic
dependencies [39]. However, to exploit autocorrelation, RMNs must be instantiated with the appropriate clique templatesto date there is no RMN algorithm for
learning autocorrelation dependencies. RDNs are the rst PRM capable of learning
cyclic autocorrelation dependencies.
On the Cora data, we learned an RDN model for seven attributes including
paper topic (e.g., neural networks) and journal name prex (e.g., IEEE). Figure 8.9

8.4

Experiments

Figure 8.8

259

Internet Movie Database RDN.

shows the resulting RDN model. Again we see that four of the attributes exhibit
autocorrelation. Note that when a dependency is on attributes of objects a single
link away, the arc is unlabeled. For example, the unlabeled self-loops from paper
variables indicates dependencies on the same variables in cited papers. In particular,
the topic of a paper depends not only on the topics of other papers that it cites
but also on the topics of other papers written by the authors. This model is a good
reection of our domain knowledge about machine learning papers.

Figure 8.9

Cora machine learning papers RDN.

On the NASD data, we learned an RDN model for eleven attributes including
broker is-problem and disclosure type (e.g., customer complaint). Figure 8.10
shows the resulting RDN model. Again we see that four of the attributes exhibit
autocorrelation. Subjective inspection by NASD analysts indicates that the RDN
has automatically uncovered statistical relationships that conrm the intuition of
domain experts. These include temporal autocorrelation of risk (past problems are
indicators of future problems) and relational autocorrelation of risk among brokers

260

Relational Dependency Networks

at the same branchindeed, fraud and malfeasance are usually social phenomena,
communicated and encouraged by the presence of other individuals who also wish to
commit fraud [5]. Importantly, this evaluation was facilitated by the intrpretability
of the RDN modelexperts are more likely to trust, and make regular use of,
models they can understand.

Figure 8.10

8.4.2.2

RDN for NASD data for 1999.

Prediction

We evaluated the learned models on prediction tasks in order to assess (1) whether
autocorrelation dependencies among instances can be used to improve model accuracy, and (2) whether the RDN models, using Gibbs sampling, can eectively infer
labels for a network of instances. To do this, we compared the same three classes
of models used in section 8.4.1: RPTs and RBCs, RDNs, and ceiling RDNs.
Figure 8.11 shows AUC results for each of the models on the three prediction
tasks. Figure 8.11a graphs the results of the RDNRP T models, compared to the
RP T conditional model. Figure 8.11b graphs the results of the RDNRBC models,
compared to the RBC conditional model. We used the following prediction tasks:
movie receipts for IMDb, paper topic for Cora, and broker is-problem for NASD.
The graphs show the AUC for the most prevalent class, averaged over a number
of training/test splits. We used temporal samples where we learned models on one
year of data and applied the model to the subsequent year. We used two-tailed,
paired t -tests to assess the signicance of the AUC results obtained from the trials.
The t -tests compare the RDN results to each of the other two models with a null
hypothesis of no dierence in the AUC.
When using the RPT as the conditional learner (gure 8.11(a), RDN performance
is superior to RPT performance on all tasks. The dierence is statistically signicant
for two of the three tasks. This indicates that autocorrelation is both present in the
data and identied by the RDN models. The RPT can sometimes use attributes of
related items to eectively represent and reason with autocorrelation dependencies.

8.4

Experiments

261

AUC results for (a) RDNRP T and RPT models, and (b) RDNRBC
and RBC models. Asterisks denote model performance that is signicantly dierent
(p < 0.10) from RDNRP T and RDNRBC .
Figure 8.11

However, in some cases the attributes other than the class label contain little
information about the class labels of related instances. This is the case for Cora
RPT performance is close to random because no other attributes inuence paper
topic (see gure 8.9). On all tasks, the RDN models achieve comparable performance
to the ceiling models. This indicates that the RDN model achieved the same level
of performance as if it had access to the true labels of related objects. On the
NASD data, the RDN performance is slightly higher than that of the ceiling model.
We note, however, that the ceiling model only represents a probabilistic ceiling
the RDN may perform better if an incorrect prediction for one object improves
inferences about related objects.
Similarly, when using the RBC as the conditional learner (Figure 8.11(b)), the
performance of RDN models is superior to the RBC models on all tasks and statistically signicant for two of the tasks. However, the RDN models achieve comparable
performance to the ceiling models on only one of the tasks. This may be another indication that RDN models combined with a non-selective conditional learner (e.g.,
RBCs) will experience increased variance during the Gibbs sampling process, and
thus they may need more seed information during inference to achieve the nearceiling performance. We should note that although the RDNRBC models do not
Ceil
is
signicantly outperform the RDNRP T models on any of the tasks, the RDNRBC
Ceil
signicantly higher than RDNRP T for Cora and IMDb. This indicates that, when
there is enough seed information, RDNRBC models may achieve signicant performance gains over RDNRP T models.

262

8.5

Relational Dependency Networks

Related Work
8.5.1

Probabilistic Relational Models

Probabilistic relational models are one class of models for density estimation in
relational data sets. Examples of PRMs include RBNs and RMNs.
As outlined in section 8.3.1, learning and inference in PRMs involve a data graph
GD , a model graph GM , and an inference graph GI . All PRMs model data that can
be represented as a graph (i.e., GD ). PRMs use dierent approximation techniques
for inference in GI (e.g., Gibbs sampling, loopy belief propagation [26]), but they
all use a similar process for rolling out an inference graph GI . Consequently, PRMs
dier primarily with respect to the representation of the model graph GM and how
that model is learned.
The RBN learning algorithm [10] for the most part uses standard Bayesian
network techniques for parameter estimation and structure learning. One notable
exception is that the learning algorithm must check for legal structures that are
guaranteed to be acyclic when rolled out for inference on arbitrary data graphs. In
addition, instead of exhaustive search of the space of relational dependencies, the
structure learning algorithm uses greedy iterative-deepening, expanding the search
in directions where the dependencies improve the likelihood.
The strengths of RBNs include understandable knowledge representations and
ecient learning techniques. For relational tasks, with a huge space of possible
dependencies, selective models are easier to interpret and understand than nonselective models. Closed-form parameter estimation techniques allow for ecient
structure learning (i.e., feature selection). Also because reasoning with relational
models requires more space and computational resources, ecient learning techniques make relational modeling both practical and feasible.
The directed acyclic graph structure is the underlying reason for the eciency
of RBN learning. As discussed in section 8.1, the acyclicity requirement precludes
the learning of arbitrary autocorrelation dependencies and limits the applicability
of these models in relational domains. RDN models enjoy the strengths of RBNs
(namely, understandable knowledge representation and ecient learning) without
being constrained by an acyclicity requirement.
The RMN learning algorithm [39] uses maximum a posteriori parameter estimation with Gaussian priors, modiying Markov network learning techniques. The algorithm assumes that the clique templates are prespecied and thus does not search
for the best structure. Because the user supplies a set of relational dependencies to
consider (i.e., clique templates), it simply optimizes the potential functions for the
specied templates.
RMNs are not hampered by an acyclicity constraint, so they can represent and
reason with arbitrary forms of autocorrelation. This is particularly important for
reasoning in relational data sets where autocorrelation dependencies are nearly
ubiquitous and often cannot be structured in an acyclic manner. However, the

8.5

Related Work

263

tradeo for this increased representational capability is a decrease in learning


eciency. Instead of closed-form parameter estimation, RMNs are trained with
conjugate gradient methods, where each iteration requires a round of inference.
In large cyclic relational inference graphs, the cost of inference is prohibitively
expensivein particular, without approximations to increase eciency, feature
selection is intractable.
Similar to the comparison with RBNs, RDN models enjoy the strengths of
RMNs but not their weaknesses. More specically, RDNs are able to reason with
arbitrary forms of autocorrelation without being limited by eciency concerns
during learning. In fact, the pseudo-likelihood estimation technique used by RDNs
has been used recently to make feature selection tractable for conditional random
eld models [24].
8.5.2

Probabilistic Logic Models

A second class of models for density estimation consists of extensions to conventional logic programming that support probabilistic reasoning in rst-order logic
environments. We will refer to this class of models as probabilistic logic models
(PLMs). Examples of PLMs include Bayesian logic programs [18] and Markov logic
networks (MLNs) [36].
PLMs represent a joint probability distribution over the groundings of a rstorder knowledge base. The rst-order knowledge base contains a set of rst-order
formulae, and the PLM model associates a set of weights/probabilities with each of
the formulae. Combined with a set of constants representing objects in the domain,
PLM models specify a probability distribution over possible truth assignments
to groundings of the rst-order formulae. Learning a PLM consists of two tasks:
generating the relevant rst-order clauses, and estimating the weights/probabilities
associated with each clause.
Within this class of models, MLNs are most similar in nature to RDNs. In
MLNs, each node is a grounding of a predicate in a rst-order knowledge base,
and features correspond to rst-order formulae and their truth-values. Learning
an MLN consists of estimating the feature weights and selecting which features
to include in the nal structure. The input knowledge base denes the relevant
relational neighborhood, and the algorithm restricts the search by limiting the
number of distinct variables in a clause, using a weighted pseudo-likelihood scoring
function for feature selection [19].
MLNs ground out to undirected Markov networks. In this sense, they are quite
similar to RMNs, sharing the same strengths and weaknessesthey are capable of
representing cyclic autocorrelation relationships but suer from the complexity of
full joint inference during learning, which decreases eciency. Kok and Domingos
[19] have recently demonstrated the promise of ecient pseudo-likelihood structure
learning techniques. Our future work will investigate the performance tradeos
between RDN and MLN approaches to pseudo-likelihood estimation for learning.

264

Relational Dependency Networks

8.5.3

Collective Inference

Collective inference models exploit autocorrelation dependencies in a network of


objects to improve predictions. Joint relational models, such as those discussed
above, are able to exploit autocorrelation to improve predictions by estimating
joint probability distributions over the entire graph and collectively inferring the
labels of related instances.
An alternative approach to collective inference combines local individual classication models (e.g., RBCs) with a joint inference procedure (e.g., relaxation
labeling). Examples of this technique include iterative classication [30], link-based
classication [21], and probabilistic relational neighbor [22, 23]. These approaches
to collective inference were developed in an ad hoc procedural fashion, motivated
by the observation that they appear to work well in practice. RDN models formalize this approach in a principled frameworklearning models locally (maximizing
psuedolikelihood) and combining them with a global inference procedure (Gibbs
sampling) to recover a full joint distribution. In this work we have demonstrated
that autocorrelation is the reason behind improved performance in collective inference (see [16] for more detail) and explored the situations under which we can
expect this type of approximation to perform well.

8.6

Discussion and Future Work


In this chapter we presented relational dependency networks, a new form of probabilistic relational model. We showed the RDN learning algorithm to be a relatively
simple method for learning the structure and parameters of a probabilistic graphical model. In addition, RDNs allow us to exploit existing techniques for learning
CPDs of relational data sets. Here we have chosen to exploit our prior work on
RPTs, which construct parsimonious models of relational data, and RBCs, which
are simple and surprisingly eective non-selective models. We expect the general
properties of RDNs to be retained if other approaches to learning CPDs are used,
given that those approaches learn accurate local models.
The primary advantage of RDN models is the ability to eciently learn and
reason with autocorrelation. Autocorrelation is a nearly ubiquitous phenomenon
in relational data sets and the dependencies are often cyclic in nature. If a data
set exhibits autocorrelation, and a model can learn the resulting dependencies,
then we can exploit those dependencies to improve overall inferences by collectively
inferring values for the entire set of instances simultaneously. The real and synthetic
data experiments in this chapter show that collective inference with RDNs can
oer signicant improvement over conditional approaches when autocorrelation is
present in the data. Except in rare cases, the performance of RDNs approaches the
performance that would be possible if all the class labels of related instances were
known. Because our analysis indicates that the amount of seed information may

References

265

interact with the level of autocorrelation and local model characteristics to impact
performance, future work will attempt to quantify these eects more formally.
We also presented learned RDNs for a number of real-world relational domains, demonstrating another strength of RDNstheir understandable and intuitive knowledge representation. Comprehensible models are a cornerstone of the
knowledge discovery process, which seeks to identify novel and interesting patterns
in large data sets. Domain experts are more willing to trust, and make regular
use of, understandable modelsparticularly when the induced models are used
to support additional reasoning. Understandable models also aid analysts assessment of the utility of the additional relational information, potentially reducing the
cost of information gathering and storage and the need for data transfer among
organizationsincreasing the practicality and feasibility of relational modeling.
Future work will compare RDN models to RMNs and MLNs in order to quantify
the performance tradeos for using pseudo-likelihood functions rather than full likelihood functions for both parameter estimation and structure learning, particularly
over data sets with varying levels of autocorrelation. Based on theoretical analysis
of pseudo-likelihood estimation ( [e.g., 9]), we expect there to be little dierence
when autocorrelation is low and increased variance when autocorrelation is high. If
this is the case, there will need to be enough training data to withstand the increase
in variance. Alternatively, bagging techniques may be a means of reducing variance
with only a moderate increase in computational cost. In either case, the simplicity and relative eciency of RDN methods are a clear win for learning models in
relational domains.

Acknowledgments
We acknowledge the invaluable assistance of A. Shapira, and helpful comments
from C. Loiselle. This eort is supported by DARPA and NSF under contract
numbers IIS0326249 and HR0011-04-1-0013. The U.S. Government is authorized to
reproduce and distribute reprints for governmental purposes notwithstanding any
copyright notation hereon. The views and conclusions contained herein are those
of the authors and should not be interpreted as necessarily representing the ocial
policies or endorsements either expressed or implied of DARPA, NSF, or the U.S.
Government.

References
[1] A. Bernstein, S. Clearwater, and F. Provost. The relational vector-space model
and industry classication. In Proceedings of the IJCAI-2003 Workshop on
Learning Statistical Models from Relational Data, 2003.

266

Relational Dependency Networks

[2] J. Besag. Statistical analysis of non-lattice data. The Statistician, 24:3:179


195, 1975.
[3] H. Blau, N. Immerman, and D. Jensen. A visual query language for relational
knowledge discovery. Technical Report 01-28, University of Massachusetts
Amherst, Computer Science Department, 2001.
[4] S. Chakrabarti, B. Dom, and P. Indyk. Enhanced hypertext categorization
using hyperlinks. In Proceedings of ACM International Conference on Management of Data, 1998.
[5] C. Cortes, D. Pregibon, and C. Volinsky. Communities of interest. In Proceedings of the International Symposium of Intelligent Data Analysis, 2001.
[6] P. Domingos and M. Richardson. Mining the network value of customers. In
International Conference on Knowledge Discovery and Data Mining, 2001.
[7] P. Flach and N. Lachiche. 1BC: A rst-order Bayesian classier. In Proceedings
of the International Conference on Inductive Logic Programming, 1999.
[8] N. Friedman, L. Getoor, D. Koller, and A. Pfeer. Learning probabilistic
relational models. In Proceedings of the International Joint Conference on
Articial Intelligence, 1999.
[9] S. Geman and C. Grane. Markov random eld image models and their
applications to computer vision. In Proceedings of the International Congress
of Mathematicians, 1987.
[10] L. Getoor, N. Friedman, D. Koller, and A. Pfeer. Learning probabilistic
relational models. In Relational Data Mining, pages 307335. Springer-Verlag,
2001.
[11] D. Heckerman, D. Chickering, C. Meek, R. Rounthwaite, and C. Kadie. Dependency networks for inference, collaborative ltering and data visualization.
Journal of Machine Learning Research, 1:4975, 2000.
[12] D. Heckerman, C. Meek, and D. Koller. Probabilistic models for relational
data. Technical Report MSR-TR-2004-30, Microsoft Research, 2004.
[13] M. Jaeger. Relational Bayesian networks. In Proceedings of the Conference
on Uncertainty in Articial Intelligence, 1997.
[14] D. Jensen and J. Neville. Linkage and autocorrelation cause feature selection
bias in relational learning. In Proceedings of the International Conference on
Machine Learning, 2002.
[15] D. Jensen and J. Neville. Avoiding bias when aggregating relational data with
degree disparity. In Proceedings of the International Conference on Machine
Learning, 2003.
[16] D. Jensen, J. Neville, and B. Gallagher. Why collective inference improves
relational classication. In International Conference on Knowledge Discovery
and Data Mining, 2004.

References

267

[17] K. Kersting. Representational power of probabilistic-logical models: From


upgrading to downgrading. In IJCAI-2003 Workshop on Learning Statistical
Models from Relational Data, 2003.
[18] K. Kersting and L. De Raedt. Basic principles of learning Bayesian logic
programs. Technical Report 174, Institute for Computer Science, University of
Freiburg, 2002.
[19] S. Kok and P. Domingos. Learning the structure of Markov logic networks.
In Proceedings of the International Conference on Machine Learning, 2005.
[20] S. Lauritzen and N. Sheehan. Graphical models for genetic analyses. Statistical Science, 18:4:489514, 2003.
[21] Q. Lu and L. Getoor. Link-based classication. In Proceedings of the
International Conference on Machine Learning, 2003.
[22] S. Macskassy and F. Provost. A simple relational classier. In Proceedings of
the 2nd Workshop on Multi-Relational Data Mining, KDD2003, 2003.
[23] S. Macskassy and F. Provost. Classication in networked data: A toolkit
and a univariate case study. Technical Report CeDER-04-08, Stern School of
Business, New York University, 2004.
[24] A. McCallum. Eciently inducing features of conditional random elds. In
Proceedings of the Conference on Uncertainty in Articial Intelligence, 2003.
[25] A. McCallum, K. Nigam, J. Rennie, and K. Seymore. A machine learning
approach to building domain-specic search engines. In Proceedings of the
International Joint Conference on Articial Intelligence, 1999.
[26] K. Murphy, Y. Weiss, and M. Jordan. Loopy belief propagation for approximate inference: An empirical study. In Proceedings of the Conference on Uncertainty in Articial Intelligence, 1999.
[27] R. Neal. Probabilistic inference using Markov chain Monte Carlo methods.
Technical Report CRG-TR-93-1, Dept of Computer Science, University of
Toronto, 1993.
[28] J. Neville and D. Jensen. Supporting relational knowledge discovery: Lessons
in architecture and algorithm design. In Proceedings of the Data Mining
Lessons Learned Workshop, ICML2002, 2002.
[29] J. Neville and D. Jensen. Collective classication with relational dependency
networks. In Proceedings of the Multi-Relational Data Mining Workshop,
KDD2003, 2003.
[30] J. Neville and D. Jensen. Iterative classication in relational data. In AAAI2000 Workshop on Learning Statistical Models from Relational Data, 2000.
[31] J. Neville, D. Jensen, L. Friedland, and M. Hay. Learning relational probability trees. In International Conference on Knowledge Discovery and Data
Mining, 2003.

268

Relational Dependency Networks

[32] J. Neville, D. Jensen, and B. Gallagher. Simple estimators for relational


Bayesian classifers. In Proceedings of the IEEE International Conference on
Data Mining, 2003.
[33] J. Neville, O. S
imsek, D. Jensen, J. Komoroske, K. Palmer, and H. Goldberg.
Using relational knowledge discovery to prevent securities fraud. In International Conference on Knowledge Discovery and Data Mining, 2005.
[34] C. Perlich and F. Provost. Aggregation-based feature invention and relational
concept classes. In International Conference on Knowledge Discovery and Data
Mining, 2003.
[35] A. Popescul, L. Ungar, S. Lawrence, and D. Pennock. Statistical relational
learning for document mining. In Proceedings of the IEEE International
Conference on Data Mining, 2003.
[36] M. Richardson and P. Domingos. Markov logic networks. Machine Learning
Journal, 62:107136, 2006.
[37] S. Sanghai, P. Domingos, and D. Weld. Dynamic probabilistic relational
models. In Proceedings of the International Joint Conference on Articial
Intelligence, 2003.
[38] B. Taskar, E. Segal, and D. Koller. Probabilistic classication and clustering
in relational data. In Proceedings of the International Joint Conference on
Articial Intelligence, 2001.
[39] B. Taskar, P. Abbeel, and D. Koller. Discriminative probabilistic models for
relational data. In Proceedings of the Conference on Uncertainty in Articial
Intelligence, 2002.
[40] B. Zadrozny and C. Elkan. Obtaining calibrated probability estimates from
decision trees and naive Bayesian classiers. In Proceedings of the International
Conference on Machine Learning, 2001.

9 Logic-based Formalisms for


Statistical Relational Learning

James Cussens

This chapter provides a selective overview of logic-based approaches to statistical


relational learning. Issues of representation, inference, and learning are addressed
with an emphasis on representation. A distinction is drawn between directed
representations with connections to Bayesian nets and undirected ones related
to Markov nets. Within directed representations a further distinction is made
between using conditional probabilities and using logical rules to dene probability
distributions. Among the formalisms discussed are: the independent choice logic,
probabilistic logic programming, and stochastic logic programs. The PRISM system
is used to provide concrete examples of probabilistic inference and parameter
estimation. The use of possible worlds to provide semantics is described and
its role in connecting diering formalisms is analyzed.

9.1

Introduction
This chapter provides a high-level and selective overview of formalisms which
incorporate both logic and probability. Naturally, the focus is on those formalisms
which fall within the ambit of statistical relational learning (SRL) or which have
inuenced formalisms used for SRL. Learning (in the AI sense) is the central topic
of this book, but in order to understand existing and potential learning algorithms
for the formalisms discussed, it is necessary to understand what is represented by
each formalism: we need to know what is to be learned before examining how to
do the learning. Consequently, in this chapter there is a strong focus on issues of
representation.
It is worth stating some important questions concerning logic and probability
which will not be addressed here. First, logic in the general nontechnical sense of
a method of rational reasoning includes probabilistic reasoning quite naturally since
humans are required to reason in uncertain situations. Thus two of the historically

270

Logic-based Formalisms for Statistical Relational Learning

most inuential logic books include sections on probabilistic reasoning as a matter


of course. The books concerned are the Port-Royal Logic [2] and An Investigation
of the Laws of Thought, on which are founded the Mathematical Theories of Logic
and Probabilities [4]. Here, however, we are not concerned with general issues in
probabilistic reasoning, (dealing with conicting evidence, probability kinematics,
relations with other uncertainty calculi, etc). Instead both probability and logic will
be treated as formalisms, i.e., in the narrow technical sense.
Secondly, there is a well-developed logical interpretation of probability found in
the philosophical literature. The best-known advocates of this interpretation are
Keynes [23] and Carnap [5]. The basic claim of this interpretation is that for any
two propositions a and b there is an objective, logical relation of partial entailment
between a and b which is measured by a unique conditional probability P (b|a).
See Howson and Urbach [18] for further details. This interpretation uses versions
of the principle of indierence to logically infer such probability values. This
approach is rejected in all the formalisms to be discussed: either the user denes
probabilities or the probabilities are estimated from data. Whether these are the
right probabilities is of no concern to the formalism.
9.1.1

Possible World Semantics

Although logic-based SRL formalisms reject Carnaps attempt to use logic to determine probabilities, his use of possible worlds to provide semantics for probabilistic
statements is widely followed. Recall that to interpret terms and formulae of a
(standard, nonprobabilistic) rst-order language L it is necessary to consider Lstructures, also known as L-interpretations, or more poetically, possible worlds. An
L-structure is a set (the domain) together with functions and relations. Each function (resp. predicate) symbol in the language has a corresponding function (resp.
relation) on the domain. Standard (Tarskian) rst-order semantics denes when a
particular L-formula is true in a particular L-structure. For example, the formula
f lies(tweety) is true in a given L-structure i the individual which the constant
tweety denotes is an element of the set which the predicate symbol f lies denotes.
To explain possible world semantics for probabilistic statements we will follow the account of Halpern [17]. Using Halperns notation, the probability that
f lies(tweety) is true is denoted by the term w(f lies(tweety)).1 The proposition
that this probability is 0.8 is represented by the formula w(f lies(tweety)) = 0.8.
As Halpern notes, it is not useful to ask whether a probabilistic statement such
as w(f lies(tweety)) = 0.8 is true in some particular L-structure. In any given Lstructure, tweety either ies or does not. Instead we have to ask whether a probability
distribution over L-structures satises w(f lies(tweety)) = 0.8 or not. A rigorous
1. Note on terminology: Throughout the rest of this chapter the term the probability of
F (where F is a rst-order formula) will be used as an abbreviation for the probability
that F is true.

9.2

Representation

271

account of how to answer this question is given by Halpern [17] but the basic idea
is simple: a probability distribution over L-structurespossible worldssatises
w(f lies(tweety)) = 0.8 i the set of worlds in which f lies(tweety) is true has
probability 0.8 according to . Note that this means that w(f lies(tweety)) is a
marginal probability since it can be computed by summing over possible-world
probabilities.

9.2

Representation
Probability-logic formalisms take one of two routes to dening probabilities. In the
directed approach there is a nonempty set of formulae all of whose probabilities
are explicitly stated: call these probabilistic facts, similarly to Sato [39]. Other
probabilities are dened recursively with the probabilistic facts acting as base cases.
A probability-logic model using the directed approach will be closely related to a
recursive graphical model (Bayesian net). Most probability-logic formalisms fall into
this category: for example, probabilistic logic programming (PLP) [30]; probabilistic
Horn abduction (PHA) [36] and its later expansion the independent choice logic
(ICL) [37]; probabilistic knowledge bases (PKBs) [31]; Bayesian logic programs
(BLPs) (see chapter 10); relational Bayesian networks (RBNs) [20]; stochastic logic
programs (SLPs) (see chapter 11) and the PRISM system [40].
The second, less common, approach is undirected, where no formula has its
probability explicitly stated. Relational Markov networks (RMNs) (see chapter 6
and Markov logic networks (MLNs) (see chapter 12) are examples of this approach.
In the undirected approach, the probability of each possible world is dened in terms
of its features, where each feature has an associated real-valued parameter. For
example, in the case of MLN each feature is associated with a rst-order formula:
the value of the feature for a given world is simply the number of true ground
instances of the formula in that world. Such approaches have much in common with
undirected probabilistic models such as Markov networks. For example, to compute
(perhaps conditional) probabilities of individual formulae, inference techniques from
Markov networks can be used. See chapters 5 and 12, for further details.
9.2.1

Dening Random Variables Using Logic

Here we will focus on formalisms using the more common directed approach.
The most basic requirement of such formalisms is to explicitly state that a given
ground atomic formula has some probability of being true: a statement such as
w(f lies(tweety)) = 0.8 should be expressible. This is indeed the case for PLP,
PHA/ICL, PKB, and PRISM. In all these cases, possible worlds semantics are
explicitly invoked.
From a statistical point of view, asserting that w(f lies(tweety)) = 0.8 amounts to
viewing f lies(tweety) as a binary variable taking the values TRUE and FALSE. In
many applications a restriction to binary variables would be be very inconvenient

272

Logic-based Formalisms for Statistical Relational Learning

so a number of formalisms have machinery to allow a logical representation of


random variables with n values for arbitrary nite n. For each such random
variable this is done by stating that a set of n atomic formulae are mutually
exclusive and exhaustive. Call such sets alternatives and call the determination
of which formula is true in an alternative an atomic choice, following Poole [37].
In each possible world exactly one atomic choice is true. This approach is taken in
PHA/ICL, PKB, and PRISM, and a similar one is used in PKB. Indeed, in these
formalisms, alternatives are used even in the binary case, thus allowing a uniform
logical representation of binary and nonbinary random variables. Rather than
stating that w(f lies(tweety)) = 0.8 we can write w(f lies(tweety, yes)) = 0.8 and
w(f lies(tweety, no)) = 0.2, and state that {f lies(tweety, yes), f lies(tweety, no)}
is an alternative.
The PRISM system restricts the syntactic form of alternatives quite drastically
as part of its distribution condition [40]. Each alternative is a set of atomic formulae
of the form
{msw(x, i, v1 ), msw(x, i, v2 ), . . . , msw(x, i, vm )};
where x and i are the switch name and trial-id respectively, and msw is short
for mult-ary random switch.2 Translating to the language of random variables,
msw(x, i, vj ) is true in a world i in that world the variable Xi takes the value
vj . For dierent trial-ids i, the random variables Xi are required to be i.i.d.
This restriction seems drastic, but since switch names can be structured logical
terms, in practice no representational power is lost. Returning to tweety we could
have the alternative
{msw(f lies(tweety), 1, yes), msw(f lies(tweety), 1, no)};
where the trial-id is eectively redundant, and now f lies is a function symbol rather
than a predicate symbol.
Since the distribution over values does not depend on the trial-id, this second
argument is not present in actual PRISM codeit is implicit. Figure 9.1 gives the
PRISM code for dening our tweety alternative. This code actually states that
i N : w(msw(f lies(tweety), i, yes)) = 0.8 w(msw(f lies(tweety), i, no)) = 0.2,
although for our example only i = 1 is needed.

values(flies(tweety),[yes,no]).
:- set_sw(flies(tweety),0.8+0.2).
Figure 9.1

PRISM code for tweety.

2. I have altered the notation slightly from that found in [40].

9.2

Representation

9.2.2

273

Using Logical Variables

Even at the rudimentary level of dening probabilities for atomic formulae some
of the power of rst-order methods is apparent. The basic point is that by using
variables we can dene probabilities for whole families of related atomic formulae.
To take an example from Ngo and Haddaway [31], the PKB formula
P (nbrhd(X, bad)) = 0.3 in CALI(X)
makes up part of the denition of the distribution of random variables nbrhd(X)
for those X where in CALI(X) is true. Informally, the formula says: For those in
California, there is probability 0.3 of living in a bad neighborhood. The formula
in CALI(X) is known as a context literal. Asserting that a context literal is true
amounts to stating that it is true in all possible worlds, or equivalently, restricting
the set of possible worlds under consideration to those where it is true. Thus the
preceding formula can be translated into Halperns syntax as
x : w(nbrhd(x, bad)) = 0.3 w(in CALI(x)) = 1.
If, for example, we had that w(in CALI(bob)) = 1, meaning it is certain that
Bob lives in California, then it immediately follows that w(nbrnd(bob, bad)) = 0.3,
meaning Bob lives in a bad neighborhood with probability 0.3.
Such a mixture of probabilistic literals (i.e., P r(nbrnd(X, bad))) and nonprobabilistic literals (i.e., in CALI(X)), where the latter states what is true in all worlds,
is common. In PLP [30], stating what is true in all worlds is made explicit. The
above formula would be written
nbrhd(X, bad) : [0.3, 0.3] in CALI(X) : [1, 1]
and would have the same informal intended meaning. As this example indicates,
in PLP probability intervals are represented. To show the sort of formulae that
can be expressed in PLP and what they mean, tables 9.1, 9.2 and 9.3 show
three example PLP formulae, their informal meaning, and their representation in
Halperns notation, respectively.3
Table 9.1
1.
2.
3.

PLP formulae

eastbound(train1) : [0.7, 0.9]


bark(X) : [0.95, 1]
not dog(X) : [1 V2 , 1 V1 ]

dog(X) : [1, 1]
dog(X) : [V1 , V2 ]

3. Some of this material is taken from Cussens [6].

274

Logic-based Formalisms for Statistical Relational Learning


Table 9.2
1.
2.
3.

The probability that train1 is eastbound is between 0.7 and 0.9.


All dogs have a probability at least 0.95 of barking.
If the probability that something is a dog lies between V1 and V2 , then
the probability that it is not a dog lies between 1 V2 and 1 V1 .

Table 9.3
1.
2.
3.

9.2.3

Informal meanings for the PLP formulae in table 9.1

The PLP formulae in table 9.1 in Halperns notation

w(eastbound(train1)) [0.7, 0.9]


x : w(bark(x)) [0.95, 1]
x, v1 , v2 : not dog(x) [1 v2 , 1 v1 ]

w(dog(x)) = 1
w(dog(x)) [v1 , v2 ]

Logical Implication versus Conditional Probabilities

So far we have been looking mainly at formulae which directly dene probabilities
for atomic formulae. Directly stating all probabilities of interest is too restrictive
and so formalisms provide mechanisms for dening probabilities which must be
inferred rather than just looked up.
For directed approaches, there are two basic ways in which this is done: using
conditional probabilities and using logical rules. PKB [31], for example, focuses on
the former approach allowing formulae such as
P (bglry(X, yes)|nbd(X, bad)) = 0.6 in CALI(X),

(9.1)

which corresponds to this statement in Halperns notation:


x : [w(bglry(x, yes) nbd(x, bad)) = 0.6 w(nbd(x, bad)) w(in CALI(x)) = 1],
which can be written in disjunctive form:
x : [w(bglry(x, yes) nbd(x, bad)) = 0.6 w(nbd(x, bad)) w(in CALI(x)) = 1].
(9.2)
(All these formulae informally mean For those living in California, the probability
of being burgled if they live in a bad neighborhood is 0.6.) This approach implicitly
denes a (possibly huge) Bayesian network (the ground network ) where each node
corresponds to a ground atomic formula. The conditional probability tables (CPTs)
in the ground network are often dened with the help of combining rules such
as noisy-or. The full ground network is never actually constructed. Instead only
just enough of it is constructed to answer any given probabilistic query. This is a
technique known as knowledge-based model construction (KBMC).
The alternative approach is to use logical rules: statements of what is true in all
worlds. But this raises a problem. From
w(f lies(tweety)) = 0.8

(9.3)

9.2

Representation

275

and the statement that x : happy(x) f lies(x) is true in all worlds:


w(x : happy(x) f lies(x)) = 1,

(9.4)

we get not w(happy(tweety)) = 0.8 but merely w(happy(tweety)) 0.8 (since


tweety may be happy for other reasons). No specic probability is determined
for happy(tweety). There are two attitudes to this problem. The rst is just to live
with having mere bounds on probabilities. This is the approach taken by Boole [4]
and Ng and Subrahmanian [30] and, using a propositional approach, Nilsson [32].
The second method is to invoke a version of the closed-world assumption (CWA),
so that (9.4) is interpreted to mean
w(x : happy(x) f lies(x)) = 1
from which, together with (9.3), w(happy(tweety)) = 0.8 does follow. The CWA
approach is taken in PHA/ICL [36, 37] and PRISM [40]. In both these formalisms
there is a strict separation between the probabilistic factslike f lies(tweety)
whose probabilities are explicitly given, and formulae like happy(tweety) whose
probabilities have to be inferred from the probabilistic facts, the rules, and the
CWA. This separation is achieved by syntactic restrictions on the rules so that no
probabilistic fact can be inferred using the rules. The rules basically extend the
distribution dened over the facts to the other atomic formulae. As Sato puts it:
When a joint distribution PF is given to a set F of facts in a logic program
DB = F R where R is a set of rules, we can further extend it to a joint
distribution PDB over the set of least models of DB [39].
Further details on this point, and on the relation between PHA/ICL, PRISM, and
SLPs are given by Cussens [9]. Two further points are worth mentioning. First, it
is possible to combine conditional probability and rule-based methods, as shown by
(9.1). In such cases it is important to distinguish between formulae in the antecedent
of a rule and those in the conditional part of a conditional probability. For example,
if in formula (9.1) we move the antecedent literal into the conditional part of the
probability we get
P (bglry(X, yes)|nbd(X, bad), in CALI(X)) = 0.6,

(9.5)

which in Halperns notation is


x : [ w(bglry(x, yes) nbd(x, bad) in CALI(x)) =
0.6 w(nbd(x, bad) in CALI(x))

].

(9.6)

This results in a strictly stronger formula: (9.2) only impacts on those known to
be Californians, whereas (9.6) states a conditional probability that applies to all
individuals. Formally, (9.6) |= (9.2) but (9.2) |= (9.6). To see this abbreviate (9.6)

276

Logic-based Formalisms for Statistical Relational Learning

to x : [p(x)] and (9.2) to x : [q(x) w(in CALI(x)) = 1], then


(9.6)

x : [ p(x)

x : [ p(x) (w(in CALI(x)) = 1

w(in CALI(x)) = 1)

x : [ (p(x) w(in CALI(x)) = 1)

(p(x) w(in CALI(x)) = 1) ]

x : [ q(x)

(p(x) w(in CALI(x)) = 1) ]

x : [ q(x)

w(in CALI(x)) = 1

(9.2)

(9.7)
Secondly, although the connection to Bayesian networks is more direct when using conditional probabilities, it is also straightforward to encode Bayesian networks
using the rule-based approach [36], since both cases share an underlying directedness.
9.2.4

Dening Joint Distributions

So far we have considered probabilities on the truth values of atomic formulae only.
It is necessary to go further and have a mechanism for dening probabilities on (at
least) conjunctions of atomic formulae. This denes the joint distribution over the
truth values of atomic formulae. If each possible world has a conjunction that it
alone satises, this will give us a complete distribution over possible worlds.
One approach is to assume independence in all cases where this is possible, an
approach going back to Boole:
The events whose probabilities are given are to be regarded as independent of
any connexion but such as is either expressed, or necessarily implied, in the
data . . . ([4], pp. 256-7.)
Where alternatives (in the Poole sense) are used, it is clear that the atomic formulae
in any given alternative are highly dependent. Equally, those formulae on either
side of an implication must be dependent. However, we are at liberty to assume
that formulae from dierent alternatives are independent and this is what is done
in PHA/ICL and PRISM; indeed this is why the independent choice logic is so
called. Note that this is only possible because these formalisms disallow inferred
probabilities, like happy(tweety), from appearing in alternatives.
When a formalism (implicitly) denes a Bayesian network whose nodes are
ground atomic formulae, then the probability of any conjunction is just the probability of the relevant joint instantiation of the Bayesian net in the normal way.
A quite dierent way of combining Bayesian networks with logic is provided by
relational Bayesian networks [20]. Each node in an RBN corresponds to a relation 4
instead of to an atomic formula. The possible values for a relation r are the possi-

4. This includes monadic relations such as f lies.

9.2

Representation

277

ble interpretations of r. An interpretation for a relation r is a set of true ground


atomic formulae with r as the predicate symbol: something that varies across possible worlds. For example, in one possible world the relation mother might have
the interpretation
{mother(gill, rob), mother(gill, jane), mother(dot, james)},
whereas in another it might be
{mother(gill, rob), mother(gill, jane), mother(alison, gill)}.
Any joint instantiation of an RBN xes an interpretation for all relations in the
RBN and thus corresponds to some possible world. So an RBN denes a distribution
over possible worlds. RBNs are, in fact, a special case of a more general class of
models called random relational structure models. See Jaeger [19] for the full story.
9.2.5

Avoiding Possible World Semantics

Not all probability-logic formalisms are framed in terms of possible worlds. The
hallmark of Bayesian logic programs [22] is a one-to-one mapping between ground
atomic formulae and random variables where there is no restriction on what these
random variables might be. In particular a random variable need not represent the
probability with which the ground atomic formula is true; indeed it need not be
binary. One advantage of this design decision is that continuous random variables
can be represented. There is, however, a logical aspect to BLPs which has associated
semantics. In BLPs, rst-order clauses are used, together with combining rules, to
dene the structure of a BLP in much the same way that parent-child edges dene
the structure of a Bayesian network. Essentially, a ground instance of an atomic
formula in the head of a clause corresponds to a child node, whereas those in
the body are its parents. Using logical formulae to dene the structure of a large
(possibly innite) Bayesian network in this way means that logical methods can be
used to reason about the structure of the network. More on BLPs can be found in
chapter 9 in this book.
The example of BLPs shows that it can be fruitful to use rst-order logic
as a convenient way of representing and manipulating data (and models) with
complex structure, without too much concern about what the resulting probability
distributions mean. Much, but not all, of the work on SLPs [28, 7] takes this
view, The easiest way to understand SLPs is by relating them to stochastic contextfree grammars (SCFGs) as Muggleton [28] did in the original paper. In an SCFG
each grammar rule has an associated probability. This provides a mechanism for
probabilistically generating strings from the grammar: when there is a choice
of grammar rules for expanding a nonterminal, one is chosen according to the
probabilities. Any derivation in the grammar thus has a probability which is simply
the product of the probabilities of all rules used in the derivation. The probability
of any string is given by the sum of the probabilities of all derivations which

278

Logic-based Formalisms for Statistical Relational Learning

generate that string. SLPs lift this basic idea to logic programs: probabilities are
attached to rst-order clauses, thus dening probabilities for proofs. SLPs are more
complex than SCFGs since they are not generally context-free: not all sequences of
clauses constitute a proof; some end in failure. There are dierent ways of dealing
with thisone option is to use backtrackingwhich dene dierent probability
distributions [8]. More on SLPs can be found in chapter 10 of this book.
A semantics-independent approach appears pragmatic and exible: is there not
the problem that a formalism with possible-worlds semantics cannot model probability distributions over spaces other than possible worlds? In fact, the distinction
between the two approaches is not so fundamental since, with a little imagination,
any probability distribution can be viewed as one over some set of possible worlds.
Conversely, having possible-worlds semantics certainly does not stop a formalism
being applicable to real-world problems.
Moreover, imposing a possible-world semantics on a formalism can provide a
useful bridge to related formalisms. For example, Cussens [9] provides a possibleworlds semantics to SLPs by translating SLPs into PRISM programs, the latter
already having possible-world semantics. This amounts to mapping each proof to
a possible world. A characterization of the sort of possible-world distributions thus
dened is given by Sato and Kameya [40]. The connection between PHA/ICL and
PRISM can then be used to connect SLPs with PHA/ICL.

9.3

Inference
Having dened a probability distribution in a logic-based formalism there remains
the problem of computing probabilities to answer specic queries, such as Whats
the probability that Tweety ies? This problem is generally known as inference
and the term is particularly apposite for a logic-based formalism, since for such
formalisms it is possible to exploit nonprobabilistic logical inference to perform
complex probabilistic computations. Here we will only consider inference for those
formalisms (such as PHA/ICL and PRISM) which use logical implication to dene
probability distributions, since in such cases normal rst-order inference can be
used particularly directly to compute probabilities.
Consider, rst, standard logical inferenceusing the rst-order logical theory H
in (9.8) by way of example. H denes possible output sequences (via the hmm/2
predicate) for a hidden Markov model (HMM)whose parameters are yet to be
dened. The HMM has two states (s0 and s1 ) and two symbols in its output
alphabet (a and b). Both states can emit both symbols and all four possible state
transitions are possible. The only formula of any interest is the second which uses
the cons function symbol to encode a nonempty sequence.

9.3

Inference

279

s : hmm(s, null)

(9.8)

s, x, y, t : hmm(s, cons(x, y)) emit(s, x), next(s, t), hmm(t, y)


emit(s0, a) emit(s0, b) emit(s1, a) emit(s1, b)
next(s0, s0) next(s0, s1) next(s1, s0) next(s1, s1)
Because the rst-order language L used in (9.8) contains the function symbol
cons, the language includes an innite number of terms, and so there are an innite
number of groundings of the second universally quantied formula. However, to
prove, for example, that the ground formula hmm(s0, cons(a, cons(b, null))) follows
from H it is not necessary to ground the formulae in H. In practice a theorem
prover like Prolog [25] uses unication to partially instantiate formulae on the y
to establish a proof.
Recall that for PHA/ICL and PRISM there is a strict separation between
the probabilistic facts for which probabilities are explicitly stated and all other
formulae. To compute the probability of a formula F which is not a probabilistic
fact, it suces to nd those conjunctions of facts which, together with the logical
rules, entail F . Thanks to the ever-convenient CWA; P (F ) is then exactly the
sum of the probabilities of these conjunctions. The key point is that nding the
required conjunctions is a purely logical operationabductionand so algorithms
and implementations for rst-order logical inference can be used for it. The key
importance of abduction is reected in the name probabilistic Horn abduction [36].
To make this concrete, consider the PRISM program H  in gure 9.2. This is just
a parameterized version of H, where, for simplicity, there is an implicit assumption
that the initial state of the HMM is given. H  denes a distribution over a countably
innite set of possible worlds (each of which encodes a particular realization of
the HMM). The facts are the (innitely many) msw/3 atoms msw(out(s0), 1, a),
msw(out(s0), 2, a), . . . msw(tr(s1), 1, s0), msw(tr(s1), 2, s0), . . . where the second
argument has been suppressed as a programming convenience.
To compute the probability that, say, hmm(s0, [a, b, a]) is true, it is necessary to
nd conjunctions of facts the truth of which entail the truth of hmm(s0, [a, b, a]).
Fortunately, we can use the PRISM built-in probf/1 to explicitly show these, as
displayed in gure 9.3. Figure 9.3 diers from the actual PRISM output in that the
implicit second argument has been made explicit.
The rst three lines of gure 9.3 state that hmm(s0, [a, b, a]) is true i either
hmm(s0, [b, a]) msw(out(s0), 1, a) msw(tr(s0), 1, s0)
or
hmm(s1, [b, a]) msw(out(s0), 1, a) msw(tr(s0), 1, s1)

280

Logic-based Formalisms for Statistical Relational Learning

values(tr(S),[s0,s1,stop]).
:- set_sw(tr(s0),0.3+0.4+0.3).
:- set_sw(tr(s1),0.4+0.1+0.5).
values(out(S),[a,b]).
:- set_sw(out(s0),0.3+0.7).
:- set_sw(out(s1),0.6+0.4).
hmm(S,[X|Y]) :msw(out(S),X),
msw(tr(S),T),
(
T == stop, Y = []
;
T \= stop, hmm(T,Y)
).
Figure 9.2

H  : a PRISM encoding of a parameterized version of the hidden

Markov model in (9.8).

| ?- probf(hmm(s0,[a,b,a])).
hmm(s0,[a,b,a])
<=> hmm(s0,[b,a]) & msw(out(s0),1,a) & msw(tr(s0),1,s0)
v hmm(s1,[b,a]) & msw(out(s0),1,a) & msw(tr(s0),1,s1)
hmm(s0,[b,a])
<=> hmm(s0,[a]) & msw(out(s0),2,b) & msw(tr(s0),2,s0)
v hmm(s1,[a]) & msw(out(s0),2,b) & msw(tr(s0),2,s1)
hmm(s1,[b,a])
<=> hmm(s0,[a]) & msw(out(s1),3,b) & msw(tr(s1),3,s0)
v hmm(s1,[a]) & msw(out(s1),3,b) & msw(tr(s1),3,s1)
hmm(s0,[a])
<=> msw(out(s0),a) & msw(tr(s0),4,stop)
hmm(s1,[a])
<=> msw(out(s1),a) & msw(tr(s1),4,stop)
yes
| ?- prob(hmm(s0,[a,b,a]),P).
P = 0.012429?
yes
Figure 9.3 Using abduction to compute a probability. The PRISM output has
been altered so that the second argument on msw/3 is explicit.

9.4

Learning

281

is true. Note that these formulae are guaranteed to be mutually exclusive since
msw(tr(s0), 1, s0) and msw(tr(s0), 1, s1) are dened to be alternatives. The next
three lines state when hmm(s0, [b, a]) is true, and so on. It is not dicult to
see that there are exactly eight mutually exclusive conjunctions of msw/3 facts
which (together with the rules) entail hmm(s0, [a, b, a]). For example, one of
these conjunctions is msw(out(s0), 1, a), msw(tr(s0), 1, s0), msw(out(s0), 2, b),
msw(tr(s0), 2, s0), msw(out(s0), 3, a), msw(tr(s0), 4, stop). Since the msw/3 probabilistic facts are dened as independent, the probability of each conjunction
is simply a product of the probabilities of the conjuncts. The probability of
hmm(s0, [a, b, a]) is just the sum of these eight products, which, as gure 9.3
shows, happens to be 0.012429. Naturally, the sum is computed by dynamic programming similar to the variable elimination algorithm used in Bayesian networks.
Sophisticated logic programming tabling technology can be exploited to do this
elegantly and eciently.
It should be stressed that this example of probabilistic inference was able to
exploit the restrictions on PRISM programs that all the probabilistic ground atoms
in the body of each clause are probabilistically independent and the clauses dening
a probabilistic predicate are probabilistically exclusive [42], as well as the CWA. In
other cases inference is much harder. For example, inference in PLP requires linear
programming to deal with the inequalities involved and the linear program LP (P )
needed for a PLP program P contains exponentially many linear programming
variables w.r.t. the size of the Herbrand base of P [29]. (The Herbrand base is the
set of all ground atomic formulae expressible in the language used to dene P .)
Naturally, one option for hard inference problems is to resort to approximate
methods. For example, Angelopoulos and Cussens [1], used an SLP to represent
a prior probability distribution over classication trees similarly to the way that
the PRISM program above dened a distribution over HMM outputs. Since they
adopt a Bayesian approach, learning reduces to probabilistic inference and so the
key problem is to compute posterior probabilities: probabilities conditional on the
observed data. Using exact inference to compute such probabilities (for example,
the posterior class distribution for a test example) seems a hopeless task, so instead,
the Metropolis-Hastings algorithm is used to sample from the posterior and thus
to produce approximations to the desired probabilities.

9.4

Learning
Having considered how probability-logic formalisms represent probability distributions and how inference can be used to compute probabilities of interest, we can
now turn to the issue of learning a model from data. As always, we consider the
observed data as a sample generated by some unknown true model whose identity we wish to learn (or rather estimate). In some cases only the parameters of
the model are unknown. In the general case both the structure and parameters of

282

Logic-based Formalisms for Statistical Relational Learning

the true model are unknown. These two cases are considered in section 9.4.1 and
section 9.4.2, respectively.
The paper by De Raedt and Kersting [10] provides an excellent overview of
learning in probability-logic formalisms. The current section is complementary to
De Raedt and Kerstings broad survey since, (1) for the sake of concreteness it
examines parameter estimation in some detail for a particular formalism (PRISM)
and (2) discusses the use of probabilities in pre-SRL inductive logic programming
(ILP). An examination of pre-SRL ILP is useful since it seems likely that some of
the techniques found there may be useful for more recent formalisms.
Before focusing on these two areas it is worth mentioning two key points about
learning in probability-logic formalisms which are provided by De Raedt and
Kersting [10].
Much of the machinery for learning Bayesian networks can be used to learn
directed probability-logic models such as BLPs. When rst-order clauses are used
for the structure of a directed model, then specialization and generalization of
clauses corresponds to using a macro-operator for adding and deleting arcs in a
Bayesian network. Parameter estimation for logical directed models corresponds
to parameter estimation with tied parameters in a normal Bayesian network.
This is because one rst-order clause typically represents a collection of network
fragments in the underlying Bayesian network via its ground instances.
The probabilistic models associated with a PHA/ICL, PRISM, or SLP model
depend on parameters associated with many predicates in the underlying logic
program. This means that structure learning for such models is, in general, at least
as hard as multiple-predicate learning / theory revision in ILP: which is known
to be a hard problem. However, there exists work on learning SLPs in a restricted
setting [27], and also work on applying grammatical inference techniques to learn
SLPs [3].
As for other types of statistical inference, the key to learning in probability-logic
formalisms is the likelihood function: the probability of the observed data as a
function of the model. If the structure is xed, then the likelihood is just a function
of the model parameters. If a probability-logic formalism denes a distribution
over possible worlds, then ideally we would like the data to be a collection of
independent observations of possible worlds, each viewed as a sample drawn from
the unknown true model. The probability of each world can then be computed
using inference (section 9.3) and the likelihood of the data is just a product of these
probabilities. In the case of alternative-based formalisms like PHA/ICL, PRISM,
and SLPs, each data point would then be associated with a unique conjunction of
atomic choices. Maximum likelihood estimation of the multinomial distribution over
each alternative is then possible by simple counting in the normal way. A Bayesian
approach using Dirichlet priors is equally simple.
However, in many cases each observation is a ground atomic formula, which is
true in many worlds. This means the data-generating process is best viewed in
terms of missing data: the true model generates a world, but we do not get to

9.4

Learning

283

hmm_out(X) :- hmm(s0,1,X).
hmm(S,N,[X|Y]) :msw(out(S),X),
msw(tr(S),T),
(
T == stop, Y = []
;
T \= stop, NN is N+1, hmm(T,NN,Y)
).

PRISM program such that only one ground instance of hmm out/1 is
true in each possible world

Figure 9.4

see this world, only some ground atomic formula that is true in it. The rest of the
information required to determine the sampled world is missing. Unsurprisingly,
the expectation maximization (EM) algorithm is generally used in such situations.
An alternative approach is presented by Kok and Domingos [24]. Here the data
is contained in a (multitable) relational database. Each row in each table denes
a ground atomic formula in the usual way. The entire database is equivalent to a
conjunction of all these ground atomic formulae (so it is equivalent to a Datalog
Prolog program). Using the CWA, this denes a unique world: all formulae which
are not consequences of the conjunction are deemed false. (This unique world is the
minimal Herbrand model of the associated Prolog program.) So, on the one hand
we have only a single observation, but on the other it is an entire world that is
observed. See Kok and Domingos [24] and chapter 11 this book for further details.
9.4.1

Parameter Estimation

To analyze parameter estimation, a modeling convention introduced by Sato and


Kameya [40] will be adopted. There will be a target predicate such that in each
possible world exactly one ground instance of this target predicate is true. This
turns out not to be much of a restriction and simplies the analysis greatly.
For example, consider using PRISM to learn the parameters of the HMM of gure 9.2. The predicate hmm/2 is not a suitable target predicate, since any world in
which all the following formulae are truemsw(out(s0), 1, a), msw(tr(s0), 1, s0),
msw(out(s0), 2, b), msw(tr(s0), 2, s0), msw(out(s0), 3, a), msw(tr(s0), 4, stop)
will also have hmm(s0, [a, b, a]), hmm(s0, [b, a]), and hmm(s0, [a]) true. This is
illustrated by gure 9.3. Also, gure 9.2 as it stands is a conditional model
conditional on the initial statesince no distribution over the initial state has been
dened. Changing the program to the one given in gure 9.4 and declaring that
hmm_out/1 is the target predicate target(hmm_out,1) is enough to x both
these problems. Now exactly one hmm out/1 formula will be true in each world.
However, each hmm out/1 formula will be true in many worlds, so that for maximum likelihood estimation of model parameters the EM algorithm is used. Fig-

284

Logic-based Formalisms for Statistical Relational Learning

| ?- learn([hmm_out([a,b,a,a]),hmm_out([b,b]),hmm_out([a,a,b])]).
..
Finished learning
Number of iterations: 13.
Final likelihood:-9.440724
Total learning time: 0.01 seconds.
All solution search time: 0.01 seconds.
Total table space used: 6304 out of 240000000 bytes
Type show_sw to show the probability distributions.
yes
| ?- show_sw
Switch tr(s1): unfixed: s0 (0.235899) s1 (0.000014) stop (0.764086)
Switch out(s1): unfixed: a (0.254708) b (0.745291)
Switch tr(s0): unfixed: s0 (0.226177) s1 (0.773818) stop (0.000003)
Switch out(s0): unfixed: a (0.788360) b (0.211639)
Figure 9.5 EM learning with PRISM. (I have edited the output to reduce the
precision of parameters.)

ure 9.5 shows a run of PRISM using the EM algorithm to estimate the parameters
of the HMM using a data set of three examples. The algorithm was initialized with
the values shown in gure 9.2. Abduction is used once to produce the data structure
shown in gure 9.3. This can then be used in each iteration of the EM algorithm
to compute the expected values required in the E-part of the EM algorithm.
So far a very simple and hopefully familiar exampleHMM parameter estimation
has been used to explain parameter estimation in PRISM. Of course, there is no
pressing reason to use SRL for this problem. The whole point of SRL is to address
problems outside the remit of more standard approaches. So now consider a simple
elaboration of the HMM learning problem which highlights some of the exibility
of a logic-based approach. Suppose that the HMM is constrained so that not all
outputs from the HMM are permitted. This amounts to altering the denition of
hmm out/1 to
hmm out(X) :- hmm(s0,1,X), constraint(X),

(9.9)

where the predicate constraint/1 is any predicate which can be dened using
(clausal) rst-order logic. Now there will be worlds in which no ground instance
of the target predicate is true: these worlds will not be associated with a possible
data point. To take a very simple example, if constraint/1 were dened thus:
constraint(X) :- X = [Y,Y|Z],
then the possible world illustrated by gure 9.3 would no longer entail hmm out([a, b, a]),
since the rst two elements dier. The distribution over ground instances of
hmm out/1 is now a conditional one: conditional on the logically dened constraint being satised. This turns out to be an exponential-family distribution

9.4

Learning

285

where the partition function Z is the probability that the constraint is true in a
world sampled from the original, unconditional distribution.
It is still possible to use the EM algorithm to search for maximum likelihood
estimates of the parameters of such a distribution; it is just that the generative
characterization of this conditional distribution is more complicated. We assume,
as always, that worlds are sampled from the true underlying distribution. If no
ground instance of the target predicate is true in a sampled world, then that world
is rejected and no data point is generated; otherwise the unique ground instance of
the target predicate which is true in that world is added to the data. Viewing the
observed data as being generated in this fashion we have an extra sort of missing
data: the worlds which were entirely rejected. So the data is now truncated data.
Fortunately, Dempster et al. [13] show that the EM algorithm is applicable even
when the data has been truncated like this. The method was applied to SLPs by
Cussens [7] under the name failure-adjusted maximization (FAM) and is used in
the most recent version of the PRISM system [41].
9.4.2

Structure Learning

The structure of a probability-logic model is by denition some sort of rst-order


theory, frequently a set of rst-order clauses, i.e., a logic program. ILP is the
branch of machine learning concerned with inducing logic programs from data,
so it is no surprise that ILP techniques are often used when learning the structure
of probability-logic models from data.
Since its inception ILP has had no option but to induce de facto probabilistic
models for the simple reason that deterministic, purely logical, rules rarely t
the datahowever, the probabilistic aspect of induced rules has not always been
properly formalized. In its simplest form, data for ILP is a set of true facts (positive
examples) and a set of untrue facts (negative examples). The ideal is to nd a
set of clauses which, when added to an existing logic program (the background
knowledge), entail all of the positives and none of the negatives. There is generally
a bias for simple theories so that, for example, just returning the positive examples
as the induced theory is decidedly suboptimal. In most real applications, this logical
ideal is unreachable. So, instead of (hopelessly) searching for rules which t the
data exactly, many ILP systems search for accurate rules: ones which entail many
more positives than negatives. Of course, training set accuracy can be an unreliable
guide to true accuracy, so, for example, Bayesian estimates of true accuracy can be
used Dzeroski [14]. Bayesian estimation is available in the ALEPH system by using
mestimate as the clause evaluation function, and setting the m parameter to dene
the underlying prior distribution. Each induced rule can now be returned with an
associated parameter: its expected accuracy, or informally, its probability.
Using ILP to distinguish positives from negatives is most readily applicable to
binary classication. In many other cases (particularly those where probability
is explicitly represented) it is more appropriate to induce a logic program that
instantiates variables to perform classication, regression [21], or more complex

286

Logic-based Formalisms for Statistical Relational Learning

tasks akin to program synthesis. For example, gure 9.6 shows a clause induced
using the ALEPH ILP system (from example input that comes with that system
to demonstrate ILP learning of classication trees). The clause probabilistically
classies days according to whether they are suitable for playing or not by the
simple expedient of putting probability in the background. No negative examples

class(A,B) :not (outlook(A,rain),windy(A,true)), outlook(A,sunny),


humidity(A,C), lteq(C,70),
random(B,[0.75-play,0.25-dont_play]).
Figure 9.6

Probabilistic classication rule induced by ALEPH.

are used to induce such a rule: the key is to declare that the class B is an output to
be computed from the day A which is an input. In the ALEPH and Progol systems
this is done with the declaration in gure 9.7.

:- modeh(1,class(+day,-class)).
Figure 9.7 ALEPH declaration that the class/2 variable takes an input (indicated
by the +) of type day and generates an output (indicated by the -) of type class.

It is possible to use an ILP algorithm to search for rules and then build some
probabilistic model from these rules afterward. One option is to use a combining
rule to compute probabilities for test examples entailed by more than one induced
rule. (See chapter 9 for further details on combining rules.) Pompe and Kononenko
[34] use a naive Bayes model to combine rst-order classication rules with a later
approach splitting induced rst-order rules to better approximate the naive Bayes
assumption [35]. This work is an example of the often used technique of viewing
rst-order rules (or parts of rules) as features for a nonlogical probabilistic model.
If induced rules are going to be used eventually as the structural component of
a probabilistic model, then naturally it is better that the algorithm searching for
rules is designed to nd rules suitable for this purpose.
A more thoroughly probabilistic approach is to use ILP techniques as subroutines
in an algorithm that directly learns a probabilistic model from data. This is the
approach taken by Dehaspe [11] with his MACCENT algorithm. The goal of
MACCENT is to learn a conditional distribution giving a distribution over classes
(C) for any given example (I). MACCENT uses the ILP framework of learning
from interpretations where each example I is a Prolog program. There are thus
connections to the approach of Kok and Domingos [24] mentioned earlier. The

9.5

Conclusion

287

distribution is always a conditional exponential-family distribution of the form


(m=M
)

1
exp
m fjm ,km (I, C) ,
p (C|I) =
Z (I)
m=1
where each feature fj,k is a Boolean clausal indicator function dened using a class
Cj and a Prolog query Qk as follows:

1 if C = Cj and Prolog query Qk succeeds in instance I
fj,k (I, C) =
.
0 otherwise
Both parameters m and the features fjm ,km are learned using an adaptation
of the algorithm of Della Pietra et al. [12]. MACCENT searches the lattice of
clausal indicator functions for good features using standard ILP search where these
functions are ordered by logical generality (or more precisely -subsumption [33]).

9.5

Conclusion
In this chapter we have looked at the big three issues of representation, inference,
and learning for probability-logic models, with a focus on representation. What
is exciting about the current interest in SRL is that techniques for all three of
these (often originating from dierent communities) are coming together to produce
powerful techniques for learning from structured and relational data. (It is worth
noting that there are initiatives with similar goals originating from the statistical
community [16], although there logical approaches are not currently used.) The
number of applications which involve such data are many: almost any real-world
problem for which standard ILP is a reasonable choiceand many more besidesis
also a target for SRL. To take just three recent examples, Frasconi et al. [15] applied
a declarative kernel approach to (1) predicting mutagenicity, (2) information
extraction, and (3) prediction of mRNA signal structure; Lodhi and Muggleton [26]
applied failure-adjusted maximization to learn SLPs to model metabolic pathways
and Riedel and Klein [38] learnt MLN based on discourse representation structures
of a sentence to extract gene-protein interactions from annotated Medline abstracts.

References
[1] N. Angelopoulos and J. Cussens. Exploiting informative priors for Bayesian
classication and regression trees. In Proceedings of the International Joint
Conference on Articial Intelligence, 2005.
[2] A. Arnauld and P. Nicole. Port-Royal Logic. Translated by Bobbs-Merrill,
Indianapolis, IN, 1964.

288

Logic-based Formalisms for Statistical Relational Learning

[3] M. Bernard and A. Habrard. Learning stochastic logic programs. In Proceedings


of the Work-in-Progress Track at the International Conference on Inductive
Logic Programming, 2001.
[4] G. Boole. An Investigation of the Laws of Thought, on which are founded the
Mathematical Theories of Logic and Probabilities. Reprint, Dover Publications,
New York, NY, 1958.
[5] R. Carnap. Logical Foundations of Probability. University of Chicago Press,
Chicago, 1950.
[6] J. Cussens. Bayesian inductive logic programming with explicit probabilistic
bias. Technical Report PRG-TR-24-96, Oxford University Computing Laboratory, Oxford, UK, 1996.
[7] J. Cussens. Parameter estimation in stochastic logic programs.
Learning, 44(3):245271, 2001.

Machine

[8] J. Cussens. Stochastic logic programs: Sampling, inference and applications. In


Proceedings of the Conference on Uncertainty in Articial Intelligence, 2000.
[9] J. Cussens. Integrating by separating: Combining probability and logic with
ICL, PRISM and SLPs. APRIL project report, January 2005.
[10] L. De Raedt and K. Kersting. Probabilistic logic learning. SIGKDD Explorations, 5(1):3148, 2003.
[11] L. Dehaspe. Maximum entropy modeling with clausal constraints. In Inductive
Logic Programming, 1997.
[12] S. Della Pietra, V. Della Pietra, and J. Laerty. Inducing features of random
elds. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19
(4):380393, 1997.
[13] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from
incomplete data via the EM algorithm. Journal of the Royal Statistical Society,
Series B, 39(1):138, 1977.
[14] S. Dzeroski. Handling Noise in Inductive Logic Programming. Masters
thesis, Faculty of Electrical Engineering and Computer Science, University of
Ljubljana, Ljubljana, Slovenia, 1991.
[15] P. Frasconi, A. Passerini, S. Muggleton, and H. Lodhi. Declarative kernels.
Inductive Logic Programming, 2005.
[16] P. Green, N. Hjort, and S. Richardson, editors. Highly Structured Stochastic
Systems. OUP, 2003.
[17] J. Halpern. An analysis of rst-order logics of probability. Articial Intelligence, 46:311350, 1990.
[18] C. Howson and P. Urbach. Scientic Reasoning: The Bayesian Approach.
Open Court, La Salle, Illinois, 1989.
[19] M. Jaeger. Relational Bayesian networks: A survey. Electronic Transactions
in Articial Intelligence, 6, 2002.

References

289

[20] Manfred Jaeger. Relational Bayesian networks. In Proceedings of the Conference on Uncertainty in Articial Intelligence, 1997.
[21] A. Karalic and I. Bratko. First order regression. Machine Learning, 26(2-3):
147176, 1997. ISSN 0885-6125.
[22] K. Kersting and L. De Raedt. Bayesian logic programs. Technical Report
151, University of Freiburg, Freiburg, Germany, April 2001.
[23] J. Keynes. A Treatise on Probability. Macmillan, London, 1921.
[24] S. Kok and P. Domingos. Learning the structure of Markov logic networks.
In Proceedings of the International Conference on Machine Learning, 2005.
[25] J. Lloyd. Foundations of Logic Programming. Springer, Berlin, second edition,
1987.
[26] H. Lodhi and S. Muggleton. Modelling metabolic pathways using stochastic
logic programs-based ensemble methods. In Proceedings of the International
Conference on Computational Methods in System Biology, 2004.
[27] S. Muggleton. Learning the structure and parameters of stochastic logic
programs. In Inductive Logic Programming, 2002.
[28] S. Muggleton. Stochastic logic programs. In L. De Raedt, editor, Advances in
Inductive Logic Programming, volume 32 of Frontiers in Articial Intelligence
and Applications, pages 254264. IOS Press, Amsterdam, 1996.
[29] R. Ng and V.S. Subrahmanian. A semantical framework for supporting
subjective and conditional probabilities in deductive databases. Journal of
Automated Reasoning, 10(2):191235, 1993.
[30] R. Ng and V.S. Subrahmanian. Probabilistic logic programming. Information
and Computation, 101(2):150201, 1992.
[31] L. Ngo and P. Haddaway. Answering queries from context-sensitive probabilistic knowledge bases. Theoretical Computer Science, 171:147171, 1997.
[32] N. Nilsson. Probabilistic logic. Articial Intelligence, 28:7187, 1986.
[33] Gordon D. Plotkin. A note on inductive generalization. Machine Intelligence,
5:153163, 1970.
[34] U. Pompe and I. Kononenko. Naive Bayesian classier within ILP-R. In
Inductive Logic Programming, 1995.
[35] U. Pompe and I. Kononenko. Probabilistic rst-order classication.
Inductive Logic Programming, 1997.

In

[36] D. Poole. Probabilistic Horn abduction and Bayesian networks. Articial


Intelligence, 64(1):81129, 1993.
[37] D. Poole. The independent choice logic for modelling multiple agents under
uncertainty. Articial Intelligence, 94(12):556, 1997.
[38] S. Riedel and E. Klein. Genic interaction extraction with semantic and
syntactic chains. In Proceedings of the Learning Language in Logic Workshop,
2005.

290

Logic-based Formalisms for Statistical Relational Learning

[39] T. Sato. A statistical learning method for logic programs with distribution
semantics. In Inductive Logic Programming, 1995.
[40] T. Sato and Y. Kameya. Parameter learning of logic programs for symbolicstatistical modeling. Journal of Articial Intelligence Research, 15:391454,
2001.
[41] T. Sato, Y. Kameya, and N. Zhou. Generative modeling with failure in
PRISM. In Proceedings of the International Joint Conference on Articial
Intelligence, 2005.
[42] N. Zhou, T. Sato, and Y. Kameya. A Reference Guide to PRISM Version
1.7, March 2004.

10 Bayesian Logic Programming: Theory and


Tool

Kristian Kersting and Luc De Raedt

Bayesian networks provide an elegant formalism for representing and reasoning


about uncertainty. They are a probabilistic extension of propositional logic and,
hence, inherit some of the limitations of propositional logic, such as the diculties
with representing objects and relations. In this chapter, we introduce Bayesian
logic programs, which are an extension of Bayesian networks to overcome these
limitations. Bayesian logic programs tightly integrate denite logic programs with
Bayesian networks. The key idea underlying Bayesian logic programs is to establish
a one-to-one mapping between ground atoms and random variables, and between
the immediate consequence operator and the dependency relation. In doing so,
Bayesian logic programs combine the advantages of both denite clause logic and
Bayesian networks: notions of objects and relations, a separation of quantitative
and qualitative aspects of the world, and a graphical representation.

10.1

Introduction
In recent years, there has been a signicant interest in integrating probability theory with rst-order logic and relational representations (see De Raedt and Kersting
[5] for an overview). Muggleton [30] and Cussens [4] have upgraded stochastic grammars toward stochastic logic programs, Sato and Kameya [42] have introduced probabilistic distributional semantics for logic programs, and Domingos and Richardson
[9] have upgraded Markov networks toward Markov logic networks. Another research stream including Pooles independent choice logic [38], Ngo and Haddawys
Probabilistic-Logic Programs [34], Jaegers relational Bayesian networks [17], and
Pfeers probabilistic relational models [37] concentrates on rst-order logical and
relational extensions of Bayesian networks.

292

Bayesian Logic Programming: Theory and Tool

Bayesian networks [36] are one of the most important, ecient, and elegant
frameworks for representing and reasoning with probabilistic models. They have
been applied to many real-world problems in diagnosis, forecasting, automated
vision, sensor fusion, and manufacturing control [16]. A Bayesian network species
a joint probability distribution over a nite set of random variables and consists of
two components:
1. a qualitative or logical one that encodes the local inuences among the random
variables using a directed acyclic graph, and
2. a quantitative one that encodes the probability densities over these local inuences.
Despite these interesting properties, Bayesian networks also have a major limitation,
i.e., they are essentially propositional representations. Indeed, imagine modeling
the localization of genes/proteins as was the task at the KDD Cup 2001 [3]. When
using a Bayesian network, every gene is a single random variable. There is no way of
formulating general probabilistic regularities among the localizations of the genes
such as
the localization L of gene G is inuenced by the localization L of another gene
G that interacts with G.
The propositional nature and limitations of Bayesian networks are similar to those
of traditional attribute-value learning techniques, which have motivated a lot of
work on upgrading these techniques within inductive logic programming. This in
turn also explains the interest in upgrading Bayesian networks toward using rstorder logical representations.
Bayesian logic programs unify Bayesian networks with logic programming which
allows the propositional character of Bayesian networks and the purely logical
nature of logic programs to be overcome. From a knowledge representation point of
view, Bayesian logic programs can be distinguished from alternative frameworks by
having logic programs (i.e., denite clause programs, which are sometimes called
pure Prolog programs), as well as Bayesian networks, as an immediate special
case. This is realized through the use of a small but powerful set of primitives.
Indeed, the underlying idea of Bayesian logic programs is to establish a one-to-one
mapping between ground atoms and random variables, and between the immediate
consequence operator and the direct inuence relation. Therefore, Bayesian logic
programs can also handle domains involving structured terms as well as continuous
random variables.
In addition to reviewing Bayesian logic programs, this chapter
contributes a graphical representation for Bayesian logic programs;
its implementation in the Bayesian logic programs tool Balios; and
shows how purely logical predicates as well as aggregate function are employed
within Bayesian logic programs.

10.2

On Bayesian Networks and Logic Programs

293

Figure 10.1 The graphical structure of a Bayesian network modeling the inheritance of blood types within a particular family.

The chapter is structured as follows. We begin by briey reviewing Bayesian


networks and logic programs in section 10.2. In section 10.3, we dene Bayesian logic
programs as well as their semantics. Afterward, in section 10.4, we discuss several
extensions of the basic Bayesian logic programming framework. More precisely, we
introduce a graphical representation for Bayesian logic programs and we discuss
the eective treatment of logic atoms and of aggregate functions. In section 10.5,
we sketch how to learn Bayesian logic programs from data. Before touching upon
related work and concluding, we briey present Balios, the engine for Bayesian
logic programs.

10.2

On Bayesian Networks and Logic Programs


In this section, we rst introduce the key concepts and assumptions underlying
Bayesian networks and logic programs. In the next section we then show how these
are combined in Bayesian logic programs. For a full and detailed treatment of each
of these topics, we refer to [28] for logic programming or Prolog and to [18] for
Bayesian networks.
We introduce Bayesian logic programs using an example from genetics which is
inspired by [10]:
It is a genetic model of the inheritance of a single gene that determines a
persons X blood type bt(X). Each person X has two copies of the chromosome
containing this gene, one, mc(Y), inherited from her mother m(Y, X), and one,
pc(Z), inherited from her father f(Z, X).
We will use the following convention: x denotes a (random) variable, x a state, and
X (resp. x) a set of variables (resp. states). We will use P to denote a probability
distribution, e.g., P(x), and P to denote a probability value, e.g., P (x = x) and
P (X = x) .
10.2.1

Bayesian Networks

A Bayesian network [36] is an augmented, directed acyclic graph, where each


node corresponds to a random variable xi and each edge indicates a direct in-

294

Bayesian Logic Programming: Theory and Tool

uence among the random variables. It represents the joint probability distribution
P(x1 , . . . , xn ) over a xed, nite set {x1 , . . . , xn } of random variables. Each random
variable xi possesses a nite set S(xi ) of mutually exclusive states. Figure 10.1
shows the graph of a Bayesian network modeling our blood type example for a particular family. The familial relationship, which is taken from Jensens stud farm example [19], forms the basis for the graph. The network encodes, e.g., that Dorothys
blood type is inuenced by the genetic information of her parents Ann and Brian.
The set of possible states of bt(dorothy) is S(bt(dorothy)) = {a, b, ab, 0}; the
set of possible states of pc(dorothy) and mc(dorothy) are S(pc(dorothy)) =
S(mc(dorothy)) = {a, b, 0}. The same holds for ann and brian. The direct predecessors of a node x, the parents of x, are denoted by Pa(x). For instance,
Pa(bt(ann)) = {pc(ann), mc(ann)}.
A Bayesian network stipulates the following conditional independence assumption.
Proposition 10.1 Independence Assumption of Bayesian Networks
Each node xi in the graph is conditionally independent of any subset A of nodes
that are not descendants of xi given a joint state of Pa(xi ), i.e.,
P(xi | A, Pa(xi )) = P(xi | Pa(xi )) .
For example, bt(dorothy) is conditionally independent of bt(ann) given a joint
state of its parents {pc(dorothy), mc(dorothy)}. Any pair (xi , Pa(xi )) is called the
family of xi denoted as Fa(xi ); e.g., bt(dorothy)s family is
(bt(dorothy), {pc(dorothy), mc(dorothy)}) .
Because of the conditional independence assumption, we can write down the joint
probability density as
P(x1 , . . . , xn ) =

n


P(xi | Pa(xi ))

i=1

by applying the independence assumption 10.1 to the chain rule expression of the
joint probability distribution. Thereby, we associate with each node xi of the graph
the conditional probability distribution P(xi | Pa(xi )), denoted as cpd(xi ). The
conditional probability distributions in our blood type domain are:
mc(dorothy)

pc(dorothy)

P(bt(dorothy))

(0.97, 0.01, 0.01, 0.01)

(0.01, 0.01, 0.97, 0.01)

(0.01, 0.01, 0.01, 0.97)

(similarly for ann and brian) and

10.2

On Bayesian Networks and Logic Programs

295

mc(ann)

pc(ann)

P(mc(dorothy))

(0.98, 0.01, 0.01)

(0.01, 0.98, 0.01)

(01, 0.01, 0.98)

(similarly for pc(dorothy)). Further conditional probability tables are associated


with the a priori nodes, i.e., the nodes having no parents:
P(mc(ann))

P(mc(ann))

P(mc(ann))

P(mc(ann))

(0.38, 0.12, 0.50)

(0.38, 0.12, 0.50)

(0.38, 0.12, 0.50)

(0.38, 0.12, 0.50)

10.2.2

Logic Programs

To introduce logic programs, consider gure 10.2, containing two programs, grandparent and nat. Formally speaking, we have that grandparent/2, parent/2 and
nat/1 are predicates (with their arity i.e., number of arguments listed explicitly). Furthermore, jef, paul, and ann are constants and X, Y, and Z are variables. All constants and variables are also terms. In addition, there exist structured terms, such as s(X), which contains the functor s/1 of arity 1 and the
term X. Constants are often considered as functors of arity 0. Atoms are predicate symbols followed by the necessary number of terms, e.g., parent(jef, paul),
nat(s(X)), parent(X, Z), etc. We are now able to dene the key concept of a (definite) clause. Clauses are formulae of the form A :B1 , . . . , Bm where A and the
Bi are logical atoms where all variables are understood to be universally quantied. For example, the clause grandparent(X, Y) :parent(X, Z), parent(Z, Y) can
be read as X is the grandparent of Y if X is a parent of Z and Z is a parent of
Y. Let us call this clause c. We call grandparent(X, Y) the head(c) of this clause,
and parent(X, Z), parent(Z, Y) the body(c). Clauses with an empty body, such as
parent(jef, paul), are called facts. A (denite) clause program (or logic program for
short) consists of a set of clauses. In gure 10.2, there are thus two logic programs,
one dening grandparent/2 and one dening nat/1.
parent(jef,paul).
parent(paul,ann).
grandparent(X,Y) :- parent(X,Z), parent(Z,Y).
Figure 10.2

nat(0).
nat(s(X)) :- nat(X).

Two logic programs, grandparent and nat.

The set of variables in a term, atom, or clause E is denoted as Var(E), e.g.,


Var(c) = {X, Y, Z}. A term, atom, or clause E is called ground when there is no
variable occurring in E, i.e., Var(E) = . A substitution = {V1 /t1 , . . . , Vn /tn },
e.g., {X/ann}, is an assignment of terms ti to variables Vi . Applying a substitution
to a term, atom, or clause e yields the instantiated term, atom, or clause e where

296

Bayesian Logic Programming: Theory and Tool

all occurrences of the variables Vi are simultaneously replaced by the term ti , e.g.,
c is grandparent(ann, Y) :parent(ann, Z), parent(Z, Y).
The Herbrand base of a logic program T , denoted as HB(T ), is the set of all
ground atoms constructed with the predicate, constant, and function symbols in
the alphabet of T . For example, HB(nat) = {nat(0), nat(s(0)), nat(s(s(0))), ...}
and
HB(grandparent) =
{parent(ann, ann), parent(jef, jef),
parent(paul, paul), parent(ann, jef), parent(jef, ann), ...,
grandparent(ann, ann), grandparent(jef, jef), ...}.
A Herbrand interpretation for a logic program T is a subset of HB(T ). The least
Herbrand model LH(T ) (which constitutes the semantics of the logic program)
consists of all facts f HB(T ) such that T logically entails f , i.e., T |= f .
Various methods exist to compute the least Herbrand model. We merely sketch its
computation through the use of the well-known immediate consequence operator TB .
The operator TB is the function on the set of all Herbrand interpretations of B such
that for any such interpretation I we have
TB (I) = {A |there is a substitution and a clause A:A1 , . . . , An in B such
that A:A1 , . . . , An is ground and for i = 1, . . . , n : Ai I}.
Now, for range-restricted clauses, the least Herbrand model can be obtained using
the following procedure:
1: Initialize LH :=
2: repeat
3:
LH := TB (LH)
4: until LH does not change anymore
At this point the reader may want to verify that LH(nat) = HB(nat) and
LH(grandparent) =
{parent(jef, paul), parent(paul, ann), grandparent(jef, ann)}.

10.3

Bayesian Logic Programs


The logical component of Bayesian networks essentially corresponds to a propositional logic program. 1 Consider, for example, the program in gure 10.3. It encodes

1. Haddawy [14] and Langley [27] have a similar view on Bayesian networks. For instance,
Langley does not represent Bayesian networks graphically but rather uses the notation of
propositional denite clause programs.

10.3

Bayesian Logic Programs

297

pc(ann).
pc(brian).
mc(ann).
mc(brian).
mc(dorothy) :- mc(ann), pc(ann).
pc(dorothy) :- mc(brian), pc(brian).
bt(ann) :- mc(ann), pc(ann).
bt(brian) :- mc(brian), pc(brian).
bt(dorothy) :- mc(dorothy), pc(dorothy).
Figure 10.3 A propositional clause program encoding the structure of the blood
type Bayesian network in gure 10.1.

the structure of the blood type Bayesian network in gure 10.1. Observe that the
random variables in the Bayesian network correspond to logical atoms. Furthermore, the direct inuence relation corresponds to the immediate consequence operator. Now, imagine another totally separated family, which could be described by a
similar Bayesian network. The graphical structure and associated conditional probability distribution for the two families are controlled by the same intensional regularities. But these overall regularities cannot be captured by a traditional Bayesian
network. So we need a way to represent these overall regularities.
Because this problem is akin to that with propositional logic and the structure
of Bayesian networks can be represented using propositional clauses, the approach
taken in Bayesian logic programs is to upgrade these propositional clauses encoding
the structure of the Bayesian network to proper rst-order clauses.
10.3.1

Representation Language

Applying the above-mentioned idea leads to the central notion of a Bayesian clause.
Denition 10.2 Bayesian Clause
A Bayesian (denite) clause c is an expression of the form A | A1 , . . . , An where
n 0, the A, A1 , . . . , An are Bayesian atoms (see below) and all Bayesian atoms
are (implicitly) universally quantied. When n = 0, c is called a Bayesian fact and
expressed as A.
So the dierences between a Bayesian clause and a logical clause are:
1. the atoms p(t1 , . . . , tl ) and predicates p/l arising are Bayesian, which means that
they have an associated (nite2) set S(p/l) of possible states, and
2. we use | instead of : to highlight the conditional probability distribution.

2. For the sake of simplicity we consider nite random variables, i.e., random variables
having a nite set S of states. However, because the semantics rely on Bayesian networks,
the ideas easily generalize to discrete and continuous random variables (modulo the
restrictions well-known for Bayesian networks).

298

Bayesian Logic Programming: Theory and Tool

For instance, consider the Bayesian clause c bt(X)|mc(X), pc(X) where S(bt/1) =
{a, b, ab, 0} and S(mc/1) = S(pc/1) = {a, b, 0}. Intuitively, a Bayesian predicate
p/l generically represents a set of random variables. More precisely, each Bayesian
ground atom g over p/l represents a random variable over the states S(g) :=
S(p/l). For example, bt(ann) represents the blood type of a person named Ann
as a random variable over the states {a, b, ab, 0}. Apart from that, most logical
notions carry over to Bayesian logic programs. So we will speak of Bayesian
predicates, terms, constants, substitutions, propositions, ground Bayesian clauses,
Bayesian Herbrand interpretations, etc. For the sake of simplicity we will sometimes
omit the term Bayesian as long as no ambiguities arise. We will assume that all
Bayesian clauses c are range-restricted, i.e., Var(head(c)) Var(body(c)). Range
restriction is often imposed in the database literature; it allows one to avoid
the derivation of nonground true facts (cf. section 10.2.2). As already indicated
while discussing gure 10.3, a set of Bayesian clauses encodes the qualitative or
structural component of the Bayesian logic programs. More precisely, ground atoms
correspond to random variables, and the set of random variables encoded by a
particular Bayesian logic program corresponds to its least Herbrand domain. In
addition, the direct inuence relation corresponds to the immediate consequence.
In order to represent a probabilistic model we also associate with each Bayesian
clause c a conditional probability distribution cpd(c) encoding P(head(c) |
body(c)); cf. gure 10.4. To keep the exposition simple, we will assume that cpd(c)
is represented as a table. More elaborate representations such as decision trees
or rules would be possible too. The distribution cpd(c) generically represents the
conditional probability distributions associated with each ground instance c of the
clause c.
In general, one may have many clauses. Consider clauses c1 and c2
bt(X) | mc(X).
bt(X) | pc(X). \ ,
and assume corresponding substitutions i that ground the clauses ci such that
head(c1 1 ) = head(c2 2 ). In contrast to bt(X)|mc(X), pc(X), they specify cpd(c1 1 )
and cpd(c2 2 ), but not the desired distribution P(head(c1 1 ) | body(c1 )body(c2 )).
The standard solution to obtain the distribution required is so-called combining
rules.
Denition 10.3 Combining Rule
A combining rule is a function that maps nite sets of conditional probability
distributions {P(A | Ai1 , . . . , Aini ) | i = 1, . . . , m} onto one (combined) conditional
$m
probability distribution P(A | B1 , . . . , Bk ) with {B1 , . . . , Bk } i=1 {Ai1 , . . . , Aini }.
We assume that for each Bayesian predicate p/l there is a corresponding combining
rule cr(p/l), such as noisy or (see, e.g., [18]) or average. The latter assumes
n1 = . . . = nm and S(Aij ) = S(Akj ), and computes the average of the distributions
*
over S(A) for each joint state over j S(Aij ); see also section 10.3.2.
By now, we are able to formally dene Bayesian logic programs.

10.3

Bayesian Logic Programs

m(ann, dorothy).
f(brian, dorothy).
pc(ann).
pc(brian).
mc(ann).
mc(brian).
mc(X)|m(Y, X), mc(Y), pc(Y).
pc(X)|f(Y, X), mc(Y), pc(Y).
bt(X)|mc(X), pc(X).

299

mc(X)
a
b

0
m(Y, X)
true
true

false

pc(X)
a
a

0
mc(Y)
a
b

P(bt(X))
(0.97, 0.01, 0.01, 0.01)
(0.01, 0.01, 0.97, 0.01)

(0.01, 0.01, 0.01, 0.97)


pc(Y)
a
a

P(mc(X))
(0.98, 0.01, 0.01)
(0.01, 0.98, 0.01)

(0.33, 0.33, 0.33)

Figure 10.4 The Bayesian logic program blood type encoding our genetic domain.
For each Bayesian predicate, the identity is the combining rule. The conditional
probability distributions associated with the Bayesian clauses bt(X)|mc(X), pc(X)
and mc(X)|m(Y, X), mc(X), pc(Y) are represented as tables. The other distributions are
correspondingly dened. The Bayesian predicates m/2 and f/2 have as possible states
{true, f alse}.

Denition 10.4 Bayesian Logic Program


A Bayesian logic program B consists of a (nite) set of Bayesian clauses. For each
Bayesian clause c there is exactly one conditional probability distribution cpd(c),
and for each Bayesian predicate p/l there is exactly one combining rule cr(p/l).
A Bayesian logic program encoding our blood type domain is shown in gure 10.4.
10.3.2

Declarative Semantics

Intuitively, each Bayesian logic program represents a (possibly innite) Bayesian


network, where the nodes are the atoms in the least Herbrand model of the Bayesian
logic program. These declarative semantics can be formalized using the annotated
dependency graph. The dependency graph DG(B) is that directed graph whose nodes
correspond to the ground atoms in the least Herbrand model LH(B). It encodes the
direct inuence relation over the random variables in LH(B): there is an edge from
a node x to a node y if and only if there exists a clause c B and a substitution
, s.t. y = head(c), x body(c) and for all ground atoms z in c : z LH(B).
Figures 10.5 and 10.6 show the dependency graph for our blood type program. Here,
mc(dorothy) directly inuences bt(dorothy). Furthermore, dening the inuence
relation as the transitive closure of the direct inuence relation, mc(ann) inuences
bt(dorothy) .
The Herbrand base HB(B) constitute the set of all random variables we can talk
about. However, only those atoms that are in the least Herbrand model LH(B)
HB(B) will appear in the dependency graph. These are the atoms that are true in

300

Bayesian Logic Programming: Theory and Tool


m(ann,dorothy).
f(brian,dorothy).
pc(ann).
pc(brian).
mc(ann).
mc(brian).
mc(dorothy) | m(ann, dorothy),mc(ann),pc(ann).
pc(dorothy) | f(brian, dorothy),mc(brian),pc(brian).
bt(ann) | mc(ann), pc(ann).
bt(brian) | mc(brian), pc(brian).
bt(dorothy) | mc(dorothy),pc(dorothy).
Figure 10.5 The grounded version of the blood type Bayesian logic program of
gure 10.4 where only clauses c with head(c) LH(B) and body(c) LH(B) are
retained. It (directly) encodes the Bayesian network as shown in gure 10.6. The
structure of the Bayesian network coincides with the dependency graph of the blood
type Bayesian logic program.

the logical sense, i.e., if the Bayesian logic program B is interpreted as a logical
program. They are the so-called relevant random variables, the random variables
over which a probability distribution is well-dened by B, as we will see. The atoms
not belonging to the least Herbrand model are irrelevant. Now, to each node x in
DG(B) we associate the combined conditional probability distribution which is
the result of applying the combining rule cr(p/n) of the corresponding Bayesian
predicate p/n to the set of cpd(c)s where head(c) = x and {x} body(c)
LH(B). Consider
cold.
flu.
malaria.

fever | cold.
fever | flu.
fever | malaria.

where all Bayesian predicates have true, false as states, and noisy or as combining
rule. The dependency graph is

and noisy or {P(fever|flu), P(fever|cold), P(fever|malaria)} is associated with


fever (see [40], p. 444). Thus, if DG(B) is acyclic and not empty, and every
node in DG(B) has a nite indegree then DG(B) encodes a (possibly innite)
Bayesian network, because the least Herbrand model always exists and is unique.
Consequently, the following independence assumption holds:
Proposition 10.5 Independence Assumption of Dependency Graph
Each node x is independent of its nondescendants given a joint state of its parents
Pa(x) in the dependency graph.

10.3

Bayesian Logic Programs

301

For instance, the dependency graph of the blood type program as shown in Figures 10.5 and 10.6 encodes that the random variable bt(dorothy) is independent
of pc(ann) given a joint state of pc(dorothy), mc(dorothy). Using this assumption
the following proposition (taken from [21]) holds:
Proposition 10.6 Semantics
Let B be a Bayesian logic program. If
1. LH(B) = ,
2. DG(B) is acyclic, and
3. each node in DG(B) is inuenced by a nite set of random variables,
then B species a unique probability distribution PB over LH(B).
To see this, note that the least Herbrand LH(B) always exists, is unique, and
countable. Thus, DG(B) exists and is unique, and due to condition (3) the combined
probability distribution for each node of DG(B) is computable. Furthermore,
because of condition (1) a total order on DG(B) exists, so that one can see
B together with as a stochastic process over LH(B). An induction argument
over together with condition (2) allows one to conclude that the family of
nite-dimensional distributions of the process is projective (cf. [2]), i.e., the joint
probability distribution over each nite subset S LH(B) is uniquely dened and

y P(S, x = y) = P(S). Thus, the preconditions of Kolmogorovs theorem [[2], p.
307] hold, and it follows that B given species a probability distribution P over
LH(B). This proves the proposition because the total order used for the induction
is arbitrary.
A program B satisfying the conditions (1), (2), and (3) of proposition 10.6 is called
well-dened. A well-dened Bayesian logic program B species a joint distribution
over the random variables in the least Herbrand model LH(B). As with Bayesian
networks, the joint distribution over these random variables can be factored to

P(x|Pa(x)),
P(LH(B)) =
xLH(B)
where the parent relation Pa is according to the dependency graph.
The blood type Bayesian logic program in gure 10.4 is an example of a welldened Bayesian logic program. Its grounded version is shown in gure 10.5.
It essentially encodes the original blood type Bayesian network of Figures 10.1
and 10.3. The only dierences are the two predicates m/2 and f/2 which can be
in one of the logical set of states true and false. Using these predicates and an
appropriate set of Bayesian facts (the extension) one can encode the Bayesian
network for any family. This situation is akin to that in deductive databases, where
the intension (the clauses) encodes the overall regularities and the extension
(the facts) the specic context of interest. By interchanging the extension, one can
swap contexts (in our case, families).

302

Bayesian Logic Programming: Theory and Tool

Figure 10.6 The structure of the Bayesian network represented by the grounded
blood type Bayesian logic program in gure 10.5. The structure of the Bayesian

network coincides with the dependency graph. Omitting the dashed nodes yields
the original Bayesian network of gure 10.1.
10.3.3

Procedural Semantics

Clearly, any (conditional) probability distribution over random variables of the


Bayesian network corresponding to the least Herbrand model can in principle
be computed. As the least Herbrand model (and therefore the corresponding
Bayesian network) can become (even innitely) large, the question arises as to
whether one needs to construct the full least Herbrand model (and Bayesian
network) to be able to perform inferences. Here, inference means the process of
answering probabilistic queries.
Denition 10.7 Probabilistic Query
A probabilistic query to a Bayesian logic program B is an expression of the form
?- q1 , . . . , qn |e1 = e1 , . . . ,em = em
where n > 0, m 0. It asks for the conditional probability distribution
P(q1 , . . . , qn | e1 = e1 , . . . , em = em )
of the query variables q1 , . . . , qn where {q1 , . . . , qn , e1 , . . . , em } HB(B).
To answer a probabilistic query, one fortunately does not have to compute the
complete least Herbrand model. It suces to consider the so-called support network.
Denition 10.8 Support Network
The support network N of a random variable x LH(B) is dened as the induced
subnetwork of
{x} {y | y LH(B) and y inuences x} .
The support network of a nite set {x1 , . . . , xk } LH(B) is the union of the
networks of each single xi .
For instance, the support network for bt(dorothy) is the Bayesian network shown
in gure 10.6. The support network for bt(brian) is the subnetwork with root
bt(brian), i.e.,

10.3

Bayesian Logic Programs

303

That the support network of a nite set X LH(B) is sucent to compute P(X)
follows from the following theorem (taken from [21]):
Theorem 10.9 Support Network
Let N be a possibly innite Bayesian network, let Q be nodes of N , and E = e,
E N , be some evidence. The computation of P(Q | E = e) does not depend on
any node x of N which is not a member of the support network N (Q E).
To compute the support network N ({q}) of a single variable q eciently, let us
look at logic programs from a proof-theoretic perspective. From this perspective,
a logic program can be used to prove that certain atoms or goals (see below) are
logically entailed by the program. Provable ground atoms are members of the least
Herbrand model.
Proofs are typically constructed using the SLD-resolution procedure which we will
now briey introduce. Given a goal :-G1 , G2 . . . , Gn and a clause G:-L1 , . . . , Lm such that
G1 = G, applying SLD resolution yields the new goal :-L1 , . . . , Lm , G2 . . . , Gn .
A successful refutation, i.e., a proof of a goal, is then a sequence of resolution steps
yielding the empty goal, i.e. :- . Failed proofs do not end in the empty goal. For
instance, in our running example, bt(dorothy) is true, because of the following
refutation:
:-bt(dorothy)
:-mc(dorothy), pc(dorothy)
:-m(ann, dorothy), mc(ann), pc(ann), pc(dorothy)
:-mc(annn), pc(ann), pc(dorothy)
:-pc(ann), pc(dorothy)
:-pc(dorothy)
:-f(brian, dorothy), mc(brian), pc(brian)
:-mc(brian), pc(brian)
:-pc(brian)
:Resolution is employed by many theorem provers (such as Prolog). Indeed, when
given the goal bt(dorothy), Prolog would compute the above successful resolution
refutation and answer that the goal is true.
The set of all proofs of :-bt(dorothy) captures all information needed to compute
N ({bt(dorothy)}). More exactly, the set of all ground clauses employed to prove
bt(dorothy) constitutes the families of the support network N ({bt(dorothy)}).
For :-bt(dorothy), they are the ground clauses shown in gure 10.5. To build the

304

Bayesian Logic Programming: Theory and Tool

Figure 10.7 The rule graph for the blood type Bayesian network. On the righthand side the local probability model associated with node R9 is shown, i.e.,
the Bayesian clause bt dorothy|mc dorothy, pc dorothy with associated conditional
probability table.

support network, we only have to gather all ground clauses used to prove the query
variable and have to combine multiple copies of ground clauses with the same head
using corresponding combining rules. To summarize, the support network N ({q})
can be computed as follows:
1
2
3

Compute all proofs for :-q.


Extract the set S of ground clauses used to prove :-q.
Combine multiple copies of ground clauses h|b S with the same head h
using combining rules.

Applying this to :-bt(dorothy) yields the support network as shown in gure 10.6.
Furthermore, the method can easily be extended to compute the support network
for P(Q | E = e) . We simply compute all proofs of :-q, q Q, and :-e, e E . The
resulting support network can be fed into any (exact or approximative) Bayesian
network engine to compute the resulting (conditional) probability distribution of the
query. To minimize the size of the support network, one might also apply Schachters
Bayes-Ball algorithm [43].

10.4

Extensions of the Basic Framework


So far, we described the basic Bayesian logic programming framework and dened
the semantics of Bayesian logic programs. Various useful extensions and modications are possible. In this section, we discuss a graphical representation, ecient
treatment of logical atoms, and aggregate functions. At the same time, we will also
present further examples of Bayesian logic programs such as hidden Markov models
(HMMs) [39] and probabilistic grammars [29].

10.4

Extensions of the Basic Framework

305

The graphical representation of the blood type Bayesian logic


program. On the righthand side, some local probability models associated with Bayesian clause nodes are shown, e.g., the Bayesian clause
R7 pc(Person)|f(Father, Person), mc(Father), pc(Father) with associated conditional
probability distribution. For the sake of simplicity, not all Bayesian clauses are
shown.
Figure 10.8

10.4.1

Graphical Representation

Bayesian logic programs have so far been introduced using an adaptation of a


logic programming syntax. Bayesian networks are, however, also graphical models
and owe at least part of their popularity to their intuitively appealing graphical
notation [20]. Inspired by Bayesian networks, we develop in this section a graphical
notation for Bayesian logic programs.
In order to develop a graphical representation for Bayesian logic programs, let us
rst consider a more redundant representation for Bayesian networks: augmented
bipartite (directed acyclic) graphs as shown in gure 10.7. In a bipartite graph, the
set of nodes is composed of two disjoint sets such that no two nodes within the
same set are adjacent. There are two types of nodes, namely
1. gradient gray ovals denoting random variables, and
2. black boxes denoting local probability models.
There is a box for each family Fa(xi ) in the Bayesian network. The incoming edges refer to the parents Pa(xi ); the single outgoing edge points to Xi .
Each box is augmented with a Bayesian network fragment specifying the conditional probability distribution P(xi |Pa(xi )). For instance, in gure 10.7, the
fragment associated with R9 species the conditional probability distribution of
P(bt(dorothy)|mc(dorothy), pc(dorothy)). Interpreting this as a propositional

306

Bayesian Logic Programming: Theory and Tool

A dynamic Bayesian logic program modeling a hidden Markov model.


The functor next/1 is used to encode the discrete time.

Figure 10.9

Bayesian logic program, the graph can be viewed as a rule graph as known from
database theory. Ovals represent Bayesian predicates, and boxes denote Bayesian
clauses. More precisely, given a (propositional) Bayesian logic program B with
Bayesian clauses Ri hi |bi1 , . . . , bim , there are edges from from Ri to hi and from
bij to Ri . Furthermore, to each Bayesian clause node, we associate the corresponding Bayesian clause as a Bayesian network fragment. Indeed, the graphical model
in gure 10.7 represents the propositional Bayesian logic program of gure 10.5.
In order to represent rst-order Bayesian logic programs graphically, we have
to encode Bayesian atoms and their variable bindings in the associated local
probability models. Indeed, logical terms can naturally be represented graphically.
They form trees. For instance, the term t(s(1, 2), X) corresponds to the tree

Logical variables such as X are encoded as white ovals. Constants and functors such
as 1, 2, s, and t are represented as white boxes. Bayesian atoms are represented
as gradient gray ovals containing the predicate name such as pc. Arguments of
atoms are treated as placeholders for terms. They are represented as white circles
on the boundary of the ovals (ordered from left to right). The term appearing in the
argument is represented by an undirected edge between the white oval representing
the argument and the root of the tree encoding the term (we start in the argument
and follow the tree until reaching variables).
As an example, consider the Bayesian logic program in gure 10.8. It models
the blood type domain. The graphical representation indeed conveys the meaning of
the Bayesian clause R7: the paternal genetic information pc(Person) of a person
is inuenced by the maternal mc(M) and the paternal pc(M) genetic information of
the persons Father.

10.4

Extensions of the Basic Framework

307

Figure 10.10 The blood type Bayesian logic program distinguishing between
Bayesian (gradient gray ovals) and logical atoms (solid gray ovals).

As another example, consider gure 10.9 which shows the use of functors to
represent dynamic probabilistic models. More precisely, it shows an HMM [39].
HMMs are extremely popular for analyzing sequential data. Application areas
include computational biology, user modeling, speech recognition, empirical natural
language processing, and robotics.
At each Time, the system is in a state hidden(Time). The time-independent
probability of being in some state at the next time next(Time) given that the
system was in a state at TimePoint is captured in the Bayesian clause R2. Here,
the next time point is represented as functor next/1 . In HMMs, however, we do not
have direct access to the states hidden(Time). Instead, we measure some properties
obs(Time) of the states. The measurement is quantied in Bayesian clause R3. The
dependency graph of the Bayesian logic program directly encodes the well-known
Bayesian network structure of HMMs:

10.4.2

Logical Atoms

Reconsider the blood type Bayesian logic program in gure 10.8. The mother/2
and father/2 relations are not really random variables but logical ones because
they are always in the same state, namely true, with probability 1, and can depend only on other logical atoms. These predicates form a kind of logical background theory. Therefore, when predicates are declared to be logical, one need
not represent them in the conditional probability distributions. Consider the blood
type Bayesian logic program in gure 10.10. Here, mother/2 and father/2 are
declared to be logical. Consequently, the conditional probability distribution asso-

308

Bayesian Logic Programming: Theory and Tool

ciated with the denition of, e.g., pc/1 takes only pc(Father) and mc(Father) into
account but not f(Father, Person). It applies only to those substitutions for which
f(Father, Person) is true, i.e., in the least Herbrand model. This can eciently be
checked using any Prolog engine. Furthermore, one may omit these logical atoms
from the induced support network. More importantly, logical predicates provide
the user with the full power of Prolog. In the blood type Bayesian logic program of
gure 10.10, the logical background knowledge denes the founder/1 relation as
founder(Person):-\+(mother( , Person); father( , Person)).
Here, \+ denotes negation, the symbol represents an anonymous variable which
is treated as a new, distinct variable each time it is encountered, and the semicolon
denotes a disjunction. The rest of the Bayesian logic program is essentially as
in gure 10.4. Instead of explicitly listing pc(ann), mc(ann), pc(brian), mc(brian)
in the extensional part we have pc(P)|founder(P) and mc(P)|founder(P) in the
intensional part.
The full power of Prolog is also useful to elegantly encode dynamic probabilistic
models. Figure 10.11 (a) shows the generic structure of an HMM where the discrete
time is now encoded as next/2 in the logical background theory using standard
Prolog predicates:
next(X, Y):-integer(Y), Y > 0, X is Y 1.
Prologs predened predicates (such as integer/1) avoid a cumbersome representation of the dynamics via the successor functor 0, next(0), next(next(0)), . . . Imagine
querying ?- obs(100) using the successor functor,
?- obs(next(next(. . . (next(0)) . . .))) .
Whereas HMMs dene probability distributions over regular languages, probabilistic context-free grammars (,s) [29] dene probability distributions over contextfree languages. Application areas of PCFGs include, e.g., natural language processing and computational biology. For instance, mRNA sequences constitute contextfree languages. Consider, e.g., the following PCFG
terminal([A|B], A, B).
0.3 : sentence(A, B):-terminal(A, a, C), terminal(C, b, B).
0.7 : sentence(A, B):-terminal(A, a, C), sentence(C, D), terminal(D, b, B).
dening a distribution over {an bn } . The grammar is represented as probabilistic
denite clause grammar where the terminal symbols are encoded in the logical
background theory via the rst rule terminal([A|B], A, B) .
A PCFG denes a stochastic process with leftmost rewriting, i.e., refutation steps
as transitions. Words, say aabb, are parsed by querying ?- sentence([a, a, b, b], []).
The third rule yields ?- terminal([a, a, b, b], a, C), sentence(C, D), terminal(D, b, []).
Applying the rst rule yields ?- sentence([a, b, b], D), terminal(D, b, []) and the sec-

10.4

Extensions of the Basic Framework

309

Figure 10.11 Two dynamic Bayesian logic programs. (a) The generic structure of a hidden Markov model more elegantly represented as in gure 10.9
using next(X, Y) : integer(Y), Y > 0, X is Y 1.. (b) A probabilistic context-free
grammar over {an bn }. The logical background theory denes terminal/3 as
terminal([A|B], A, B).

ond rule ?- terminal([a, b, b], a, C), terminal(C, b, D), terminal(D, b, []). Applying
the rst rule three times yields a successful refutation. The probability of a refutation is the product of the probability values associated with clauses used in the
refutation; in our case 0.7 0.3. The probability of aabb then is the sum of the
probabilities of all successful refutations. This is also the basic idea underlying
Muggletons stochastic logic programs [30] which extend the PCFGs to denite
clause logic.
Figure 10.11 (b) shows the {an bn } PCFG represented as a Bayesian logic program. The Bayesian clauses are the clauses of the corresponding denite clause
grammar. In contrast to PCFGs, however, we associate a complete conditional probability distribution, namely (0.3, 0.7) and (0.7, 0.3; 0.0, 1.0) to the Bayesian clauses.
For the query ?- sentence([a, a, b, b], []), the following Markov chain is induced
(omitting logical atoms):

310

Bayesian Logic Programming: Theory and Tool

The Bayesian logic program for the university domain. Octagonal


nodes denote aggregate predicates and atoms.
Figure 10.12

10.4.3

Aggregate Functions

An alternative to combining rules are aggregate functions. Consider the university


domain due to [12]. The domain is that of a university, and contains professors,
students, courses, and course registrations. Objects in this domain have several
descriptive attributes such as intelligence/1 and rank/1 of a student/1. A
student will typically be registered in several courses; the students rank depends
on the grades she receives in all of them. So we have to specify a probabilistic
dependence of the students rank on a multiset of course grades of size 1, 2, and so
on.
In this situation, the notion of aggregation is more appropriate than that of
a combining rule. Using combining rules, the Bayesian clauses would describe the
dependence for a single course only. All information of how the rank probabilistically
depends on the multiset of course grades would be hidden in the combining rule.
In contrast, when using an aggregate function, the dependence is interpreted as
a probabilistic dependence of rank on some deterministically computed aggregate
property of the multiset of course grades. The probabilistic dependence is moved
out of the combining rule.
To model this, we introduce aggregate predicates. They represent deterministic
random variables, i.e., the state of an aggregate atom is a function of the joint
state of its parents. As an example, consider the university Bayesian logic program
as shown in gure 10.12. Here, avgGrade/1 is an aggregate predicate, denoted
as an octagonal node. As combining rule, the average of the parents states is

10.5

Learning Bayesian Logic Programs

311

deterministically computed; cf. Bayesian clause R5. In turn, the students rank/1
probabilistically depends on her averaged rank; cf. R6.
The use of aggregate functions is inspired by probabilistic relational models [37].
As we will show in the related work section, using aggregates in Bayesian logic
programs, it is easy to model probabilistic relational models.

10.5

Learning Bayesian Logic Programs


When designing Bayesian logic programs, the expert has to determine the structure
of the Bayesian logic program by specifying the extensional and intensional predicates, and by providing denitions for each of the intensional predicates. Given this
logical structure, the Bayesian logic program induces a Bayesian network whose
nodes are the relevant random variables. It is well-known that determining the
structure of a Bayesian network, and therefore also of a Bayesian logic program,
can be dicult and expensive. On the other hand, it is often easier to obtain a set
D = {D1 , . . . , Dm } of data cases, which can be used for learning.
10.5.1

The Learning Setting

For Bayesian logic programs, a data case Di D has two parts, a logical and a
probabilistic part. The logical part of a data case is a Herbrand interpretation. For
instance, the following set of atoms constitutes a Herbrand interpretation for the
blood type Bayesian logic program.
{m(ann, dorothy), f(brian, dorothy), pc(ann), mc(ann), bt(ann),
pc(brian), mc(brian), bt(brian), pc(dorothy), mc(dorothy), bt(dorothy)}
This (logical) interpretation can be seen as the least Herbrand model of an unknown
Bayesian logic program. In general, data cases specify dierent sets of relevant
random variables, depending on the given extensional context. If we accept that
the genetic laws are the same for dierent families, then a learning algorithm should
transform such extensionally dened predicates into intensionally dened ones, thus
compressing the interpretations. This is precisely what inductive logic programming
techniques [31] do. The key assumption underlying any inductive technique is
that the rules that are valid in one interpretation are likely to hold for other
interpretations. It thus seems clear that techniques for learning from interpretations
can be adapted for learning the logical structure of Bayesian logic programs.
So far, we have specied the logical part of the learning problem: we are looking
for a set H of Bayesian clauses given a set D of data cases such that all data cases
are a model of H. The hypotheses H in the space H of hypotheses are sets of
Bayesian clauses. However, we have to be more careful. A candidate set H H
has to be acyclic on the data, which implies that for each data case the induced
Bayesian network has to be acyclic.

312

Bayesian Logic Programming: Theory and Tool

Let us now focus on the quantitative components. The quantitative component


of a Bayesian logic program is given by the associated conditional probability
distributions and combining rules. For the sake of simplicity, we assume that the
combining rules are xed. Each data case Di D has a probabilistic part that is a
partial assignment of states to the random variables in Di . As an example consider
the following data case:
{m(ann, dorothy) = true, f(brian, dorothy) = true, pc(ann) = a, mc(ann) = a,
bt(ann) = a, pc(brian) = a, mc(brian) = b, bt(brian) = ab,
pc(dorothy) = b, mc(dorothy) =?, bt(dorothy) = ab},
where ? denotes an unknown state of a random variable. The partial assignments
induce a joint distribution over the random variables. A candidate H H should
reect this distribution. In Bayesian networks the conditional probability distributions are typically learned using gradient descent or expectation maximization
(EM) for a xed structure of the Bayesian network. A scoring function scoreD (H)
that evaluates how well a given structure H H matches the data is maximized.
To summarize, the learning problem is a probabilistic extension of the learning
from interpretations setting from inductive logic programming and can be formulated as follows:
Given a set D of data cases, a set H of Bayesian logic programs and a scoring
function scoreD .
Find a candidate H H which is acyclic on the data cases such that the data
cases Di D are models of H (in the logical sense) and H matches the data
D best according to scoreD .
Here, the best match refers to those parameters of the associated conditional
probability distributions which maximize the scoring function.
The learning setting provides an interesting link between inductive logic programming and Bayesian network learning as we will show in the next section.
10.5.2

Maximum Likelihood Learning

Consider the task of performing maximum likelihood learning, i.e., scoreD (H) =
P(D|H). As in many cases, it is more convenient to work with the logarithm of
this function, i.e., scoreD (H) = LL(D, H) := log P(D|H). It can be shown (see [22]
for more details) that the likelihood of a Bayesian logic program coincides with the
likelihood of the support network induced over D. Thus, learning Bayesian logic
programs basically reduces to learning Bayesian networks. The main dierences are
the ways to estimate the parameters and to traverse the hypotheses space.

10.5

Learning Bayesian Logic Programs

313

Decomposable combining rules can be expressed within support


networks. The nodes hi have the domain of h and cpd(c) associated. The node
h becomes a deterministic node, i.e., its parameters are xed. For example, for
noisy or, logical or is associated as function with h. Note that the hi s are never
observed; only h might be observed.
Figure 10.13

10.5.2.1

Parameter Estimation

The parameters of nonground Bayesian clauses have to be estimated. In order to


adapt techniques traditionally used for parameter estimation of Bayesian networks
such as the EM algorithm [7], combining rules are assumed to be decomposable 3 [15].
Decomposable combining rules can be completely expressed by adding extra nodes
to the induced support network; cf. gure 10.13. These extra nodes are copies of
the (ground) head atom which becomes a deterministic node. Now, each node in
the support network is produced by exactly one Bayesian clause c, and each node
derived from c can be seen as a separate experiment for the conditional probability
distribution cpd(c). Therefore, the EM estimates the improved parameters as the
following ratio:
m 
body(c) | Dl )
l=1 P(head(c),
,
m 
P(body(c)
| Dl )

l=1
where denotes substitutions such that Dl is a model of c.
10.5.2.2

Traversing the Hypotheses Space

Instead of adding, deleting, or ipping single edges in the support network, we


employ renement operators traditionally used in inductive logic programming to
add, delete, or ip several edges in the support network at the same time. More
specically, according to some language bias say we consider only functor-free
and constants-free clauses we use the two renement operators s : 2H  H
and g : 2H  H. The operator s (H) adds constant-free atoms to the body of a
single clause c H, and g (H) deletes constant-free atoms from the body of a single
clause c H. Other renement operators such as deleting and adding logically valid
clauses, instantiating variables, and unifying variables are possible too; cf. [35].

3. Most combining rules commonly employed in Bayesian networks such as noisy or are
decomposable.

314

Bayesian Logic Programming: Theory and Tool

Figure 10.14 Baliosthe engine for Bayesian logic programs. (a) Graphical representation of the university Bayesian logic program. (b) Textual representation of
Bayesian clauses with associated conditional probability distributions. (c) Computed support network and probabilities for a probabilistic query.

Combining these ideas, a basic greedy hill-climbing algorithm for learning


Bayesian logic programs can be sketched as follows. Assuming some data cases
D, we take some H0 as a starting point (for example, computed using some standard inductive logic programming system) and compute the parameters maximizing
LL(D, H). Then, we use s (H) and g (H) to compute the legal neighbors of H in
H and score them. If LL(D, H) < LL(D, H  ), then we take H  as new hypothesis.
The process is continued until no further improvements in score are obtained.

10.6

10.6

Balios The Engine for Basic Logic Programs

315

Balios The Engine for Basic Logic Programs


An engine for Bayesian logic programs featuring a graphical representation, logical
atoms, and aggregate functions has been implemented in the Balios system [23],
which is freely available for academic use at
http://www.informatik.uni-freiburg.de/~kersting/profile/. Balios is written in Java. It calls Sicstus Prolog to perform logical inference and a Bayesian
network inference engine to perform probabilistic inference. Balios features a GUI
graphically representing Bayesian logic programs (gure 10.14), computing the
most likely conguration, approximative inference methods (rejection, likelihood,
and Gibbs sampling), and parameter estimation methods (hard EM, EM, and
conjugate gradient).

10.7

Related Work
In the last ten years, there has been a lot of work done at the intersection of probability theory, logic programming, and machine learning [38, 14, 41, 30, 34, 17, 26,
1, 24, 9]; see [5] for an overview. Instead of giving a probabilistic characterization
of logic programming such as [32], this research highlights the machine learning
aspect and is known under the names of statistical relational learning (SRL) [11, 8],
probabilistic logic learning (PLL) [5], or probabilistic inductive logic programming
(PILP) [6]. Bayesian logic programs belong to the SRL line of research which
extends Bayesian networks. They are motivated and inspired by the formalisms
discussed in [38, 14, 34, 17, 10, 25]. We will now investigate these relationships in
more detail.
Probabilistic logic programs [33, 34] also adapt a logic program syntax, the
concept of the least Herbrand model to specify the relevant random variables, and
SLD resolution to develop a query-answering procedure. Whereas Bayesian logic
programs view atoms as random variables, probabilistic-logic programs view them
as states of random variables. For instance,
P (burglary(Person, yes) | neighbourhood(Person, average)) = 0.4
states that the a posteriori probability of a burglary in Persons house given that
Person has an average neighborhood is 0.4. Thus, instead of conditional probability
distributions, conditional probability values are associated with clauses.
Treating atoms as states of random variables has several consequences: (1)
Exclusivity constraints such as
false neighbourhood(X, average), neighbourhood(X, bad)

316

Bayesian Logic Programming: Theory and Tool

have to be specied in order to guarantee that random variables are always in


exactly one state. (2) The inference procedure is exponentially slower in time for
building the support network than that for Bayesian logic programs because there
is a proof for each conguration of a random variable. (3) It is more dicult if
not impossible to represent continuous random variables. (4) Qualitative, i.e.,
the logical component, and quantitative information, i.e., the probability values, are
mixed. Just this separation of both information made the graphical representation
for Bayesian logic programs possible.
Probabilistic and Bayesian logic programs are also related to Pooles framework
of probabilistic Horn abduction [38], which is a pragmatically-motivated simple logic formulation that includes denite clauses and probabilities over hypotheses [38]. Pooles framework provides a link to abduction and assumption-based
reasoning. However, as Ngo and Haddawy point out, probabilistic and therefore
also Bayesian logic programs have not as many constraints on the representation
language, represent probabilistic dependencies directly rather than indirectly, have
a richer representational power, and their independence assumption reects the
causality of the domain.
Koller et. al. [10, 25] dene probabilistic relational models, which are based
on the well-known entity/relationship model. In probabilistic relational models, the
random variables are the attributes. The relations between entities are deterministic, i.e., they are only true or false. Probabilistic relational models can be described
as Bayesian logic programs.
Indeed, each attribute a of an entity type E is a Bayesian predicate a(E) and each
n-ary relation r is an n-ary logical Bayesian predicate r/n. Probabilistic relational
models consist of a qualitative dependency structure over the attributes and their
associated quantitative parameters (the conditional probability densities). Koller
et. al. distinguish between two types of parents of an attribute. First, an attribute
a(X) can depend on another attribute b(X), e.g., the professors popularity depends
on the professors teaching ability in the university domain. This is equivalent to
the Bayesian clause a(X) | b(X). Second, an attribute a(X) possibly depends on an
attribute b(Y) of an entity Y related to X, e.g., a students grade in a course depends
on the diculty of the course. The relation between X and Y is described by a slot
or logical relation s(X, Y). Given these logical relations, the original dependency is
represented by a(X) | s(X, Y), b(Y). To deal with multiple ground instantiations of
a single clause (with the same head ground atom), probabilistic relational models
employ aggregate functions, as discussed earlier.
Clearly, probabilistic relational models employ a more restricted logical component than Bayesian logic programs do: it is a version of the commonly used
entity/relationship model. Any entity/relationship model can be represented using a (range-restricted) denite clause logic. Furthermore, several extensions to
treat existential uncertainty, referential uncertainty, and domain uncertainty exist. Bayesian logic programs have the full expressivity of denite clause logic and,
therefore, of a universal Turing machine. Indeed, general denite clause logic (using

10.7

Related Work

317

functors) is undecidable. The functor-free fragment of denite clause logic, however,


is decidable.
Jaeger [17] introduced relational Bayesian networks. They are Bayesian
networks where the nodes are predicate symbols. The states of these random
variables are possible interpretations of the symbols over an arbitrary, nite domain
(here we only consider Herbrand domains), i.e., the random variables are set
valued. The inference problem addressed by Jaeger asks for the probability that
an interpretation contains a ground atom. Thus, relational Bayesian networks
are viewed as Bayesian networks where the nodes are the ground atoms and
have the domain {true, false}. 4 The key dierence between relational Bayesian
networks and Bayesian logic programs is that the quantitative information is
specied by so-called probability formulae. These formulae employ the notion of
combination functions, functions that map every nite multiset with elements
from [0, 1] into [0, 1], as well as that of equality constraints. 5 Let F (cancer)(x)
be noisy or{comb {exposed(x, y, z) | z; true} | y; true} . This formula states that
for any specic organ y, multiple exposures to radiation have a cumulative but
independent eect on the risk of developing cancer of y. Thus, a probability formula
not only species the distribution but also the dependency structure. Therefore and
because of the computational power of combining rules, a probability formula is
easily expressed as a set of Bayesian clauses: the head of the Bayesian clauses is the
corresponding Bayesian atom and the bodies consist of all maximally generalized
Bayesian atoms occurring in the probability formula. Now the combining rule can
select the right ground atoms and simulate the probability formula. This is always
possible because the Herbrand base is nite. For example, the clause cancer(X) |
exposed(X,Y,Z) together with the right combining rule and associated conditional
probability distribution models the example formula.
In addition to extensions of Bayesian networks, several other probabilistic models
have been extended to the rst-order or relational case: Sato [41] introduces
distributional semantics in which ground atoms are seen as random variables over
{true, false}. Probability distributions are dened over the ground facts of a program
and propagated over the Herbrand base of the program using the clauses. Stochastic
logic programs [30, 4], introduced by Muggleton, lift context-free probabilistic
grammars to the rst-order case. Production rules are replaced by clauses labeled
with probability values. Recently, Domingos and Richardson [9] introduced Markov
logic networks which upgrade Markov networks to the rst-order case. The features
of the Markov logic network are weights attached to rst-order predicate logic
formulae. The weights specify a bias for ground instances to be true in a logical
model.
Finally, Bayesian logic programs are related to some extent to the BUGS
language [13] which aims at carrying out Bayesian inference using Gibbs sampling.

4. It is possible, but complicated to model domains having more than two values.
5. To simplify the discussion, we will further ignore these equality constraints here.

318

Bayesian Logic Programming: Theory and Tool

It uses concepts of imperative programming languages such as for-loops to model


regularities in probabilistic models. Therefore, the relation between Bayesian logic
programs and BUGS is akin to the general relation between logical and imperative
languages. This holds in particular for relational domains such as those used in this
chapter. Without the notion of objects and relations among objects, family trees
are hard to represent: BUGS uses traditional indexing to group together random
variables (e.g. X1 , X2 , X3 ... all having the same distribution), whereas Bayesian
logic programs use denite clause logic.

10.8

Conclusions
We have described Bayesian logic programs, their representation language, their semantics, and a query-answering process, and briey touched upon learning Bayesian
logic programs from data.
Bayesian logic programs combine Bayesian networks with denite clause logic.
The main idea of Bayesian logic programs is to establish a one-to-one mapping
between ground atoms in the least Herbrand model and random variables. The
least Herbrand model of a Bayesian logic program together with its direct inuence
relation is viewed as a (possibly innite) Bayesian network. Bayesian logic programs
inherit the advantages of both Bayesian networks and denite clause logic, including
the strict separation of qualitative and quantitative aspects. Moreover, the strict
separation facilitated the introduction of a graphical representation, which stays
close to the graphical representation of Bayesian networks.
Indeed, Bayesian logic programs can naturally model any type of Bayesian
network (including those involving continuous variables) as well as any type of
pure Prolog program (including those involving functors). We also demonstrated
that Bayesian logic programs can model HMMs and stochastic grammars, and
investigated their relationship to other rst-order extensions of Bayesian networks.
We have also presented the Balios tool, which employs the graphical as well as
the logical notations for Bayesian logic programs. It is available at
http://www.informatik.uni-freiburg.de/~kersting/profile/.
and the authors invite the reader to employ it.

Acknowledgments
The authors thank Uwe Dick for implementing the Balios system. This research
was partly supported by the European Union IST programme under contract number IST-2001-33053 and FP6-508861, APRIL I & II (Application of Probabilistic
Inductive Logic Programming).

References

319

References
[1] C. R. Anderson, P. Domingos, and D. S. Weld. Relational Markov models and
their application to adaptive web navigation. In International Conference on
Knowledge Discovery and Data Mining, 2002.
[2] Heinz Bauer. Wahrscheinlichkeitstheorie, 4th edition. Walter de Gruyter,
Berlin, 1991.
[3] J. Cheng, C. Hatzis, M.A. Krogel, S. Morishita, D. Page, and J. Sese. KDD
Cup 2001 report. SIGKDD Explorations, 3(2):47 64, 2002.
[4] J. Cussens. Loglinear models for rst-order probabilistic reasoning. In Proceedings of the Conference on Uncertainty in Articial Intelligence, 1999.
[5] L. De Raedt and K. Kersting. Probabilistic logic learning. ACM-SIGKDD
Explorations: Special Issue on Multi-Relational Data Mining, 5(1):3148, 2003.
[6] L. De Raedt and K. Kersting. Probabilistic inductive logic programming. In
Proceedings of the International Conference on Algorithmic Learning Theory,
pages 1936, 2004.
[7] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from
incomplete data via the EM algorithm. Journal of the Royal Statistical Society,
B 39:139, 1977.
[8] T. Dietterich, L. Getoor, and K. Murphy, editors. Working Notes of the ICML2004 Workshop on Statistical Relational Learning and its Connections to Other
Fields (SRL-04), 2004.
[9] P. Domingos and M. Richardson. Markov Logic: A Unifying Framework for
Statistical Relational Learning. In Proceedings of the ICML-2004 Workshop
on Statistical Relational Learning and its Connections to Other Fields, pages
4954, 2004.
[10] N. Friedman, L. Getoor, D. Koller, and A. Pfeer. Learning probabilistic
relational models. In Proceedings of the International Joint Conference on
Articial Intelligence, 1999.
[11] L. Getoor and D. Jensen, editors. Working Notes of the IJCAI-2003 Workshop
on Learning Statistical Models from Relational Data (SRL-03), 2003.
[12] L. Getoor, N. Friedman, D. Koller, and A. Pfeer. Learning probabilistic
relational models. In S. Dzeroski and N. Lavrac, editors, Relational Data
Mining, pages 307335. Kluwer, 2001.
[13] W. R. Gilks, A. Thomas, and D. J. Spiegelhalter. A language and program
for complex bayesian modelling. The Statistician, 43, 1994.
[14] P. Haddawy. Generating Bayesian networks from probabilistic logic knowledge bases. In Proceedings of the Conference on Uncertainty in Articial Intelligence, 1994.

320

Bayesian Logic Programming: Theory and Tool

[15] D. Heckerman and J. Breese. Causal independence for probability assessment


and inference using bayesian networks. Technical Report MSR-TR-94-08,
Microsoft Research, Seattle, WA, 1994.
[16] D. Heckerman, A. Mamdani, and M. P. Wellman. Real-world applications of
Bayesian networks. Communications of the ACM, 38(3):2426, March 1995.
[17] M. Jaeger. Relational Bayesian networks. In Proceedings of the Conference
on Uncertainty in Articial Intelligence, 1997.
[18] F. V. Jensen. Bayesian Networks and Decision Graphs. Springer-Verlag, 2001.
[19] F. V. Jensen. An Introduction to Bayesian Networks. UCL Press Limited,
1996. Reprinted 1998.
[20] M. I. Jordan, editor. Learning in Graphical Models. MIT Press, Cambridge,
MA, 1998.
[21] K. Kersting and L. De Raedt. Bayesian logic programs. Technical Report 151,
University of Freiburg, Institute for Computer Science, Freiburg, Germany,
April 2001.
[22] K. Kersting and L. De Raedt. Adaptive Bayesian Logic Programs. In
Proceedings of the International Conference on Inductive Logic Programming,
2001.
[23] K. Kersting and U. Dick. Balios - The engine for Bayesian logic programs.
In Proceedings of the European Conference on Principles and Practice of
Knowledege Discovery in Databases, 2004.
[24] K. Kersting, T. Raiko, S. Kramer, and L. De Raedt. Towards discovering
structural signatures of protein folds based on logical hidden Markov models.
In Proceedings of the Pacic Symposium on Biocomputing, 2003.
[25] D. Koller. Probabilistic relational models. In Proceedings of the International
Conference on Inductive Logic Programming, 1999.
[26] D. Koller and A. Pfeer. Probabilistic frame-based systems. In Proceedings
of the National Conference on Articial Intelligence, 1998.
[27] P. Langley. Elements of Machine Learning. Morgan Kaufmann, San Fransisco,
1995.
[28] J. W. Lloyd. Foundations of Logic Programming, 2nd edition. Springer-Verlag,
Berlin, 1989.
[29] C. H. Manning and H. Sch
utze. Foundations of Statistical Natural Language
Processing. MIT Press, Cambridge, MA, 1999.
[30] S. Muggleton. Stochastic logic programs. In L. De Raedt, editor, Advances
in Inductive Logic Programming, Amsterdam, 1996. IOS Press.
[31] S. Muggleton and L. De Raedt. Inductive logic programming: Theory and
methods. Journal of Logic Programming, 19(20):629679, 1994.
[32] R. Ng and V. S. Subrahmanian. Probabilistic logic programming. Information
and Computation, 101(2):150201, 1992.

References

321

[33] L. Ngo and P. Haddawy. Probabilistic logic programming and Bayesian


networks. In Algorithms, Concurrency and Knowledge: Proceedings of the
Asian Computing Science Conference, 1995.
[34] L. Ngo and P. Haddawy. Answering queries from context-sensitive probabilistic knowledge bases. Theoretical Computer Science, 171:147177, 1997.
[35] S.-H. Nienhuys-Cheng and R. de Wolf.
Programming. Springer-Verlag, 1997.

Foundations of Inductive Logic

[36] J. Pearl. Reasoning in Intelligent Systems: Networks of Plausible Inference,


2nd edition. Morgan Kaufmann, San Fransisco, 1991.
[37] A. J. Pfeer. Probabilistic Reasoning for Complex Systems. PhD thesis,
Stanford University, 2000.
[38] D. Poole. Probabilistic Horn abduction and Bayesian networks. Articial
Intelligence, 64:81129, 1993.
[39] L. R. Rabiner. A tutorial on hidden Markov models and selected applications
in speech recognition. Proceedings of the IEEE, 77(2):257286, 1989.
[40] S. J. Russell and P. Norvig. Articial Intelligence: A Modern Approach.
Prentice-Hall, Upper Saddle River, NJ, 1995.
[41] T. Sato. A statistical learning method for logic programs with distribution
semantics. In Proceedings of the International Conference on Inductive Logic
Programming, 1995.
[42] T. Sato and Y. Kameya. Parameter learning of logic programs for symbolicstatistical modeling. Journal of Articial Intelligence Research, 15:391454,
2001.
[43] R. D. Schachter. Bayes-Ball: The rational pasttime (for determining irrelevance and requisite information in belief networks and inuence diagrams). In
Proceedings of the Conference on Uncertainty in Articial Intelligence, 1998.

11 Stochastic Logic Programs: A Tutorial

Stephen Muggleton and Niels Pahlavi

Stochastic logic programs (SLPs)provide a simple scheme for representing probability distributions over structured objects. Other papers have concentrated on
technical issues related to the semantics and machine learning of SLPs. By contrast, this chapter provides a tutorial for the use of SLPs as a means of representing probability distributions over structured objects such as sequences, graphs, and
plans.

11.1

Introduction
11.1.1

Logic Programs for Algorithms

Logic programs [5] provide a convenient way of describing computer algorithms in


a compact and declarative fashion. A good example of this is the following Prolog
[1] representation of the quick-sort algorithm.
quick_sort([],[]).
quick_sort([Head|Tail],Sorted) :partition(Tail,Head,BeforeHead,AfterHead),
quick_sort(BeforeHead,BeforeSorted).
quick_sort(AfterHead,AfterSorted),
append(BeforeSorted,[Head|AfterSorted],Sorted).
Here the key elements of the algorithm are captured in a few lines, showing the
relationship between the various parts of the solution.
11.1.2

Logic Programs and Non-Determinism

An algorithm generally describes an entirely deterministic series of actions. However, apart from their use in describing algorithms logic programs can also describe

324

Stochastic Logic Programs: A Tutorial

processes involving non-deterministic choice. For instance, consider the following


well-known logic program.
member(Element,[Element|_]).
member(Element,[_|List]) :member(Element,List).
When provided with the goal :- member(X,[b,a,c]), a Prolog interpreter will give
the following solutions.
X = b
X = a
X = c
Each of these three solutions is associated with one of the derivations of the given
goal.
11.1.3

Probabilistic Non-Determinism

Consider the following non-deterministic logic program representation of the outcome of tossing a two-sided coin.
coin(head).
coin(tail).
This logic program can be interpreted as saying that when the coin is tossed it will
either come up as heads or tails. However, the logic program does not state the
frequency with which we can expect these two outcomes to occur. By associating
probability labels with the clauses we get the following stochastic logic program
(SLP) [9] representation of a fair coin (a coin with equal probability outcomes of
heads and tails).
0.5: coin(head).
0.5: coin(tail).
Given the goal :- coin(X) we would now expect the outcomes X = head and
X = tail to occur randomly with probability 0.5 in each case. Here we can view X
as a random variable in the statistical sense.

11.2

Mixing Deterministic and Probabilistic Choice


We now consider two more complex representational problems. The rst involves a
simple game with probabilistic outcomes and the second a simplied version of the
famous casino blackjack game. The rst game is described below.

11.2

Mixing Deterministic and Probabilistic Choice

11.2.1

325

Simple Game of Chance

The game involves a player and a banker. The player starts with a quantity of N
counters and the banker with M counters. Until the player chooses to stop he does
the following repeatedly.
1. The player pays an entrance fee (F counters) to the banker.
2. The player rolls a six-sided dice and gets the value D.
3. The banker rewards the player with D counters.
11.2.2

Representing the Game as an SLP

Below we show how this game can be represented as an SLP. The form of SLP used
below is known as an impure SLP. An impure SLP is one in which not every denite
clause has a probability label. Those without a probability label are treated as
normal logic program clauses. Let us start with the unlabeled part of the program.
play(State) :act(stop(State,State)).
play(State) :act(pay_entrance(State,State1)),
act(dice_reward(State1,State2)),
play(State2).
act(X) :- X.
Here we see the general playing strategy for the game. Every action such as
pay entrance and dice reward is conducted by the predicate act. Each such action
transforms one state into another. Play proceeds by recursing via the second clause
until the stop action is taken using the rst clause.
11.2.3

Representing the Actions

Next we show the way in which actions are represented.


stop([Player,Banker],_) :Player1 is Player,
Banker1 is Banker,
write(Player = ), write(Player1), nl
write(Banker = ), write(Banker1), nl
pay_entrance([Player,Banker],[Player-4,Banker+4]).
dice_reward([Player,Banker],[Player+D,Banker-D]) :roll_dice(D).

326

Stochastic Logic Programs: A Tutorial

The stop action simply prints out the playing state, which is represented as a twoelement list consisting of the number of counters held by the Player and Banker
respectively.
The pay entrance action reduces the players counters by 4 and increases the
bankers counters by 4.
The dice reward action increases the players counters by the value D of the rolled
dice and decreases the bankers counters by D.
11.2.4

Representing the Dice

The dice represent the only probabilistic element of the game. A fair dice is
represented as follows.
1/6:
1/6:
1/6:
1/6:
1/6:
1/6:

roll_dice(1).
roll_dice(2).
roll_dice(3).
roll_dice(4).
roll_dice(5).
roll_dice(6).

11.2.5

Simplied Blackjack Game

The end of this section will be dedicated to a more extended representational


problem which involves a version of the blackjack game. After describing the game,
we will show how we can represent it using the SLP framework and why this
framework is ecient and adapted for this problem. Finally we will see how we
could modify this version and the eect of such modications on the corresponding
SLP representation.
11.2.6

Description and Specications

Our blackjack game model is very close to the real blackjack game, as described in
Wikipedia [14]. We consider this version as a simplication of the real game in the
sense that it does not include bets and money. It involves only one player and the
player does not have any strategy.
Let us now describe the specications of the game. Blackjack hands are scored
by their point total. The hand with the highest total wins as long as it does not go
over 21, which is called a bust. Cards 2 through 10 are worth their face value,
and face cards (Jack, Queen, King) are also worth 10. An ace counts as 11 unless
it would bust a hand, in which case it counts as 1.
In our version there is only one player. His goal is to beat the dealer, by having
the higher, unbusted hand. Note that if the player busts, he loses, even if the dealer
also busts. If the players and the dealers hands have the same point value, this is
known as a push, and neither player nor dealer wins the hand.

11.2

Mixing Deterministic and Probabilistic Choice

327

The dealer deals the cards, in our version from one deck of cards. The dealer
gives two cards to the player and to himself.
A two-card hand of 21 (an ace plus a ten-value card) is called a blackjack or a
natural, and is an automatic winner.
If the dealer has a blackjack and the player does not, the dealer wins automatically. If the player has a blackjack and the dealer does not, the player wins
automatically. If the player and dealer both have blackjack, it is a tie (push). If
neither side has a blackjack, in our version the strategy of the player is always to
stand, then the dealer plays his hand. He must hit until he has at least 17, regardless
of what the player has. The dealer may hit until he has a maximum of ve cards
in his hands.
The parameters of the game that could be modied, dening another version of
the game are:
the number of decks of cards;
the maximum number of cards in a hand;
the strategy of the player;
the number of players.
11.2.7

Prolog Implementation of the Game

We show here how we can represent the game in Prolog. Such an implementation
could lead to the SLP representation for several reasons. First, since SLPs lift the
concept of logic programs, representing the game in Prolog allows us to translate
it in order to obtain an SLP representation of the game. Secondly, it is interesting
to see the dierence of expressivity between logic programs and SLPs. Finally, the
Prolog implementation permits us to experimentally verify the correctness of our
representation.
Let us present the entry clause of the program.
game(Result,PScore,PHand,DScore,DHand) :State0 = [[],[],[]],
act(first_2_cards(State0,State1)),
act(rest_of_game(State1,State2)),
end_of_game(State2,Result,PScore,PHand,DScore,DHand).
act(X) :- X.
The general playing strategy has the same structure as for the simple game of chance
described above. Indeed, every action such as first 2 cards and rest of game is
conducted by the predicate act. Each such action transforms one state into another.
The predicate end of game does not represent an action but calculates, given the
nal state of the game, the result, returning also the scores and the hands of the
player and the dealer as its last arguments.

328

Stochastic Logic Programs: A Tutorial

A playing state is represented as a list of three lists. The rst list represents the
player hand, the second the dealer hand, and the third all the cards already dealt.
The rest of the program denes each of the predicates which are mentioned in
the body of the clause. For instance, let us present the denition of the predicate
rest of game.
rest_of_game(State,State2) :act(p_turn(State,State1)),
act(d_turn(State1,State2)).
We need to introduce two other predicates; p turn and d turn. For instance, d turn
represents the dealers turn after he has received his rst two cards. In this phase he
asks for extra cards until he stands. This corresponds to the following two clauses.
d_turn(State,State) :d_stands(State).
d_turn(State,State2) :\+ d_stands(State),
act(d_deal_card(State,State1)),
act(d_turn(State1,State2)).
The predicate d deal card represents the action of dealing a card to the dealer.
Therefore, it requires taking a card from the deck of cards. Taking a card is
represented by the following clause.
pick_card(Cards,Card) :random_card(Card),
non_member(Card,Cards).,
where
random_card((C,V)) :repeat,
random(1,5,C),
random(1,14,V).
random is a build-in Prolog predicate which simulates the choice of a number
between two bounds.
11.2.8

Representing the Blackjack Game as an SLP

SLP is the statistical relational learning (SRL) framework that is arguably the
closest to logic programs in terms of declarativeness. Therefore SLP is the most
expressive framework to translate the blackjack game into. Indeed, there are two
type of clauses in the Prolog program that require two dierent types of treatment.

11.2

Mixing Deterministic and Probabilistic Choice

329

The Prolog clauses without any random aspect are not modied in the SLP
representation. Obviously we do not restrict ourselves to the notion of pure SLPs
but instead we allow for impure SLP representations.
The random aspects of the program are transformed in labeled clauses. However,
taking a card from a deck of cards is the only probabilistic element of the game.
Therefore, the Prolog implementation and the SLP representation of the blackjack
game are virtually identical. The sole use of the predicate random is replaced by
several labeled clauses expressing that taking a card from a deck is a random
action. Let us show how the action of taking a card from the deck is translated:
Compared to the Prolog implementation of this action described above, the
random predicates have to be replaced in the SLP representation by labeled clauses.
The clause choose color determines the color of the card and the clause choose value
determines the value of the card.
pick_card(Cards,Card) :random_card(Card),
non_member(Card,Cards).
random_card((C,V)) :repeat,
choose_color(C),
choose_value(V).
0.25:
0.25:
0.25:
0.25:

choose_color(1).
choose_color(2).
choose_color(3).
choose_color(4).

0.07692308:
0.07692308:
0.07692308:
0.07692308:
0.07692308:
0.07692308:
0.07692308:
0.07692308:
0.07692308:
0.07692308:
0.07692308:
0.07692308:
0.07692308:

choose
choose
choose
choose
choose
choose
choose
choose
choose
choose
choose
choose
choose

value(1).
value(2).
value(3).
value(4).
value(5).
value(6).
value(7).
value(8).
value(9).
value(10).
value(11).
value(12).
value(13).

330

Stochastic Logic Programs: A Tutorial

Thus the SLP representation is almost as expressive as the Prolog program. Since
SLPs are logically oriented, it is relatively easy to understand the rules of the game
given the SLP representation.
11.2.9

Eect of Several Game Modications

Let us now present how a modication in the parameters of the game description
would aect this model.
If we added other decks of cards, we would have to add an argument to the description of a card. A card would be dened as C=(Value,Color,Number of Deck).
We would have to modify this in the relevant clauses but the number of these
clauses is limited. We would also have to add a predicate choose deck dened
like choose color and choose value.
If we allowed for more cards in a hand, we would only have to replace 5 by the
new number in the denition of d stands.
If we wanted to assign a cleverer game strategy for the player, we would only have
to modify the p stands predicate.
We would have to make more important changes if we wanted to change the
number of players. Indeed, we would have to add equivalent predicates for all the
predicates that model the actions of the players.
Thanks to its great expressivity and compactness, the SLP representation would
not be modied much when changing the parameters of the game compared to
other frameworks.

11.3

Stochastic Grammars
The initial inspiration for SLPs in [9] was the idea of lifting stochastic grammars
to the expressive level of logic programs. In this section we show the relationship
between stochastic grammars and SLPs.
11.3.1

Stochastic Automata

Stochastic automata, otherwise called hidden Markov models [11], have found many
applications in speech recognition. An example is shown in gure 11.1. Stochastic
automata are dened by a 5-tuple A = Q, , q0 , F, . Q is a set of states. is an
alphabet of symbols. q0 is the initial state and F Q (F = {q2 } in gure 11.1) is
the set of nal states. : (Q \ F ) Q [0, 1] is a stochastic transition function
which associates probabilities with labeled transitions between states. The sum of
probabilities associated with transitions from any state q (Q \ F ) is 1.
In the following represents the empty string. The transition function :
(Q\F ) Q[0, 1] is dened as follows. (q, ) = q, 1 . (q, au) = qau , pa pu

11.3

Stochastic Grammars

331

0.4: a

q0
Figure 11.1

0.7: b

0.6: b

q1

0.3: c

q2

Stochastic automaton.

if and only if (q, a) = qa , pa and (qa , u) = qau , pu . The probability of u being


accepted from state q in A is dened as follows. P r(u|q, A) = p if (q, u) = q  , p
and q  F . P r(u|q, A) = 0 otherwise.
Theorem 11.1
Probability of a string being accepted from a particular state Let A =
Q, , q0 , F, be a stochastic automaton. For any q Q the following holds.

P r(u|q, A) = 1.
u

Proof Suppose the theorem is false. Either q F or q  F . Suppose q F .


Then by the denition of stochastic automata q has no outgoing transitions.
Therefore by denition P r(u|q, A) is 1 for u = and 0 otherwise, which is in
accordance with the theorem. Therefore suppose q  F . Suppose in state q the
transitions are (q, a1 ) = q, p1 , . . . , (q, an ) = q, pn . Then each string u is
accepted in proportions p1 , . . . , pn . according to its rst symbol. That is to say,

A) = p1 +..+pn . But according to the denition of , p1 +. . .+pn = 1,
u P r(u|q,

which means u P r(u|q, A) = 1. This contradicts the assumption and completes
the proof. 
If the probability of u being accepted by A is now dened as P r(u|A) = P r(u|q0 , A),
then the following corollary shows that A denes a probability distribution over .
Corollary 11.2
Stochastic automata represent probability distributions Given stochastic
automaton A,

P r(u|A) = 1.
u

Proof Special case of theorem 11.1 when q = q0 . 


The following example illustrates the calculation of probabilities of strings.
Example 11.1
Probabilities associated with strings For the automaton A in gure 11.1 we
have P r(abbc|A) = 0.4 0.6 0.7 0.3 = 0.0504. P r(abac|A) = 0.

332

Stochastic Logic Programs: A Tutorial

0.4 : q0 aq0
0.6 : q0 bq1
0.7 : q1 bq1
0.3 : q1 cq2
1.0 : q2
Figure 11.2

Labelled production rule representation of stochastic automaton.

A can also be viewed as expressing a probability distribution over the language


L(A) = {u : (q0 , u) = q, p and q F }. The following theorem places bounds on
the probability of individual strings in L(A). The notation |u| is used to express
the length of string u.
Theorem 11.3
Probability bounds. Let A = Q, , q0 , F, be a stochastic automaton and
let pmin , pmax be respectively the minimum and maximum probabilities of any
transition in A. Let u L(A) be a string.
|u|

pmin P r(u|A) p|u|


max .
|u|
Proof P r(u|A) = i=1 pi , where pi is the probability associated with the ith
transition in A accepting u. Clearly each pi is bounded below by pmin and above
|u|
|u|
by pmax , and thus pmin P r(u|A) pmax . 
This theorem shows that (a) all strings in L(A) have nonzero probability and (b)
stochastic automata express probability distributions that decrease exponentially
in the length of strings in L(A).
11.3.2

Labeled Productions

Stochastic automata can be equivalently represented as a set of labeled production


rules. Each state in the automaton is represented by a nonterminal symbol and
each transition q, a q  , p is represented by a production rule of the form
p : q aq  . Figure 11.2 is the set of labeled production rules corresponding
to the stochastic automaton of gure 11.1. Strings can now be generated from
this stochastic grammar by starting with the string q0 and progressively choosing
productions to rewrite the leftmost nonterminal randomly in proportion to their
probability labels. The process terminates once the string contains no nonterminals.
The probability of the generated string is the product of the labels of rewrite rules
used.

11.4

Stochastic Logic Programs

333

0.5 : S
0.5 : S aSb
Figure 11.3

11.3.3

Stochastic context-free grammar

Stochastic Context-free Grammars

Stochastic context-free grammars [4] can be treated in the same way as the labeled
productions of the last section. However, the following dierences exist between the
regular and context-free cases.
To allow for the expression of context-free grammars the left-hand sides of the
production rules are allowed to consist of arbitrary strings of terminals and
nonterminals.
Since context-free grammars can have more than one derivation of a particular
string u, the probability of u is the sum of the probabilities of the individual
derivations of u.
The analogue of Theorem 11.3 holds only in relation to the length of the derivation, not the length of the generated string.
Example 11.2
The language an bn Figure 11.3 shows a stochastic context-free grammar G
expressed over the language an bn . The probabilities of generated strings are as
follows. P r(|G) = 0.5, P r(ab|G) = 0.25, P r(aabb|G) = 0.125.

11.4

Stochastic Logic Programs


Every context-free grammar can be expressed as a denite clause grammar [2].
For this reason the generalization of stochastic context-free grammars to SLPs is
reasonably straightforward. First, a denite clause C is dened in the standard way
as having the following form.
A B1 , . . . , Bn ,
where the atom A is the head of the clause and B1 , . . . , Bn is the body of the
clause. C is said to be range-restricted if and only if every variable in the head of
C is found in the body of C. A stochastic clause is a pair p : C where p is in the
interval [0, 1] and C is a range-restricted clause. A set of stochastic clauses P is
called a stochastic logic program if and only if for each predicate symbol q in P the
probability labels for all clauses with q in the head sum to 1.

334

Stochastic Logic Programs: A Tutorial

0.5 : nate(0)
0.5 : nate(s(N )) nate(N )
Figure 11.4

11.4.1

Exponential distribution over natural numbers.

Stochastic SLD Refutations

For SLPs the stochastic refutation of a goal is analogous to the stochastic generation
of a string from a set of labeled production rules. Suppose that P is an SLP.
Then n(P ) will be used to express the logic program formed by dropping all the
probability labels from clauses in P . A stochastic SLD procedure will be used to
dene a probability distribution over the Herbrand base of n(P ). The stochastic
SLD derivation of atom a is as follows. Suppose g is a unit goal with the same
predicate symbol as a, no function symbols, and distinct variables. Next suppose
that there exists an SLD refutation of g with answer substitution such that
g = a. Since all clauses in n(P ) are range-restricted, is necessarily a ground
substitution. The probability of each clause selection in the refutation is as follows.
Suppose the rst atom in the subgoal g  can unify with the heads of stochastic
clauses p1 : C1 , . . . , pn : Cn , and stochastic clause pi : Ci is chosen in the refutation.
pi
. The probability of the derivation of
Then the probability of this choice is p1 +...+p
n
a is the product of the probability of the choices in the refutation. As with stochastic
context-free grammars, the probability of a is then the sum of the probabilities of
the derivations of a.
This stochastic SLD strategy corresponds to a distributional semantics [13] for
P . That is, each atom a in the success set of n(P ) is assigned a nonzero probability
(due to the completeness of SLD derivation). For each predicate symbol q the
probabilities of atoms in the success set of n(P ) corresponding to q sum to 1 (the
proof of this is analogous to theorem 11.1).
11.4.2

Polynomial Distributions

It is reasonable to ask whether theorem 11.3 extends in some form to SLPs. The
distributions described in [10] include both those that decay exponentially over the
length of formulae and those that decay polynomially. SLPs can easily be used to
describe an exponential decay distribution over the natural numbers as follows.
Example 11.3
Exponential distribution Figure 11.4 shows a recursive SLP P which describes an exponential distribution over the natural numbers expressed in Peano
arithmetic form. The probabilities of atoms are as follows. P r(nate(0)|P ) =
0.5, P r(nate(s(0))|P ) = 0.25, and P r(nate(s(s(0)))|P ) = 0.125. In general,
P r(nate(N )|P ) = 2N 1 .

11.5

Learning Techniques

335

1.0 : natp(N ) nate(U ), bin(U, N )


0.5 : bin(0, [1])
0.5 : bin(s(U ), [C|N ]) coin(C), bin(U, N )
Figure 11.5

Polynomial distribution over natural numbers.

However, SLPs can also be used to dene a polynomially decaying distribution over
the natural numbers as follows.
Example 11.4
Polynomial distribution Figure 11.5 shows a recursive SLP P which describes a
polynomial distribution over the natural numbers expressed in reverse binary form.
Numbers are constructed by rst choosing the length of the binary representation
and then lling out the binary expression by repeated tossing of a fair coin. Since
the probability of choosing a number N of length log2 (N ) is roughly 2log2 (N ) and
there are 2log2 (N ) such numbers, each with equal probability, P r(natp(N )|P )
22log2 (N ) = N 2 .

11.5

Learning Techniques
We will now briey introduce the dierent existing learning techniques for SLP.
We will begin with the description of data used for learning. We will then focus
on studying the parameter estimation techniques and nally the structure learning,
after having dened these notions.
11.5.1

Data Used

For SLP, as for stochastic context-free grammars, the evidence used for learning is
facts or even clauses .
11.5.2

Parameter Estimation

The aim of parameter estimation is, given a set of examples, to infer the values
of the parameters (which represent the quantitative part of the model) that best
justify the set of examples. We will focus on the maximum likelihood estimation
(MLE) which tries to nd = argmax P (E|L, ). Yet we cannot calculate exactly
the MLE when data is missing, so the expectation maximization (EM) algorithm
is the most commonly used technique.
As described in [12], EM assumes that the parameters have been initialized (e.g.,
at random) and then iteratively perform the following two steps until convergence:

336

Stochastic Logic Programs: A Tutorial

E-Step: on the basis of the observed data and the present parameters of the model,
compute a distribution over all possible completions of each partially observed
data case.
M-Step: Using each completion as a fully-observed data case weighted by its probability, compute the updated parameter using (weighted) frequency counting.
For SLP, one uses the failure-adjusted maximization (FAM) algorithm introduced
by Cussens [3]. One has to learn the parameters thanks to the evidence, whereas the
logical part of the SLP is given. The examples consist of atoms for a predicate p and
are logically entailed by the SLP, since they are generated from the target SLP. In
order to estimate the parameters, SLD trees are computed for each example. Each
path from root to leaf is considered as one of the possible completions. Then, one
weights the above completions with the product of probabilities associated with
clauses that are used in the completions. Eventually, one obtains the improved
estimates for each clause by dividing the clauses expected counts by the sum of
the expected counts of clauses for the same predicate.
11.5.3

Structure Learning

Given a set of examples E and a language bias B, which determines the set of
possible hypotheses, one searches for a hypothesis H B such that
1. H logically covers the examples E, i.e., cover(H , E), and
2. the hypothesis H is optimal w.r.t. some scoring function scores, i.e., H =
argmaxHB = score(H, E).
The hypotheses are of the form (L, ) where L is the logical part and the vector
of parameters values dened in section 11.5.2.
The existing approaches use a heuristic search through the space of hypothesis.
Hill-climbing or beam-search are typical methods that are applied until the candidate hypothesis satises the two conditions dened above. One applies renement
operators during the steps in the search space.
For SLPs, as described in [12], structure learning involves applying a renement
operator at the theory level (i.e. considering multiple predicates) under entailment.
It is theory revision in inductive logic programming. This problem being known as
very hard, the only approaches have been restricted to learning missing clauses for a
single predicate. Muggleton [7], introduced a two-phase approach that separates the
structure learning aspects from the parameter estimation phase. In a more recent
approach, Muggleton [8] presents an initial attempt to integrate both phases for
single predicate learning.

11.6

11.6

Conclusion

337

Conclusion
Stochastic logic programs provide a simple scheme for representing probability
distributions over structured objects. This chapter provides a tutorial for the use of
SLPs as a means of representing probability distributions over structured objects
such as sequences, graphs, and plans.
SLPs were initially applied to the problem of learning from positive examples
only [6]. This required the implementation of the following function which denes
the generality of an hypothesis.

DX (x).
g(H) =
xH

The generality is thus the sum of the probability of all instances of hypothesis H.
Clearly such a sum can be innite. However, if a large enough sample is generated
from DX (implemented as an SLP), then the proportion of the sample entailed by
H gives a good approximation of g(H).

Acknowledgments
Our thanks for useful discussions on the topics in this chapter with James Cussens,
Kristian Kersting, Jianzhong Chen, and Hiroaki Watanabe. This work was supported by the Esprit IST project Application of Probabilistic Inductive Logic
Programming II (APRIL II) and the DTI Beacon project, Metalog - Integrated
Machine Learning of Metabolic Networks Applied to Predictive Toxicology.

References
[1] I. Bratko. Prolog for Articial Intelligence. Addison-Wesley, London, 1986.
[2] W.F. Clocksin and C.S. Mellish. Programming in Prolog. Springer-Verlag,
Berlin, 1981.
[3] J. Cussens. Parameter estimation in stochastic logic programs.
Learning, 44(3):245271, 2001.

Machine

[4] K. Lari and S. J. Young. The estimation of stochastic context-free grammars


using the inside-outside algorithm. Computer Speech and Language, 4:3556,
1990.
[5] J.W. Lloyd. Foundations of Logic Programming, 2nd edition. Springer-Verlag,
Berlin, 1987.
[6] S.H. Muggleton. Learning from positive data. In Proceedings of the International Conference on Inductive Logic Programming, 1997.

338

Stochastic Logic Programs: A Tutorial

[7] S.H. Muggleton. Learning stochastic logic programs. Electronic Transactions


in Articial Intelligence, 4(041), 2000.
[8] S.H. Muggleton. Learning structure and parameters of stochastic logic programs. Electronic Transactions in Articial Intelligence, 6, 2002.
[9] S.H. Muggleton. Stochastic logic programs. In L. de Raedt, editor, Advances
in Inductive Logic Programming, pages 254264. IOS Press, Amsterdam, 1996.
URL http://www.doc.ic.ac.uk/~
shm/Papers/slp.pdf.
[10] S.H. Muggleton and C.D. Page. A learnability model for universal representations. Technical Report PRG-TR-3-94, Oxford University Computing
Laboratory, Oxford, UK, 1994.
[11] L.R. Rabiner. A tutorial on hidden Markov models and selected applications
in speech recognition. Proceedings of the IEEE, 77(2):257286, 1989.
[12] L. De Raedt and K. Kersting. Probabilistic logic learning. ACM-SIGKDD
Explorations, 5(1):3148, 2003.
[13] T. Sato. A statistical learning method for logic programs with distributional
semantics. In Proceedings of the Twelth International conference on logic
programming, pages 715729, 1995.
[14] Wikipedia, 2006.
Wikipedias page
http://en.wikipedia.org/wiki/Blackjack.

on

the

Blackjack

game,

12 Markov Logic: A Unifying Framework for


Statistical Relational Learning

Pedro Domingos and Matthew Richardson

Interest in statistical relational learning (SRL) has grown rapidly in recent years.
Several key SRL tasks have been identied, and a large number of approaches have
been proposed. Increasingly, a unifying framework is needed to facilitate transfer of
knowledge across tasks and approaches, to compare approaches, and to help bring
structure to the eld. We propose Markov logic as such a framework. Syntactically,
Markov logic is indistinguishable from rst-order logic, except that each formula
has a weight attached. Semantically, a set of Markov logic formulae represents a
probability distribution over possible worlds, in the form of a log-linear model with
one feature per grounding of a formula in the set, with the corresponding weight.
We show how approaches like probabilistic relational models, knowledge-based
model construction, and stochastic logic programs can be mapped into Markov
logic. We also show how tasks like collective classication, link prediction, linkbased clustering, social network modeling, and object identication can be concisely
formulated in Markov logic. Finally, we develop learning and inference algorithms
for Markov logic, and report experimental results on a link prediction task.

12.1

The Need for a Unifying Framework


Many (if not most) real-world application domains are characterized by the presence
of both uncertainty and complex relational structure. Statistical learning focuses
on the former, and relational learning on the latter. Statistical relational learning
(SRL) seeks to combine the power of both. Research in SRL has expanded rapidly
in recent years, both because of the need for it in applications, and because statistical and relational learning have individually matured to the point where combining
them is a feasible research enterprise. A number of key SRL tasks have been identied, including collective classication, link prediction, link-based clustering, social
network modeling, object identication, and others. A large and growing number of

340

Markov Logic: A Unifying Framework for Statistical Relational Learning

SRL approaches have been proposed, including knowledge-based model construction [55, 39, 29], stochastic logic programs [37, 9], PRISM [51], MACCENT [12],
probabilistic relational models [17], relational Markov models [1], relational Markov
networks [53], relational dependency networks [38], structural logistic regression
[44], relational generation functions [7], constraint logic programming for probablistic knowledge (CLP(BN )) [50], and others.
While the variety of problems and approaches in the eld is valuable, it makes
it dicult for researchers, students, and practitioners to identify, learn, and apply
the essentials. In particular, for the most part, the relationships between dierent
approaches and their relative strengths and weaknesses remain poorly understood,
and innovations in one task or application do not easily transfer to others, slowing
down progress. There is thus an increasingly pressing need for a unifying framework,
a common language for describing and relating the dierent tasks and approaches.
To be most useful, such a framework should satisfy the following desiderata:
1. The framework must incorporate both rst-order logic and probabilistic graphical
models. Otherwise some current or future SRL approaches will fall outside its
scope.
2. SRL problems should be representable clearly and simply in the framework.
3. The framework must facilitate the use of domain knowledge in SRL. Because
the search space for SRL algorithms is very large even by AI standards, domain
knowledge is critical to success. Conversely, the ability to incorporate rich domain
knowledge is one of the most attractive features of SRL.
4. The framework should facilitate the extension to SRL of techniques from statistical learning, inductive logic programming, probabilistic inference, and logical
inference. This will speed progress in SRL by taking advantage of the large extant
literature in these areas.
In this chapter we propose Markov logic as a framework that we believe meets
all of these desiderata. We begin by briey reviewing the necessary background
in Markov networks (section 12.2) and rst-order logic (section 12.3). We then
introduce Markov logic (section 12.4) and describe how several SRL approaches
and tasks can be formulated in this framework (sections 12.5 and 12.6). Next,
we show how techniques from logic, probabilistic inference, statistics and inductive
logic programming can be used to obtain practical inference and learning algorithms
for Markov logic (sections 12.7 and 12.8). Finally, we illustrate the application of
these algorithms in a real-world link prediction task (section 12.9) and conclude
(section 12.10).

12.2

12.2

Markov Networks

341

Markov Networks
A Markov network (also known as a Markov random eld) is a model for the joint
distribution of a set of variables X = (X1 , X2 , . . . , Xn ) X [41]. It is composed of
an undirected graph G and a set of potential functions k . The graph has a node
for each variable, and the model has a potential function for each clique in the
graph. A potential function is a non-negative real-valued function of the state of
the corresponding clique. The joint distribution represented by a Markov network
is given by

P (X = x) =

1 
k (x{k} ),
Z

(12.1)

where x{k} is the state of the kth clique (i.e., the state of the variables that appear in


that clique). Z, known as the partition function, is given by Z = xX k k (x{k} ).
Markov networks are often conveniently represented as log-linear models, with each
clique potential replaced by an exponentiated weighted sum of features of the state,
leading to


1
wj fj (x) .
P (X = x) = exp
Z
j

(12.2)

A feature may be any real-valued function of the state. This chapter will focus on
binary features, fj (x) {0, 1}. In the most direct translation from the potentialfunction form (12.1), there is one feature corresponding to each possible state x{k}
of each clique, with its weight being log k (x{k} ). This representation is exponential
in the size of the cliques. However, we are free to specify a much smaller number
of features (e.g., logical functions of the state of the clique), allowing for a more
compact representation than the potential-function form, particularly when large
cliques are present. Markov Login Networks (MLNs) will take advantage of this.
Inference in Markov networks is #P-complete [49]. The most widely used method
for approximate inference in Markov networks is Markov chain Monte Carlo
(MCMC) [20], and in particular Gibbs sampling, which proceeds by sampling each
variable in turn given its Markov blanket. (The Markov blanket of a node is the
minimal set of nodes that renders it independent of the remaining network; in a
Markov network, this is simply the nodes neighbors in the graph.) Marginal probabilities are computed by counting over these samples; conditional probabilities are
computed by running the Gibbs sampler with the conditioning variables clamped
to their given values. Another popular method for inference in Markov networks is
belief propagation [57].
Maximum likelihood or maximup a posteriori (MAP) estimates of Markov network weights cannot be computed in closed form, but, because the log-likelihood
is a concave function of the weights, they can be found eciently using standard

342

Markov Logic: A Unifying Framework for Statistical Relational Learning

gradient-based or quasi-Newton optimization methods [40]. Another alternative is


iterative scaling [13]. Features can also be learned from data, by, for example, greedily constructing conjunctions of atomic features [13].

12.3

First-Order Logic
A rst-order knowledge base (KB) is a set of sentences or formulae in rst-order logic
[18]. Formulae are constructed using four types of symbols: constants, variables,
functions, and predicates. Constant symbols represent objects in the domain of
interest (e.g., people: Anna, Bob, Chris, etc.). Variable symbols range over the
objects in the domain. Function symbols (e.g., MotherOf) represent mappings from
tuples of objects to objects. Predicate symbols represent relations among objects in
the domain (e.g., Friends) or attributes of objects (e.g., Smokes). An interpretation
species which objects, functions, and relations in the domain are represented by
which symbols. Variables and constants may be typed, in which case variables range
only over objects of the corresponding type, and constants can only represent
objects of the corresponding type. For example, the variable x might range over
people (e.g., Anna, Bob, etc.), and the constant C might represent a city (e.g,
Seattle, Tokyo, etc.).
A term is any expression representing an object in the domain. It can be a
constant, a variable, or a function applied to a tuple of terms. For example, Anna,
x, and GreatestCommonDivisor(x, y) are terms. An atomic formula or atom is a
predicate symbol applied to a tuple of terms (e.g., Friends(x, MotherOf(Anna))).
Formulae are recursively constructed from atomic formulae using logical connectives
and quantiers. If F1 and F2 are formulae, the following are also formulae: F1
(negation), which is true i F1 is false; F1 F2 (conjunction), which is true i both
F1 and F2 are true; F1 F2 (disjunction), which is true i F1 or F2 is true; F1 F2
(implication), which is true i F1 is false or F2 is true; F1 F2 (equivalence), which
is true i F1 and F2 have the same truth-value; x F1 (universal quantication),
which is true i F1 is true for every object x in the domain; and x F1 (existential
quantication), which is true i F1 is true for at least one object x in the domain.
Parentheses may be used to enforce precedence. A positive literal is an atomic
formula; a negative literal is a negated atomic formula. The formulae in a KB are
implicitly conjoined, and thus a KB can be viewed as a single large formula. A
ground term is a term containing no variables. A ground atom or ground predicate
is an atomic formula all of whose arguments are ground terms. A possible world or
Herbrand interpretation assigns a truth value to each possible ground predicate.
A formula is satisable i there exists at least one world in which it is true. The
basic inference problem in rst-order logic is to determine whether a knowledge
base KB entails a formula F , i.e., if F is true in all worlds where KB is true
(denoted by KB |= F ). This is often done by refutation: KB entails F i KB F
is unsatisable. (Thus, if a KB contains a contradiction, all formulae trivially follow
from it, which makes painstaking knowledge engineering a necessity.) For automated

12.3

First-Order Logic

343

Table 12.1 Example of a rst-order knowledge base and MLN. Fr() is short for
Friends(), Sm() for Smokes(), and Ca() for Cancer()
English
Friends of friends
are friends
Friendless people
smoke.
Smoking causes
cancer.
If two people are
friends, either both
smoke or neither
does.

First-order logic

Clausal form

Wt

xyz Fr(x, y)
Fr(y, z) Fr(x, z)
x ((y Fr(x, y))
Sm(x))

Fr(x, y) Fr(y, z) Fr(x, z)

0.7

Fr(x, g(x)) Sm(x)

2.3

x Sm(x) Ca(x)

Sm(x) Ca(x)

1.5

xy Fr(x, y)
(Sm(x) Sm(y))

Fr(x, y) Sm(x) Sm(y),


Fr(x, y) Sm(x) Sm(y)

1.1
1.1

inference, it is often convenient to convert formulae to a more regular form, typically


clausal form (also known as conjunctive normal form (CNF)). A KB in clausal form
is a conjunction of clauses, a clause being a disjunction of literals. Every KB in
rst-order logic can be converted to clausal form using a mechanical sequence of
steps.1 Clausal form is used in resolution, a sound and refutation-complete inference
procedure for rst-order logic [48].
Inference in rst-order logic is only semidecidable. Because of this, knowledge
bases are often constructed using a restricted subset of rst-order logic with more
desirable properties. The most widely used restriction is to Horn clauses, which are
clauses containing at most one positive literal. The Prolog programming language
is based on Horn clause logic [34]. Prolog programs can be learned from databases
by searching for Horn clauses that (approximately) hold in the data; this is studied
in the eld of inductive logic programming (ILP) [32].
Table 12.1 shows a simple KB and its conversion to clausal form. Notice that,
while these formulae may be typically true in the real world, they are not always
true. In most domains it is very dicult to come up with non-trivial formulae that
are always true, and such formulae capture only a fraction of the relevant knowledge.
Thus, despite its expressiveness, pure rst-order logic has limited applicability to
practical AI problems. Many ad hoc extensions to address this have been proposed.
In the more limited case of propositional logic, the problem is well solved by
probabilistic graphical models. The next section describes a way to generalize these
models to the rst-order case.
1. This conversion includes the removal of existential quantiers by Skolemization, which
is not sound in general. However, in nite domains an existentially quantied formula can
simply be replaced by a disjunction of its groundings.

344

Markov Logic: A Unifying Framework for Statistical Relational Learning

12.4

Markov Logic
A rst-order KB can be seen as a set of hard constraints on the set of possible
worlds: if a world violates even one formula, it has zero probability. The basic idea
in Markov logic is to soften these constraints: when a world violates one formula in
the KB it is less probable, but not impossible. The fewer formulae a world violates,
the more probable it is. Each formula has an associated weight that reects how
strong a constraint it is: the higher the weight, the greater the dierence in log
probability between a world that satises the formula and one that does not, other
things being equal. We call a set of formulae in Markov logic a Markov logic network.
MLNs dene probability distributions over possible worlds [21] as follows.
Denition 12.1
An MLN L is a set of pairs (Fi , wi ), where Fi is a formula in rst-order logic and
wi is a real number. Together with a nite set of constants C = {c1 , c2 , . . . , c|C| },
it denes a Markov network ML,C ((12.1) and (12.2)) as follows:

1. ML,C contains one binary node for each possible grounding of each predicate
appearing in L. The value of the node is 1 if the ground atom is true, and 0
otherwise.
2. ML,C contains one feature for each possible grounding of each formula Fi in L.
The value of this feature is 1 if the ground formula is true, and 0 otherwise. The
weight of the feature is the wi associated with Fi in L.
The syntax of the formulae in an MLN is the standard syntax of rst-order
logic [18]. Free (unquantied) variables are treated as universally quantied at the
outermost level of the formula.
An MLN can be viewed as a template for constructing Markov networks. Given
dierent sets of constants, it will produce dierent networks, and these may be of
widely varying size, but all will have certain regularities in structure and parameters,
given by the MLN (e.g., all groundings of the same formula will have the same
weight). We call each of these networks a ground Markov network to distinguish it
from the rst-order MLN. From denition 12.1 and (12.1) and (12.2), the probability
distribution over possible worlds x specied by the ground Markov network ML,C
is given by
(
)

1 
1
wi ni (x) =
i (x{i} )ni (x) .
P (X = x) = exp
Z
Z
i
i

(12.3)

where ni (x) is the number of true groundings of Fi in x, x{i} is the state (truth
values) of the atoms appearing in Fi , and i (x{i} ) = ewi . Notice that, although we
dened MLNs as log-linear models, they could equally well be dened as products
of potential functions, as the second equality above shows. This will be the most
convenient approach in domains with a mixture of hard and soft constraints (i.e.,

12.4

Markov Logic

345

Friends(A,B)

Friends(A,A)

Smokes(A)

Smokes(B)

Cancer(A)

Friends(B,B)

Cancer(B)
Friends(B,A)

Ground Markov network obtained by applying the last two formulae


in table 12.1 to the constants Anna(A) and Bob(B).

Figure 12.1

where some formulae hold with certainty, leading to zero probabilities for some
worlds).
The graphical structure of ML,C follows from denition 12.1: there is an edge
between two nodes of ML,C i the corresponding ground atoms appear together
in at least one grounding of one formula in L. Thus, the atoms in each ground
formula form a (not necessarily maximal) clique in ML,C . Figure 12.1 shows the
graph of the ground Markov network dened by the last two formulae in table 12.1
and the constants Anna and Bob. Each node in this graph is a ground atom (e.g.,
Friends(Anna, Bob)). The graph contains an arc between each pair of atoms that
appear together in some grounding of one of the formulae. ML,C can now be used
to infer the probability that Anna and Bob are friends given their smoking habits,
the probability that Bob has cancer given his friendship with Anna and whether
she has cancer, etc.
Each state of ML,C represents a possible world. A possible world is a set of
objects, a set of functions (mappings from tuples of objects to objects), and a
set of relations that hold between those objects; together with an interpretation,
they determine the truth-value of each ground atom. The following assumptions
ensure that the set of possible worlds for (L, C) is nite, and that ML,C represents
a unique, well-dened probability distribution over those worlds, irrespective of
the interpretation and domain. These assumptions are quite reasonable in most
practical applications, and greatly simplify the use of MLNs. For the remaining
cases, we discuss below the extent to which each one can be relaxed.
Assumption 1
Unique names Dierent constants refer to dierent objects [18].
Assumption 2
Domain closure The only objects in the domain are those representable using the
constant and function symbols in (L, C) [18].
Assumption 3
Known functions For each function appearing in L, the value of that function
applied to every possible tuple of arguments is known, and is an element of C.

346

Markov Logic: A Unifying Framework for Statistical Relational Learning


Table 12.2

Construction of all groundings of a rst-order formula under assump-

tions 13

function Ground(F , C)
inputs: F , a formula in rst-order logic
C, a set of constants
output: GF , a set of ground formulae
calls: CN F (F, C), which converts F to conjunctive normal form, replacing
existentially quantied formulae by disjunctions of their groundings over C
F CN F (F, C)
GF =
for each clause Fj F
Gj = {Fj }
for each variable x in Fj
for each clause Fk (x) Gj
Gj (Gj \ Fk (x)) {Fk (c1 ), Fk (c2 ), . . . , Fk (c|C| )},
where Fk (ci ) is Fk (x) with x replaced by ci C
GF GF Gj
for each ground clause Fj GF
repeat
for each function f (a1 , a2 , . . .) all of whose arguments are constants
Fj Fj with f (a1 , a2 , . . .) replaced by c, where c = f (a1 , a2 , . . .)
until Fj contains no functions
return GF

This last assumption allows us to replace functions by their values when grounding formulae. Thus the only ground atoms that need to be considered are those
having constants as arguments. The innite number of terms constructible from all
functions and constants in (L, C) (the Herbrand universe of (L, C)) can be ignored,
because each of those terms corresponds to a known constant in C, and atoms
involving them are already represented as the atoms involving the corresponding
constants. The possible groundings of a predicate in denition 12.1 are thus obtained simply by replacing each variable in the predicate with each constant in C,
and replacing each function term in the predicate by the corresponding constant.
Table 12.2 shows how the groundings of a formula are obtained given assumptions 13. If a formula contains more than one clause, its weight is divided equally
among the clauses, and a clauses weight is assigned to each of its groundings.
Assumption 1 (unique names) can be removed by introducing the equality
predicate (Equals(x, y), or x = y for short) and adding the necessary axioms to the
MLN: equality is reexive, symmetric, and transitive; for each unary predicate P,
xyx = y (P(x) P(y)); and similarly for higher-order predicates and functions
[18]. The resulting MLN will have a node for each pair of constants, whose value
is 1 if the constants represent the same object and 0 otherwise; these nodes will
be connected to each other and to the rest of the network by arcs representing the
axioms above. Notice that this allows us to make probabilistic inferences about the

12.4

Markov Logic

347

equality of two constants. We have successfully used this as the basis of an approach
to object identication (see section 12.6.5).
If the number u of unknown objects is known, assumption 2 (domain closure) can
be removed simply by introducing u arbitrary new constants. If u is unknown but
nite, assumption 2 can be removed by introducing a distribution over u, grounding
the MLN with each number of unknown objects, and computing the probability of a
umax
u
u
P (u)P (F |ML,C
), where ML,C
is the ground MLN with
formula F as P (F ) = u=0
u unknown objects. An innite u requires extending MLNs to the case |C| = .
Let HL,C be the set of all ground terms constructible from the function symbols in
L and the constants in L and C (the Herbrand universe of (L, C)). Assumption 3
(known functions) can be removed by treating each element of HL,C as an additional
constant and applying the same procedure used to remove the unique names
assumption. For example, with a function G(x) and constants A and B, the MLN
will now contain nodes for G(A) = A, G(A) = B, etc. This leads to an innite number
of new constants, requiring the corresponding extension of MLNs. However, if we
restrict the level of nesting to some maximum, the resulting MLN is still nite.
To summarize, assumptions 13 can be removed as long as the domain is nite.
We believe it is possible to extend MLNs to innite domains (see Jaeger [27]), but
this is an issue of chiey theoretical interest, and we leave it for future work. In the
remainder of this chapter we proceed under assumptions 13, except where noted.
A rst-order KB can be transformed into an MLN simply by assigning a weight
to each formula. For example, the clauses and weights in the last two columns of
Table 12.1 constitute an MLN. According to this MLN, other things being equal, a
world where n friendless people are nonsmokers is e(2.3)n times less probable than
a world where all friendless people smoke. Notice that all the formulae in table 12.1
are false in the real world as universally quantied logical statements, but capture
useful information on friendships and smoking habits, when viewed as features of
a Markov network. For example, it is well-known that teenage friends tend to have
similar smoking habits [35]. In fact, an MLN like the one in table 12.1 succinctly
represents a type of model that is a staple of social network analysis [54].
It is easy to see that MLNs subsume essentially all propositional probabilistic
models, as detailed below.
Proposition 12.2
Every probability distribution over discrete or nite-precision numeric variables can
be represented as a Markov logic network.
Proof Consider rst the case of Boolean variables (X1 , X2 , . . . , Xn ). Dene a
predicate of zero arity Rh for each variable Xh , and include in the MLN L a
formula for each possible state of (X1 , X2 , . . . , Xn ). This formula is a conjunction
of n literals, with the hth literal being Rh () if Xh is true in the state, and Rh ()
otherwise. The formulas weight is log P (X1 , X2 , . . . , Xn ). (If some states have zero
probability, use instead the product form (see [12.3]), with i () equal to the
probability of the ith state.) Since all predicates in L have zero arity, L denes the
same Markov network ML,C irrespective of C, with one node for each variable Xh .

348

Markov Logic: A Unifying Framework for Statistical Relational Learning

For any state, the corresponding formula is true and all others are false, and thus
(12.3) represents the original distribution (notice that Z = 1). The generalization
to arbitrary discrete variables is straightforward, by dening a zero-arity predicate
for each value of each variable. Similarly for nite-precision numeric variables, by
noting that they can be represented as Boolean vectors.
Of course, compact factored models like Markov networks and Bayesian networks can still be represented compactly by MLNs, by dening formulae for the
corresponding factors (arbitrary features in Markov networks, and states of a node
and its parents in Bayesian networks).2
First-order logic (with assumptions 13 above) is the special case of Markov logic
obtained when all weights are equal and tend to innity, as described below.
Proposition 12.3
Let KB be a satisable knowledge base, L be the MLN obtained by assigning
weight w to every formula in KB, C be the set of constants appearing in KB,
Pw (x) be the probability assigned to a (set of) possible world(s) x by ML,C , XKB
be the set of worlds that satisfy KB, and F be an arbitrary formula in rst-order
logic. Then:
1. x XKB limw Pw (x) = |XKB |1
x  XKB limw Pw (x) = 0
2. For all F , KB |= F i limw Pw (F ) = 1
Proof Let k be the number of ground formulae in ML,C . By (12.3), if x XKB ,
then Pw (x) = ekw /Z, and if x  XKB then Pw (x) e(k1)w /Z. Thus all
x XKB are equiprobable and limw P (X \ XKB )/P (XKB ) limw (|X \
XKB |/|XKB |)ew = 0, proving part 1. By denition of entailment, KB |= F i
every world that satises KB also satises F . Therefore, letting XF be the set of

worlds that satises F , if KB |= F , then XKB XF and Pw (F ) = xXF Pw (x)
Pw (XKB ). Since, from part 1, limw Pw (XKB ) = 1, this implies that if KB |= F ,
then limw Pw (F ) = 1. The inverse direction of part 2 is proved by noting that
if limw Pw (F ) = 1, then every world with nonzero probability must satisfy F ,
and this includes every world in XKB .
In other words, in the limit of all equal innite weights, the MLN represents a
uniform distribution over the worlds that satisfy the KB, and all entailment queries
can be answered by computing the probability of the query formula and checking
whether it is 1. Even when weights are nite, rst-order logic is embedded in
Markov logic in the following sense. Assume without loss of generality that all
weights are non-negative. (A formula with a negative weight w can be replaced
by its negation with weight w.) If the KB composed of the formulae in an
2. While some conditional independence structures can be compactly represented with
directed graphs but not with undirected ones, they still lead to compact models in the
form of Equation 12.3 (i.e., as products of potential functions).

12.4

Markov Logic

349

MLN L (negated, if their weight is negative) is satisable, then, for any C, the
satisfying assignments are the modes of the distribution represented by ML,C . This

is because the modes are the worlds x with maximum i wi ni (x) (see [12.3]), and
this expression is maximized when all groundings of all formulae are true (i.e., the
KB is satised). Unlike an ordinary rst-order KB, however, an MLN can produce
useful results even when it contains contradictions. An MLN can also be obtained
by merging several KBs, even if they are partly incompatible. This is potentially
useful in areas like the Semantic Web [2] and mass collaboration [46].
It is interesting to see a simple example of how Markov logic generalizes rst-order
logic. Consider an MLN containing the single formula x R(x) S(x) with weight
w, and C = {A}. This leads to four possible worlds: {R(A), S(A)}, {R(A), S(A)},
{R(A), S(A)}, and {R(A), S(A)}. From (12.3) we obtain that P ({R(A), S(A)}) =
1/(3ew + 1) and the probability of each of the other three worlds is ew /(3ew + 1).
(The denominator is the partition function Z; see section 12.2.) Thus, if w > 0, the
eect of the MLN is to make the world that is inconsistent with x R(x) S(x)
less likely than the other three. From the probabilities above we obtain that
P (S(A)|R(A)) = 1/(1 + ew ). When w , P (S(A)|R(A)) 1, recovering the
logical entailment.
In practice, we have found it useful to add each predicate to the MLN as a unit
clause. In other words, for each predicate R(x1 , x2 , . . .) appearing in the MLN,
we add the formula x1 , x2 , . . . R(x1 , x2 , . . .) with some weight wR . The weight
of a unit clause can (roughly speaking) capture the marginal distribution of the
corresponding predicate, leaving the weights of the non-unit clauses free to model
only dependencies between predicates.
When manually constructing an MLN or interpreting a learned one, it is useful
to have an intuitive understanding of the weights. The weight of a formula F is
simply the log odds between a world where F is true and a world where F is false,
other things being equal. However, if F shares variables with other formulae, as
will typically be the case, it may not be possible to keep the truth-values of those
formulae unchanged while reversing F s. In this case there is no longer a one-to-one
correspondence between weights and probabilities of formulae.3 Nevertheless, the
probabilities of all formulae collectively determine all weights, if we view them
as constraints on a maximum entropy distribution, or treat them as empirical
probabilities and learn the maximum likelihood weights (the two are equivalent)
[13]. Thus a good way to set the weights of an MLN is to write down the probability
with which each formula should hold, treat these as empirical frequencies, and learn
the weights from them using the algorithm in section 12.8. Conversely, the weights

3. This is an unavoidable side eect of the power and exibility of Markov networks. In
Bayesian networks, parameters are probabilities, but at the cost of greatly restricting the
ways in which the distribution may be factored. In particular, potential functions must be
conditional probabilities, and the directed graph must have no cycles. The latter condition
is particularly troublesome to enforce in relational extensions [53].

350

Markov Logic: A Unifying Framework for Statistical Relational Learning

in a learned MLN can be viewed as collectively encoding the empirical formula


probabilities.
The size of ground Markov networks can be vastly reduced by having typed
constants and variables, and only grounding variables to constants of the same
type. However, even in this case the size of the network may be extremely large.
Fortunately, many inferences do not require grounding the entire network, as we
will see in section 12.7.

12.5

SRL Approaches
Because of the simplicity and generality of Markov logic, many representations used
in SRL can be easily mapped into it. In this section, we informally do this for a
representative sample of these approaches. The goal is not to capture all of their
many details, but rather to help bring structure to the eld. Further, converting
these representations to Markov logic brings a number of new capabilities and
advantages, and we also discuss these.
12.5.1

Knowledge-Based Model Construction

Knowledge-based model construction (KBMC) is a combination of logic programming and Bayesian networks [55, 39, 29]. As in Markov logic, nodes in KBMC represent ground predicates. Given a Horn KB, KBMC answers a query by nding all
possible backward-chaining proofs of the query and evidence predicates from each
other, constructing a Bayesian network over the ground predicates in the proofs,
and performing inference over this network. The parents of a predicate node in
the network are deterministic AND nodes representing the bodies of the clauses
that have that node as head. The conditional probability of the node given these
is specied by a combination function (e.g., noisy OR, logistic regression, arbitrary
conditional probability table (CPT)). Markov logic generalizes KBMC by allowing arbitrary formulas (not just Horn clauses) and inference in any direction. It
also sidesteps the thorny problem of avoiding cycles in the Bayesian networks constructed by KBMC, and obviates the need for ad hoc combination functions for
clauses with the same consequent.
A KBMC model can be translated into Markov logic by writing down a set of
formulae for each rst-order predicate Pk(...) in the domain. Each formula is a
conjunction containing Pk(...) and one literal per parent of Pk(...) (i.e., per rstorder predicate appearing in a Horn clause having Pk(...) as the consequent).
A subset of these literals are negated; there is one formula for each possible
combination of positive and negative literals. The weight of the formula is w =
log[p/(1 p)], where p is the conditional probability of the child predicate when the
corresponding conjunction of parent literals is true, according to the combination
function used. If the combination function is logistic regression, it can be represented
using only a linear number of formulae, taking advantage of the fact that a logistic

12.5

SRL Approaches

351

regression model is a (conditional) Markov network with a binary clique between


each predictor and the response. Noisy OR can similarly be represented with a
linear number of parents.
12.5.2

Other Logic Programming Approaches

Stochastic logic programs (SLPs) [37, 9] are a combination of logic programming


and log-linear models. Puech and Muggleton [45] showed that SLPs are a special
case of KBMC, and thus they can be converted into Markov logic in the same
way. Like Markov logic, SLPs have one coecient per clause, but they represent
distributions over Prolog proof trees rather than over predicates; the latter have
to be obtained by marginalization. Similar remarks apply to a number of other
representations that are essentially equivalent to SLPs, like independent choice
logic [43] and PRISM [51].
MACCENT [12] is a system that learns log-linear models with rst-order features;
each feature is a conjunction of a class and a Prolog query (clause with empty head).
A key dierence between MACCENT and Markov logic is that MACCENT is a
classication system (i.e., it predicts the conditional distribution of an objects class
given its properties), while an MLN represents the full joint distribution of a set of
predicates. Like any probability estimation approach, Markov logic can be used for
classication simply by issuing the appropriate conditional queries.4 In particular,
a MACCENT model can be converted into Markov logic simply by dening a
class predicate (as in section 12.6.1), adding the corresponding features and their
weights to the MLN, and adding a formula with innite weight stating that each
object must have exactly one class. (This fails to model the marginal distribution
of the nonclass predicates, which is not a problem if only classication queries
will be issued.) MACCENT can make use of deterministic background knowledge
in the form of Prolog clauses; these can be added to the MLN as formulae with
innite weight. In addition, Markov logic allows uncertain background knowledge
(via formulae with nite weights). As described in Subsection 12.6.1, MLNs can be
used for collective classication, where the classes of dierent objects can depend
on each other; MACCENT, which requires that each object be represented in a
separate Prolog KB, does not have this capability.
Constraint logic programming is an extension of logic programming where variables are constrained instead of being bound to specic values during inference [31].
Probabilistic CLP generalizes SLPs to CLP [47], and CLP(BN ) combines CLP with
Bayesian networks [50]. Unlike in Markov logic, constraints in CLP(BN ) are hard
(i.e., they cannot be violated; rather, they dene the form of the probability distribution).

4. Conversely, joint distributions can be built up from classiers (e.g., [23]), but this would
be a signicant extension of MACCENT.

352

Markov Logic: A Unifying Framework for Statistical Relational Learning

12.5.3

Probabilistic Relational Models

Probabilistic relational models (PRMs) [17] are a combination of frame-based


systems and Bayesian networks. PRMs can be converted into Markov logic by
dening a predicate S(x, v) for each (propositional or relational) attribute of each
class, where S(x, v) means The value of attribute S in object x is v. A PRM is
then translated into an MLN by writing down a formula for each line of each (classlevel) CPT and value of the child attribute. The formula is a conjunction of literals
stating the parent values and a literal stating the child value, and its weight is the
logarithm of P (x|P arents(x)), the corresponding entry in the CPT. In addition,
the MLN contains formulae with innite weight stating that each attribute must
take exactly one value. This approach handles all types of uncertainty in PRMs
(attribute, reference, and existence uncertainty).
As Taskar et al. [53] point out, the need to avoid cycles in PRMs causes signicant
representational and computational diculties. Inference in PRMs is done by
creating the complete ground network, which limits their scalability. PRMs require
specifying a complete conditional model for each attribute of each class, which in
large complex domains can be quite burdensome. In contrast, Markov logic creates
a complete joint distribution from whatever number of rst-order features the user
chooses to specify.
12.5.4

Relational Markov Networks

Relational Markov networks (RMNs) use conjunctive database queries as clique


templates [53]. They do not provide a language for dening features. As a result,
by default RMNs require a feature for every possible state of a clique, making
them exponential in clique size and limiting the complexity of dependencies they
can model. Markov logic provides rst-order logic as a powerful language for
specifying features. Specifying the features also indirectly species the cliques,
which can be very large as long as the number of relevant features (i.e., formulae)
is tractable. Additionally, Markov logic generalizes RMNs by allowing uncertainty
over arbitrary relations (not just attributes of individual objects). RMNs are trained
discriminatively, and do not specify a complete joint distribution for the variables
in the model. Discriminative training of MLNs is straightforward [52]. RMNs
use MAP estimation with belief propagation for inference, which makes learning
quite slow, despite the simplied discriminative setting; both pseudo-likelihood
optimization and the discriminative training described in Singla and Domingos [52]
are presumably much faster. To date, no structure-learning algorithms for RMNs
have been proposed. MLN structure can be learned using standard inductive logic
programming (ILP) techniques, as described later in this chapter, or by directly
optimizing pseudo-likelihood, as described in Kok and Domingos [30].

12.5

SRL Approaches

12.5.5

353

Structural Logistic Regression

In structural logistic regression (SLR) [44], the predictors are the output of SQL
queries over the input data. In the same way that a logistic regression model can be
viewed as a discriminatively trained Markov network, an SLR model can be viewed
as a a discriminatively trained MLN.5
12.5.6

Relational Dependency Networks

In a relational dependency network (RDN), each nodes probability conditioned on


its Markov blanket is given by a decision tree [38]. Every RDN has a corresponding
MLN in the same way that every dependency network has a corresponding Markov
network, given by the stationary distribution of a Gibbs sampler operating on it
[23].
12.5.7

Plates and Probabilistic Entity Relationship Models

Large graphical models with repeated structure are often compactly represented
using plates [4]. Markov logic allows plates to be specied using universal quantication. In addition, it allows individuals and their relations to be explicitly represented (see Cussens [8]), and context-specic independences to be compactly written
down, instead of left implicit in the node models. More recently, Heckerman et al.
[24] have proposed probabilistic entity relationship (ER) models, a language based
on ER models that combines the features of plates and PRMs; this language can
be mapped into Markov logic in the same way that ER models can be mapped into
rst-order logic. Probabilistic ER models allow logical expressions as constraints
on how ground networks are constructed, but the truth-values of these expressions
have to be known in advance; Markov logic allows uncertainty over all logical expressions.
12.5.8

BLOG

Milch et al. [36] have proposed a language, called BLOG (Bayesian Logic), designed
to avoid making the unique names and domain closure assumptions. A BLOG
program species procedurally how to generate a possible world, and does not allow
arbitrary rst-order knowledge to be easily incorporated. Also, it only species the
structure of the model, leaving the parameters to be specied by external calls.
BLOG models are directed graphs and need to avoid cycles, which substantially
complicates their design. We saw in the previous section how to remove the
unique names and domain closure assumptions in Markov logic. (When there are
unknown objects of multiple types, a random variable for the number of each

5. Use of SQL aggregates requires that their denitions be imported into Markov logic.

354

Markov Logic: A Unifying Framework for Statistical Relational Learning

type is introduced.) Inference about an objects attributes, rather than those of


its observations, can be done simply by having variables for objects as well as for
their observations (e.g., for books as well as citations to them). To date, no learning
algorithms or practical inference algorithms for BLOG have been proposed.

12.6

SRL Tasks
Many SRL tasks can be concisely formulated in Markov logic, making it possible to
see how they relate to each other, and to develop algorithms that are simultaneously
applicable to all. In this section we exemplify this with ve key tasks: collective
classication, link prediction, link-based clustering, social network modeling, and
object identication.
12.6.1

Collective Classication

The goal of ordinary classication is to predict the class of an object given its
attributes. Collective classication also takes into account the classes of related
objects (e.g., [6, 53, 38]). Attributes can be represented in Markov logic as predicates
of the form A(x, v), where A is an attribute, x is an object, and v is the value
of A in x. The class is a designated attribute C, representable by C(x, v), where
v is xs class. Classication is now simply the problem of inferring the truthvalue of C(x, v) for all x and v of interest given all known A(x, v). Ordinary
classication is the special case where C(xi , v) and C(xj , v) are independent for all
xi and xj given the known A(x, v). In collective classication, the Markov blanket
of C(xi , v) includes other C(xj , v), even after conditioning on the known A(x, v).
Relations between objects are represented by predicates of the form R(xi , xj ). A
number of interesting generalizations are readily apparent; for example, C(xi , v) and
C(xj , v) may be indirectly dependent via unknown predicates, possibly including the
R(xi , xj ) predicates themselves.
12.6.2

Link Prediction

The goal of link prediction is to determine whether a relation exists between two
objects of interest (e.g., whether Anna is Bobs Ph.D. advisor) from the properties of
those objects and possibly other known relations (e.g., see Popescul and Ungar [44]).
The formulation of this problem in Markov logic is identical to that of collective
classication, with the only dierence that the goal is now to infer the value of
R(xi , xj ) for all object pairs of interest, instead of C(x, v). The task used in our
experiments is an example of link prediction (see section 12.9).

12.6

SRL Tasks

12.6.3

355

Link-Based Clustering

The goal of clustering is to group together objects with similar attributes. In model
based clustering, we assume a generative model P (X) = C P (C) P (X|C), where
X is an object, C ranges over clusters, and P (C|X) is Xs degree of membership
in cluster C. In link-based clustering, objects are clustered according to their links
(e.g., objects that are more closely related are more likely to belong to the same
cluster), and possibly according to their attributes as well (e.g., see Flake et al.
[16]). This problem can be formulated in Markov logic by postulating an unobserved
predicate C(x, v) with the meaning x belongs to cluster v, and having formulas
in the MLN involving this predicate and the observed ones (e.g., R(xi , xj ) for links
and A(x, v) for attributes). Link-based clustering can now be performed by learning
the parameters of the MLN, and cluster memberships are given by the probabilities
of the C(x, v) predicates conditioned on the observed ones.
12.6.4

Social Network Modeling

Social networks are graphs where nodes represent social actors (e.g., people) and
arcs represent relations between them (e.g., friendship). Social network analysis
[54] is concerned with building models relating actors properties and their links.
For example, the probability of two actors forming a link may depend on the
similarity of their attributes, and conversely two linked actors may be more likely
to have certain properties. These models are typically Markov networks, and can
be concisely represented by formulas like xyv R(x, y) (A(x, v) A(y, v)),
where x and y are actors, R(x, y) is a relation between them, A(x, v) represents
an attribute of x, and the weight of the formula captures the strength of the
correlation between the relation and the attribute similarity. For example, a model
stating that friends tend to have similar smoking habits can be represented by the
formula xy Friends(x, y) (Smokes(x) Smokes(y)) (table 12.1). As well as
encompassing existing social network models, Markov logic allows richer ones to
be easily stated (e.g., by writing formulas involving multiple types of relations and
multiple attributes, as well as more complex dependencies between them).
12.6.5

Object Identication

Object identication (also known as record linkage, deduplication, and others) is


the problem of determining which records in a database refer to the same real-world
entity (e.g., which entries in a bibliographic database represent the same publication) [56]. This problem is of crucial importance to many companies, government
agencies, and large-scale scientic projects. One way to represent it in Markov
logic is by removing the unique names assumption as described in section 12.4,
i.e., by dening a predicate Equals(x, y) (or x = y for short) with the meaning
x represents the same real-world entity as y. This predicate is applied both to
records and their elds (e.g., ICML = Intl. Conf. on Mach. Learn.). The de-

356

Markov Logic: A Unifying Framework for Statistical Relational Learning

pendencies between record matches and eld matches can then be represented by
formulas like xy x = y fi (x) = fi (y), where x and y are records and fi (x)
is a function returning the value of the ith eld of record x. We have successfully
applied this approach to deduplicating the Cora database of computer science papers [52]. Because it allows information to propagate from one match decision (i.e.,
one grounding of x = y) to another via elds that appear in both pairs of records,
it eectively performs collective object identication, and in our experiments outperformed the traditional method of making each match decision independently of
all others. For example, matching two references may allow us to determine that
ICML and MLC represent the same conference, which in turn may help us to
match another pair of references where one contains ICML and the other MLC.
Markov logic also allows additional information to be incorporated into a deduplication system easily, modularly, and uniformly. For example, transitive closure is
incorporated by adding the formula xyz x = y y = z x = z, with a weight
that can be learned from data.

12.7

Inference
We now show how inference in Markov logic can be carried out. Markov logic can
answer arbitrary queries of the form What is the probability that formula F1 holds
given that formula F2 does? If F1 and F2 are two formulae in rst-order logic, C
is a nite set of constants including any constants that appear in F1 or F2 , and L
is an MLN, then
P (F1 |F2 , L, C) = P (F1 |F2 , ML,C )
P (F1 F2 |ML,C )
=
P (F2 |ML,C )

xXF XF2 P (X = x|ML,C )
,
=  1
xXF P (X = x|ML,C )

(12.4)

where XFi is the set of worlds where Fi holds, and P (x|ML,C ) is given by (12.3).
Ordinary conditional queries in graphical models are the special case of (12.4) where
all predicates in F1 , F2 , and L are zero-arity and the formulae are conjunctions.
The question of whether a knowledge base KB entails a formula F in rst-order
logic is the question of whether P (F |LKB , CKB,F ) = 1, where LKB is the MLN
obtained by assigning innite weight to all the formulae in KB, and CKB,F is the
set of all constants appearing in KB or F . The question is answered by computing
P (F |LKB , CKB,F ) by (12.4), with F2 = True.
Computing (12.4) directly will be intractable in all but the smallest domains.
Since Markov logic inference subsumes probabilistic inference, which is #Pcomplete, and logical inference in nite domains, which is NP-complete, no better
results can be expected. However, many of the large number of techniques for

12.7

Inference
Table 12.3

357

Network construction for inference in Markov logic

function ConstructNetwork(F1 , F2 , L, C)
inputs: F1 , a set of ground atoms with unknown truth-values (the query)
F2 , a set of ground atoms with known truth-values (the evidence)
L, a Markov logic network
C, a set of constants
output: M , a ground Markov network
calls: M B(q), the Markov blanket of q in ML,C
G F1
while F1 =
for all q F1
if q  F2
F1 F1 (M B(q) \ G)
G G M B(q)
F1 F1 \ {q}
return M , the ground Markov network composed of all nodes in G, all arcs between
them in ML,C , and the features and weights on the corresponding cliques

ecient inference in either case are applicable to Markov logic. Because Markov
logic allows ne-grained encoding of knowledge, including context-specic independences, inference in it may in some cases be more ecient than inference in an
ordinary graphical model for the same domain. On the logic side, the probabilistic
semantics of Markov logic allows for approximate inference, with the corresponding
potential gains in eciency.
In principle, P (F1 |F2 , L, C) can be approximated using an MCMC algorithm
that rejects all moves to states where F2 does not hold, and counts the number of
samples in which F1 holds. However, even this is likely to be too slow for arbitrary
formulae. Instead, we provide an inference algorithm for the case where F1 and F2
are conjunctions of ground literals. While less general than (12.4), this is the most
frequent type of query in practice, and the algorithm we provide answers it far more
eciently than a direct application of (12.4). Investigating lifted inference (where
queries containing variables are answered without grounding them) is an important
direction for future work (see Jaeger [26] and Poole [42] for initial results). The
algorithm proceeds in two phases, analogous to knowledge-based model construction
[55]. The rst phase returns the minimal subset M of the ground Markov network
required to compute P (F1 |F2 , L, C). The algorithm for this is shown in table 12.3.
The size of the network returned may be further reduced, and the algorithm sped
up, by noticing that any ground formula which is made true by the evidence can be
ignored, and the corresponding arcs removed from the network. In the worst case,
the network contains O(|C|a ) nodes, where a is the largest predicate arity in the
domain, but in practice it may be much smaller.
The second phase performs inference on this network, with the nodes in F2 set
to their values in F2 . Our implementation uses Gibbs sampling, but any inference
method may be employed. The basic Gibbs step consists of sampling one ground

358

Markov Logic: A Unifying Framework for Statistical Relational Learning

atom given its Markov blanket. The Markov blanket of a ground atom is the set
of ground predicates that appear in some grounding of a formula with it. The
probability of a ground atom Xl when its Markov blanket Bl is in state bl is
P (Xl = xl |Bl = bl ) =

exp( fi Fl wi fi (Xl = xl , Bl = bl ))


,
exp( fi Fl wi fi (Xl = 0, Bl = bl )) + exp( fi Fl wi fi (Xl = 1, Bl = bl ))

(12.5)

where Fl is the set of ground formulae that Xl appears in, and fi (Xl = xl , Bl = bl )
is the value (0 or 1) of the feature corresponding to the ith ground formula when
Xl = xl and Bl = bl . For sets of atoms of which exactly one is true in any given
world (e.g., the possible values of an attribute), blocking can be used (i.e., one atom
is set to true and the others to false in one step, by sampling conditioned on their
collective Markov blanket). The estimated probability of a conjunction of ground
literals is simply the fraction of samples in which the ground literals are true, after
the Markov chain has converged. Because the distribution is likely to have many
modes, we run the Markov chain multiple times. When the MLN is in clausal
form, we minimize burn-in time by starting each run from a mode found using
MaxWalkSat, a local search algorithm for the weighted satisability problem (i.e.,
nding a truth assignment that maximizes the sum of weights of satised clauses)
[28]. When there are hard constraints (clauses with innite weight), MaxWalkSat
nds regions that satisfy them, and the Gibbs sampler then samples from these
regions to obtain probability estimates.

12.8

Learning
We learn MLN weights from one or more relational databases. (For brevity, the
treatment below is for one database, but the generalization to many is trivial.) We
make a closed-world assumption [18]: if a ground atom is not in the database, it is
assumed to be false. If there are n possible ground atoms, a database is eectively
a vector x = (x1 , . . . , xl , . . . , xn ) where xl is the truth value of the lth ground
atom (xl = 1 if the atom appears in the database, and xl = 0 otherwise). Given
a database, MLN weights can in principle be learned using standard methods,
as follows. If the ith formula has ni (x) true groundings in the data x, then by
Equation 12.3 the derivative of the log-likelihood with respect to its weight is


log Pw (X = x) = ni (x)
Pw (X = x ) ni (x ),
wi


(12.6)

where the sum is over all possible databases x , and Pw (X = x ) is P (X = x )


computed using the current weight vector w = (w1 , . . . , wi , . . .). In other words,
the ith component of the gradient is simply the dierence between the number of

12.8

Learning

359

true groundings of the ith formula in the data and its expectation according to the
current model. Unfortunately, counting the number of true groundings of a formula
in a database is intractable, even when the formula is a single clause, as stated in
the following proposition (due to Dan Suciu).
Proposition 12.4
Counting the number of true groundings of a rst-order clause in a database is
#P-complete in the length of the clause.
Proof Counting satisfying assignments of propositional monotone 2-CNF is #Pcomplete [49]. This problem can be reduced to counting the number of true
groundings of a rst-order clause in a database as follows. Consider a database
composed of the ground atoms R(0, 1), R(1, 0), and R(1, 1). Given a monotone
2-CNF formula, construct a formula that is a conjunction of predicates of the
form R(xi , xj ), one for each disjunct xi xj appearing in the CNF formula. (For
example, (x1 x2 ) (x3 x4 ) would yield R(x1 , x2 ) R(x3 , x4 ).) There is a oneto-one correspondence between the satisfying assignments of the 2-CNF and the
true groundings of . The latter are the false groundings of the clause formed by
disjoining the negations of all the R(xi , xj ), and thus can be counted by counting
the number of true groundings of this clause and subtracting it from the total
number of groundings.

In large domains, the number of true groundings of a formula may be counted


approximately, by uniformly sampling groundings of the formula and checking
whether they are true in the data. In smaller domains, and in our experiments
below, we use an ecient recursive algorithm to nd the exact count.
A second problem with (12.6) is that computing the expected number of true
groundings is also intractable, requiring inference over the model. Further, ecient
optimization methods also require computing the log-likelihood itself (12.3), and
thus the partition function Z. This can be done approximately using a Monte Carlo
maximum likelihood estimator (MC-MLE) [19]. However, in our experiments the
Gibbs sampling used to compute the MC-MLEs and gradients did not converge in
reasonable time, and using the samples from the unconverged chains yielded poor
results.
A more ecient alternative, widely used in areas like spatial statistics, social
network modeling, and language processing, is to optimize instead the pseudolikelihood [3]

Pw (X = x) =

n


Pw (Xl = xl |M Bx (Xl )),

(12.7)

l=1

where M Bx (Xl ) is the state of the Markov blanket of Xl in the data. The gradient
of the pseudo-log-likelihood is

360

Markov Logic: A Unifying Framework for Statistical Relational Learning

log Pw (X = x) =
[ni (x) Pw (Xl = 0|M Bx(Xl )) ni (x[Xl=0] )
wi
n

l=1

Pw (Xl = 1|M Bx (Xl )) ni (x[Xl=1] )],

(12.8)

where ni (x[Xl=0] ) is the number of true groundings of the ith formula when we
force Xl = 0 and leave the remaining data unchanged, and similarly for ni (x[Xl=1] ).
Computing this expression (or (12.7)) does not require inference over the model.
We optimize the pseudo-log-likelihood using the limited-memory BFGS algorithm
[33]. The computation can be made more ecient in several ways:
The sum in (12.8) can be greatly sped up by ignoring predicates that do not
appear in the ith formula.
The counts ni (x), ni (x[Xl=0] ), and ni (x[Xl=1] ) do not change with the weights, and
need only be computed once (as opposed to in every iteration of BFGS).
Ground formulas whose truth-value is unaected by changing the truth-value of
any single literal may be ignored, since then ni (x) = ni (x[Xl=0] ) = ni (x[Xl=1] ). In
particular, this holds for any clause which contains at least two true literals. This
can often be the great majority of ground clauses.
To combat overtting, we penalize the pseudo-likelihood with a Gaussian prior
on each weight.
When we know a priori which predicates will be evidence, MLN weights can also
be learned discriminatively [52].
ILP techniques can be used to learn additional clauses, rene the ones already in
the MLN, or learn an MLN from scratch. Here we use the CLAUDIEN system for
this purpose [10]. Unlike most other ILP systems, which learn only Horn clauses,
CLAUDIEN is able to learn arbitrary rst-order clauses, making it well suited to
Markov logic. Also, by constructing a particular language bias, we are able to direct
CLAUDIEN to search for renements of the MLN structure. Alternatively, MLN
structure can be learned by directly optimizing pseudo-likelihood [30].

12.9

Experiments
We have empirically tested the algorithms described in the previous sections using
a database describing the Department of Computer Science and Engineering at the
University of Washington (UW-CSE). The domain consists of 12 predicates and
2707 constants divided into 10 types. Types include: publication (342 constants),
person (442), course (176), project (153), academic quarter (20), etc. Predicates
include: Professor(person), Student(person), Area(x, area) (with x ranging over
publications, persons, courses, and projects), AuthorOf(publication, person),
AdvisedBy(person, person), YearsInProgram(person, years), CourseLevel(course, level), TaughtBy(course, person, quarter), TeachingAssistant(course, per-

12.9

Experiments

361

son, quarter), etc. Additionally, there are 10 equality predicates: SamePerson


(person, person), SameCourse(course, course), etc., which always have known,
xed values that are true i the two arguments are the same constant.
Using typed variables, the total number of possible ground atoms (n in section 12.8) was 4,106,841. The database contained a total of 3380 tuples (i.e., there
were 3380 true ground atoms). We obtained this database by scraping pages in
the departments website (www.cs.washington.edu). Publications and AuthorOf relations were obtained by extracting from the BibServ database (www.bibserv.org)
all records with author elds containing the names of at least two department
members (in the form last name, rst name or last name, rst initial).
We obtained a knowledge base by asking four volunteers to each provide a set of
formulas in rst-order logic describing the domain. (The volunteers were not shown
the database of tuples, but were members of the department who thus had a general
understanding about it.) Merging these yielded a KB of 96 formulas. The complete
KB, volunteer instructions, database, and algorithm parameter settings are online
at http://www.cs.washington.edu/ai/mln. Formulas in the KB include statements
like: students are not professors; each student has at most one advisor; if a student
is an author of a paper, so is her advisor; advanced students only TA courses taught
by their advisors; at most one author of a given publication is a professor; students
in phase I of the Ph.D. program have no advisor; etc. Notice that these statements
are not always true, but are typically true.
For training and testing purposes, we divided the database into ve subdatabases,
one for each area: AI, graphics, programming languages, systems, and theory.
Professors and courses were manually assigned to areas, and other constants were
iteratively assigned to the most frequent area among other constants they appeared
in some tuple with. Each tuple was then assigned to the area of the constants in it.
Tuples involving constants of more than one area were discarded, to avoid train-test
contamination. The subdatabases contained, on average, 521 true ground atoms out
of a possible 58,457.
We performed leave-one-out testing by area, testing on each area in turn using
the model trained from the remaining four. The test task was to predict the
AdvisedBy(x, y) predicate given (a) all others (All Info) and (b) all others except
Student(x) and Professor(x) (Partial Info). In both cases, we measured the
average conditional log-likelihood of all possible groundings of AdvisedBy(x, y) over
all areas, drew precision-recall curves, and computed the area under the curve. This
task is an instance of link prediction, a problem that has been the object of much
interest in statistical relational learning (see section 12.6). All KBs were converted
to clausal form. Timing results are on a 2.8Ghz Pentium 4 machine.
12.9.1

Systems

In order to evaluate Markov logic, which uses logic and probability for inference,
we wished to compare it with methods that use only logic or only probability. We

362

Markov Logic: A Unifying Framework for Statistical Relational Learning

were also interested in automatic induction of clauses using ILP techniques. This
section gives details of the comparison systems used.
12.9.1.1

Logic

One important question we aimed to answer with the experiments is whether adding
probability to a logical KB improves its ability to model the domain. Doing this
requires observing the results of answering queries using only logical inference, but
this is complicated by the fact that computing log-likelihood and the area under
the precision-recall curve requires real-valued probabilities, or at least some measure
of condence in the truth of each ground atom being tested. We thus used the
following approach. For a given knowledge base KB and set of evidence atoms E,
let XKBE be the set of worlds that satisfy KB E. The probability of a query
|X
|
, the fraction of XKBE in which q is
atom q is then dened as P (q) = |XKBEq
KBE |
true.
A more serious problem arises if the KB is inconsistent (which was indeed the case
with the KB we collected from volunteers). In this case the denominator of P (q) is
zero. (Also, recall that an inconsistent KB trivally entails any arbitrary formula).
To address this, we redene XKBE to be the set of worlds which satises the
maximum possible number of ground clauses. We use Gibbs sampling to sample
from this set, with each chain initialized to a mode using WalkSat. At each Gibbs
step, the step is taken with probability: 1 if the new state satises more clauses than
the current one (since that means the current state should have 0 probability), 0.5
if the new state satises the same number of clauses (since the new and old state
then have equal probability), and 0 if the new state satises fewer clauses. We
then use only the states with the maximum number of satised clauses to compute
probabilities. Notice that this is equivalent to using an MLN built from the KB and
with all innite equal weights.
12.9.1.2

Probability

The other question we wanted to answer with these experiments is whether existing (propositional) probabilistic models are already powerful enough to be used
in relational domains without the need for the additional representational power
provided by MLNs. In order to use such models, the domain must rst be propositionalized by dening features that capture useful information about it. Creating
good attributes for propositional learners in this highly relational domain is a dicult problem. Nevertheless, as a tradeo between incorporating as much potentially
relevant information as possible and avoiding extremely long feature vectors, we dened two sets of propositional attributes: order-1 and order-2. The former involves
characteristics of individual constants in the query predicate, and the latter involves
characteristics of relations between the constants in the query predicate.
For the order-1 attributes, we dened one variable for each (a, b) pair, where a is
an argument of the query predicate and b is an argument of some predicate with the

12.9

Experiments

363

same value as a. The variable is the fraction of true groundings of this predicate
in the data. Some examples of rst-order attributes for AdvisedBy(Matt, Pedro)
are: whether Pedro is a student, the fraction of publications that are published by
Pedro, the fraction of courses for which Matt was a teaching assistant, etc.
The order-2 attributes were dened as follows: for a given (ground) query predicate Q(q1 , q2 , . . . , qk ), consider all sets of k predicates and all assignments of constants q1 , q2 , . . . , qk as arguments to the k predicates, with exactly one constant per
predicate (in any order). For instance, if Q is Advised By(Matt, Pedro) then one
such possible set would be {TeachingAssistant( , Matt, ), TaughtBy( , Pedro, )}.
This forms 2k attributes of the example, each corresponding to a particular truth assignment to the k predicates. The value of an attribute is the number of times, in the
training data the set of predicates have that particular truth assignment, when their
unassigned arguments are all lled with the same constants. For example, consider
lling the above empty arguments with CSE546 and Autumn 0304. The resulting
set, {TeachingAssistant(CSE546, Matt, Autumn 0304), TaughtBy(CSE546, Pedro,
Autumn 0304)} has some truth assignment in the training data (e.g., {True,True},
{True,False}, . . . ). One attribute is the number of such sets of constants that create
the truth assignment {True,True}, another for {True,False}, and so on. Some examples of second-order attributes generated for the query AdvisedBy(Matt, Pedro)
are: how often Matt is a teaching assistant for a course that Pedro taught (as well
as how often he is not), how many publications Pedro and Matt have coauthored,
etc.
The resulting 28 order-1 attributes and 120 order-2 attributes (for the All Info
case) were discretized into ve equal-frequency bins (based on the training set).
We used two propositional learners: naive Bayes [14] and Bayesian networks [22]
with structure and parameters learned using the VFBN2 algorithm [25] with a
maximum of four parents per node. The order-2 attributes helped the naive Bayes
classier but hurt the performance of the Bayesian network classier, so below we
report results using the order-1 and order-2 attributes for naive Bayes, and only
the order-1 attributes for Bayesian networks.
12.9.1.3

Inductive Logic Programming

Our original KB was acquired from volunteers, but we were also interested in
whether it could have been developed automatically using ILP methods. As mentioned earlier, we used CLAUDIEN to induce a KB from data. CLAUDIEN was
run with: local scope; minimum accuracy of 0.1; minimum coverage of 1; maximum
complexity of 10; and breadth-rst search. CLAUDIENs search space is dened by
its language bias. We constructed a language bias which allowed: a maximum of
three variables in a clause; unlimited predicates in a clause; up to two non-negated
appearances of a predicate in a clause, and two negated ones; and use of knowledge of predicate argument types. To minimize search, the equality predicates (e.g.,
SamePerson) were not used in CLAUDIEN, and this improved its results.

364

Markov Logic: A Unifying Framework for Statistical Relational Learning

Besides inducing clauses from the training data, we were also interested in using
data to automatically rene the KB provided by our volunteers. CLAUDIEN does
not support this feature directly, but it can be emulated by an appropriately
constructed language bias. We did this by, for each clause in the KB, allowing
CLAUDIEN to (1) remove any number of the literals, (2) add up to v new variables,
and (3) add up to l new literals. We ran CLAUDIEN for 24 hours on a Sun-Blade
1000 for each (v, l) in the set {(1, 2), (2, 3), (3, 4)}. All three gave nearly identical
results; we report the results with v = 3 and l = 4.
12.9.1.4

Markov Logic

Our results compare the above systems to Markov logic. The MLNs were trained
using a Gaussian weight prior with zero mean and unit variance, and with the
weights initialized at the mode of the prior (zero). For optimization, we used the
FORTRAN implementation of L-BFGS from Zhu et al. [58] and Byrd et al. [5],
leaving all parameters at their default values, and with a convergence criterion (ftol )
of 105 . Inference was performed using Gibbs sampling as described in section 12.7,
with ten parallel Markov chains, each initialized to a mode of the distribution using
MaxWalkSat. The number of Gibbs steps was determined using the criterion of
DeGroot and Schervish [11][pp. 707 and 740-741]. Sampling continued until we
reached a condence of 95% that the probability estimate was within 1% of the true
value in at least 95% of the nodes (ignoring nodes which are always true or false). A
minimum of 1000 and maximum of 500,000 samples was used, with one sample per
complete Gibbs pass through the variables. Typically, inference converged within
5000 to 100,000 passes. The results were insensitive to variation in the convergence
thresholds.
12.9.2
12.9.2.1

Results
Training with MC-MLE

Our initial system used MC-MLE to train MLNs, with ten Gibbs chains, and each
ground atom being initialized to true with the corresponding rst-order predicates
probability of being true in the data. Gibbs steps may be taken quite quickly by
noting that few counts of satised clauses will change on any given step. On the
UW-CSE domain, our implementation took 4-5 ms per step. We used the maximum
across all predicates of the Gelman criterion R [20] to determine when the chains
had reached their stationary distribution. In order to speed convergence, our Gibbs
sampler preferentially samples atoms that were true in either the data or the initial
state of the chain. The intuition behind this is that most atoms are always false,
and sampling repeatedly from them is inecient. This improved convergence by
approximately an order of magnitude over uniform selection of atoms. Despite
these optimizations, the Gibbs sampler took a prohibitively long time to reach
a reasonable convergence threshold (e.g., R = 1.01). After running for 24 hours

12.9

Experiments

365

(approximately 2 million Gibbs steps per chain), the average R-value across training
sets was 3.04, with no one training set having reached an R-value less than 2 (other
than briey dipping to 1.5 in the early stages of the process). Considering this must
be done iteratively as L-BFGS searches for the minimum, we estimate it would
take anywhere from 20 to 400 days to complete the training, even with a weak
convergence threshold such as R = 2.0. Experiments conrmed the poor quality
of the models that resulted if we ignored the convergence threshold and limited
the training process to less than ten hours. With a better choice of initial state,
approximate counting, and improved MCMC techniques such as the SwendsenWang algorithm [15], MC-MLE may become practical, but it is not a viable option
for training in the current version. (Notice that during learning MCMC is performed
over the full ground network, which is too large to apply MaxWalkSat to.)
12.9.2.2

Training with Pseudo-likelihood

In contrast to MC-MLE, pseudo-likelihood training was quite fast. As discussed in


section 12.8, each iteration of training may be done quite quickly once the initial
clause and ground atom satisability counts are complete. On average (over the
ve test sets), nding these counts took 2.5 minutes. From there, training took, on
average, 255 iterations of L-BFGS, for a total of 16 minutes.
12.9.2.3

Inference

Inference was also quite quick. Inferring the probability of all AdvisedBy(x, y) atoms
in the All Info case took 3.3 minutes in the AI test set (4624 atoms), 24.4 in
graphics (3721), 1.8 in programming languages (784), 10.4 in systems (5476), and
1.6 in theory (2704). The number of Gibbs passes ranged from 4270 to 500,000,
and averaged 124,000. This amounts to 18 ms per Gibbs pass and approximately
200,000500,000 Gibbs steps per second. The average time to perform inference in
the Partial Info case was 14.8 minutes (vs. 8.3 in the All Info case).
12.9.2.4

Comparison of Systems

We compared twelve systems: the original KB (KB); CLAUDIEN (CL); CLAUDIEN with the original KB as language bias (CLB); the union of the original KB and
CLAUDIENs output in both cases (KB+CL and KB+CLB); an MLN with each
of the above KBs (MLN(KB), MLN(CL), MLN(KB+CL), and MLN(KB+CLB));
naive Bayes (NB); and a Bayesian network learner (BN). Add-one smoothing of
probabilities was used in all cases.
Table 12.4 summarizes the results, and gure 12.2 shows precision-recall curves
for all areas (i.e., averaged over all AdvisedBy(x, y) predicates). MLNs are clearly
more accurate than the alternatives, showing the promise of this approach. The
purely logical and purely probabilistic methods often suer when intermediate
predicates have to be inferred, while MLNs are largely unaected. Naive Bayes

366

Markov Logic: A Unifying Framework for Statistical Relational Learning


Table 12.4 Experimental results for predicting AdvisedBy(x, y) when all other
predicates are known (All Info) and when Student(x) and Professor(x) are unknown
(Partial Info). CLL is the average conditional log-likelihood, and AUC is the area
under the precision-recall curve. The results are averages over all atoms in the ve
test sets and their standard deviations. (See http://www.cs.washington.edu/ai/mln
for details on how the standard deviations of the AUCs were computed.)
System
MLN(KB)
MLN(KB+CL)
MLN(KB+CLB)
MLN(CL)
MLN(CLB)
KB
KB+CL
KB+CLB
CL
CLB
NB
BN

All Info

Partial Info

AUC

CLL

AUC

CLL

0.2150.0172
0.1520.0165
0.0110.0003
0.0350.0008
0.0030.0000
0.0590.0081
0.0370.0012
0.0840.0100
0.0480.0009
0.0030.0000
0.0540.0006
0.0150.0006

0.0520.004
0.0580.005
3.9050.048
2.3150.030
0.0520.005
0.1350.005
0.2020.008
0.0560.004
0.4340.012
0.0520.005
1.2140.036
0.0720.003

0.2240.0185
0.2030.0196
0.0110.0003
0.0320.0009
0.0230.0003
0.0480.0058
0.0280.0012
0.0440.0064
0.0370.0001
0.0100.0001
0.0440.0009
0.0150.0007

0.0480.004
0.0450.004
3.9580.048
2.4780.030
0.3380.002
0.0630.004
0.1220.006
0.0510.005
0.8360.017
0.5980.003
1.1400.031
0.2150.003

performs well in AUC in some test sets, but very poorly in others; its CLLs
are uniformly poor. CLAUDIEN performs poorly on its own, and produces no
improvement when added to the KB in the MLN. Using CLAUDIEN to rene the
KB typically performs worse in AUC but better in CLL than using CLAUDIEN
from scratch; overall, the best-performing logical method is KB+CLB, but its
results fall well short of the best MLNs. The general drop-o in precision at around
50% recall is attributable to the fact that the database is very incomplete, and only
allows identifying a minority of the AdvisedBy relations. Inspection reveals that the
occasional smaller drop-os in precision at very low recalls are due to students who
graduated or changed advisors after coauthoring many publications with them.

12.10

Conclusion

367

0.8

MLN(KB)
MLN(KB+CL)
KB
KB+CL
CL
NB
BN

Precision

0.6
0.4
0.2
0
0

0.2

0.8

0.8

0.8

MLN(KB)
MLN(KB+CL)
KB
KB+CL
CL
NB
BN

0.6
Precision

0.4
0.6
Recall

0.4
0.2
0
0

0.2

0.4
0.6
Recall

Figure 12.2 Precision and recall for all areas: All Info (upper graph) and Partial
Info (lower graph).

12.10

Conclusion

The rapid growth in the variety of SRL approaches and tasks has led to the need for
a unifying framework. In this chapter we propose Markov logic as a candidate for
such a framework. Markov logic combines rst-order logic and Markov networks and
allows a wide variety of SRL tasks and approaches to be formulated in a common
language. Initial experiments with an implementation of Markov logic have yielded
good results. Software implementing Markov logic and learning and inference
algorithms for it is available at http://www.cs.washington.edu/ai/alchemy.

368

Markov Logic: A Unifying Framework for Statistical Relational Learning

Acknowledgments
We are grateful to Julian Besag, Vitor Santos Costa, James Cussens, Nilesh
Dalvi, Alan Fern, Alon Halevy, Mark Handcock, Henry Kautz, Kristian Kersting,
Tian Sang, Bart Selman, Dan Suciu, Jeremy Tantrum, and Wei Wei for helpful
discussions. This research was partly supported by ONR grant N00014-02-1-0408
and by a Sloan Fellowship awarded to P. D. We used the VFML library in our
experiments (http://www.cs.washington.edu/dm/vfml/).

References
[1] C. Anderson, P. Domingos, and D. Weld. Relational Markov models and their
application to adaptive Web navigation. In Proceedings of the Eighth ACM
SIGKDD International Conference on Knowledge Discovery and Data Mining,
pages 143152, Edmonton, Canada, 2002. ACM Press.
[2] T. Berners-Lee, J. Hendler, and O. Lassila. The Semantic Web. Scientic
American, 284(5):3443, 2001.
[3] J. Besag. Statistical analysis of non-lattice data. The Statistician, 24:179195,
1975.
[4] W. Buntine. Operations for learning with graphical models.
Articial Intelligence Research, 2:159225, 1994.

Journal of

[5] R. H. Byrd, P. Lu, and J. Nocedal. A limited memory algorithm for bound constrained optimization. SIAM Journal on Scientic and Statistical Computing,
16(5):11901208, 1995.
[6] S. Chakrabarti, B. Dom, and P. Indyk. Enhanced hypertext categorization
using hyperlinks. In Proceedings of ACM International Conference on Management of Data, 1998.
[7] C. Cumby and D. Roth. Feature extraction languages for propositionalized
relational learning. In Proceedings of the IJCAI-2003 Workshop on Learning
Statistical Models from Relational Data, 2003.
[8] J. Cussens. Individuals, relations and structures in probabilistic models. In
Proceedings of the IJCAI-2003 Workshop on Learning Statistical Models from
Relational Data, 2003.
[9] J. Cussens. Loglinear models for rst-order probabilistic reasoning. In Proceedings of the Conference on Uncertainty in Articial Intelligence, 1999.
[10] L. De Raedt and L. Dehaspe. Clausal discovery. Machine Learning, 26:99146,
1997.
[11] M. H. DeGroot and M. J. Schervish. Probability and Statistics, 3rd edition.
Addison Wesley, Boston, 2002.

References

369

[12] L. Dehaspe. Maximum entropy modeling with clausal constraints. In Proceedings of the International Conference on Inductive Logic Programming, 1997.
[13] S. Della Pietra, V. Della Pietra, and J. Laerty. Inducing features of random
elds. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19:
380392, 1997.
[14] P. Domingos and M. Pazzani. On the optimality of the simple Bayesian
classier under zero-one loss. Machine Learning, 29:103130, 1997.
[15] R.G. Edwards and A.G. Sokal. Generalization of the Fortuin-KasteleynSwendsen-Wang representation and Monte Carlo algorithm. Physics Review
D, 38:20092012, 1988.
[16] G. W. Flake, S. Lawrence, and C. L. Giles. Ecient identication of Web
communities. In International Conference on Knowledge Discovery and Data
Mining, 2000.
[17] N. Friedman, L. Getoor, D. Koller, and A. Pfeer. Learning probabilistic
relational models. In Proceedings of the International Joint Conference on
Articial Intelligence, 1999.
[18] M. R. Genesereth and N. J. Nilsson. Logical Foundations of Articial Intelligence. Morgan Kaufmann, San Mateo, CA, 1987.
[19] C. J. Geyer and E. A. Thompson. Constrained Monte Carlo maximum
likelihood for dependent data. Journal of the Royal Statistical Society, Series
B, 54(3):657699, 1992.
[20] W. R. Gilks, S. Richardson, and D. J. Spiegelhalter, editors. Markov Chain
Monte Carlo in Practice. Chapman and Hall, London, 1996.
[21] J. Halpern. An analysis of rst-order logics of probability. Articial Intelligence, 46:311350, 1990.
[22] D. Heckerman, D. Geiger, and D. M. Chickering. Learning Bayesian networks:
The combination of knowledge and statistical data. Machine Learning, 20:197
243, 1995.
[23] D. Heckerman, D. M. Chickering, C. Meek, R. Rounthwaite, and C. Kadie.
Dependency networks for inference, collaborative ltering, and data visualization. Journal of Machine Learning Research, 1:4975, 2000.
[24] D. Heckerman, C. Meek, and D. Koller. Probabilistic entity-relationship
models, PRMs, and plate models. In Proceedings of the ICML-2004 Workshop
on Statistical Relational Learning and Its Connections to Other Fields, 2004.
[25] G. Hulten and P. Domingos. Mining complex models from arbitrarily large
databases in constant time. In International Conference on Knowledge Discovery and Data Mining, 2002.
[26] M. Jaeger. On the complexity of inference about probabilistic relational
models. Articial Intelligence, 117:297308, 2000.

370

Markov Logic: A Unifying Framework for Statistical Relational Learning

[27] M. Jaeger. Reasoning about innite random structures with relational


Bayesian networks. In Proceedings of the International Conference on Principles of Knowledge Representation and Reasoning, 1998.
[28] H. Kautz, B. Selman, and Y. Jiang. A general stochastic approach to solving
problems with hard and soft constraints. In D. Gu, J. Du, and P. Pardalos,
editors, The Satisability Problem: Theory and Applications, pages 573586.
American Mathematical Society, New York, 1997.
[29] K. Kersting and L. De Raedt. Towards combining inductive logic programming with Bayesian networks. In Proceedings of the International Conference
on Inductive Logic Programming, 2001.
[30] S. Kok and P. Domingos. Learning the structure of Markov logic networks.
In Proceedings of the International Conference on Machine Learning, 2005.
[31] J. Laar and J.-L. Lassez. Constraint logic programming. In Proceedings of
the ACM Conference on Principles of Programming Languages, 1987.
[32] N. Lavrac and S. Dzeroski. Inductive Logic Programming: Techniques and
Applications. Ellis Horwood, Chichester, UK, 1994.
[33] D. C. Liu and J. Nocedal. On the limited memory BFGS method for large
scale optimization. Mathematical Programming, 45(3):503528, 1989.
[34] J. W. Lloyd. Foundations of Logic Programming. Springer-Verlag, Berlin,
1987.
[35] E. Lloyd-Richardson, A. Kazura, C. Stanton, R. Niaura, and G. Papandonatos. Dierentiating stages of smoking intensity among adolescents: Stagespecic psychological and social inuences. Journal of Consulting and Clinical
Psychology, 70(4), 2002.
[36] B. Milch, B. Marthi, and S. Russell. BLOG: Relational modeling with
unknown objects. In Proceedings of the ICML-2004 Workshop on Statistical
Relational Learning and its Connections to Other Fields, 2004.
[37] S. Muggleton. Stochastic logic programs. In L. De Raedt, editor, Advances
in Inductive Logic Programming, pages 254264. IOS Press, Amsterdam, 1996.
[38] J. Neville and D. Jensen. Collective classication with relational dependency
networks. In Proceedings of the Second International Workshop on MultiRelational Data Mining, 2003.
[39] L. Ngo and P. Haddawy. Answering queries from context-sensitive probabilistic knowledge bases. Theoretical Computer Science, 171:147177, 1997.
[40] J. Nocedal and S. J. Wright. Numerical Optimization. Springer-Verlag, New
York, NY, 1999.
[41] J. Pearl. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible
Inference. Morgan Kaufmann, San Francisco, 1988.
[42] D. Poole. First-order probabilistic inference. In Proceedings of the International Joint Conference on Articial Intelligence, 2003.

References

371

[43] D. Poole. Probabilistic Horn abduction and Bayesian networks. Articial


Intelligence, 64:81129, 1993.
[44] A. Popescul and L. H. Ungar. Structural logistic regression for link analysis.
In Proceedings of the Second International Workshop on Multi-Relational Data
Mining, 2003.
[45] A. Puech and S. Muggleton. A comparison of stochastic logic programs
and Bayesian logic programs. In Proceedings of the IJCAI-2003 Workshop
on Learning Statistical Models from Relational Data, 2003.
[46] M. Richardson and P. Domingos. Building large knowledge bases by mass
collaboration. In Proceedings of the International Conference on Knowledge
Capture, 2003.
[47] S. Riezler. Probabilistic Constraint Logic Programming. PhD thesis, University of Tubingen, Tubingen, Germany, 1998.
[48] J. A. Robinson. A machine-oriented logic based on the resolution principle.
Journal of the ACM, 12:2341, 1965.
[49] D. Roth. On the hardness of approximate reasoning. Articial Intelligence,
82:273302, 1996.
[50] V. Santos Costa, D. Page, M. Qazi, , and J. Cussens. CLP(BN): Constraint
logic programming for probabilistic knowledge. In Proceedings of the Conference on Uncertainty in Articial Intelligence, 2003.
[51] T. Sato and Y. Kameya. PRISM: A symbolic-statistical modeling language.
In Proceedings of the International Joint Conference on Articial Intelligence,
1997.
[52] P. Singla and P. Domingos. Discriminative training of Markov logic networks.
In AAAI Press, 2005.
[53] B. Taskar, P. Abbeel, and D. Koller. Discriminative probabilistic models for
relational data. In Proceedings of the Conference on Uncertainty in Articial
Intelligence, 2002.
[54] S. Wasserman and K. Faust. Social Network Analysis: Methods and Applications. Cambridge University Press, Cambridge, UK, 1994.
[55] M. Wellman, J. S. Breese, and R. P. Goldman. From knowledge bases to
decision models. Knowledge Engineering Review, 7:3553, 1992.
[56] W. Winkler. The state of record linkage and current research problems.
Technical report, Statistical Research Division, US Census Bureau, 1999.
[57] J. S. Yedidia, W. T. Freeman, and Y. Weiss. Generalized belief propagation.
In Proceedings of Neural Information Processing Systems, 2001.
[58] C. Zhu, R. H. Byrd, P. Lu, and J. Nocedal. Algorithm 778: L-BFGS-B,
FORTRAN routines for large scale bound constrained optimization. ACM
Transactions on Mathematical Software, 23(4):550560, 1997.

13 BLOG: Probabilistic Models with Unknown


Objects

Brian Milch, Bhaskara Marthi, Stuart Russell, David Sontag, Daniel L.


Ong and Andrey Kolobov

Many AI problems, ranging from sensor data association to linguistic coreference


resolution, involve making inferences about real-world objects that underlie some
data. In many cases, we do not know the number of underlying objects or the
mapping between observations and objects. This chapter presents a probabilistic
modeling language, called Bayesian logic (BLOG), which allows such scenarios
to be represented in a natural way. A well-formed BLOG model fully denes a
distribution over model structures of a rst-order logical language; these possible
worlds can contain varying numbers of objects with varying relations among them.
We show how to use a probabilistic form of Skolemization to express evidence about
objects that were not initially known to exist. We also present a sampling-based
approximate inference algorithm that does inference in nite time per sampling
step on a large class of BLOG models, even those involving innitely many random
variables.

13.1

Introduction
Human beings and AI systems must convert sensory input into some understanding
of what is going on in the world around them. That is, they must make inferences
about the objects and events that underlie their observations. No prespecied list of
objects is given; the agent must infer the existence of objects that were not known
initially to exist.
In many AI systems, this problem of unknown objects is engineered away or
resolved in a preprocessing step. However, there are important applications where
the problem is unavoidable. Population estimation, for example, involves counting a
population by sampling from it randomly and measuring how often the same object
is resampled; this would be pointless if the set of objects were known in advance.
Record linkage, a task undertaken by an industry of more than 300 companies,

374

BLOG: Probabilistic Models with Unknown Objects

involves matching entries across multiple databases. These companies exist because
of uncertainty about the mapping from observations to underlying objects. Finally,
multitarget tracking systems perform data association, connecting, say, radar blips
to hypothesized aircraft.
Probability models for such tasks are not new: Bayesian models for data association have been used since the 1960s [29]. The models are written in English and
mathematical notation and converted by hand into special-purpose code. This can
result in inexible models of limited expressivenessfor example, tracking systems
assume independent trajectories with linear dynamics, and record linkage systems
assume a naive Bayes model for elds in records. It seems natural, therefore, to seek
a formal language in which to express probability models that allow for unknown
objects.
Recent achievements in the eld of probabilistic graphical models [24] illustrate
the benets that can be expected from adopting a formal language: general-purpose
inference algorithms, more sophisticated models, and techniques for automated
model selection (structure learning). However, graphical models only describe xed
sets of random variables with xed dependencies among them; they become awkward in scenarios with unknown objects. There has also been signicant work on
rst-order probabilistic languages (FOPLs), which explicitly represent objects and
the relations between them. We review some of this work in section 13.7. However,
most FOPLs make the assumptions of unique names, requiring that the symbols
or terms of the language all refer to distinct objects, and domain closure, requiring
that no objects exist besides the ones referred to by terms in the language. These
assumptions are inappropriate for problems such as multitarget tracking, where we
may want to reason about objects that are observed multiple times or that are
not observed at all. Those FOPLs that do support unknown objects often do so in
limited and ad hoc ways. In this chapter, we describe Bayesian logic (Blog) [19], a
new language that compactly and intuitively denes probability distributions over
outcomes with varying sets of objects.
We begin in section 13.2 with three example problems, each of which involves
possible worlds with varying object sets and identity uncertainty. We show Blog
models for these problems and give initial, informal descriptions of the probability
distributions that they dene. Section 13.3 observes that the possible worlds in
these scenarios are naturally viewed as model structures of rst-order logic. It then
denes precisely the set of possible worlds corresponding to a Blog model. The
key idea is a generative process that constructs a world by adding objects whose
existence and properties depend on those of objects already created. In such a
process, the existence of objects may be governed by many random variables, not
just a single population size variable. Section 13.4 discusses exactly how a Blog
model species a probability distribution over possible worlds.
Section 13.5 solves a previously unnoticed probabilistic Skolemization problem:
how to specify evidence about objectssuch as radar blipsthat one didnt know
existed. Finally, section 13.6 briey discusses inference in unbounded outcome

13.2

Examples

375

spaces, stating a sampling algorithm and a completeness theorem for a large class
of Blog models and giving experimental results on one particular model.

13.2

Examples
In this section we examine three typical scenarios with unknown objectssimplied
versions of the population estimation, record linkage, and multitarget tracking
problems mentioned above. In each case, we provide a short Blog model that,
when combined with a suitable inference engine, constitutes a working solution for
the problem in question.
Example 13.1
An urn contains an unknown number of ballssay, a number chosen from a Poisson
distribution. Balls are equally likely to be blue or green. We draw some balls from
the urn, observing the color of each and replacing it. We cannot tell two identically
colored balls apart; furthermore, observed colors are wrong with probability 0.2.
How many balls are in the urn? Was the same ball drawn twice?

type Color; type Ball; type Draw;

2
3
4

random Color TrueColor(Ball);


random Ball BallDrawn(Draw);
random Color ObsColor(Draw);

5
6

guaranteed Color Blue, Green;


guaranteed Draw Draw1, Draw2, Draw3, Draw4;

#Ball Poisson[6]();

TrueColor(b) TabularCPD[[0.5, 0.5]]();

BallDrawn(d) Uniform({Ball b});

10
11
12

ObsColor(d)
if (BallDrawn(d) != null) then
TabularCPD[[0.8, 0.2], [0.2, 0.8]](TrueColor(BallDrawn(d)));

Figure 13.1

Blog model for balls in an urn (Example 13.1) with four draws.

The Blog model for this problem, shown in Figure 13.1, describes a stochastic
process for generating worlds. The rst 4 lines introduce the types of objects in these
worldscolors, balls, and drawsand the functions that can be applied to these
objects. For each function, the model species a type signature in a syntax similar to
that of C or Java. For instance, line 2 species that TrueColor is a random function
that takes a single argument of type Ball and returns a value of type Color. Lines

376

BLOG: Probabilistic Models with Unknown Objects

57 specify what objects may exist in each world. In every world, there are exactly
two distinct colors, blue and green, and there are exactly four draws. These are the
guaranteed objects. On the other hand, dierent worlds have dierent numbers of
balls, so the number of balls that exist is chosen from a priora Poisson with mean
6. Each ball is then given a color, as specied on line 8. Properties of the four draws
are lled in by choosing a ball (line 9) and an observed color for that ball (lines
1012). The probability of the generated world is the product of the probabilities
of all the choices made.
1

type Researcher; type Publication; type Citation;

2
3
4
5

random
random
random
random

origin Researcher Author(Publication);

guaranteed Citation Cite1, Cite2, Cite3, Cite4;

8
9

#Researcher NumResearchersPrior();
#Publication(Author = r) NumPubsPrior();

10
11

Name(r) NamePrior();
Title(p) TitlePrior();

12

PubCited(c) Uniform({Publication p});

13
14

Text(c) NoisyCitationGrammar(Title(PubCited(c)),
Name(Author(PubCited(c))));

String Name(Researcher);
String Title(Publication);
Publication PubCited(Citation);
String Text(Citation);

Figure 13.2

Blog model for Example 13.2 with four observed citations.

Example 13.2
We have a collection of citations that refer to publications in a certain eld. What
publications and researchers exist, with what titles and names? Who wrote which
publication, and to which publication does each citation refer? For simplicity, we
just consider the title and author-name strings in these citations, which are subject
to errors of various kinds, and we assume only single-author publications.
Figure 13.2 shows a Blog model for this example, based on the model in [23].
The Blog model denes the following generative process. First, sample the total
number of researchers from some distribution; then, for each researcher r, sample
the number of publications by that researcher. Sample the researchers names and
publications titles from appropriate prior distributions. Then, for each citation,
sample the publication cited by choosing uniformly at random from the set of pub-

13.2

Examples

377

lications. Finally, generate the citation text with a noisy formatting distribution
that allows for errors and abbreviations in the title and author names.
1

type Aircraft; type Blip;

2
3

random R6Vector State(Aircraft, NaturalNum);


random R3Vector ApparentPos(Blip);

nonrandom NaturalNum Pred(NaturalNum) = Predecessor;

5
6

origin Aircraft Source(Blip);


origin NaturalNum Time(Blip);

#Aircraft NumAircraftPrior();

8
9
10

State(a, t)
if t = 0 then InitState()
else StateTransition(State(a, Pred(t)));

11
12

#Blip(Source = a, Time = t) DetectionCPD(State(a, t));


#Blip(Time = t) NumFalseAlarmsPrior();

13
14
15

ApparentPos(b)
if (Source(b) = null) then FalseAlarmDistrib()
else ObsCPD(State(Source(b), Time(b)));

Figure 13.3

Blog model for Example 13.3.

Example 13.3
An unknown number of aircraft exist in some volume of airspace. An aircrafts
state (position and velocity) at each time step depends on its state at the previous
time step. We observe the area with radar: aircraft may appear as identical blips
on a radar screen. Each blip gives the approximate position of the aircraft that
generated it. However, some blips may be false detections, and some aircraft may
not be detected. What aircraft exist, and what are their trajectories? Are there any
aircraft that are not observed?
The Blog model for this scenario (Figure 13.3) describes the following process:
rst, sample the number of aircraft in the area. Then, for each time step t (starting
at t = 0), choose the state (position and velocity) of each aircraft given its state at
time t 1. Also, for each aircraft a and time step t, possibly generate a radar blip
b with Source(b) = a and Time(b) = t. Whether a blip is generated or not depends
on the state of the aircraftthus the number of objects in the world depends on
certain objects attributes. Also, at each step t, generate some false-alarm blips
b with Time(b ) = t and Source(b ) = null. Finally, sample the position for each
blip given the true state of its source aircraft (or using a default distribution for a
false-alarm blip).

378

13.3

BLOG: Probabilistic Models with Unknown Objects

Syntax and Semantics: Possible Worlds


13.3.1

Outcomes as First-Order Model Structures

The possible outcomes for examples 12.1 through 12.3 are structures containing
many related objects, with the set of objects and the relations among them varying
from outcome to outcome. We will treat these outcomes formally as model structures
of rst-order logic. A model structure provides interpretations for the symbols of a
rst-order language; each sentence of the rst-order language can be evaluated to
yield a truth-value in each model structure.
In Example 13.1, the language has function symbols such as TrueColor(b) for the
true color of ball b; BallDrawn(d) for the ball drawn on draw d; and Draw1 for
the rst draw. (Usually, rst-order languages are described as having predicate,
function, and constant symbols. For conciseness, we view all symbols as function
symbols; predicates are just functions that return a Boolean value, and constants are
just zero-ary functions.) To eliminate meaningless random variables, we use typed
logical languages. Each Blog model uses a language with a particular set of types,
such as Ball and Draw. Blog also has some built-in types that are available in all
models, namely Boolean, NaturalNum, Integer, String, Real, and RkVector (for each
k 2). Each function symbol f has a type signature (0 , . . . , k ), where 0 is the
return type of f and 1 , . . . , k are the argument types. The type Boolean receives
special syntactic treatment: if the return type of a function f is Boolean, then terms
of the form f (t1 , . . . , tk ) constitute atomic formulae, which can be combined using
logical operators and placed inside quantiers.
The logical languages used in Blog are also free: a function is not required to
apply to all tuples of arguments, even if they are appropriately typed [16]. For
instance, in Example 13.3, the function Source usually maps blips to aircraft, but
it is not applicable if the blip is a false detection. We adopt the convention that
when a function is not applicable to some arguments, it returns the special value
null. Any function that receives null as an argument also returns null, and an atomic
formula that evaluates to null is treated as false.
The truth of any rst-order sentence is determined by a model structure for the
corresponding language. A model structure species the extension of each type and
the interpretation for each function symbol:
Denition 13.1
A model structure of a typed, free, rst-order language consists of an extension

[ ] for each type , which may be an arbitrary set, and an interpretation [f ] for
each function symbol f . If f has return type 0 and argument types 1 , . . . , k , then

[f ] is a function from [1 ] [k ] to [0 ] {null}.


Three model structures for the language used in Figure 13.1 are shown in

Figure 13.4. Identity uncertainty arises because [BallDrawn] (Draw1) might be

equal to [BallDrawn] (Draw2) in one structure (such as Figure 13.4(a)) but not

13.3

Syntax and Semantics: Possible Worlds

Balls

Balls

2 3
Draws
(a)

379

2 3
Draws

Balls

(b)

2 3
Draws

(c)

Three model structures for the language of Figure 13.1. Shaded circles
represent balls that are blue; shaded squares represent draws where the drawn ball
appeared blue (unshaded means green). Arrows represent the BallDrawn function
from draws to balls.
Figure 13.4

another (such as Figure 13.4(b)). The set of balls, [Ball] , can also vary between
structures, as Figure 13.4 illustrates. The purpose of a Blog model is to dene a
probability distribution over such structures. Because any sentence can be evaluated
as true or false in each model structure, a distribution over model structures
implicitly denes the probability that is true for each sentence in the logical
language.
13.3.2

Outcomes with Fixed Object Sets

We begin our formal discussion of Blog semantics by considering the relatively


simple case of models with xed sets of objects. Blog models for xed object sets
have ve kinds of statements. A type declaration, such as the two statements on line
1 of Figure 13.3, introduces a type. A random function declaration, such as line 2 of
Figure 13.3, species the type signature for a function symbol whose values will be
chosen randomly in the generative process. A nonrandom function denition, such
as the one on line 4 of Figure 13.3, introduces a function whose interpretation is
xed in all possible worlds. In our implementation, the interpretation is given by a
Java class (Predecessor in this example). A guaranteed object statement, such as
line 5 in Figure 13.1, introduces and names some distinct objects that exist in all
possible worlds. For the built-in types, the obvious sets of guaranteed objects and
constant symbols are predened. The set of guaranteed objects of type in Blog
model M is denoted GM ( ). Finally, for each random function symbol, a Blog
model includes a dependency statement specifying how values are chosen for that
function. We postpone further discussion of dependency statements to section 13.4.
The rst four kinds of statements listed above dene a particular typed rst-order
language LM for a model M . The set of possible worlds of M , denoted M , consists

380

BLOG: Probabilistic Models with Unknown Objects

of those model structures of LM where the extension of each type is GM ( ), and


all nonrandom function symbols (including guaranteed constants) have their given
interpretations.
For each random function f and tuple of appropriately typed guaranteed objects o1 , . . . , ok , we can dene a random variable (RV) f [o1 , . . . , ok ] () 

[f ] (o1 , . . . , ok ). For instance, in a simplied version of Example 13.1 where the urn
contains a known set of balls {Ball1, . . . , Ball8} and we make four draws, the RVs are
TrueColor [Ball1] , . . . , TrueColor [Ball8], BallDrawn [Draw1] , . . . , BallDrawn [Draw4],
and ObsColor [Draw1] , . . . , ObsColor [Draw4]. The possible worlds are in one-to-one
correspondence with full instantiations of these basic RVs. Thus, a joint distribution
for the basic RVs denes a distribution over possible worlds.
13.3.3

Unknown Objects

In general, a Blog model denes a generative process in which objects are added
iteratively to a world. To describe such processes, we rst introduce origin function
declarations 1, such as lines 56 of Figure 13.3. Unlike other functions, origin
functions such as Source or Time have their values set when an object is added.
An origin function must take a single argument of some type (namely Blip in the
example); it is then called a -origin function.
Generative steps that add objects to the world are described by number statements, such as line 11 of Figure 13.3:
#Blip(Source = a, Time = t) DetectionCPD(State(a, t));
This statement says that for each aircraft a and time step t, the process adds some
number of blips, and each of these added blips b has the property that Source(b) = a
and Time(b) = t. In general, the beginning of a number statement has the form
# (g1 = x1 , . . . , gk = xk ),
where is a type, g1 , . . . , gk are -origin functions, and x1 , . . . , xk are logical
variables. (For types that are generated ab initio with no origin functions, the empty
parentheses are omitted, as in Figure 13.1.) The inclusion of a number statement
means that for each appropriately typed tuple of objects o1 , . . . , ok , the generative
process adds some random number (possibly zero) of objects q of type such that
[gi ] (q) = oi for i = 1, . . . , k. Note that the types of the generating objects o1 , . . . , ok
are the return types of g1 , . . . , gk .
Object generation can even be recursive: objects can generate other objects of
the same type. For instance, consider a model of sexual reproduction in which
every malefemale pair of individuals produces some number of ospring. We could
represent this with the number statement:

1. In [19] we used the term generating function, but we have now adopted the term
origin function because it seems clearer.

13.3

Syntax and Semantics: Possible Worlds

381

#Individual(Mother = m, Father = f)
if Female(m) & !Female(f) then NumOffspringPrior();
We can also view number statements more declaratively:
Denition 13.2
Let be a model structure of LM , and consider a number statement for type

with origin functions g1 , . . . , gk . An object q [ ] satises this number statement

applied to o1 , . . . , ok in if [gi ] (q) = oi for i = 1, . . . , k, and [g] (q) = null for all
other -origin functions g.
Note that if a number statement for type omits one of the -origin functions,
then this function takes on the value null for all objects satisfying that number
statement. For instance, Source is null for objects satisfying the
false-detection number statement on line 12 of Figure 13.3:
#Blip(Time = t) NumFalseAlarmsPrior();
Also, a Blog model cannot contain two number statements with the same set of
origin functions. This ensures that, in any given model structure, each object o
has exactly one generation history, which can be found by tracing back the origin
functions on o.
The set of possible worlds M is the set of model structures that can be
constructed by M s generative process. To complete the picture, we must explain
not only how many objects are added on each step, but also what these objects are. It
turns out to be convenient to dene the generated objects as follows: when a number
statement with type and origin functions g1 , . . . , gk is applied to generating
objects o1 , . . . , ok , the generated objects are tuples {(, (g1 , o1 ), . . . , (gk , ok ), n) :
n = 1, . . . , N }, where N is the number of objects generated. Thus in Example 13.3,
the aircraft are pairs (Aircraft, 1), (Aircraft, 2), etc., and the blips generated by
aircraft are nested tuples such as (Blip, (Source, (Aircraft, 2)), (Time, 8), 1). The tuple
encodes the objects generation history; of course, it is purely internal to the
semantics and remains invisible to the user.
Denition 13.3
The universe of a type in a Blog model M , denoted UM ( ), consists of the
guaranteed objects of type as well as all nested tuples of type that can be
generated from the guaranteed objects through nitely many recursive applications
of number statements.
As the following denition stipulates, in each possible world the extension of is
some subset of UM ( ).
Denition 13.4
For a Blog model M , the set of possible worlds M is the set of model structures
of LM such that

1. for each type , GM ( ) [ ] UM ( );

382

BLOG: Probabilistic Models with Unknown Objects

2. nonrandom functions have the specied interpretations;


3. for each number statement in M with type and origin functions g1 , . . . , gk , and
each appropriately typed tuple of generating objects (o1 , . . . , ok ) in , the set of

objects in [ ] that satisfy this number statement applied to these generating


objects is {(, (g1 , o1 ), . . . , (gk , ok ), n) : n = 1, . . . , N } for some natural number
N;

4. for every type , each element of [ ] satises some number statement applied to
some objects in .
Note that by part 3 of this denition, the number of objects generated by any
given application of a number statement in world is a nite number N . However,
a world can still contain innitely many nonguaranteed objects if some number
statements are applied recursively: then the world may contain tuples that are
nested to depths 1, 2, 3, . . ., with no upper bound. Innitely many objects can
also result if number statements are triggered for every natural number, like the
statements that generate radar blips in Example 13.3.
With a xed set of objects, it was easy to dene a set of basic RVs such that a
full instantiation of the basic RVs uniquely identied a possible world. To achieve
the same eect with unknown objects, we need two kinds of basic RVs:
Denition 13.5
For a Blog model M , the set VM of basic random variables consists of:
for each random function f with type signature (0 , . . . , k ) and each tuple
of objects (o1 , . . . , ok ) UM (1 ) UM (k ), a function application RV

f [o1 , . . . , ok ] () that is equal to [f ] (o1 , . . . , ok ) if o1 , . . . , ok all exist in , and


null otherwise;
for each number statement with type and origin functions g1 , . . . , gk that have
return types 1 , . . . , k , and each tuple of objects (o1 , . . . , ok ) UM (1 )
UM (k ), a number RV # [g1 = o1 , . . . , gk = ok ] () equal to the number of objects
that satisfy this number statement applied to o1 , . . . , ok in .
Intuitively, each step in the generative world-construction process determines the
value of a basic variable. The crucial result about basic RVs is the following:
Proposition 13.6
For any Blog model M and any complete instantiation of VM , there is at most
one model structure in M consistent with this instantiation.
Some instantiations of VM do not correspond to any possible world: for example, an instantiation for the urn-and-balls example where #Ball [] = 2, but
TrueColor [(Ball, 7)] is not null. Instantiations of VM that correspond to a world
are called achievable. Thus, to dene a probability distribution over M , it suces
to dene a joint distribution over the achievable instantiations of VM .
Now that we have seen this technical development, we can say more about
the need to represent objects as tuples that encode generation histories. Equat-

13.4

Syntax and Semantics: Probabilities

383

ing objects with tuples might seem unnecessarily complicated, but it becomes
very helpful when we dene a Bayes net over the basic RVs (which we do
in section 13.4.2). For instance, in the aircraft tracking example, the parent
of ApparentPos [(Blip, (Source, (Aircraft, 2)), (Time, 8), 1)] is State [(Aircraft, 2), 8]. It
might seem more elegant to assign numbers to objects as they are generated, so
that the extension of each type in each possible world would be simply a prex
of the natural numbers. Specically, we could number the aircraft arbitrarily, and
then number the radar blips lexicographically by aircraft and time step. Then we
would have basic RVs such as ApparentPos [23], representing the apparent aircraft
position for blip 23. But blip 23 could be generated by any aircraft at any time
step. In fact, the parents of ApparentPos [23] would have to include all the #Blip
and State variables in the model. So dening objects as tuples yields a much simpler
Bayes net.

13.4

Syntax and Semantics: Probabilities


13.4.1

Dependency Statements

Dependency and number statements specify exactly how the steps are carried out
in our generative process. Consider the dependency statement for State(a, t) from
Figure 13.3:
State(a, t)
if t = 0 then InitState()
else StateTransition(State(a, Pred(t)));
This statement is applied for every basic RV of the form State [a, t] where a
UM (Aircraft) and t N. If t = 0, the conditional distribution for State [a, t]
is given by the elementary CPD InitState; otherwise it is given by the elementary conditional probability distribution CPD StateTransition, which takes
State(a, Pred(t)) as an argument. These elementary CPDs dene distributions over
objects of type R6Vector (the return type of State). In our implementation, elementary CPDs are Java classes with a method getProb that returns the probability of
a particular value given a list of CPD arguments, and a method sampleVal that
samples a value given the CPD arguments.
A dependency statement begins with a function symbol f and a tuple of logical
variables x1 , . . . , xk representing the arguments to this function. In a number
statement, the variables x1 , . . . , xk represent the generating objects. In either case,
the rest of the statement consists of a sequence of clauses. When the statement is
not abbreviated, the syntax for the rst clause is
if cond then elem-cpd (arg1, . . ., argN )

384

BLOG: Probabilistic Models with Unknown Objects

The cond portion is a formula of the rst-order logical language LM (containing no


free variables other than x1 , . . . , xk ) specifying the condition under which this clause
should be used to sample a value for a basic RV. More precisely, if the possible world
constructed so far is , then the applicable clause is the rst one whose condition
is satised in (assuming for the moment that is complete enough to determine
the truth-values of the conditions). If no clauses condition is satised, or if the
basic RV refers to objects that do not exist in , then the value is set by default to
false for Boolean functions, null for other functions, and zero for number variables.
If the condition in a clause is just true, then the whole string if true then
may be omitted.
In the applicable clause, each CPD argument is evaluated in . The resulting
values are then passed to the elementary CPD. In the simplest case, the arguments
are terms or formulae of LM , such as State(a, Pred(t)). An argument can also be a
set expression of the form { y : }, where is a type, y is a logical variable, and
is a formula. The value of such an expression is the set of objects o [ ] such
that satises with y bound to o. If the formula is just true it can be omitted:
this is the case on line 9 of Figure 13.1, where we just see the expression {Ball
b}. Blog also includes other kinds of arguments to allow counting the number of
elements in a set, aggregating a multiset of values, or passing in a set of pairs (o, w)
where the os are objects and the ws are nonuniform sampling weights.
We require that the elementary CPDs obey two rules related to nonguaranteed
objects. First, if a CPD is dening a distribution over nonguaranteed objects,
e.g., the Uniform CPD on line 9 of Figure 13.1), it should never assign positive
probability to objects that do not exist in the partially completed world . To ensure
this, we allow an elementary CPD to assign positive probability to a nonguaranteed
object only if the object was passed in as part of a CPD argument (in Figure 13.1,
{Ball b} is passed in). Second, an elementary CPD cannot peek at the tuple
representations of objects that are passed in: it must be invariant to permutations
of the nonguaranteed objects.
13.4.2

Declarative Semantics

So far we have explained Blog semantics procedurally, in terms of a generative


process. To facilitate both knowledge engineering and the development of learning
algorithms, we would like to have declarative semantics. The standard approach
which is used in most existing rst-order regression systems (FOPLs) is to say
that a Blog model denes a certain Bayesian network (BN) over the basic RVs.
In this section we discuss how that approach needs to be modied for Blog.
We will write to denote an instantiation of a set of RVs vars(), and X to
denote the value that assigns to X. If a BN is nite, then the probability it assigns

to each complete instantiation is P () = Xvars() pX (X |Pa(X) ), where pX
is the CPD for X and Pa(X) is restricted to the parents of X. In an innite
BN, we can write a similar expression for each nite instantiation that is closed
under the parent relation (that is, X vars() implies Pa(X) vars()). If the

13.4

Syntax and Semantics: Probabilities

385

#Ball[]

TrueColor[(Ball, 1)]

TrueColor[(Ball, 2)]

TrueColor[(Ball, 3)]

BallDrawn[Draw1]

BallDrawn[Draw4]
ObsColor[Draw1]

ObsColor[Draw4]

Bayes net for the Blog model in Figure 13.1. The ellipses and dashed
arrows indicate that there are innitely many TrueColor [b] nodes.

Figure 13.5

BN is acyclic and each variable has nitely many ancestors, then these probability
assignments dene a unique distribution [14].
The diculty is that in the BN corresponding to a Blog model, variables often
have innite parent sets. For instance, the BN for Example 13.1 (shown partially
in Figure 13.5) has an innite number of basic RVs of the form TrueColor [b]: if it
had only a nite number N of these RVs, it could not represent outcomes with
more than N balls. Furthermore, each of these TrueColor [b] RVs is a parent of each
ObsColor [d] RV, since if BallDrawn [d] happens to be b, then the observed color on
draw d depends directly on the color of ball b. So the
ObsColor [d] nodes have innitely many parents. In such a model, assigning
probabilities to nite instantiations that are closed under the parent relation
does not dene a unique distribution: in particular, it tells us nothing about the
ObsColor [d] variables.
We required instantiations to be closed under the parent relation so that the
factors pX (X |Pa(X) ) would be well-dened. But we may not need the values of
all of Xs parents in order to determine the conditional distribution for X. For
instance, knowing BallDrawn [d] = (Ball, 13) and TrueColor [(Ball, 13)] = Blue is sufcient to determine the distribution for ObsColor [d]: the colors of all the other balls
are irrelevant in this context. We can read o this context-specic independence
from the dependency statement for ObsColor in Figure 13.1 by noting that the instantiation (BallDrawn [d] = (Ball, 13), TrueColor [(Ball, 13)] = Blue) determines the
value of the sole CPD argument TrueColor(BallDrawn(d)). We say this instantiation
supports the variable ObsColor [d] (see [20]).
Denition 13.7
An instantiation supports a basic RV V of the form f [o1 , . . . , ok ] or
# [g1 = o1 , . . . , gk = ok ] if all possible worlds consistent with agree on (1) whether
all the objects o1 , . . . , ok exist, and, if so, on (2) the applicable clause in the dependency or number statement for V and the values for the CPD arguments in that
clause.

386

BLOG: Probabilistic Models with Unknown Objects

Note that some RVs, such as #Ball [] in Example 13.1, are supported by the
empty instantiation. We can now generalize the notion of being closed under the
parent relation.
Denition 13.8
A nite instantiation is self-supporting if its instantiated variables can be numbered X1 , . . . , XN such that for each n N , the restriction of to {X1 , . . . , Xn1 }
supports Xn .
This denition lets us give semantics to Blog models in a way that is meaningful
even when the corresponding BNs contain innite parent sets. We will write
pV (v | ) for the probability that V s dependency or number statement assigns
to the value v, given an instantiation that supports V .
Denition 13.9
A distribution P over M satises a Blog model M if for every nite, selfsupporting instantiation with vars() VM :
P ( ) =

N


pXn (Xn | {X1 ,...,Xn1 } )

(13.1)

n=1

where is the set of possible worlds consistent with and X1 , . . . , XN is a


numbering of as in Denition 13.8.
A Blog model is well-dened if there is exactly one probability distribution that
satises it. Recall that a BN is well-dened if it is acyclic and each variable has
a nite set of ancestors. Another way of saying this is that each variable can be
reached by enumerating its ancestors in a nite, topologically ordered list. The
well-denedness criterion for Blog is similar, but deals with nite, self-supporting
instantiations rather than nite, topologically ordered lists of variables. Because we
are dealing with instantiations rather than variables, we need to make sure that
they cover all possible worlds in addition to covering all basic variables.
Theorem 13.10
Let M be a Blog model. Suppose that VM is at most countably innite,2 and for
each V VM and M , there is a self-supporting instantiation that agrees with
and includes V . Then M is well-dened.
Proof: We provide only a sketch of the proof here, deferring the full version
to a more technical paper. First, since VM is at most countably innite, we
can impose an arbitrary numbering (a bijection with some prex of the natural
numbers) on VM . This numbering is global in the sense that it does not
depend on the instantiation of the random variables. Now, we dene a sequence
of auxiliary random variables {Yn : 0 n < |VM |} on M as follows. Let
2. This is satised if the Real and RkVector types are not arguments to random functions
or return types of gorigin functions.

13.4

Syntax and Semantics: Probabilities

387

Y0 () = X() where X is the rst basic RV in the global ordering that is


supported by the empty instantiation. For n 1, let n () be the instantiation
(Y0 = Y0 (), . . . , Yn1 = Yn1 ()). Then let Yn () = Z() where Z is the rst basic
RV in the global ordering that is supported by n (), but has not already been used
to dene Ym () for any m < n. The important property of the sequence {Yn } is
that any instantiation of Y0 , . . . , Yn1 determines the CPD for Yn . In other words, if
we dene our model in terms of {Yn }, we get a standard BN in which each variable
has nitely many ancestors.
However, we must show that this sequence {Yn } is well-dened. Specically, we
must show that for every n < |VM | and every M , there exists a basic RV
Z that is supported by n () and has not already been used to dene Ym () for
some m < n. This can be shown using the premise that for every V VM , there
is a self-supporting instantiation consistent with that contains V .
We can use standard results from probability theory to show that there is a
unique probability distribution over full instantiations of {Yn } such that each Yn
has the specied conditional distribution given all its predecessors. It remains to
show that this distribution over instantiations corresponds to a unique distribution
on M . First, we must show that each full instantiation of {Yn } corresponds to at
most one possible world: this follows from Proposition 13.6, plus the fact that a full
instantiation of {Yn } determines all the basic RVs. Second, we can show that the
probability distribution we have dened over {Yn } is concentrated on instantiations
that actually correspond to possible worlds not instantiations that give RVs
values of the wrong type, or give RVs non-null values in contexts where they must
be null.
Finally, we need to check that this unique distribution on M indeed satises M .
For nite, self-supporting instantiations that correspond to the auxiliary instantiations n () used in dening {Yn }, the constraint is satised by construction. All
other nite, self-supporting instantiations can be expressed as disjunctions of those
core instantiations. From these observations, it is possible to show that (13.1) is
satised for all nite, self-supporting instantiations.
To check that the criterion of Theorem 13.10 holds for a particular example, we
need to consider each basic RV. In Example 13.1, the number RV for balls is supported by the empty instantiation, so in every world it is part of a self-supporting instantiation of size one. Each TrueColor [b] RV depends only on whether its argument
exists, so these variables participate in self-supporting instantiations of size two.
Similarly, each BallDrawn variable depends only on what balls exist. To sample an
ObsColor [d] variable, we need to know BallDrawn [d] and TrueColor [BallDrawn [d]],
so these variables are in self-supporting instantiations of size four. Similar arguments can be made for Examples 13.2 and 13.3. Of course, we would like to have
an algorithm for checking whether a Blog model is well-dened; the criteria given
in Theorem 13.12 in section 13.6.2 are a rst step in this direction.

388

13.5

BLOG: Probabilistic Models with Unknown Objects

Evidence and Queries


Because a well-dened Blog model M denes a distribution over model structures,
we can use arbitrary sentences of LM as evidence and queries. But sometimes such
sentences are not enough. In Example 13.3, the user observes radar blips, which are
not referred to by any terms in the language. The user could assert evidence about
the blips using existential quantiers, but then how could he make a query of the
form, Did this blip come from the same aircraft as that blip?
A natural solution is to allow the user to extend the language when evidence
arrives, adding constant symbols to refer to observed objects. In many cases, the
user observes some new objects, introduces some new symbols, and assigns the
symbols to the objects in an uninformative order. To handle such cases, Blog
includes a special macro. For instance, given four radar blips at time 8, one can
assert
{Blip r: Time(r) = 8} = {Blip1, Blip2, Blip3, Blip4};
This asserts that there are exactly four radar blips at time 8, and introduces new
constants Blip1, . . . , Blip4 in one-to-one correspondence with those blips.
Formally, the macro augments the model with dependency statements for the
new symbols. The statements implement sampling without replacement; for our
example, we have
Blip1 Uniform({Blip r : (Time(r) = 8)});
Blip2 Uniform({Blip r : (Time(r) = 8) & (Blip1 != r)});
and so on. Once the model has been extended this way, the user can make assertions
about the apparent positions of Blip1, Blip2, etc., and then use these symbols in
queries.
These new constants resemble Skolem constants, but conditioning on assertions
about the new constants is not the same as conditioning on an existential sentence.
For example, suppose you go into a new wine shop, pick up a bottle at random, and
observe that it costs $40. This scenario is correctly modeled by introducing a new
constant Bottle1 with a Uniform CPD. Then observing that Bottle1 costs at least
$40 suggests that this is a fancy wine shop. On the other hand, the mere existence
of a $40+ bottle in the shop does not suggest this, because almost every shop has
some bottle at over $40.

13.6

Inference
Because the set of basic RVs of a Blog model can be innite, it is not obvious that
inference for well-dened Blog models is even decidable. However, the generative
process intuition suggests a rejection sampling algorithm. We present this algorithm
not because it is particularly ecient, but because it demonstrates the decidability

13.6

Inference

389

of inference for a large class of Blog models (see Theorem 13.12 below) and
illustrates several issues that any Blog inference algorithm must deal with. At
the end of this section, we present experimental results from a somewhat more
ecient likelihood weighting algorithm.
13.6.1

Rejection sampling

Suppose we are given a partial instantiation e as evidence, and a query variable Q.


To generate each sample, our rejection sampling algorithm starts with an empty
instantiation . Then it repeats the following steps: enumerate the basic RVs in a
xed order3 until we reach the rst RV V that is supported by but not already
instantiated in ; sample a value v for V according to V s dependency statement;
and augment with the assignment V = v. The process continues until all the
query and evidence variables have been sampled. If the sample is consistent with
the evidence e, then the program increments a counter Nq , where q is the sampled
value of Q. Otherwise, it rejects this sample. After N accepted samples, the estimate
of P (Q = q | e) is Nq /N .
This algorithm requires a subroutine that determines whether a partial instantiation supports a basic RV V , and if so, returns a sample from V s conditional
distribution. For a basic RV V of the form f [o1 , . . . , ok ] or # [g1 = o1 , . . . , gk = ok ],
the subroutine begins by checking the values of the relevant number variables in
to determine whether all of o1 , . . . , ok exist. If some of these number variables
are not instantiated, then does not support V . If some of o1 , . . . , ok do not exist,
the subroutine returns the default value for V . If they do all exist, the subroutine
follows the semantics for dependency statements discussed in section 13.4.1. First,
it iterates over the clauses in the dependency (or number) statement until it reaches
a clause whose condition is either undetermined or determined to be true given
(if all the conditions are determined to be false, then it returns the default value for
V ). If the condition is undetermined, then does not support V . If it is determined
to be true, then the subroutine evaluates each of the CPD arguments in this clause.
If determines the values of all the arguments, then the subroutine samples a value
for V by passing those values to the sampleVal method of this clauses elementary
CPD. Otherwise, does not support V .
To evaluate terms and quantier-free formulae, we use a straightforward recursive
algorithm. The base case looks up the value of a particular function application
RV in ; if this RV is not instantiated, the algorithm returns undetermined. To
evaluate a formula, we evaluate its subformulae in order from left to right. We stop
when we hit an undetermined subformula or when the value of the whole formula is
determined. For example, to evaluate , we rst evaluate . If is undetermined,

3. Each basic RV f [o1 , . . . , ok ] or # [g1 = o1 , . . . , gk = ok ] can be assigned a depth which


is the maximum of the depths of nested tuples and the magnitudes of integers among its
arguments o1 , . . . , ok . The number of RVs at each given depth is nite. Thus, we can
enumerate rst the RVs at depth 0, then those at depth 1, depth 2, etc.

390

BLOG: Probabilistic Models with Unknown Objects

we return undetermined; if is true, we return true, and if is false, we go on


to evaluate .4
It is more complicated to evaluate set expressions such as {Blip r: Time(r) = 8},
which can be used as CPD arguments. A naive algorithm for evaluating this expression would rst enumerate all the objects of type Blip (which would require
certain number variables to be instantiated), then select the blips r that satisfy
Time(r) = 8. But Figure 13.3 species that there may exist some blips for each aircraft a and each natural number t: since there are innitely many natural numbers,
some worlds contain innitely many blips. Fortunately, the number of blips r with
Time(r) = 8 is necessarily nite: in every world there are a nite number of aircraft,
and each one generates a nite number of blips at time 8. We have an algorithm
that scans the formula within a set expression for origin function restrictions such
as Time(r) = 8, and uses them to avoid enumerating innite sets when possible.
These restrictions may be either equality constraints, or inequalities that dene a
bounded set of natural numbers, such as Time(r) < 12. A similar method is used
for evaluating quantied formulas.
13.6.2

Termination Criteria

In order to generate each sample, the algorithm above repeatedly instantiates the
rst variable that is supported but not yet instantiated, until it instantiates all
the query and evidence variables. When can we be sure that this will take a nite
amount of time? The rst way this process could fail to terminate is if it goes into
an innite loop while checking whether a particular variable is supported. This
happens if the program ends up enumerating an innite set while evaluating a
set expression or quantied formula. We can avoid this by ensuring that all such
expressions in the Blog model are nite once origin function restrictions are taken
into account.
The sample generator also fails to terminate if it never constructs an instantiation
that supports a particular query or evidence variable. To see how this can happen,
consider calling the subroutine described above to sample a variable V . If V is not
supported, the subroutine will realize this when it encounters a variable U that is
relevant but not instantiated. Now consider a graph over basic variables where we
draw an edge from U to V when the evaluation process for V hits U in this way. If
a variable is never supported, then it must be part of a cycle in this graph, or part
of a receding chain of variables V1 V2 that is extended innitely.
The graph constructed in this way varies from sample to sample: for instance,
sometimes the evaluation process for ObsColor [d] will hit TrueColor [(Ball, 7)], and
sometimes it will hit TrueColor [(Ball, 13)]. However, we can rule out cycles and

4. This left-to-right evaluation scheme does not always detect that a formula is determined: for instance, on , it returns undetermined if is undetermined but is
trueeven though must be true in this case.

13.6

Inference

391

Color

Ball

Researcher

Draw

Publication

Citation
Name

BallDrawn

Title

TrueColor

PubCited
ObsColor

Text

(a)

(b)
Aircraft

State

Blip

ApparentPos

NaturalNum

(c)

Symbol graphs for (a) the urn-and-balls model in Figure 13.1; (b) the
bibliographic model in Figure 13.2; (c) the aircraft tracking model in Figure 13.3.

Figure 13.6

innite receding chains in all these graphs by considering a more abstract graph
over function symbols and types (along the same lines as the dependency graph of
[15, 4]).
Denition 13.11
The symbol graph for a Blog model M is a directed graph whose nodes are the
types and random function symbols of M , where the parents of a type or function
symbol f are
the random function symbols that occur on the right-hand side of the dependency
statement for f or some number statement for ;
the types of variables that are quantied over in formulae or set expressions on
the right-hand side of such a statement;
the types of the arguments for f or the return types of origin functions for .
The symbol graphs for our three examples are shown in Figure 13.6. If the
sampling subroutine for a basic RV V hits a basic RV U , then there must be
an edge from U s function symbol (or type, if U is a number RV) to V s function
symbol (or type) in the symbol graph. This property, along with ideas from [20],
allows us to prove the following:

392

BLOG: Probabilistic Models with Unknown Objects

Theorem 13.12
Suppose M is a Blog model where
1. uncountable built-in types do not serve as function arguments or as the return
types of origin functions;
2. each quantied formula and set expression ranges over a nite set once origin
function restrictions are taken into account;
3. the symbol graph is acyclic.
Then M is well-dened. Also, for any evidence instantiation e and query variable
Q, the rejection sampling algorithm described in section 13.6.1 converges to the
posterior P (Q|e) dened by the model, taking nite time per sampling step.
The criteria in Theorem 13.12 are very conservative: in particular, when we construct the symbol graph, we ignore all structure in the dependency statements and
just check for the occurrence of function and type symbols. These criteria are satised by the models in Figures 13.1 and 13.2. However, the aircraft tracking model in
Figure 13.3 does not satisfy the criteria because its symbol graph (Figure 13.6(c))
contains a self-loop from State to State. The criteria do not exploit the fact that
State(a, t) depends only on State(a, Pred(t)), and the nonrandom function Pred is
acyclic. Friedman et al. [4] have already dealt with this issue in the context of
probabilistic relational models; their algorithm can be adapted to obtain a stronger
version of Theorem 13.12 that covers the aircraft tracking model.
13.6.3

Experimental results

Milch et al. [20] describe a guided likelihood weighting algorithm that uses backward
chaining from the query and evidence nodes to avoid sampling irrelevant variables.
This algorithm can also be adapted to Blog models. We applied this algorithm
for Example 13.1, asserting that 10 balls were drawn and all appeared blue, and
querying the number of balls in the urn. Figure 13.7(a) shows that when the prior
for the number of balls is uniform over {1, . . . , 8}, the posterior puts more weight
on small numbers of balls; this makes sense because the more balls there are in the
urn, the less likely it is that they are all blue. Figure 13.7(b), using a Poisson(6)
prior, shows a similar but less pronounced eect.
Note that in Figure 13.7, the posterior probabilities computed by the likelihood
weighting algorithm are very close to the exact values (computed by exhaustive
enumeration of possible worlds with up to 170 balls). We were able to obtain
this level of accuracy using runs of 20,000 samples with the uniform prior, and
100,000 samples using the Poisson prior. On a Linux workstation with a 3.2GHz
Pentium 4 processor, the runs with the uniform prior took about 35 seconds (571
samples/second), and those with the Poisson prior took about 170 seconds (588
samples/second). Such results could not be obtained using any algorithm that
constructed a single xed BN, since the number of potentially relevant TrueColor [b]
variables is innite in the Poisson case.

Related Work

393

0.45

0.18

0.4

0.16

0.35

0.14

0.3

0.12

Probability

Probability

13.7

0.25
0.2

0.1
0.08

0.15

0.06

0.1

0.04

0.05

0.02

0
1

3
4
5
6
Number of balls in urn

(a)

10
15
Number of balls in urn

20

25

(b)

Distribution for the number of balls in the urn (Example 13.1).


Dashed lines are the uniform prior (a) or Poisson prior (b); solid lines are the exact
posterior given that 10 balls were drawn and all appeared blue; and plus signs are
posterior probabilities computed by ve independent runs of 20,000 samples (a) or
100,000 samples (b).
Figure 13.7

13.7

Related Work
Gaifman [5] was the rst to suggest dening a probability distribution over rstorder model structures. Halpern [10] denes a language in which one can make
statements about such distributions: for instance, that the probability of the set of
worlds that satisfy Flies(Tweety) is 0.8. Probabilistic logic programming [22] can be
seen as an application of this approach to Horn-clause knowledge bases. Such an
approach only denes constraints on distributions, rather than dening a unique
distribution.
Most FOPLs that dene unique distributions x the set of objects and the
interpretations of (non-Boolean) function symbols. Examples include relational
Bayesian networks [12] and Markov logic models [3]. Prolog-based languages such
as probabilistic Horn abduction [26], PRISM [28], and Bayesian logic programs [14]
work with Herbrand models, where the objects are in one-to-one correspondence
with the ground terms of the language (a consequence of the unique names and
domain closure assumptions).
There are a few FOPLs that allow explicit reference uncertainty, i.e., uncertainty
about the interpretations of function symbols. Among these are two languages that
use indexed RVs rather than logical notation: BUGS [7] and indexed probability
diagrams (IPDs) [21]. Reference uncertainty can also be represented in probabilistic
relational models (PRMs) [15], where a single-valued complex slot corresponds
to an uncertain unary function. PRMs are unfortunately restricted to unary functions (attributes) and binary predicates (relations). Probabilistic entity-relationship
models [11] lift this restriction, but represent reference uncertainty using relations
(such as Drawn(d, b)) and special mutual exclusivity constraints, rather than with

394

BLOG: Probabilistic Models with Unknown Objects

functions such as BallDrawn(d). Multientity Bayesian network logic (MEBN) [17]


is similar to Blog in allowing uncertainty about the values of functions with any
number of arguments.
The need to handle unknown objects has been appreciated since the early days
of FOPL research: Charniak and Goldmans plan recognition networks (PRNs)
[2] can contain unbounded numbers of objects representing hypothesized plans.
However, external rules are used to decide what objects and variables to include in
a PRN. While each possible PRN denes a distribution on its own, Charniak and
Goldman do not claim that the various PRNs are all approximations to some single
distribution over outcomes.
Some more recent FOPLs do dene a single distribution over outcomes with
varying objects. IPDs allow uncertainty over the index range for an indexed family
of RVs. PRMs and their extensions allow a variety of forms of uncertainty about
the number (or existence) of objects satisfying certain relational constraints [15, 6]
or belonging to each type [23]. However, there is no unied syntax or semantics
for dealing with unknown objects in PRMs. MEBNs take yet another approach:
an MEBN model includes a set of unique identiers, for each of which there is an
identity RV indicating whether the named object exists.
Our approach to unknown objects in Blog can be seen as unifying the PRM
and MEBN approaches. Number statements neatly generalize the various ways
of handling unknown objects in PRMs: number uncertainty [15] corresponds to a
number statement with a single origin function; existence uncertainty [6] can be
modeled with two or more origin functions (and a CPD whose support is {0, 1});
and domain uncertainty [23] corresponds to a number statement with no origin
functions. There is also a correspondence between Blog and MEBN logic: the
tuple representations in a Blog model can be thought of as unique identiers in
an MEBN model. The dierence is that Blog determines which objects actually
exist in a world using number variables rather than individual existence variables.
Finally, it is informative to compare Blog with the IBAL language [25], in which
a program denes a distribution over outputs that can be arbitrary nested data
structures. An IBAL program could implement a Blog-like generative process with
the outputs viewed as logical model structures. But the declarative semantics of
such a program would be less clear than the corresponding Blog model.

13.8

Conclusions and Future Work


Blog is a representation language for probabilistic models with unknown objects.
It contributes to the solution of a very general problem in AI: intelligent systems
must represent and reason about objects, but those objects may not be known a
priori and may not be directly and uniquely identied by perceptual processes. Our
approach denes generative models in which rst-order model structures are created
by adding objects and setting function values; everything else follows naturally from
this design decision.

13.8

Conclusions and Future Work

395

Much work remains to be done on Blog. The inference algorithms presented


in this chapter are not practical for any but the smallest examples. For real-world
problems, we expect to employ Markov chain Monte Carlo (MCMC) techniques (see,
e.g, Gilks et al. [8]), simulating a Markov chain over possible worlds. More precisely,
these algorithms must use partial descriptions of possible worlds: in a model with
innitely many RVs, a world cannot be represented explicitly as a full instantiation.
We plan to implement a general Gibbs sampling algorithm for Blog models,
using some of the same techniques as the BUGS system [7]. However, for models
with unknown objects, we expect to obtain faster convergence with MetropolisHastings algorithms [18] using proposal distributions that split and merge objects
[13]. For now, it appears that these proposal distributions will need to be designed
by hand to propose reasonable splits and merges (e.g., merging publications with
similar or identical titles), as was done in [23]. However, we have implemented a
general Metropolis-Hastings inference engine for Blog that maintains the state of
the Markov chain and computes acceptance probabilities for any given proposal
distribution. In the future, we plan to explore adaptive MCMC techniques (see,
e.g., [9] and references therein).
Another important question is how to design Blog models that will lead to accurate inferences from real-world data. For the citation matching problem, Pasula
et al. [23] obtained state-of-the-art accuracy using reasonably simple prior distributions for publication titles and author names, estimated from BibTeX les and
U.S. Census data (these results are competitive with the discriminative approach of
Wellner et al. [31]). It is not so clear how to estimate the prior distributions for the
numbers of objects of various types, such as researchers and publications. Pasula
et al. [23] simply used a log-normal distribution, which has a very large variance. As
an alternative to dening such a prior distribution, one could use the nonparametric version of Blog proposed by Carbonetto et al. [1], which incorporates Dirichlet
process mixture models.
Finally, perhaps the most interesting questions about Blog have to do with
learning. Parameter estimation, even from partially observed data, is conceptually
straightforward: the sampling-based inference algorithms described above can serve
as the basis for Monte Carlo expectation-maximization (EM) algorithms [30]. But
learning the structure of Blog models is an exciting open problem. In other
statistical relational formalisms, techniques have been proposed for discovering
dependencies that hold between attributes of related objects [4, 27]. We believe that
extensions of these techniques can be applied to Blog. The ultimate goal, however,
is to develop algorithms that can hypothesize new attributes, new relations, and
even new types of objects. Blog provides a language in which such hypotheses can
be expressed.

396

BLOG: Probabilistic Models with Unknown Objects

References
[1] P. Carbonetto, J. Kisy
nski, N. de Freitas, and D. Poole. Nonparametric
Bayesian logic. In Proceedings of the Conference on Uncertainty in Articial
Intelligence, 2005.
[2] E. Charniak and R. P. Goldman. A Bayesian model of plan recognition.
Articial Intelligence, 64(1):5379, 1993.
[3] P. Domingos and M. Richardson. Markov logic: A unifying framework for
statistical relational learning. In ICML Workshop on Statistical Relational
Learning and Its Connections to Other Fields, 2004.
[4] N. Friedman, L. Getoor, D. Koller, and A. Pfeer. Learning probabilistic
relational models. In Proceedings of the International Joint Conference on
Articial Intelligence, 1999.
[5] H. Gaifman. Concerning measures in rst order calculi. Israel Journal of
Mathematics, 2:118, 1964.
[6] L. Getoor, N. Friedman, D. Koller, and B. Taskar. Learning probabilistic
models of relational structure. In Proceedings of the International Conference
on Machine Learning, 2001.
[7] W. R. Gilks, A. Thomas, and D. J. Spiegelhalter. A language and program for
complex Bayesian modelling. The Statistician, 43(1):169177, 1994.
[8] W. R. Gilks, S. Richardson, and D. J. Spiegelhalter, editors. Markov Chain
Monte Carlo in Practice. Chapman and Hall, London, 1996.
[9] H. Haario, E. Saksman, and J. Tamminen. An adaptive Metropolis algorithm.
Bernoulli, 7:223242, 2001.
[10] J. Y. Halpern. An analysis of rst-order logics of probability. Articial
Intelligence, 46:311350, 1990.
[11] D. Heckerman, C. Meek, and D. Koller. Probabilistic models for relational
data. Technical Report MSR-TR-2004-30, Microsoft Research, Seattle, WA,
2004.
[12] M. Jaeger. Complex probabilistic modeling with recursive relational Bayesian
networks. Annals of Math and Articial Intelligence, 32:179220, 2001.
[13] S. Jain and R. M. Neal. A split-merge Markov chain Monte Carlo procedure for the Dirichlet process mixture model. Journal of Computational and
Graphical Statistics, 13:158182, 2004.
[14] K. Kersting and L. De Raedt. Adaptive Bayesian logic programs. In
Proceedings of the International Conference on Inductive Logic Programming,
2001.
[15] D. Koller and A. Pfeer. Probabilistic frame-based systems. In Proceedings
of the National Conference on Articial Intelligence, 1998.

References

397

[16] K. Lambert. Free logics, philosophical issues in. In E. Craig, editor, Routledge
Encyclopedia of Philosophy. Routledge, London, 1998.
[17] K. B. Laskey and P. C. G. da Costa. Of starships and Klingons: Bayesian
logic for the 23rd century. In Proceedings of the Conference on Uncertainty in
Articial Intelligence, 2005.
[18] N. Metropolis, A.W. Rosenbluth, M.N. Rosenbluth, A.H. Teller, and E. Teller.
Equations of state calculations by fast computing machines. Journal of Chemical Physics, 21:10871092, 1953.
[19] B. Milch, B. Marthi, S. Russell, D. Sontag, D. L. Ong, and A. Kolobov. BLOG:
Probabilistic models with unknown objects. In Proceedings of the International
Joint Conference on Articial Intelligence, 2005.
[20] B. Milch, B. Marthi, D. Sontag, S. Russell, D. L. Ong, and A. Kolobov.
Approximate inference for innite contingent Bayesian networks. In Tenth
International Workshop on Articial Intelligence and Statistics, 2005.
[21] E. Mjolsness. Labeled graph notations for graphical models. Technical Report
04-03, School of Information and Computer Science, University of California,
Irvine, 2004.
[22] R. T. Ng and V. S. Subrahmanian. Probabilistic logic programming. Information and Computation, 101(2):150201, 1992.
[23] H. Pasula, B. Marthi, B. Milch, S. Russell, and I. Shpitser. Identity uncertainty and citation matching. In Proceedings of Neural Information Processing
Systems, 2003.
[24] J. Pearl. Probabilistic Reasoning in Intelligent Systems, revised edition.
Morgan Kaufmann, San Francisco, 1988.
[25] A. Pfeer. IBAL: A probabilistic rational programming language. In Proceedings of the International Joint Conference on Articial Intelligence, 2001.
[26] D. Poole. Probabilistic Horn abduction and Bayesian networks. Articial
Intelligence, 64(1):81129, 1993.
[27] A. Popescul, L. H. Ungar, S. Lawrence, and D. M. Pennock. Statistical relational learning for document mining. In Proceedings of the IEEE International
Conference on Data Mining, 2003.
[28] T. Sato and Y. Kameya. Parameter learning of logic programs for symbolicstatistical modeling. Journal of Articial Intelligence Research, 15:391454,
2001.
[29] R. W. Sittler. An optimal data association problem in surveillance theory.
IEEE Transactions on Military Electronics, MIL-8:125139, 1964.
[30] G. C. G. Wei and M. A. Tanner. A Monte Carlo implementation of the EM
algorithm and the poor mans data augmentation algorithms. Journal of the
American Statistical Association, 85:699704, 1990.

398

BLOG: Probabilistic Models with Unknown Objects

[31] B. Wellner, A. McCallum, F. Peng, and M. Hay. An integrated, conditional


model of information extraction and coreference with application to citation
matching. In Proceedings of the Conference on Uncertainty in Articial Intelligence, 2004.

14 The Design and Implementation of IBAL:


A General-Purpose Probabilistic Language

Avi Pfeer

This chapter describes IBAL, a high-level representation language for probabilistic


AI. IBAL integrates several aspects of probability-based rational behavior, including
probabilistic reasoning, Bayesian parameter estimation, and decision-theoretic utility maximization. IBAL is based on the functional programming paradigm, and is
an ideal rapid prototyping language for probabilistic modeling. The chapter presents
the IBAL language, and presents a number of examples in the language. It then
discusses the semantics of IBAL, presenting the semantics in two dierent ways.
Finally, the inference algorithm of IBAL is presented. Seven desiderata are listed
for inference, and it is shown how the algorithm fulls each of them.

14.1

Introduction
In a rational programming language, a program specifes a situation encountered
by an agent; evaluating the program amounts to computing what a rational agent
would believe or do in the situation. Rational programming combines the advantages of declarative representations with features of programming languages such
as modularity, compositionality, and type systems. A system designer need not
reinvent the algorithms for deciding what the system should do in each possible
situation it encounters. It is sucient to declaratively describe the situation, and
leave the sophisticated inference algorithms to the implementors of the language.
One can think of Prolog as a rational programming language, focused on computing the beliefs of an agent that uses logical deduction. In the past few years there has
been a shift in AI toward specications of rational behavior in terms of probability
and decision theory. There is therefore a need for a natural, expressive, generalpurpose, and easy-to-program language for probabilistic modeling. This chapter
presents IBAL, a probabilistic rational programming language. IBAL, pronounced
eyeball, stands for I ntegrated B ayesian Agent Language. As its name suggests,

400

The Design and Implementation of IBAL: A General-Purpose Probabilistic Language

it integrates various aspects of probability-based rational behavior, including probabilistic reasoning, Bayesian parameter estimation, and decision-theoretic utility
maximization. This chapter will focus on the probabilistic representation and reasoning capabilities of IBAL, and not discuss the learning and decision-making aspects.
High-level probabilistic languages have generally fallen into two categories. The
rst category is rule-based [19, 13, 5]. In this approach the general idea is to associate
logic-programming-like rules with noise factors. A rule describes how one rst-order
term depends on other terms. Given a specic query and a set of observations, a
Bayesian network (BN) can be constructed describing a joint distribution over all
the rst-order variables in the domain.
The second category of language is object-based [7, 8, 10]. In this approach, the
world is described in terms of objects and the relationships between them. Objects
have attributes, and the probabilistic model describes how the attributes of an
object depend on other attributes of the same object and on attributes of related
objects. The model species a joint probability distribution over the attributes of
all objects in the domain.
This chapter explores a dierent approach to designing high-level probabilistic
languages. IBAL is a functional language for specifying probabilistic models. Models
in IBAL look like programs in a functional programming language. In the functional
approach, a model is a description of a computational process. The process stochastically generates a value, and the meaning of the model is the distribution over the
value generated by the process.
The functional approach, as embodied in IBAL, has a number of attractive
features. First of all, it is an extremely natural way to describe a probabilistic
model. To construct a model, one simply has to provide a description of the way
the world works. Describing the generative process explicitly is the most direct way
to describe a generative model. Second, IBAL is highly expressive. It builds on top
of a Turing-complete programming language, so that every generative model that
can reasonably be described computationally can be described in IBAL. Third, by
basing probabilistic modeling languages on programming languages, we are able
to enjoy the benets of a programming language, such as a type system and type
inference. Furthermore, by building on the technology of functional languages, we
are able to utilize all their features, such as lambda abstraction and higher-order
functions.
In addition, the use of a functional programming framework provides an elegant
and uniform language with which to describe all aspects of a model. All levels of a
model can be described in the language, including the low-level probabilistic dependencies and the high-level structure. This is in contrast to rule-based approaches,
in which combination rules describe how the dierent rules t together. It is also in
contrast to object-based languages, in which the low-level structure is represented
using conditional probability tables and a dierent language is used for high-level
structure. Furthermore, PRMs use special syntax to handle uncertainty over the
relational structure. This means that each such feature must be treated as a spe-

14.2

The IBAL Language

401

cial case, with special purpose inference algorithms. In IBAL, special features are
encoded using the language syntax, and the general-purpose inference algorithm is
applied to handle them.
IBAL is an ideal rapid prototyping language for developing new probabilistic
models. Several examples are provided that show how easy it is to express models
in the language. These include well-known models as well as new models. IBAL has
been implemented, and made publicly available at
http:www.eecs.harvard.edu/~avi/IBAL.
The chapter begins by presenting the IBAL language. The initial focus is on the
features that allow description of generative probabilistic models. After presenting
examples, the chapter presents the declarative semantics of IBAL.
When implementing a highly expressive reasoning language, the question of
inference comes to the forefront. Because IBAL is capable of expressing many
dierent frameworks, its inference algorithm should generalize the algorithms of
those frameworks. If, for example, a BN is encoded in IBAL, the IBAL inference
algorithm should perform the same operations as a BN inference algorithm. This
chapter describes the IBAL inference algorithm and shows how it generalizes many
existing frameworks, including Bayesian networks, hidden Markov models (HMMs),
and stochastic context free grammars (SCFGs). Seven desiderata for a generalpurpose inference algorithm are presented, and it is shown how IBALs algorithm
satises all of them simultaneously.

14.2

The IBAL Language


IBAL is a rich language. We rst describe the core of the language which is used to
build generative probabilistic models. Then we discuss how to encode observations
in models. Finally we present some syntactic sugar that makes the language easier
and more natural to use.
14.2.1

Basic Expressions

The basic program unit in IBAL is the expression. An expression describes a


stochastic experiment that generates a value. Just as in a regular programming
language an expression describes a computation that produces a value, so in IBAL
an expression describes a computation that stochastically produces a value. IBAL
provides constructs for dening basic expressions, and for composing expressions
together to produce more complex expressions. In this section we provide an
intuitive meaning for IBAL expressions in terms of stochastic experiments. We will
provide precise semantics in section 14.4. The core of IBAL includes the following
kinds of expressions.

402

The Design and Implementation of IBAL: A General-Purpose Probabilistic Language

Constant expressions A constant expression is a literal of one of the built-in


primitive types, Boolean, integer and symbol. The symbol type contains symbolic
constants, which can be any string value. For example, true, 6 and hello are
all constant expressions. A constant expression represents the experiment that
always produces the given value.
Conditional expressions The expression if e1 then e2 else e3 provides conditional choice between two possible outcomes. It corresponds to the experiment in
which e1 is evaluated; then, if the value of e1 was true, e2 is evaluated; otherwise
e3 is evaluated.
Stochastic choice The expression dist [ p1 : e1 , . . ., pn : en ] species a
stochastic choice among the dierent possibilities e1 , . . . , en . Each of the pi is the
probability of choosing the corresponding ei . The expression corresponds to the
experiment in which the ith branch is chosen with probability pi , and then the
expression ei is evaluated.
Variable binding IBAL allows variables to be named and assigned a value, and
then referred to later. This can be accomplished using an expression of the form
let x = e1 in e2 . Here x is the name of the variable being dened, e1 is its
denition, and e2 is the expression in which x can appear. The simplest way to
understand a let expression is that it corresponds to the experiment in which e1
is evaluated, and then e2 is evaluated, with the result of e1 being used wherever
x appears. The result of the entire let expression is the result of e2 .
Lambda abstraction IBAL provides lambda abstraction, allowing the denition
of functions. The expression lambda x1 , . . . , xn -> e represents the function that
takes arguments x1 , . . . , xn whose body is e. Function denitions can also be
recursive, using the syntax fix f (x1 , . . . , xn ) -> e. Here f is the name of the
function being dened, and the body e can refer to f . Both lambda and fix
expressions correspond to experiments that always produce the functional value
dened by the expression. The functional value is a closure consisting of argument
names, a function body, and an environment in which to evaluate free variables.
Function application The expression e0 (e1 , . . . , en ) represents function application. It corresponds to the experiment in which e0 is evaluated, and its functional
result is applied to the results of e1 , . . . , en . Note that there may be uncertainty
in e0 , the expression dening the function to be applied.
Tuple construction and access The expression < x1 : e1 , . . . , xn : en > constructs a tuple with components named x1 , . . . , xn . It corresponds to the experiment in which each of the ei is evaluated and assigned to the component xi . Once
a tuple has been constructed, a component can be accessed using dot notation.
The expression e.x evaluates the expression e, and extracts component x from
the result.
Comparison The expression e1 == e2 corresponds to the experiment in which e1
and e2 are evaluated. The result is true if the values of e1 and e2 are the same;
otherwise it is false.

14.2

The IBAL Language

403

Example 14.1
It is important to note that in an expression of the form let x = e1 in e2 the variable
x is assigned a specic value in the experiment; any stochastic choices made while
evaluating e1 are resolved, and the result is assigned to x. For example, consider
let z = dist [ 0.5 : true, 0.5 : false ] in
z & z
The value of z is resolved to be either true or false, and the same value is used in
the two places in which z appears in z & z. Thus the whole expression evaluates
to true with probability 0.5, not 0.25, which is what the result would be if z was
reevaluated each time it appears. Thus the let construct provides a way to make
dierent parts of an expression probabilistically dependent, by making them both
mention the same variable.
Example 14.2
This example illustrates the use of a higher-order function. It begins by dening
two functions, one corresponding to the toss of a fair coin and one describing a toss
of a biased coin. It then denes a higher-order function, whose return value is one
of the rst two functions. This corresponds to the act of deciding which kind of
coin to toss. The example then denes a variable named c whose value is either
the fair or biased function. It then denes two variables x and y to be dierent
applications of the function contained in c. The variables x and y are conditionally
independent of each other given the value of c. Note by the way that in this example
the functions take zero arguments.
let fair = lambda () -> dist [ 0.5 : heads, 0.5 : tails ] in
let biased = lambda () -> dist [ 0.9 : heads, 0.1 : tails ] in
let pick = lambda () -> dist [ 0.5 : fair, 0.5 : biased ] in
let c = pick () in
let x = c () in
let y = c () in
<x:x, y:y> \ \ \bbox
14.2.2

Observations

The previous section presented the basic constructs for describing generative probabilistic models. Using the constructs above, one can describe any stochastic experiment that generatively produces values. The language presented so far can express
many common models, such as BNs, probabilistic relational models, HMMs, dynamic Bayesian networks, and SCFGs. All these models are generative in nature.
The richness of the model is encoded in the way the values are generated.
IBAL also provides the ability to describe conditional models, in which the
generative probability distribution is conditioned on certain observations being
satised. IBAL achieves this by allowing observations to be encoded explicitly

404

The Design and Implementation of IBAL: A General-Purpose Probabilistic Language

in a model, at any point. An observation serves to condition the model on the


observation being true.
An observation has the general syntax obs x = v in e where x is a variable, v
is a value, and e is an expression. Its meaning is the same as expression e, except
that the value of variable x is conditioned to be equal to v. The variable x should
have been dened earlier, as part of a let expression.
Example 14.3
Consider
let y = dist [ 0.5 : true, 0.5 : false ] in
let z =
if y
then dist [ 0.9 : true, 0.1 : false ]
else dist [ 0.1 : true, 0.9 : false ] in
obs z = true in
y
Here, the distribution dened by the expression is the conditional distribution over
y, given that z takes on the value true.
14.2.3

Syntactic Sugar

In addition to the basic constructs described above, IBAL provides a good deal of
syntactic sugar. The sugar does not increase the expressive power of the language,
but makes it considerably easier to work with. The syntactic sugar is presented
here, because it will be used in many of the later examples.
The let syntax is extended to make it easy to dene functions. The syntax
let f (x1 , . . . , xn ) = e is equivalent to let f = fix f (x1 , . . . , xn ) = e.
Thus far, every IBAL construct has been an expression. Indeed, everything in
IBAL can be written as an expression, and presenting everything as expressions
simplies the presentation. A real IBAL program, however, also contains denitions.
A block is a piece of IBAL code consisting of a sequence of variable denitions.
Example 14.4
For example, we can rewrite our coins example using denitions.
fair() = dist [ 0.5 : heads, 0.5 : tails ]
biased() = dist [ 0.9 : heads, 0.1 : tails ]
pick() = dist [ 0.5 : fair, 0.5 : biased ]
c = pick()
x = c()
y = c()
The value of this block is a tuple containing a component for every variable dened
in the block, i.e., fair, biased, pick, c, x, and y.

14.2

The IBAL Language

405

Bernoulli and uniform random variables are so common that a special notation
is created for them. The expression flip is shorthand for dist [ : true, 1 :
false]. The expression uniform n is short for dist [ n1 : 0, . . . , n1 : n 1].
IBAL provides basic operators for working with values. These include logical
operators for working with Boolean values and arithmetic operators for integer
values. IBAL also provides an equality operator that tests any two values for
equality. Operator notation is equivalent to function application, where the relevant
functions are built in.
Dot notation can be used to reference nested components of variables. For
example, x.a.b means the component named b of the component named a of
the variable named x. This notation can appear anywhere a variable appears. For
example, in an observation one can say obs x.a = true in y. This is equivalent
to saying
let z = x.a in obs z = true in y.
Patterns can be used to match sets of values. A pattern may be
an atomic value (Boolean, integer, or strong), that matches itself;
the special pattern *, that matches any value;
a variable, which matches any value, binding the variable to the matched value
in the process;
a tuple of patterns, which matches any tuple value such that each component
pattern matches the corresponding component value.
For example, the pattern < 2, , y > matches value < 2, true, h >, binding
y to h in the process. A pattern can appear in an observation. For example,
obs x = <2,*,y> in true conditions the experiment on the value of x matching
the pattern.
Patterns also appear in case expressions, which allow the computation to branch
depending on the value of a variable. The general syntax of case expressions is
case e0 of
#p1 : e1
...
#pn : en
where the pi are patterns and the ei are expressions. The meaning, in terms of a
stochastic experiment, is to begin by evaluating e0 . Then its value is matched to
each of the patterns in turn. If the value matches p1 , the result of the experiment
is the result of e1 . If the value does not match p1 through pi1 and it does match
pi , then ei is the result. It is an error for the value not to match any pattern. A
case expression can be rewritten as a series of nested if expressions.
The case expression is useful for describing conditional probability tables as are
used in BNs. In this case, the expression e0 is a tuple consisting of the parents of the
node, each of the patterns pi matches a specic set of values of the parents, and the

406

The Design and Implementation of IBAL: A General-Purpose Probabilistic Language

corresponding expression ei is the conditional distribution over the node given the
values of the parents. It is also possible to dene a pattern that matches whenever
a subset of the variables takes on specied values, regardless of the values of other
variables. Such a pattern can be used to dene conditional probability tables with
context-specic independence, where only some of the parents are relevant in certain
circumstances, depending on the values of other parents.
In addition to tuples, IBAL provides algebraic data types (ADTs) for creating
structured data. An ADT is a data type with several variants. Each variant has a
tag and a set of elds. ADTs are very useful in dening recursive data types such
as lists and trees. For example, the list type has two variants. The rst is Nil and
has no elds. The second is Cons and has a eld representing the head of the list
and a further eld representing the remainder of the list.
Example 14.5
Using the list type, we can easily dene a stochastic context free grammar. First
we dene the append function that appends two lists. Then, for each nonterminal
in the grammar we dene a function corresponding to the act of generating a string
with that non-terminal. For example,
append(x,y) =
case x of
# Nil -> y
# Cons(a,z) -> Cons(a, append(z,y))
term(x) = Cons(x,Nil)
s() = dist [0.6:term(a);
0.4:append(s(),t())]
t() = dist [0.9:term(b);
0.1:append(t(),s())]
We can then examine the beginning of a string generated by the grammar using
the take function:
take(n,x) =
case(n,x) of
# (0,_) -> Nil
# (_,Nil) -> Nil
# (_,Cons(y,z)) -> Cons(y,take(n-1,z))

IBAL is a strongly typed language. The language includes type declarations that
declare new types, and data declarations that dene algebraic data types. The type
system is based on that of ML. The type language will not be presented here, but
it will be used in the examples, where it will be explained.
In some cases, it is useful to dene a condition as being erroneous. For example,
when one tries to take the head of an empty list, an error condition should result.
IBAL provides an expression error s, where s is a string, to signal an error

14.3

Examples

407

condition. This expression takes on the special value ERROR: s, which belongs to
every type and can only be used to indicate errors.
Finally, IBAL allows comments in programs. A comment is anything beginning
with a // through to the end of the line.

14.3

Examples
Example 14.6
Encoding a BN is easy and natural in IBAL. We include a denition for each variable
in the network. A case expression is used to encode the conditional probability table
for a variable. For example,
burglary = flip 0.01;
earthquake = flip 0.001;
alarm = case <burglary, earthquake> of
# <false, false> : flip 0.01
# <false, true> : flip 0.1
# <true, false> : flip 0.7
# <true, true> : flip 0.8
We can also easily encode conditional probability tables with structure. For
example, we may want the alarm variable to have a noisy-or structure:
alarm = flip 0.01 // leak probability
| earthquake & flip 0.1
| alarm & flip 0.7
We may also create variables with context-specic independence. Context-specic
independence is the case where a variable depends on a parent for some values of the
other parents but not others. For example, if we introduce variables representing
whether or not John is at home and John calls, John calling is dependent on
the alarm only in the case that John is at home. IBALs pattern syntax is very
convenient for capturing context-specic independence. The symbol * is used as
the pattern that matches all values, when we dont care about the value of a specic
variable:
john_home = flip 0.5
john_calls = case <john_home, alarm> of
# <false,*> : false
# <true,false> : flip 0.001
# <true,true> : flip 0.7

408

The Design and Implementation of IBAL: A General-Purpose Probabilistic Language

Example 14.7
Markov chains can easily be encoded in IBAL. Here we present an example where
the states are integers. The sequence of chains produced by the model is represented
as a List. The rst line of the program denes the List data type:
data List [a] = Nil | Cons (a, List [a])
This declaration states that List is a parameterized type, taking on the type
parameter a. That is, for any type a, List [a] is also a type. It then goes on to
state that a List [a] can be one of two things: it can be Nil, or it can be the Cons
of two arguments, the rst of type a and the second of type List [a].
Given a sequence of states represented as a List, it is useful to be able to examine
a particular state in the sequence. The standard function nth does this.
nth (n,l) : (Int, List [a]) -> a =
case l of
# Cons (x,xs) : if n==0 then x else nth (n-1,xs)
# Nil : error "Too short";
The rst line of nth includes a typing rule. It states that nth is a function taking
two arguments, where the rst is an integer and the second is a List [a], and
returning a value of type a.
Next, we dene the types to build up a Markov model. A Markov model consists of
two functions, an initialization function and a transition function. The initialization
function takes zero arguments and produces a state. The transition function takes
a state argument and produces a state. Markov models are parameterized by the
type of the state, which is here called a.
type Init [a] = () -> a;
type Trans [a] = (a) -> a;
type Markov [a] = < init : Init [a], trans : Trans [a] >;
Given a Markov model, we can realize it to produce a sequence of states.
realize (m) : (Markov [a]) -> List [a] =
let f(x) = Cons (x, f(m.trans (x))) in
f(m.init ());
Thus far, the denitions have been abstract, applying to every Markov model.
Now we dene a particular Markov model by supplying denitions for the initialization and transition functions. Note that the state here is integer, so the state
space is innite. The state can be any type whatsoever, including algebraic data
types like lists or trees.
random_walk : Markov [Int] =
< init : lambda () -> 0,
trans : lambda (n) -> dist [ 0.5 : n++, 0.5 : n-- ] >;

14.3

Examples

409

It is easy to see how to generalize this example to HMMs by providing an


observation function, and then specifying the observations using obs expressions.
Then, combined with the previous example of BNs, we can generalize this to
dynamic Bayesian networks [2].
Example 14.8
One of the features of PRMs is structural uncertainty: uncertainty over the relational structure of the domain. One kind of structural uncertainty is number
uncertainty, where we do not know how many objects an object is related to by a
particular relation. In the development of the SPOOK system [17], a good deal of
code was devoted to handling number uncertainty. In this example, we show how
to encode number uncertainty in IBAL. By encoding it in IBAL, a lot of code is
saved, and all the inference mechanisms for dealing with number uncertainty are
essentially attained for free.
The main mechanism for representing number uncertainty in IBAL is a function
create that creates a set consisting of a given number of objects of a certain kind.
In addition to the number of objects, the function takes the function used to create
individual objects as an argument:
create(n,f) =
if n = 0
then Nil
else Cons(f(), create(n-1, f))
In this function, the argument f is a function that takes zero arguments. However,
create can easily be used to create objects when the creating function takes
arguments, by passing an intermediate function as follows. In the following code
snippet, the field argument is the same for every course that is created, but the
prof argument is dierent. We see here that the functional framework provides a
great deal of exibility in the way arguments are dened and passed to functions.
let f() =
let p = prof(field1) in
course(p, field)
in
create(5, f)
Once we have dened how to create sets of a given size, we can easily introduce
uncertainty over the size. The number of objects to create is dened by its own
expression, which may include dist or uniform expressions.
After creating a set, we want to be able to talk about properties of the set.
PRMs use aggregate operators for this, and these can easily be encoded in IBAL.
The following count function counts how many members of a set satisfy a given
property. The rst argument p is a predicate that takes an element of the set as
argument and returns a Boolean.

410

The Design and Implementation of IBAL: A General-Purpose Probabilistic Language

count(p, s) =
case s of
# Nil : 0
# Cons(x,xs) :
if p x
then 1 + count(p, xs)
else count(p, xs)
In addition to count, we can easily dene universal and existential quantiers
and other aggregates.
Example 14.9
IBAL is an ideal language in which to rapidly prototype new probabilistic models. Here we illustrate using a recently developed kind of model, the repetition
model [15]. A repetition model is used to describe a sequence of elements in which
repetition of elements from earlier in the sequence is a common occurrence. It is
attached to an existing sequence model such as an n-gram or an HMM. Here we
describe the repetition HMM.
In a repetition HMM, there is a hidden state that evolves according to a
Markov process, just as in an ordinary HMM. An observation is generated at each
time point. With some probability , the observation is generated from memory,
meaning that a previous observation is reused for the current observation. With
the remaining 1 probability, the observation is generated from the hidden state
according to the observation model of the HMM. This model captures the fact that
there is an underlying generative process as described by the HMM, but this process
is sometimes superseded by repeating elements that have previously appeared.
Repetition is a key element of music, and repetition models have successfully been
applied to modeling musical rhythm.
To describe a repetition HMM in IBAL, we rst need a function to select a random
element from a sequence. The function nth takes an integer argument and selects
the given element of the sequence. We then let the argument range uniformly over
the length of the sequence, which is passed as an argument to the select function.
nth(n, seq) =
case seq of
# Cons(x,xs) :
if n = 0
then x
else nth(n-1, xs)
# Nil : error
select(length, seq) = nth(uniform length, seq)
Similarly to the way we dened Markov models earlier, a repetition HMM takes
init, trans, and obs functions as arguments. The parameter must be supplied.
If we used all of IBALs features it could be a learnable parameter. In our example

14.4

Semantics

411

we set it to 0.1. The generation process is described exceedingly simply. A function


sequence generates the sequence of observations using a given memory of a given
length, under the given model, beginning in a given state. The rst thing it does is
generate the new hidden state according to the transition model. Then it generates
the observation using a dist expression on the parameter . With probability
it selects the observation from memory; otherwise it uses the observation model.
Finally, the entire sequence is put together by consing the current observation with
the new sequence formed using the new memory from the new hidden state.
type Init [a] = () -> a;
type Trans [a] = (a) -> a;
type Obs [a,o] = (a) -> o;
type repetition_hmm [a,o] =
< init : Init [a], trans : trans [a], obs : Obs [a,o] >;
param rho = [ 0.1, 0.9 ];
sequence(memory, length, model, state) =
let h = model.trans(state) in
let o = pdist rho [ select(length, memory), model.obs(h) ] in
Cons(o, sequence(Cons(o, memory), length + 1, model, h))
repetition_hmm(model) =
sequence(Nil, 0, model, state.init())

In addition to these examples, IBAL can represent PRMs, and by extension


dynamic PRMs [21]. Meanwhile, the decision-making constructs of IBAL allow the
encoding of inuence diagrams and Markov decision processes.

14.4

Semantics
In specifying the semantics of the language, it is sucient to provide semantics
for the core expressions, since the syntactic sugar is naturally induced from them.
The semantics is distributional: the meaning of a program is specied in terms of
a probability distribution over values.
14.4.1

Distributional Semantics

We use the notation M[e] to denote the meaning of expression e, under the
distributional semantics. The meaning function takes as argument a probability
distribution over environments. The function returns a probability distribution over
values. We write M[e] v to denote the probability of v under the meaning of e

412

The Design and Implementation of IBAL: A General-Purpose Probabilistic Language

when the distribution over environments is . We also use the notation M[e] v to
denote the probability of v under the meaning of e when the probability distribution
over environments assigns positive probability only to .
We now dene the meaning function for dierent types of expressions. The
meaning of a constant expression is given by

1 if v = v,

M[v] v =
0 otherwise
The probability that referring to a variable produces a value is obtained simply
by summing over environments in which the variable has the given value:

( ).
M[x] v =
:(x)=v

The meaning of an if expression is dened as follows. We rst take the sum over
all environments of the meaning of the expression in the particular environments.
The reason we need to do this is because the meanings of the if clause and of
the then and else clauses are correlated by the environment. Therefore we need
to specify the particular environment before we can break up the meaning into
the meanings of the subexpressions. Given the environments, however, the subexpressions become conditionally independent, so we can multiply their meanings
together.



(M[e1 ] true)(M[e2 ] v)+
( )
M[if e1 then e2 else e3 ] v =
(M[e1 ] false)(M[e3 ] v)

The distributional semantics of a dist expression simply states that the probability of a value under a dist expression is the weighted sum of the probability of
the value under the dierent branches:

pi (M[ei ] v).
M[dist[p1 : e1 , . . . , pn : en ]] v =
i

To dene the meaning of an expression let x = e1 in e2, we rst dene


a probability distribution  over extended environments that are produced by
binding x with any possible value. The probability of an extended environment
is the probability of the original environment that is being extended times the
probability that e1 will produce the given value in the original environment. We
then dene the meaning of the entire expression to be the meaning of e2 under  .
The notation [x/v  ] indicates the environment produced by extending by binding
x to v  .
M[let x = e1 in e2 ] v = M[e2 ]  v

( )(M[e1 ] v  ) if  = [x/v  ]
where  (  ) =
0
otherwise

14.4

Semantics

413

lambda and fix expressions are treated as constants whose values are closures.
The only dierence is that the closure species an environment, so we take the
probability that the current environment is the closure environment.

args
=
x
,
.
.
.
,
x
;

1
n

( ) if v =
body = e;

M[lambda x1 , . . . , xn e] v =

env =

0
otherwise

args = x1 , . . . , xn ;

( ) if v =
body = e;

M[fix x1 , . . . , xn e] v =

env = [f /v]

0
otherwise

The distributional semantics for function application is logically constructed


as follows. We sum over all possible environments, and over all possible values
v0 , v1 , . . . , vn , of the expression e0 dening the function to be applied, and of
the expressions v1 , . . . , vn dening the arguments. We take the product of the
probabilities of obtaining each vi from ei in the environment, and multiply by
the probability that applying v0 to v1 , . . . , vn produces the value v. Here, applying
v0 to v1 , . . . , vn means taking the meaning of the body of v0 in an environment
formed by extending the closure environment by binding each argument xi to vi .



( )

M[e0 (e1 , . . . , en )] v =
n

v0 ,v1 ,...,vn ( i=0 M[ei ] vi )(M[e] [x1 /v1 , . . . , xn /vn ] v)

where {args = x1 , . . . , xn ; body = e; env =  } = v0


The meaning of a tuple expression is given by
 
0

 ( )

M[< x1 : e1 , . . . , xn : en >] v =
n
if v =< x1 : v1 , . . . , xn : vn >,
i=1 M[ei ] ( ) vi
otherwise

The meaning of extracting a component from a tuple is



M[e] v  .
M[e.x] v =
v  :v  .x=v

414

The Design and Implementation of IBAL: A General-Purpose Probabilistic Language

Finally, the probability of a comparison being true is derived by taking the sum,
over all possible values, of the probability that both expressions produce the value.

if v = true
p

M[e1 == e2 ] v =
1 p if v = false

0
otherwise


where p =  ( ) v (M[e1 ] v  )(M[e2 ] v  )
The distributional semantics captures observations quite simply. The eect of an
observation is to condition the distribution over environments on the observation
holding. When the probability that the observation holds is zero, the probability of
the expression is dened to be zero.
 P
:(x)=v ()(M[e]  v)
if P (x = v  ) > 0
P (x=v  )
M[obs x = v  in e] v =
0
if P (x = v  ) = 0

where P (x = v  ) = :(x)=v ( )
14.4.2

Lazy Semantics

A very natural way to dene a probabilistic model is to describe a generative model


that generates possibly innite values, and then to ask queries that only consider
a nite portion of the values. For example, an SCFG may generate arbitrarily long
strings. We may query the probability that a grammar generates a particular string.
This requires looking at only a nite portion of the generated string, and a nite
portion of the generation process.
We use lazy semantics in IBAL to get at the idea that only those parts of an
expression that need to be evaluated in order to generate the result of the expression
are evaluated. There are two places in particular where this applies. In a
let x1 = e1 in e2
expression, the subexpression e1 is only evaluated if x is actually needed in evaluating e2 . More precisely, if e1 denes a tuple, only those components of the tuple
that are needed in e2 are evaluated. In a function application
e0 (e1 , . . . , en )
only those parts of the argument ei are evaluated that are needed in evaluating the
body of the function. The body of the function here could mean the body of any
possible value of e0 if any value of e0 requires a component of the argument, the
component is evaluated.
Example 14.10
Consider the program
f() = Cons(flip 0.5, f())

14.5

Desiderata for Inference

415

g(x) =
case x of
# Cons(y,z) -> y
g(f())
The function f() denes an innite sequence of true and false elements. The
function g() then returns the rst element in the sequence. When g is applied to
f, the body of g species that only the rst component of its argument is required.
Therefore, when evaluating f, only its rst component will be evaluated. That can
be done by examining a single flip.
The distributional semantics presented earlier is agnostic about whether it is
eager or lazy. It simply presents a set of equations, and says nothing about how
the equations are evaluated. Both eager and lazy interpretations are possible. The
meaning of an expression under either interpretation is only well-dened when the
process of evaluating it converges. The eager and lazy semantics do not necessarily
agree. The eager semantics may diverge in some cases where the lazy semantics
produces a result. However, if the eager semantics converges, the lazy semantics
will produce the same result.

14.5

Desiderata for Inference


IBAL is able to capture many traditional kinds of representations, such as BNs,
HMMs, and SCGGs. It can also express more recent models such as object-oriented
Bayesian networks (OOBNs) and relational probability models [16]. For IBAL to be
successful as a general-purpose language, the implementation should be designed
to capture eective strategies for as many models as possible. This leads to the
following desiderata. To be sure, this list is not complete. In particular, it does not
consider issues to do with the time-space tradeo. Nevertheless, it is a good set of
goals, and no existing implementation is able to achieve all of them.
1. Exploit independence Independence and conditional independence are traditionally exploited by BNs. The inference algorithm should have similar properties
to traditional BN algorithms when run on BN models.
2. Exploit low-level structure In BNs, the conditional probability distribution
over a variable given its parents is traditionally represented as a table. Researchers
have studied more compact representations, such as noisy-or and context-specic
independence. Special-purpose inference algorithms have been designed for these
structures [4, 20]. Because of IBALs programming language constructs, it is easy
to describe such structures easier, in fact, than describing full conditional
probability tables. The inference algorithm should be able to take a representation
that elucidates the low-level structure and automatically provide benets from
exploiting the structure.

416

The Design and Implementation of IBAL: A General-Purpose Probabilistic Language

3. Exploit high-level structure Larger models can often be decomposed into


weakly interacting components. It was discovered for OOBNs [17] that exploiting
such high-level structure is a big win. In particular, the dierent components
tend to be largely decoupled from one another, and they can be separated by a
small interface that renders them conditionally independent of each other. IBAL
represents high-level structure-using functions. The inference algorithm should take
advantage of the decoupling of the internals of functions from the remainder of the
model.
4. Exploit repetition Many frameworks such as SCFGs and HMMs involve
many repeated computations. In IBAL, the same function can be applied many
times, and this should be exploited to avoid repeated computation.
5. Exploit the query Often, one describes a very complex or innite probabilistic model, but asks a query that only requires a small portion of the model.
This is the process, for example, for SCFGs: the grammar can generate arbitrarily
long sentences, but only a nite generation process is needed to generate a particular nite sentence. IBAL should use the query to consider only the parts of the
generation process that are necessary for producing the result.
6. Exploit support of variables When a model contains variables, its behavior
depends on their values. The support of a variable is the set of values it can take
with positive probability. Taking the support into account can simplify inference,
by restricting the set of inputs that need to be considered. It can turn a potentially
innite inference into a nite one.
7. Exploit evidence If we have observations in the program, they can be used
to further limit the set of values variables can take, and to restrict the possible
computations. For example, suppose we have a model in which a string is generated
by a grammar that can generate arbitrarily long strings, and then observe that the
string has length at most four. We can use this observation to restrict the portion
of the grammar that needs to be examined.

14.6

Related Approaches
Previous approaches to inference in high-level probabilistic languages have generally
fallen into four categories. On one side are approaches that use approximate
inference, particularly Markov chain Monte Carlo methods. This is the approach
used in BUGS [23] and the approach taken by Pasula and Russell in their rst-order
probabilistic logic [14]. While exact inference may be intractable for many models,
and approximate strategies are therefore needed, the goal of this chapter is to push
exact inference as far as possible.
The rst generation of high-level probabilistic languages generally used the
knowledge-based model construction (KBMC) approach (e.g. [19, 13, 10, 5]). In
this approach, a knowledge base describes the general probabilistic mechanisms.
These are combined with ground facts to produce a BN for a specic situation. A
standard BN inference algorithm is then used to answer queries.

14.6

Related Approaches

417

This approach generally satises only the rst of the above desiderata. Since a
BN is constructed, any independence will be represented in that network, and can
be exploited by the BN algorithm. The second desideratum can also be satised, if
a BN algorithm that exploits low-level structure is used, and the BN construction
process is able to produce that structure. Since the construction process creates one
large BN, any structure resulting from weakly interacting components is lost, so the
third desideratum is not satised. Similarly, when there is repetition in the domain
the large BN contains many replicated components, and the fourth desideratum is
not satised. Satisfaction of the remaining desiderata depends on the details of the
BN construction process. The most common approach is to grow the network using
backward chaining, starting at the query and the evidence. If any of these lead to
an innite regress, the process will fail.
Sato and Kameya [22] present a more advanced version of this approach that
achieves some of the aims of this paper. They use a tabling procedure to avoid
performing redundant computations. In addition, their approach is query-directed.
However they do not exploit low-level independence or weak interaction between
objects, nor do they utilize observations or support.
More recent approaches take one of two tacks. The rst is to design a probabilistic representation language as a programming language, whether a functional
language [9, 18] or logic programming [12]. The inference algorithms presented for
these languages are similar to evaluation algorithms for ordinary programming languages, using recursive descent on the structure of programs. The programming language approach has a number of appealing properties. First, the evaluation strategy
is natural and familiar. Second, a programming language provides the ne-grained
representational control with which to describe low-level structure. Third, simple
solutions are suggested for many of the desiderata. For example, high-level structure
can be represented in the structure of a program, with dierent functions representing dierent components. As for exploiting repetition, this can be achieved by
the standard technique of memoization. When a function is applied to a given set
of arguments, the result is cached, and retrieved whenever the same function is
applied to the same arguments. Meanwhile, lazy evaluation can be used to exploit
the query to make a computation simpler.
However, approaches based on programming languages have a major drawback.
They do not do a good job of exploiting independence. Koller et al. [9] made an eort
to exploit independence by maintaining a list of variables shared by dierent parts of
the computation. The resulting algorithm is much more dicult to understand, and
the solution is only partial. Given a BN encoded in their language, the algorithm can
be viewed as performing variable elimination (VE) using a particular elimination
order: namely, from the last variable in the program upward. It is well-known that
the cost of VE is highly dependent on the elimination order, so the algorithm is
exponentially more expensive for some families of models than an algorithm that
can use any order.
In addition, while these approaches suggest solutions to many of the desiderata,
actually integrating them into a single implementation is dicult. For example,

418

The Design and Implementation of IBAL: A General-Purpose Probabilistic Language

Koller et al. [9] suggested using both memoization and lazy evaluation, believing
that since both were standard techniques their combination would be simple. In
fact it turns out that implementing both simultaneously is considered extremely
dicult!1 The nal three desiderata are all variations on the idea that knowledge
can be used to simplify computation. The general approach was captured by the
term evidence-nite computation in [9]. However, this catchall term fails to capture
the distinctions between the dierent ways knowledge can be exploited. A careful
implementation of the algorithm in [9] showed that it achieved termination only in
a relatively small number of possible cases. In particular it failed to exploit support
and observations.
The nal approach to high-level probabilistic inference is to use a structured
inference algorithm. In this approach, used in object-oriented Bayesian networks
and relational probabilistic models [17, 16], a BN fragment is provided for each
model component, and the components are related to each other in various ways.
Rather than constructing a single BN to represent an entire domain, inference
works directly on the structured model, using a standard BN algorithm to work
within each component. The approach was designed explicitly to exploit high-level
structure and repetition. In addition, because a standard BN algorithm is used,
this approach exploits independence. However, it does not address the nal three
desiderata. An anytime approximation algorithm [6] was provided for dealing with
innitely recursive models, but it is not an approximate inference algorithm.
In addition, this approach does not do as well as one might hope at exploiting
low-level structure. One might rely on the underlying BN inference algorithm to
exploit whatever structure it can. For example, if it is desired to exploit noisy-or
structure, the representation should explicitly encode such structure, and the BN
algorithm should take advantage of it. The problem with this approach is that
it requires a special-purpose solution for each possible structure, and high-level
languages make it easy to specify new structures. A case in point is the structure
arising from quantication over a set of objects. In the SPOOK system [17], an
object A can be related to a set of objects B, and the properties of A can depend
on an aggregate property of B. If implemented naively, A will depend on each of
the objects in B, so its conditional probability table will be exponential in the size
of B. As shown in [17], the relationship between A and B can be decomposed in
such a way that the representation and inference are linear in the size of B. Special
purpose code had to be written in SPOOK to capture this structure, but it is easy
to specify in IBAL, as described in example 14.8, so it would be highly benecial if
IBALs inference algorithm can exploit it automatically.

14.7

14.7

Inference

419

Inference
14.7.1

Inference Overview

If we examine the desiderata of section 14.5, we see that they fall into two
categories. Exploiting repetition, queries, support, and evidence all require avoiding
unnecessary computation, while exploiting structure and independence require
performing the necessary computation as eciently as possible. One of the main
insights gained during the development of IBALs inference algorithm is that
simultaneously trying to satisfy all the desiderata can lead to quite complex code.
The inference process can be greatly simplied by recognizing the two dierent kinds
of desiderata, and dividing the inference process into two phases. The rst phase
is responsible for determining exactly what computations need to be performed,
while the second phase is responsible for performing them eciently.
This division of labor is reminiscent of the symbolic probabilistic inference (SPI)
algorithm for BN inference [11], in which the rst phase nds a factoring of
the probability expression, and the second phase solves the expression using the
factoring. However, there is a marked dierence between the two approaches. In
SPI, the goal of the rst phase is to nd the order in which terms should be
multiplied. In IBAL, the rst phase determines which computations need to be
performed, but not their order. That is left for the variable elimination algorithm
in the second phase. Indeed, SPI could be used in the second phase of IBAL as the
algorithm that computes probabilities.
The rst phase of IBAL operates directly on programs, and produces a data
structure called the computation graph. This rooted directed acyclic graph contains
a node for every distinct computation to be performed. A computation consists of
an expression to be evaluated, and the supports of free variables in the expression.
The computation graph contains an edge from one node to another if the second
node represents a computation for a subexpression that is required for the rst
node.
The second phase of the algorithm traverses the computation graph, solving every
node. A solution for a node is a conditional probability distribution over the value of
the expression given the values of the free variables, assuming that the free variables
have values in the given supports. The solution is computed bottom-up. To solve a
node, the solutions of its children are combined to form the solution for the node.
On the surface, the design seems similar to that of the KBMC approaches.
They both create a data structure, and then proceed to solve it. The IBAL
approach shares with KBMC the idea of piggybacking on top of existing BN
technology. However, the two approaches are fundamentally dierent. In KBMC,
the constructed BN contains a node for every random variable occurring in the
solution. By contrast, IBALs computation graph contains a node for every distinct

1. Simon Peyton-Jones, personal communication.

420

The Design and Implementation of IBAL: A General-Purpose Probabilistic Language

computation that is performed during the solution process. If dierent random


variables share the same computation, only one node is created. Secondly, the
computation graph is not a BN. Rather, it is an intermediate data structure that
guides the construction of many dierent BNs, and their combination to provide the
solution to the query. Thirdly, traditional KBMC approaches typically do not utilize
all the information available in the query, support, and evidence in constructing the
BN, whereas IBAL uses all of these in constructing the computation graph.
14.7.2

First Phase

It is the task of the rst phase to construct the computation graph, containing a
node for every computation that has to be performed. At the end of the phase,
each node will contain an expression to be evaluated, annotated with the supports
of the free variables, and the support of the expression itself. The rst phase begins
by propagating observations to all subexpressions that they eect. The result of
this operation is an annotated expression, where each expression is annotated with
the eective observation about its result. When the computation graph is later
constructed, the annotations will be used to restrict the supports of variables,
and possibly to restrict the set of computations that are required. Thus the
seventh desideratum of exploiting evidence is achieved. IBALs observation
propagation process is sound but not complete. For an SCFG, it is able to infer
when the output string is nite that only a nite computation is needed to produce
it. The details are omitted here.
14.7.2.1

Lazy Memoization

After propagating observations, the process of constructing the computation graph


begins. In order not to construct any more than is necessary to answer the query, the
graph is constructed lazily. In particular, whenever a let x = e1 in e2 expression is
encountered, the graph for e2 is constructed to determine how much of x is required
for e2 . Then only the required amount of the graph for e1 is constructed. (Recall
that a variable can have a complex value, so only part of its value may be required in
another expression.) Similarly, when a function is applied to arguments, the graph
for the arguments is constructed lazily. Since no node of the computation graph is
constructed unless it has been determined that it is required for solving the query,
the fth desideratum of exploiting the query is achieved.
The fourth desideratum of exploiting repetition is achieved by avoiding
repeated nodes in the graph. In particular, when a function is applied to arguments,
the same node is used as long as (1) the supports of the required parts of the
arguments are the same; (2) the required components of the output are the same;
and (3) the observed evidence on the output is the same. This is quite a strong
property. It requires that the same node be used when the supports of the arguments
are the same, even if the arguments are dened by dierent expressions. It also

14.7

Inference

421

stipulates that the supports only need to be the same on the required parts of the
arguments.
Unfortunately, the standard technique of memoization does not interact well with
lazy evaluation. The problem is that in memoization, when we want to create a new
node in the computation graph, we have to check if there is an existing node for the
same expression that has the same supports for the required parts of the arguments.
But we dont know yet what the required parts of the arguments are, or what their
supports are. Worse yet, with lazy evaluation, we may not yet know these things
for expressions that already have nodes. This issue is the crux of the diculty with
combining lazy evaluation and memoization. In fact, no functional programming
language appears to implement both, despite the obvious appeal of these features.
A new evaluation strategy was developed for IBAL to achieve both laziness and
memoization together. The key idea is that when the graph is constructed for a
function application, the algorithm speculatively assumes that an argument is not
required. If it turns out that part of it is required, enough of the computation graph
is created for the required part, and the graph for the application is reconstructed,
again speculatively assuming that enough of the argument has been constructed.
This process continues until the speculation turns out to be correct. At each point,
we can check to see if there is a previously created node for the same expression
that uses as much as we think is required of the argument. At no point will we
create a node or examine part of the argument that is not required.
An important detail is that whenever it is discovered that an argument to the
function is required, this fact is stored in the cache. This way, the speculative evaluation is avoided if it has already been performed for the same partial arguments. In
general, the cache consists of a mapping from partial argument supports to either
a node in the computation graph or to a note specifying that another argument is
required.
For example, suppose we have a function
f(x,y,z) = if x then y else z
where the support of x is {true}, the support of y is {5,6}, and z is dened by
a divergent function. We rst try to evaluate f with no arguments evaluated. We
immediately discover that x is needed, and store this fact in the cache. We obtain
the support of x, and attempt to evaluate f again. Now, since x must be true, we
discover that y is needed, and store this in the cache. We now attempt again to
evaluate f with the supports of x and y, and since z is not needed, we return with
a computation node, storing the fact that when x and y have the given supports,
the result is the given node. The contents of the cache after the evaluation has
completed are
f(x,y,z)

Need x

f({true},y,z)

Need y

f({true},{5,6},z)

{5,6}

422

The Design and Implementation of IBAL: A General-Purpose Probabilistic Language

In subsequent evaluations of f , examining the cache will tell us to immediately


evaluate the support of x, and if the support of x is {true}, we will immediately
get the support of y, without any speculative computation required. If the support
of y then turns out to be {5,6}, the result will be retrieved from the cache without
any evaluation.
14.7.2.2

Support Computation

Aside from issues of laziness and memoization, the support computation is fairly
straightforward, with the support of an expression being computed from the support
of its subexpressions and its free variables. For example, to compute the support of
dist [e1 , ..., en ], simply take the union of the supports of each of the ei .
Some care is taken to use the supports of some subexpressions to simplify the
computation of other subexpressions, so as to achieve the sixth desideratum of
exploiting supports. The most basic manifestation of this idea is the application
expression e1 e2 , where we have functional uncertainty, i.e., uncertainty over the
identity of the function to apply. For such an expression, IBAL rst computes the
support of e1 to see which functions can be applied. Then, for each value f in the
support of e1 , IBAL computes the support of applying f to e2 . Finally, the union of
all these supports is returned as the support of e1 e2 . For another example, consider
an expression e of the form if e1 else e2 then e3 . A naive implementation would
set the support of e to be the union of the supports of e2 and e3 . IBAL is smarter,
and performs a form of short-circuiting: if true is not in the support of e1 , the
support of e2 is not included in the support of e, and similarly for false and e3 .
14.7.3

Second Phase

In the second phase, the computation graph is solved from the bottom up. The
solution for each node is generally not represented directly. Rather, it is represented
as a set of factors. A factor mentions a set of variables, and denes a function from
the values of those variables to real numbers. The variables mentioned by the factors
in a solution include a special variable (pronounced star) corresponding to the
value of the expression, the free variables X of the expression, and other variables Y.

The solution specied by a set of factors f1 , ..., fn is P (|x) = Z1 y i fi (, x, y),
where Z is a normalizing factor.2 The set of factors at any node are a compact,
implicit representation of the solution at that node. It is up to the solution algorithm
to decide which Y variables to keep around, and which to eliminate.
At various points in the computation, the algorithm eliminates some of the
intermediate variables Y, using VE [3] to produce a new set of factors over the
remaining variables. The root of the computation graph corresponds to the users

2. The fi do not need to mention the same variables. The notation fi (, x, y) denotes the
value of fi when , x, and y are projected onto the variables mentioned by fi .

14.7

Inference

423

query. At the root there are no free variables. To compute the nal answer,
all variables other than are eliminated using VE, all remaining factors are
multiplied together, and the result is normalized. By using VE for the actual
process of computing probabilities, the algorithm achieves the rst desideratum
of exploiting independence. The main point is that unlike other programming
language-based approaches, IBAL does not try to compute probabilities directly by
working with a program, but rather converts a program into the more manipulable
form of factors, and rests on tried and true technology for working with them.
In addition, this inference framework provides an easy method to satisfy the
third desideratum of exploiting the high-level structure of programs. As
discussed in section 14.5, high-level structure is represented in IBAL using functions.
In particular, the internals of a function are encapsulated inside the function, and
are conditionally independent of the external world given the function inputs and
outputs. From the point of view of VE, this means that we can safely eliminate all
variables internal to the function consecutively. This idea is implemented by using
VE to eliminate all variables internal to a function at the time the solution to the
function is computed.
14.7.3.1

Microfactors

Most implementations of VE in BNs represent a factor as a table. A table consists


of a sequence of rows, each row consisting of a complete assignment of values to the
factor variables and a real number. This representation is incapable of capturing
low-level structure, and it also does not closely match the form of IBAL programs.
Therefore, in order to achieve the second desideratum of exploiting lowlevel structure, IBAL uses a more rened representation called microfactors.
Microfactors have similarities to other representations used for exploiting low-level
structure, such as partial functions [20] and algebraic decision diagrams [1], but they
were developed to match the structure of IBAL programs as closely as possible.
The design of microfactors is motivated by several observations about IBAL
programs. First, it is common for values of variables to map to zero. Consider a
comparison e1 == e2 . A microfactor is created mentioning variables Y1 and Y2
for the outcomes of e1 and e2 , and , the outcome of the expression. The only
assignments that have positive probability are those where Y1 and Y2 are equal and
is true, or Y1 and Y2 are unequal and is false. All others are zero. To take
advantage of the common zeros, only positive cases are represented explicitly in a
microfactor.
The second observation is that we often dont care about the value of a variable,
as in the case of context-specic independence. For example, given the expression
if x then y else z, we will not care about z if x is true. Similarly, a factor
will often have the same value for all but a few values of a variable. Consider the
expression if x = a then y else z. When translated into a factor, we obtain a
function that is the same for all values of x except a. To take advantage of these
cases, a row in a microfactor allows a variable to take on one of a set of values.

424

The Design and Implementation of IBAL: A General-Purpose Probabilistic Language

Sets of values are represented as either V or V , where V = {v1 , . . . , vn } is an


explicitly enumerated set of elements. The notation V denotes the complement of
V , with respect to the universe of possible values. These are called Zariski sets,
after the Zariski topology in real analysis. We can use to denote the situation
where something holds for all values of a variable, and {v} when something holds
for all but the one value v. Next, a row over a set of variables X1 , . . . , Xn associates
each variable Xi with a Zariski set Zi , notated < X1 : Z1 , . . . , Xn : Zn >. A row
represents the set of tuples < X1 = x1 , . . . , Xn = xn > such that xi Zi . A row
is empty if any of the Zi is . A microfactor is a sequence of disjoint, but not
necessarily covering rows, where each row is associated with a real number.
In order to implement VE, we need to dene sum and product operations on
microfactors. These in turn require intersection and dierence operations on rows.
Dierence in turn is dened in terms of intersection and complements. Intersection
is straightforward. Complement is more complex, and dened recursively. Because
rows are not closed under complement, the complement operation returns a set of
rows, whose union is the complement of the given row. These rows are guaranteed
to be disjoint. Details of the operation are omitted.
To implement VE, we need multiplication and summation operators on factors.
Multiplication is straightforward. For summation, an iterative process is used that
guarantees that the resulting microfactor correctly represents the sum over the
given variable in the original factor, and that its rows are disjoint. Again details
are omitted.
14.7.3.2

Translating Programs into Microfactors

The next step in IBAL inference is to translate a program into a set of microfactors,
and then perform VE. The goal is to produce factors that capture all the structure in
the program, including both the independence structure and the low-level structure.
The translation is expressed through a set of rules, each of which takes an
expression of a certain form and returns a set of microfactors. The notation T [e] is
used to denote the translation rule for expression e. Thus, for a constant expression

v the rule is3
T [ v] =

.
1

The Boolean constants and lambda and fix expressions are treated similary.
For a variable expression, T [x], we need to make sure that the result has the same
value as x. If x is a simple variable, whose values are symbols, the rule is as follows.
Assuming v1 , . . . , vn are the values in the support of x, this is achieved with the
3. For convenience, we omit the set brackets for singletons.

14.7

Inference

425

rule

T [x] =

v1

v1

...
vn

vn

Here, we exploit the fact that an assignment of values to variables not covered by
any row has value 0.
If x is a complex variable with multiple elds, each of which is itself complex, we
could use the above rule, considering all values in the cross-product space of the
elds of x. However, that is unnecessarily inecient. Rather, for each eld a of x,
we ensure separately that .a is equal to x.a. If a itself is complex, we break that
equality up into elds. We end up with a factor like the one above for each simple
chain c dened on x. If we let the simple chains be c1 , . . . , cm , and the possible
values of ci be v1i , . . . , vni i , we get the rule

T [x] =

m
+

.ci

x.ci

vi1

vi1

...

i=1

vn1 i

vn1 i

m i

i
The total number of rows according to this method is m
i=1 n , rather than
i=1 n
for the product method.
Next we turn to variable denitions. Recall that those are specied in IBAL
through a let expression of the form let x = e1 in e2 . We need some notation: if F
1
is a set of factors, F cc2 denotes the same set as F , except that chain c1 is substituted
for c2 in all the factors in F . Now the rule for let is simple. We compute the factors
for e1 , and replace with x. We then conjoin the factors for e2 , with no additional
change. The full rule is4 T [let x = e1 in e2 ] = T [e1 ]x T [e2 ].
For if-then-else expressions, we proceed as follows. First we dene a primitive
prim_if (x, y, z) that is the same as if but only operates on variables. Then we can
rewrite
if e1 then e2 else e3 =
let x = e1 in
let y = e2 in
let z = e3 in
prim_if (x, y, z)

4. A fresh variable name is provided for the bound variable to avoid name clashes.

426

The Design and Implementation of IBAL: A General-Purpose Probabilistic Language

Now, all we need is a translation rule for prim_if and we can invoke the above let
rule to translate all if expressions.5 Let the simple chains on y and z be c1 , . . . , cm .
(They must have the same set of simple chains for the program to be well typed.)
Using the same notation as before for the possible values of these chains, a naive
rule for prim_if is as follows:

T [prim_if(x, y, z)] =

m
+
i=1

.ci

y.ci

z.ci

v1i

v1i

...
vni i
v1i

vni i

v1i

vni i

...
vni i

This rule exploits the context-specic independence (CSI) present in any if expression: the outcome is independent of either the then clause or the else clause
given the value of the test. The CSI is captured in the entries for the irrelevant
variables. However, we can do even better. This rule unites y.ci and z.ci in a single
factor. However, there is no row in which both are simultaneously relevant. We see
that if expressions satisfy a stronger property than CSI. To exploit this property,
the prim_if rule produces two factors for each ci whose product is equal to the
factor above.
T [prim_if(x, y, z)] =

$m
i=1

.ci

y.ci

v1i

v1i

z.ci

v1i

v1i

...
vni i

.ci

vni i

...

vni i

vni i

Note the last row in each of these factors. It is a way of indicating that the factor is
only relevant if x has the appropriate value. For the rst factor, if x has the value
F , the factor has value 1 whatever the values of the other variables, and similarly
for the other factor. The number of rows in the factors for ci is two more than for
the previous method, because of the irrelevance rows. However, we have gained in
that y.ci and z.ci are no longer in the same factor. Considering all the ci , the moral
graph for the second approach contains m fewer edges than for the rst approach.
Essentially, the variable x is playing the role of a separator for all the pairs y.ci and

5. In practice, if e1 , e2 , or e3 are already variable expressions, we can omit the let


expression dening x, y, or z and use them directly in the prim if.

14.7

Inference

427

z.ci . If we can avoid eliminating x until as late as possible, we may never have to
connect many of the y.ci and z.ci .
None of the expression forms introduced so far contained uncertainty. Therefore,
every factor represented a zero-one function, in other words, a constraint on the
values of variables. Intermediate probabilities are nally introduced by the dist
expression, which has the form dist [p1 : e1 , . . . , pn : en ]. As in the case of if,
we introduce a primitive prim_dist (p1 , . . . , pn ), which selects an integer from 1
to n with the corresponding probability. We also use prim_case which generalizes
the prim_if above to take an integer test with n possible outcomes. We can then
rewrite
dist [p1 : e1 , . . . , pn : en ] =
let x1 = e1 in
...
let xn = en in
let z = prim_dist (p1 , . . . , pn ) in
prim_case (z, [x1 , . . . , xn ])
To complete the specication, we only need to provide rules for prim_dist and
prim_case. The prim_dist rule is extremely simple:

T [prim_dist(p1 , . . . , pn )] =

p1

...
n

pn

The prim_case rule generalizes the rule for prim_if above. It exploits the property
that no two of the xj can be relevant, because the dist expression selects only one
of them. This technique really comes into its own here. If there are m dierent
chains dened on the result, as before, and n dierent possible outcomes of the
dist expression, the number of edges removed from the moral graph is m n. The
rule is
T [prim_case(z, [x1 , . . . , xn ])] =

$m $n
i=1

.ci

xj .ci

v1i

v1i

...

j=1

vni i

vni i

{j}

The rules for record construction and eld access expressions are relatively simple,
and are omitted. Observations are also very simple.

428

The Design and Implementation of IBAL: A General-Purpose Probabilistic Language

Next, we turn to the mechanism for applying functions. It also needs to be able
to handle functional uncertainty the fact that the function to be applied is itself
dened by an expression, over whose value we have uncertainty. To start with,
however, let us assume that we know which particular function we are applying to
a certain set of arguments. For a function f , let f.x1 , . . . , f.xn denote its formal
arguments, and f.b denote its body. Let A[f, e1 , . . . , en ] denote the application of
f to arguments dened by expressions e1 , . . . , en . Then

letf.x1 = e1 in

...

A[f, e1 , . . . , en ] = T
letf.x = e in .
n
n

f.b
By the let rule presented earlier, this will convert f.b into a set of factors that
mention the result variable , the arguments f.xi , and variables internal to the
body of f . Meanwhile, each of the ei is converted into a set of factors dening the
distribution over f.xi .
+
i
A[f, e1 , . . . , en ] = T [f.b]
T [ei ]f.x

To exploit encapsulation, we want to eliminate all the variables that are internal
to the function call before passing the set of factors out to the next level. This can be
achieved simply by eliminating all temporary variables except for those representing
the f.xi from T [f.b]. Thus, a VE process is performed for every function application.
The result of performing VE is a conditional distribution over given the f.xi .6
Normally in VE, once all the designated variables have been eliminated, the
remaining factors are multiplied together to obtain a distribution over the uneliminated variables. Here that is not necessary: performing VE returns a set of factors
over the uneliminated variables that is passed to the next level up in the computation. Delaying the multiplication can remove some edges from the moral graph at
the next level up.
Now suppose we have an application expression e0 (e1 , . . . , en ). The expression e0
does not have to name a particular function, and there may be uncertainty as to
its value. We need to consider all possible values of the function, and apply each of
those to the arguments. Let F denote the support of e0 . Then for each fi F , we
need to compute Ai = A[fi , e1 , . . . , en ] as above.
Now, we cannot simply take the union of the Ai as part of the application result,
since we do not want to multiply factors in dierent Ai together. The dierent Ai
represent the conditional distribution over the result for dierent function bodies.
We therefore need to condition Ai on F being fi . This eect is achieved as follows.
j
j
j
j
Let A1i , . . . , Am
i be the factors in Ai , and let (r1 , p1 ), . . . , (rj , pj ) be the rows in
6. There may also be variables that are free in the body of f and not bound by function
arguments. These should also not be eliminated.

14.8

Lessons Learned and Conclusion

429

factor Aji . Then we can write

Bi =

m
+
j=1

, fi .x1 , . . . , fi .xn

fi

r1j

pj1

...
fi

rjj

pjj

{fi }

for all

In words, each Bij is formed from the corresponding Aji in two steps. First, Aji
is extended by adding a column for F , and setting its value to be equal to fi . The
eect is to say that when F is equal to fi , we want Aji to hold. Then, a row is added
saying that when F is unequal to fi , the other variables can take on any value and
the result will be 1. The eect is to say that Aji does not matter when F = fi . We
can now take the union of all the Bi . To complete the translation rule for function
application, we just have to supply the distribution over F :
T [e0 (e1 , . . . , en )] = i Bi T [e0 ]F

14.8

Lessons Learned and Conclusion


The IBAL implementation represents the culmination of several years of investigation, that begin with the original stochastic Lisp paper [9] and continued with the
SPOOK system [17]. A number of important lessons were learned from the process:
Stochastic programming languages are surprisingly complex, and a sophisticated
algorithm such as the one in this chapter is needed to implement them.
As a corollary, a single mechanism is unlikely to achieve all the goals of inference
in a complex system. The move to the two-phase approach greatly simplied the
implementation, but was also an admission that the implementation had entered
a new level of complexity.
The design of the language and of the inference algorithm go hand in hand. The
set of language constructs in IBAL was chosen to support the specic inference
goals described in this chapter.
Dierent approaches that are individually inadequate may each have something to
contribute to the overall solution. Programming language evaluation approaches
provide a natural way to work with programs, and were used in constructing the
computation graph. SPOOKs approach of using local VE processes for dierent
model components was used. Also, the KBMC approach of separating the model
analysis and probability computation components was used, albeit in a very
dierent way.

430

The Design and Implementation of IBAL: A General-Purpose Probabilistic Language

Beware unexpected interactions between goals! Koller et al. [9] blithely declared
that lazy evaluation and memoization would be used. In retrospect, combining
the two mechanisms was the single most dicult thing in the implementation.
This chapter has presented the probabilistic inference mechanism for IBAL, a
highly expressive probabilistic representation language. A number of apparently
conicting desiderata for inference were presented, and it was shown how IBALs
inference algorithm satises all of them. It is hoped that the development of IBAL
provides a service to the community in two ways. First, it provides a blueprint
for anyone who wants to build a rst-order probabilistic reasoning system. Second,
and more important, it is a general-purpose system that has been released for
public use. In future it will hopefully be unnecessary for designers of expressive
models to have to build their own inference engine. IBAL has succesfully been tried
on BNs, HMMs (including innite state-space models), stochastic grammars, and
probabilistic relational models. IBAL has also been used successfully as a teaching
tool in a probabilistic reasoning course at Harvard. Its implementation consists of
approximately 10,000 lines of code. It includes over fty test examples, all of which
the inference engine is able to handle. IBALs tutorial and reference manuals are
both over twenty pages long.
Of course, there are many models for which the techniques presented in this
chapter will be insucient, and for which approximate inference is needed. The next
step of IBAL development is to provide approximate inference algorithms. IBALs
inference mechanism already provides one way to do this. One can simply plug in
any standard BN approximate inference algorithm in place of VE whenever a set of
factors has to be simplied. However, other methods such as Markov chain Monte
Carlo will change the way programs are evaluated, and will require a completely
dierent approach.

References
[1] R. I. Bahar, E. A. Frohm, C. M. Gaona, G. D. Hachtel, E. Macii, A. Pardo, and
F. Somenzi. Algebraic decision diagrams and their applications. In IEEE/ACM
International Conference on Computer-Aided Design, 1993.
[2] T. Dean and K. Kanazawa. A model for reasoning about persistence and
causation. Computational Intelligence, 5:142150, 1989.
[3] R. Dechter. Bucket elimination : a unifying framework for probabilistic inference. In Proceedings of the Conference on Uncertainty in Articial Intelligence,
1996.
[4] D. Heckerman and J. S. Breese. A new look at causal independence. In
Proceedings of the Conference on Uncertainty in Articial Intelligence, 1994.
[5] K. Kersting and L. de Raedt. Bayesian logic programs. In Proceedings of
the Work-In-Progress Track at the 10th International Conference on Inductive
Logic Programming, 2000.

References

431

[6] D. Koller and A. Pfeer. Semantics and inference for recursive probability
models. In Proceedings of the National Conference on Articial Intelligence,
2000.
[7] D. Koller and A. Pfeer. Object-oriented Bayesian networks. In Uncertainty
in Articial Intelligence (UAI), 1997.
[8] D. Koller and A. Pfeer. Probabilistic frame-based systems. In Proceedings of
the National Conference on Articial Intelligence, 1998.
[9] D. Koller, D. McAllester, and A. Pfeer. Eective Bayesian inference for
stochastic programs. In Proceedings of the National Conference on Articial
Intelligence, 1997.
[10] K. B. Laskey and S. M. Mahoney. Network fragments: Representing knowledge
for constructing probabilistic models. In Proceedings of the Conference on
Uncertainty in Articial Intelligence, 1997.
[11] Z. Li and B. DAmbrosio. Ecient inference in bayes networks as a combinatorial optimization problem. International Journal of Approximate Inference,
11, 1994.
[12] S. Muggleton. Stochastic logic programs. Journal of Logic Programming,
2001. Accepted subject to revision.
[13] L. Ngo and P. Haddawy. Answering queries from context-sensitive probabilistic knowledge bases. Theoretical Computer Science, 1996.
[14] H. Pasula and S. Russell. Approximate inference for rst-order probabilistic
languages. In Proceedings of the International Joint Conference on Articial
Intelligence, 2001.
[15] A. Pfeer. Repeated observation models. In Proceedings of the National
Conference on Articial Intelligence, 2004.
[16] A. Pfeer. Probabilistic Reasoning for Complex Systems. PhD thesis, Stanford
Univeristy, 2000.
[17] A. Pfeer, D. Koller, B. Milch, and K. T. Takusagawa. SPOOK: A system
for probabilistic object-oriented knowledge representation. In Proceedings of
the Conference on Uncertainty in Articial Intelligence, 1999.
[18] D. Pless and G. Luger. Toward general analysis of recursive probability models. In Proceedings of the Conference on Uncertainty in Articial Intelligence,
2001.
[19] D. Poole. Probabilistic Horn abduction and Bayesian networks. Articial
Intelligence Journal, 64(1):81129, 1993.
[20] D. Poole and N. L. Zhang. Exploiting contextual independence in probabilistic
inference. Journal of Articial Intelligence Research (JAIR), 2003.
[21] S. Sanghai, P. Domingos, and D. Weld. Dynamic probabilistic relational
models. In Proceedings of the International Joint Conference on Articial
Intelligence, 2003.

432

The Design and Implementation of IBAL: A General-Purpose Probabilistic Language

[22] T. Sato and Y. Kameya. Parameter learning of logic programs for symbolic
statistical modeling. Journal of Articial Intelligence Research, 15:391454,
2001.
[23] D. J. Spiegelhalter, A. Thomas, N. Best, and W. R. Gilks. BUGS 0.5 :
Bayesian inference using Gibbs sampling manual. Technical report, Institute
of Public Health, Cambridge University, 1995.

15 Lifted First-Order Probabilistic Inference

Rodrigo de Salvo Braz, Eyal Amir and Dan Roth

Most probabilistic inference algorithms are specied and processed on a propositional level, even though many domains are better represented by rst-order specications that compactly stand for a class of propositional instantiations. In the last
fteen years, many algorithms accepting rst-order specications have been proposed. However, these algorithms still perform inference on a mostly propositional
model, generated by the instantiation of rst-order constructs. When this is done,
the rich and useful rst-order structure is not explicit anymore. This rst-order
representation and structure allow us to perform lifted inference, that is, inference
on the rst-order representation directly, manipulating not only individuals but
also groups of individuals. This has the potential of greatly speeding up inference.
We precisely dene the problem and present an algorithm that generalizes variable
elimination and manipulates rst-order representations in order to perform lifted
inference.

15.1

Introduction
Probabilistic inference algorithms are widely employed in articial intelligence.
Among those, graphical models such as Bayesian and Markov networks (BNs and
MNs respectively) ([8]) are among the most popular. These models are specied by a
set of conditional probabilities (for BNs) or factors, also called potential functions
(for MNs). Both conditional probabilities and factors are dened over particular
subsets of the available random variables, and map assignments of those random
variables to positive real numbers (called potentials in MNs). For our purposes,
it will be helpful to think of graphical models in general and simply consider
conditional probabilities as a type of factor.
For example, in an application for document subject classication, one can specify a dependence between the random variables subject apple, word mac (which

434

Lifted First-Order Probabilistic Inference

indicate that the subject of the document is apple and that the word mac is
present in it) by dening a factor on their assignments. The higher the potential
for a given assignment to these random variables, the more likely it will be in the
joint distribution dened by the model.
A limitation of graphical models arises when the same dependence holds between
dierent subsets of random variables. For example, we might declare the dependence above to hold also between subject microsof t, word windows. In traditional
graphical models, we must use separate potential functions to do so, even though
the dependence is the same. This brings redundancy to the model and possibly
wasted computation. It is also an ad hoc mechanism since it does not cover other
sets of random variables exhibiting the same dependence (in this case, some other
company and product).
The root of this limitation is that graphical models are propositional (random
variables can be seen as analogous to propositions in logic), that is, they do not
allow quantiers and parameterization of random variables by objects. A rst-order
or relational language, on the other hand, does allow for these elements. With such
a language, we can specify a potential function that applies, for example, to all
tuples of random variables obtained by instantiating X and Y in the tuple
subject(X), company(X), product(X, Y ), word(Y ).

(15.1)

This way we not only cover both cases presented before, but also unforeseen ones,
with a single compact specication.
In the last fteen years, many proposals for probabilistic inference algorithms
accepting rst-order specications have been presented ([7, 6, 1, 4, 10, 11], among
many others), most of which based on the theoretic framework of Halpern [5].
However, these solutions still perform inference at a mostly propositional level;
they typically instantiate potential functions according to the objects relevant to
the present query, thus obtaining a regular graphical model on propositional random
variables, and then using a regular inference algorithm on this model. In domains
with a large number of objects this may be both costly and essentially unnecessary.
Suppose we have a medical application about the health of a large population,
with a random variable per person indicating whether they are sick with a certain
disease, and with a potential function representing the dependence between a person
being sick and that person getting hospitalized. To answer the query what is the
probability that someone will be hospitalized?, an algorithm that depends on
propositionalization will instantiate a random variable per person. However this
is not necessary since one can calculate the same probability by reasoning about
individuals on a general level, simply using the population size, in order to answer
that query in a much shorter time. In fact, the latter calculation would not depend
on the population size at all.
Naturally, it is possible to reformulate the problem so that it is solved in a
more ecient manner. However, this would require manual devising of a process
specic to the model or query in question. It is desirable to have an algorithm that

15.2

Language, Semantics and Inference problem

435

can receive a general rst-order model and automatically answer queries like these
without computational waste.
A rst step in this direction was given by Poole [9], which proposes a generalized
version of the variable elimination algorithm [12] that is lifted, that is, deals
with groups of random variables at a rst-order level. The algorithm receives
a specication in which parameterized random variables stand for all of their
instantiations and then eliminates them in a way that is equivalent to, but much
cheaper than, eliminating all their instantiations at once. For the parameterized
potential function (15.1), for example, one can eliminate product(X, Y ) in a single
step that would be equivalent to eliminating all of its instantiations.
The algorithm in Poole [9], however, applies only to certain types of models
because it uses a single elimination operation that can only eliminate parameterized
random variables containing all parameters present in the potential function (the
method can eliminate product(X, Y ) from (15.1) but not company(X) because the
latter does not contain the parameter Y ). As we will see later, Pooles algorithm uses
the operation we call inversion elimination. In addition to inversion elimination, we
have developed further operations (the main ones called counting elimination and
partial inversion) that broaden the applicability of lifted inference to a greater
extent ([2, 3]). These operations are combined to form the rst-order variable
elimination (FOVE) algorithm presented in this chapter. The cases to which lifted
inference applies can be roughly summarized as those containing dependencies
where the set of parameters of each parameterized random variable are disjoint or,
when this is not the case, where there is a set of parameters whose instantiations
create independent solvable cases. We specify these conditions in more detail when
explaining the operations, and further discuss applicability in section 15.6. When no
lifted inference operation applies to a specic part of a model, FOVE can still apply
standard propositional methods to that part, assuring completeness and limiting
propositional inference to only some parts of the model.

15.2

Language, Semantics and Inference problem


Like Markov networks, rst-order probabilistic models (FOPMs) are essentially
dened by a set of factors. However, unlike them, these factors are dened over
parameterized random variables, and for this reason we call them parfactors
(following [9]).
Given a universe of objects over which the parameters range, we can generate
regular propositional factors from a parfactor by replacing its parameters by
particular objects. A parfactor is therefore a compact representation of a set of
regular factors, and a FOPM is a compact representation of a Markov network
composed by all instantiations of all of its parfactors.
Based on the correspondence to logic concepts, we call a parameterized random
variable an atom, and a parameter a logical variable (as opposed to random
variables). We also call the functors of atoms predicates. Even though we infor-

436

Lifted First-Order Probabilistic Inference

mally refer to atoms as parameterized random variables, they are not, technically
speaking, random variables, but stand for classes of them. A ground atom, however, denotes a random variable. Sometimes we call random variables ground to
emphasize their correspondence to ground atoms.
Logical variables are typed, with each type being a nite set of objects. We
denote the domain, or type, of a logical variable X by DX and its cardinality by
|X|. In our examples, unless noted, all logical variables have the same type. Each
predicate p also has its domain, Dp , which is the set of values that each of the
random variables with that predicate can take.
Formally, a parfactor g is a tuple (g , Ag , Cg ), where g is a potential function dened over atoms Ag to be instantiated by all substitutions of its logical
variables satisfying a constraint Cg . A constraint is a pair (F, V ) where F is
an equational formula on logical variables and V is the set of logical variables
to be instantiated (some of them may not be in the formula). We sometimes denote a constraint by its formula F alone, when the set of logical variables V is
clear from context. Tautological formulas are represented by ". For example, the
parfactor (, (p(X), q(X, Y )), (X = a, {X, Y })) applies to all instantiations of
(p(X), q(X, Y )) by substitutions of X and Y satisfying X = a. We denote the set
of substitutions satisfying C by [C].
While we are neutral as to how the potential functions are actually specied, logical formulas seem to be a convenient choice. For example, a weighted
formula 0.7 : epidemic(D) sick(P, D) might represent a potential function
(epidemic(D), sick(P, D)) with potential 0.7 for assignments in which the formula is true. This allows us to specify FOPMs by sets of weighted logical formulas
that are intuitive and simple to read, and is the approach taken by Markov logic
networks ([11]).
The projection C|L of a constraint C = (F, V ) onto a set of logical variables
L is a constraint equivalent to (L F, L) for L = V \ L. Intuitively, C|L describes
the conditions posed by C on L alone, that is, the possible substitutions on L
that are part of substitutions in [C]. For example, (X = a X = Y Y =
b, {X, Y })|{X} = (X = a, {X}). FOVE uses a constraint solver which is able to
solve several constraint problems, such as determining the number of solutions of a
constraint and its projection onto sets of logical variables.
In certain contexts we wish to describe the class of random variables instantiated
from an atom with constraints on its logical variables (for example, the set of
random variables instantiated from p(X, Y ), with X = a). We call such pairs
of atoms and constraints constrained atoms, or c-atoms. The c-atoms of a
parfactor is the set of c-atoms formed by its atoms and its constraint.
Let be a parfactor, c-atom, constraint or a set of those. We dene RV () to
be the set of (ground) random variables specied by , and denotes the result
of applying a substitution to . [Cg ] is also denoted by g .
A FOPM is specied by a set of parfactors G and the types of its logical variables.
Its semantics is a joint distribution dened on RV (G) by the Markov network
formed by all the instantiations of parfactors. Thus it is proportional to the product

15.3

The First-Order Variable Elimination (FOVE) algorithm

437

of all instantiated parfactors:


P (RV (G))

 

g.

gG g



For convenience, we denote g g by (g), and gG (g) by (G). Therefore
we can write the above as P (RV (G)) (G).
The most important inference task in graphical models is marginalization. For
FOPMs, it takes the following form: given a set of ground random variables Q,
calculate

(G),
(15.2)
P (Q)
RV (G)\Q

where the summation ranges over all assignments to RV (G) \ Q. Posterior probabilities can be calculated by representing evidence as additional parfactors on the
evidence atoms.
The FOVE algorithm makes the simplifying assumption that the FOPM is
shattered w.r.t the query Q. A set of c-atoms is shattered if the instantiations
of any pair of its elements are either identical or disjoint. A parfactor, or set of
parfactors, is shattered if the set of their c-atoms is shattered. A FOPM is shattered
w.r.t. a query Q if the union of its c-atoms and those of the query is shattered. For
example, we can have c-atoms (p(X), X = a),(p(Y ), Y = a) and p(a) in a model,
but not p(Y ) and p(a), because RV (p(a)) RV (p(Y )) but RV (p(a)) = RV (p(Y )).
When a FOPM and query are not shattered, we can replace them by equivalent
shattered FOPM and query through the process of shattering, detailed in section
15.5.2.

15.3

The First-Order Variable Elimination (FOVE) algorithm


Computing (15.2) directly is intractable since it would take exponential time in
the number of random variables in RV (G) \ Q. This is the case even for the
propositional case, which is the reason why algorithms have been developed that
take advantage of independences represented in the model in order to compute
marginals more eciently. One of these algorithms is variable elimination (VE)
[12]. First-order variable elimination (FOVE) is a rst-order generalization of VE.
While VE eliminates a random variable at a time, FOVE eliminates a c-atom, or
set of c-atoms, at each step. By eliminating a c-atom, we implicitly eliminate all of
its instantiations at the same time. Let E be a set of c-atoms to be eliminated from
a FOPM with a set G of parfactors. Let GE , GE G be the sets of parfactors
depending and not depending on E, respectively. Then



(G) =
(GE )
(GE ).
RV (G)\Q

(RV (G)\RV (E))\Q

RV (E)

438

Lifted First-Order Probabilistic Inference


We later show operations computing a parfactor g  such that RV (E) (GE ) =
(g  ). Once we have g  , the right-hand side of the above is equal to



(GE )(g  ) =
(GE {g  }) =
(G )
(RV (G)\RV (E))\Q

(RV (G)\RV (E))\Q

RV (G )\Q

where G = GE {g  }. In other words, we have reduced the original marginalization


to a smaller instance that does not include RV (E). This is repeated until only Q
is left.
A crucial dierence between VE and FOVE is elimination ordering. VE eliminates
random variables according to an ordering given a priori. In FOVE, eliminating
certain c-atoms may require eliminating some other c-atoms rst, so it may be
the case that some c-atoms are not eliminable at all times (these conditions will
be claried later). Because parfactors and c-atoms are sometimes changed and
reorganized during the algorithm, it is not a simple matter to choose an ordering
in advance. Instead, the elimination ordering is dynamically determined.

Before we move on to explaining the operations for calculating RV (E) (GE ) =

(g  ), we mention that, in fact, they only calculate
RV (E) (g) for a single
parfactor g. This is not a problem because the operation of fusion, covered in
section 15.5.1, calculates g such that (g) = (G) for any set of parfactors G.
15.3.1

Counting Elimination

We rst show counting elimination on a specic example and later generalize it.
Consider the summation
 
(p(X), p(Y )),
RV (p(X)) X,Y

where p is a boolean predicate. (Note that the X used under the summation is not
the same X used by the product. RV (p(X)) is shorthand for all assignments over
the set {p(X) : X DX }, so X is locally used. In fact, we could have written
RV (p(Y )), or even RV (p(Z)), to the same eect. We choose to use X or Y to make
the link with the atom in the parfactor more obvious.)
Counting elimination is based on the following insight: because a parfactor will
typically only evaluate to a few dierent potentials, large groups of its instantiations
will evaluate to the same potential. So the summation is rewritten

(0, 0)|(0,0)| (0, 1)|(0,1)| (1, 0)|(1,0)| (1, 1)|(1,1)| ,
RV (p(X))

where |(v1 , v2 )| indicates the number of possible choices for X and Y so that p(X) =
v1 and p(Y ) = v2 given the current assignment to RV (p(X)). These partition sizes
 p,
can be calculated by a combinatorial, or counting, argument. Assume we know N
a vector of integers that indicates how many random variables in RV (p(X)) are
 p,i = |{r RV (p(X)) : r = i}|
currently assigned a particular value, that is, N

15.3

The First-Order Variable Elimination (FOVE) algorithm

439

 

for each i Dp . Naturally, i N
p,i = |RV (p(X))|. Then there are Np,v1 possible
 p,v2 distinct possible values for Y (so that
values for X (so that p(X) = v1 ) and N
 p,v1 N
 p,v2 .
p(Y ) = v2 ), so |(v1 , v2 )| = N
We take advantage of the fact that the values |(v1 , v2 )| do not depend on the
 p . This allows us to iterate over
particular assignments to RV (p(X)), but only on N

the groups of assignments with the same Np and do the calculation for the entire
group. We also take
into account
the group size," which is provided
by the binomial
"|RV
(p(X))|#
|RV (p(X))|#

(or, equivalently,
). We then have
coecient of Np ,


N
N
p,0

.
p
N

p,1

/
|RV (p(X))| 


(v1 , v2 )Np,v1 Np,v2
 p,0
N
(v ,v )
1

which has a number of terms linear in |RV (p(X))|, as opposed to the previous
exponential number.
Counting elimination is not a universal method. The counting argument presented
above requires that there be little interaction between the logical variables of atoms.
If a parfactor is on p(X, Y ), q(X, Z), for example, the counting argument does not
work because the choices for (X, Z) depend on the particular X chosen for p(X); we
can no longer compute number of choices using counters alone but need to know the
particular assignment to RV (p(X)). Generally, under counting elimination, choices
for one atom cannot constrain the choices for another atom (there are exceptions
to this rule, as for example just-dierent atoms, presented in [3]).
We now give the formal account of counting elimination, starting with some
preliminary denitions.
First, we dene the notion of independent atoms given a constraint.
Intuitively, this happens when choosing a substitution for the logical variables of
one atom does not change the possible choices of substitutions for the other atom.
2 be two sets of logical variables such that X
1 X
2 V . X
1 is
1 and X
Let X
2 given C if, for any substitution 2 [C|X ], C|X (C2 )|X .
independent from X
2
1
1
2 are independent given C if X
1 is independent from X
2 given C and
1 and X
X
1 ) and p2 (X
2 ) are independent given C if X
1 and X
2
vice-versa. Two atoms p1 (X
are independent given C.
Finally, we dene multinomial counters. Let a be a c-atom with domain Da .
 a,j indicates how many
 a , is a vector where N
Then the multinomial counter of a, N
instantiations of a are assigned the j-th value in Da . The multinomial coecient


 a ! = (Na,1 ++Na,|Da | )! is a generalization of binomial coecients and indicates
N
a,1 !...N
a,|D | !
N
a
how many assignments to RV (a) exhibit the particular value distribution counted
 a.
by N
Counters can be applied to sets of c-atoms with the same general meaning. The
 A , and the product
set of multinomial counters for a set of c-atoms A is denoted N



aA Na ! of their multinomial coecients is denoted NA !.

440

Lifted First-Order Probabilistic Inference

Theorem 15.1 Counting Elimination


Let g be a shattered parfactor and E = {E1 , . . . , Ek } be a subset of Ag such that
RV (E) is disjoint from RV (Ag \ E), A = Ag \ E are all ground, and where each
pair of atoms is independent given Cg . Then


 
Qk

E!
(Ag ) =
(v, A ) i=1 NEi ,vi .
N
RV (E) g

E
N

vDE

The theorems proof reects the argument given above. Counting elimination
brings a signicant computational advantage because iterating over assignments is
exponential in |RV (E)| while doing so over groups of assignments is only polynomial
in it.
It is important to notice that E must contain all non ground c-atoms in g. Also, if
all c-atoms in g are ground, E can be any subset of them and we will have a simple
propositional summation, the same used in VE (counters over 1-random variable
c-atoms reduce to ordinary assignments).
15.3.2

Inversion

Counting elimination requires a parfactors atoms to be independent given its


constraints. In particular, logical variables shared between atoms may render them
dependent on each other. In some of these cases, the operation of inversion can be
applied. In fact, even in cases in which counting elimination can be applied, it is
advantageous to apply inversion rst, if possible, for eciency reasons.
Let us consider a couple of examples before we formalize inversion. Consider the
following:


(p(X), q(X, Y ))
RV (q(X,Y )) XY

q(o1 ,o1 ) q(o1 ,o2 )

(p(o1 ), q(o1 , o1 ))(p(o1 ), q(o1 , o2 )) . . . (p(on ), q(on , on ))

q(on ,on )

(by observing that (p(oi ), q(oi , oj )) depends on q(oi , oj ) only)


=

(p(o1 ), q(o1 , o1 ))

q(o1 ,o1 )

(p(on ), q(on , on ))

q(on ,on )

(by observing that only (p(oi ), q(oi , oj )) depends on q(oi , oj ))


=

0 

1
0 
1
(p(o1 ), q(o1 , o1 )) . . .
(p(on ), q(on , on ))

q(o1 ,o1 )

 
XY q(X,Y )

q(on ,on )

(p(X), q(X, Y ))

15.3

The First-Order Variable Elimination (FOVE) algorithm

441

(by observing that only the summation is the same for all q(X, Y ))
=

 (p(X)).

XY

Inversion works by establishing a one-to-one correspondence between parfactor


instantiations and summations. If the summation were on the instantiations of
p(X), such correspondence would not be possible because there would be less
summations (|X| of them) than parfactor instantiations (|X| |Y | of them).
Another condition for inversion is that the c-atom being inverted not have
dierent instances in the same instance of the parfactor. For example, we cannot
use inversion on p(X, Y ) for a parfactor on p(X, Y ), p(Y, X) because for any pair of
objects oi , oj , neither of the parfactor instantiations p(oi , oj ) and p(oj , oi ) can be


factored out of the innermost of p(oi ,oj ) and p(oj ,oi ) . This breaks the one-to-one
correspondence between summations and instantiated parfactors.
In the case above, the resulting inner summation was propositional. Inversion
resulting in propositional summations were called inversion elimination in our
earlier work [2, 3]. In the next example, the inner summation is one computed
by counting elimination.
Suppose we want to calculate


(p(X, Y ), p(X, Z))
RV (p(X,Y )) X,Y,Z



RV (p(X,Y ))

X Y,Z

=
0

RV (p(o1 ,Y ))

(p(X, Y ), p(X, Z))




(p(o1 , Y ), p(o1 , Z))

RV (p(on ,Y )) Y,Z

1 0
(p(o1 , Y ), p(o1 , Z)) . . .

RV (p(o1 ,Y )) Y,Z


X

(p(on , Y ), p(on , Z))

Y,Z

1
(p(on , Y ), p(on , Z))

RV (p(on ,Y )) Y,Z

(p(X, Y ), p(X, Z))

RV (p(X,Y )) Y,Z

Because X is now bound before the summation, it works as a constant (whose


exact identity is irrelevant), and so it is not included in the counting argument.
The counting argument now involves only Y and Z and is actually very similar
to our original counting argument example. For that reason, the above is equal to



X (), for the result of counting elimination.
Inversions resulting in counting elimination problems only invert on a subset
of the parfactors logical variables. For this reason, they have been called Partial
inversions in our previous work. Note however that, since propositional sums are
a trivial case of counting elimination, both inversion operations can be unied into
one. This is what we do in the formalization below.

442

Lifted First-Order Probabilistic Inference

15.3.2.1

Uniform Solution Counting Partition (USCP)

Before we present the theorem formalizing inversion, we touch a last issue. Consider
the inversion of X resulting in the expression



(p(X, Y ), p(X, Z)).
X RV (p(X,Y )) Y =X,Z=X,Y =a,Z=a

The summation can be done by counting elimination since X is bound. However,


it will depend on |RV (p(X, Y ))|, but that depends on whether X = a or not. One
needs to split the expression according to cases X = a and X = a:
1
0


(p(a, Y ), p(a, Z))
RV (p(a,Y )) Y =a,Z=a

0 

1
(p(X, Y ), p(X, Z))

X=a RV (p(X,Y )) Y =X,Z=X,Y =a,Z=a

and then proceed as usual.


In general, one needs to consider the uniform solution counting partition
(USCP) of the inverted logical variables with respect to an original constraint
system. The USCP UL (C) of a set of logical variables L with respect to a constraint
C is a set of constraints {C1 , . . . , Ck } such that {[Ci ]}i forms a partition of [C|L ]
and
i  ,  [Ci ] |[C ]| = |[C ]|,
that is, the number of solutions for the constraint conditioned on L is the same for
each of the components Ci .
15.3.2.2

Inversion Formalization

Theorem 15.2 Inversion


Let g be a shattered parfactor, L a set of logical variables and E a set of c-atoms
such that RV (E) and RV (Ag \ E) are disjoint. If
1. ek , el E i , j [Cg|L ] i = j RV (ek i ) RV (el j ) = .
2. i , j [Cg|L ] i = j RV (Ei ) RV (Ej ) = .
then

RV (E)

(g) =

(gC ),

CUL (Cg )

where gC is the parfactor (g , Ag , C Cg ) and using g  dened by the recursive

computation g  =
RV (E) (g), for an arbitrary element of [C] (by the
denition of USCP, it does not matter which).

15.3

The First-Order Variable Elimination (FOVE) algorithm

443

Proof Let E = {e1 , . . . , en }, [Cg|L ] = {1 , . . . , m }. Below, we decompose Cg into


the part w.r.t. L and the remaining logical variables:
 

(g) =
(Ag )
RV (E)

RV (E) g

RV (E) [Cg|L ]

(Ag  )

1 0
(Ag 1  ) . . .

 0 

1
(Ag  )

# "
(Ag 1  ) . . .

RV (e1 m )

#
(Ag m  )

 [Cg m ]

1
(Ag m  )

RV (en m )  [Cg m ]

RV (en )  [Cg ]

 0 

1
(Ag  )

[Cg|L ] RV (E)  [Cg ]

"

RV (en m )  [Cg 1 ]

RV (e1 m )

RV (en 1 )  [Cg 1 ]

[Cg|L ] RV (e1 )

RV (e1 1 )

RV (en 1 )

0 


g ]

RV (e1 1 )

(Ag  )

RV (en ) [Cg|L ]  [Cg ]

RV (e1 )

 [C

(g) =

CUL (Cg ) [C] RV (E)

CUL (Cg ) [C]

(gC ) =

(gC ).

CUL (Cg )



Note that condition 1 is used to ensure the summations on RV (e1 1 ) RV (en m )
are indeed distinct. Condition 2 ensures that the innermost products are on distinct
sets of random variables and can therefore be factored out as shown.
15.3.3

The Algorithm

Figure 15.1 shows the main pseudocode for FOVE. The algorithm consists of
successively choosing eliminations (E, {L1 , . . . , Lk }), consisting of a collection of
atoms E to eliminate after performing a series of inversions based on sets L1 , . . . , Lk
of logical variables. A possible way of choosing eliminations is presented in gure
15.2. It is presented separately from the main algorithm for clarity, but because
these two phases have many operations in common, actual implementations will
typically integrate them more tightly.
There are potentially many ways to choose eliminations. The one we present
starts by choosing an atom and checking if its inversion will produce a propositional
summation, since this is the most ecient case. If not, we successively add atoms
to E until GE forms a parfactor where all atoms with logical variables are part
of E (because counting elimination requires it). Then, for eciency and to avoid
shared logical variables between atoms, we try to determine as many inversions as
possible, coded in the sequence L1 , . . . , Lk , to be done before counting elimination
(or explicit summation when counting cannot be done).

444

Lifted First-Order Probabilistic Inference

FUNCTION FOVE (G, Q)


G a set of parfactors, Q RV (G), G shattered against Q (section 15.5.2).
1. If RV (G) = Q, return G.
2. (E, {L1 , . . . , Lk }) CHOOSE-ELIMINATION(G, Q).
3. gE f s(GE ) (fusion, section 15.5.1).
4. G ELIMINATE(gE , E).
5. Return FOVE (G GE , Q).

FUNCTION ELIMINATE (g, E, {L1 , . . . , Lk })


1. If k = 0 (no inversion)
return SUMMATION-WITHOUT-INVERSION (g, E).
2. E1 {eS E : LV (e) L1 = } (get inverted atoms).
3. Return C1 UL (Cg ) ELIMINATE-GIVEN-UNIFORMITY(g, E1 , C1 , {L2 , . . . , Lk }).

FUNCTION ELIMINATE-GIVEN-UNIFORMITY (g, E1 , C1 , {L2 , . . . , Lk })


1. Choose 1 [C1 ] (bind inverted logical variables arbitrarily).
(g1 , E1 1 , {L2 , . . . , Lk }).
2. G ELIMINATE
S



3. G S
g  1 G (g , Ag , C1 Cg ).
4. Return g G SIMPLIFICATION(g  ) (simplication, section 15.5.3).

FUNCTION SUMMATION-WITHOUT-INVERSION (g, E)


given Cg and Ag \ E is ground
1. If E = {E1 , . . . , Ek } atoms are independent
Q
P
E ! Q g (v, Ag \ E) ki=1 N Ei ,vi , Ag \ E, ) (counting, section
return ( N E N
v
15.3.1). P
Q
2. Return ( RV (E) g g (Ag g ), Ag \ E, Cg ) (propositional elimination).

Notation:
LV (): logical variables in object .
g: parfactor (g , Ag , Cg ).
UL (Cg ): USCP of L with respect to Cg (section 15.3.2.1).
C|L : constraints projected to a set of logical variables L.
GE : subset of parfactors G which depend on RV (E).
GE : subset of parfactors G which do not depend on RV (E).
: tautology constraint.
Figure 15.1

15.4

The FOVE algorithm.

An experiment
We use the implementation available at http://l2r.cs.uiuc.edu/~cogcomp to
compare average run times between lifted and propositional inference (which produce the exact same results) for two dierent models while increasing the number

An experiment

445

FUNCTION CHOOSE-ELIMINATION (G, Q)


1. Choose e from AG \ Q.
2. g f s(Ge ) (fusion, section 15.5.1).
3. If LV (e) = LV (g) and e Ag RV (e ) = RV (e)
return ({e}, LV (e)) (inversion eliminable).
4. E {e}.
5. While E = non-ground atoms of GE
E E non-ground atoms of GE .
6. Return (E, GET-SEQUENCE-OF-INVERSIONS(f s(GE ))).

FUNCTION GET-SEQUENCE-OF-INVERSIONS (g)


1. If there is no L1 set of invertible logical variables in g (inversion, section 15.3.2)
return .
2. Choose 1 [Cg|L1 ].
3. {L2, . . . , Lk } GET-SEQUENCE-OF-INVERSIONS(g1 ).
4. Return {L1 , L2 , . . . , Lk }.

One possible way of choosing an elimination.

Figure 15.2

(I) Inversion elimination


490
390

Lifted

290

Ground

190
90
-10
1

(II) Counting elimination


Average run time
(ms)

590

Average run time


(ms)

15.4

10 11 12 13 14 15

Domain size

3400
2900
2400
1900
1400
900
400
-100

Lifted
Ground

10

Domain size

Figure 15.3 (I) Average run time for answering query P (p) from a parfactor
on (p, q(X)), using inversion elimination, with domain size |X| being gradually
increased. (II) Average run time for answering query P (r) from a parfactor on
(p(X), p(Y ), r), using counting elimination, with domain size |X| = |Y | being gradually increased.

of objects in the domain. The rst one, (I) in gure 15.3, answers the query P (p)
from a parfactor on (p, q(X)) and uses inversion elimination only. The inference
in (II) answers query P (r) from a parfactor on (p(X), p(Y ), r) and uses counting
elimination only. In both cases propositional inference starts taking very long before
any noticeable variation in lifted inference run times.

446

15.5

Lifted First-Order Probabilistic Inference

Auxiliary operations
15.5.1

Fusion


We have assumed in section 15.3 that we have operations to calculate RV (E) (GE ),

but elimination operations calculate RV (E) (g), for g a single parfactor. Fusion
bridges this gap by computing, for any set of parfactors G, a single parfactor f s(G)
such that (G) = (f s(G)).
Fusion works by replacing the constraints of all parfactor in the set by a single,
common constraint which is the conjunction of them all. This guarantees that all
parfactors get instantiated by the same set of substitutions on a single set of logical
variables, which allows their products (in the expression for (G)) to be unied
under a single product. Note that not all parfactors contain all the logical variables,
and will be instantiated to the same ground factor by distinct substitutions (those
agreeing on the logical variables present in the parfactor, but disagreeing on some
of the others). In other words, some of the parfactors will have their number of
instantiatiations increased by this unication. For this reason, we also exponentiate
the potential function to the inverse of how many times the number of instantiations
was increased, keeping the nal result the same as before.
This is illustrated in the example below:
1

#"
#
"
1 (e(D), s(D, P ))
2 (e(D )) =
1 (e(D), s(D, P ))2|D,P | (e(D ))
D

D,P

D,P,D 

3 (e(D), s(D, P ), e(D  )).

D,P,D 

(note that logical variables in dierent parfactors must be standardized apart.)


Formally, we have the fusion theorem below.
Theorem 15.3 Fusion
2
$
Let G be a set of parfactors. Dene CG = gG Cg , G = [CG ] and AG = gG Ag .

| |/|G |
, AG , CG ). Then (G) = (f s(G)).
Let f s(G) be the parfactor ( gG g g
Proof
(G) =

 

g (Ag ) =

gG g

 

G gG

 

g (Ag )|g |/|G |

gG G
|g |/|G |

g (Ag )

f s(G) (AG ) = (f s(G))

While the above is correct, it is rather unnatural to have e(D) and e(D ) be
distinct atoms. If a set of logical variables has the same possible substitutions, like
D and D here, we can do something better:

15.5

Auxiliary operations

0

447

10
1 0 
10
1
1 (e(D), s(D, P ))
2 (e(D )) =
1 (e(D), s(D, P ))
2 (e(D ))
D

D,P

0"

D

#"
#1
1 (e(D ), s(D , P )) 2 (e(D ))






D
P
0
D

1 (e(D ), s(D , P ))2 (e(D )) |P |

P
1

1 (e(D ), s(D , P ))2|P | (e(D ))

D ,P

3 (e(D ), s(D , P )).

D ,P

Formally, this process is similar to inversion with respect to D . However, it does
require the additional previous step of unifying distinct logical variables (but with
identical sets of possible substitutions) into a single one rst (in the example, D and
D are replaced by D ). For lack of space we omit the details of this improvement.
15.5.2

Shattering

In section 15.3 we mentioned the need for shattering, which we now discuss in more
detail. This need arises from c-atoms representing overlapping, but not identical,
classes of random variables. Consider the following marginalization over parfactors
g1 and g2 with potential functions 1 and 2 respectively:
0
1

1 (p(X, Y ), q)
2 (p(a, Y ))
RV (p(X,Y )) X,Y

If we pick E = p(a, Y ), GE = {g1 , g2 }. However, only some instantiations of g1


depend on p(a, Y ) (the ones with X = a). Moreover, the operations we later talk
about require any pair of c-atoms in GE to represent either identical classes of
random variables, or those classes to be disjoint. This is violated by RV (p(a, Y ))
being a subset of RV (p(X, Y )). Picking E = p(X, Y ) also violates this requirement.
The solution is to split parfactor g1 into two dierent parfactors. The union of
instantiations of (1 , (p(X, Y ), q), X = a) and (1 , (p(a, Y ), q), ") is identical to the
set of instantiations of g1 , so the summation can be simply rewritten as
0 
10
1

1 (p(X, Y ), q)
1 (p(a, Y ), q)
2 (p(a, Y ))
RV (p(X,Y )) X,Y :X=a

RV (p(X,Y ):X=a) X,Y :X=a

1  0
1
1 (p(X, Y ), q)
1 (p(a, Y ), q)
2 (p(a, Y ))
p(a,Y )

Now E = p(a, Y ) satises the operations requirements. Picking E = p(X, Y ), X =


a would work equally well.

448

Lifted First-Order Probabilistic Inference

Splitting parfactors is done by pairwise comparisons of atoms of the same


predicate. We split parfactors g1 = (1 , A1 , C1 ) and g2 = (2 , A2 , C2 ) around atoms
a1 A1 and a2 A2 by replacing them by parfactors (1 , A1 , C1 a1 = a2 ),
(1 , A1 , C1 a1 = a2 ), (2 , A2 , C2 a1 = a2 ) and (2 , A2 , C2 a1 = a2 ), after
standardizing apart their logical variables. In fact, we only need to keep those
whose constraint is satisable. (This is why g2 does not need to be broken in the
example above that would only produce itself and another parfactor with zero
instantiations.) In particular, if RV (a1 ) = RV (a2 ), we end up obtaining the original
parfactors.
The uniformity requirement is met by shattering the FOPM in advance, that is,
by successively splitting the parfactors of each pair of c-atoms, including the query
atoms, until no overlapping non-identically grounded pair remains. The query atoms
need to be involved in shattering because if a c-atom includes query and non-query
random variables, it needs to be split so that the non-query ones can be eliminated.
As pointed out by Poole [9], splitting parfactors resembles the role of unication
in rst-order resolution, which determines the conditions for two atoms to match.
In probabilistic inference, however, we are interested not only in the overlapping
of atoms but also in the residual parfactors that originate from the matching.
The reason for this dierence is that the number of instantiations of a parfactor
matters for the nal joint distribution. In regular resolution, the original clauses
are kept because their redundancy with the clauses resulting from resolution makes
no dierence, while here we need to discount them and replace the originals with
the non-matching cases.
15.5.3

Irrelevant Logical Variable Simplication

Inversion often produces parfactors with constraints with logical variables not
present in its atoms. The rst inversion example produces the expression below.
We can simplify it by observing that the actual value of Y is irrelevant inside
the product. Only the number |Y | of possible values for Y will make a dierence.
Therefore we can write



 (p(X)) =
 (p(X))|Y | =
 (p(X)).
XY

15.6

Applicability of lifted inference


As explained in the previous sections, the lifted operations of FOVE are not always
applicable, each of them requiring certain conditions to be satised in advance.
Therefore a natural question is to what kinds of FOPMs we can apply FOVE in an
exclusively lifted manner.
It is not clear at this point whether it is possible to tell in advance if a FOPM
can be solved with lifted operations alone. The main reason for this is that lifted

15.7

Future Directions

449

operations will be applied to parfactors resulting from previous operations, so we


do not know them in advance. It may be that two parfactors satisfying the lifted
operations conditions fuse to form one which does not. (This is similar to the fact
that an elimination ordering is not computed in advance but only as the algorithm
proceeds.)
As a summarization, the conditions for applying lifted operations to eliminate a
set RV (E) from a parfactor g are the following: for counting elimination, the atoms
in g must be independent given its Cg ; for inversion on L LV (g),
1. ek , el E i , j [Cg|L ] i = j RV (ek i ) RV (el j ) = .
2. i , j [Cg|L ] i = j RV (Ei ) RV (Ej ) = .
When lifted operations do not apply, FOVE uses non-lifted operations to calculate

RV (E) GE . These non-lifted methods could be propositionalization, sampling etc,
but with the advantage of being restricted to a subset of the model only.

15.7

Future Directions
There are several possible directions for further development of FOVE. One of the
main ones is the incorporation of function symbols, both random (the color of an
object, for example) and interpreted (summation over integers), which will greatly
increase its expressivity and applicability.
In applications involving evidence over many objects (for example, the facts about
all the words in an English document), shattering may take a long time because all
parfactors have to be checked against it. The large number of objects involved may
create the need for numerous parfactor splittings. This is unfortunate because often
only some objects are truly relevant to the query. For example, analyzing only some
words and phrases in a document will often be enough to determine its subject.
Therefore a variant of FOVE that does only the necessary shattering, guided by
the inference process, is of great interest.
Finally, lifted FOVE operations do not cover all possible cases and explicit
summation may be required at times, so increasing their coverage is an important
direction.

15.8

Conclusion
Intuitive descriptions of models very often include rst-order elements. When these
models are probabilistic, the dominant approach has been that of grounding the
model to a propositional one and solving it with a regular propositional algorithm.
This strategy loses the explicit representation of the models rst-order structure,
which can be used to great computational advantage, and which is computationally
hard to retrieve from the grounded model.

450

Lifted First-Order Probabilistic Inference

We presented FOVE, a rst-order generalization of the popular VE propositional


inference algorithm. Like VE, FOVE successively eliminates random variables
from the model by summing them out while taking advantage of independences
for eciency. Unlike VE, FOVE directly manipulates rst-order representations,
eliminating c-atoms that stand for potentially large sets of random variables at
once. This can in some cases exponentially (in the domain size) speed inference up.
There are important directions in which FOVE needs to be extended, such as
incorporating function symbols, avoiding unnecessary shattering, and extending
operations for as of yet uncovered cases. However FOVE is already applicable,
and especially useful, in domains with large objects about which we have identical
knowledge. More than that, it is a general framework to be expanded and help close
the gap between logic and probabilistic reasoning.

References

451

Acknowledgments
This work was partly supported by Cycorp in relation to the Cyc technology,
the Advanced Research and Development Activity (ARDA)s Advanced Question
Answering for Intelligence (AQUAINT) program, NSF grant ITR-IIS- 0085980, and
a Defense Advanced Research Projects Agency (DARPA) grant HR0011-05-1-0040.

References
[1] V. S. Costa, D. Page, M. Qazi, and J. Cussens. CLP(BN): Constraint logic
programming for probabilistic knowledge. In Proceedings of the Conference on
Uncertainty in Articial Intelligence, 2003.
[2] R. de Salvo Braz, E. Amir, and D. Roth. Lifted rst-order probabilistic
inference. In Proceedings of the International Joint Conference on Articial
Intelligence, 2005.
[3] R. de Salvo Braz, E. Amir, and D. Roth. MPE and partial inversion in
lifted probabilistic variable elimination. In National Conference on Articial
Intelligence, 2006.
[4] N. Friedman, L. Getoor, D. Koller, and A. Pfeer. Learning probabilistic
relational models. In Proceedings of the International Joint Conference on
Articial Intelligence, 1999.
[5] J. Y. Halpern. An analysis of rst-order logics of probability. In Proceedings
of the International Joint Conference on Articial Intelligence, 1990.
[6] K. Kersting and L. De Raedt. Bayesian logic programs. In Proceedings of
the Work-in-Progress Track at the 10th International Conference on Inductive
Logic Programming, 2000.
[7] L. Ngo and P. Haddawy. Probabilistic logic programming and Bayesian
networks. In Asian Computing Science Conference, 1995.
[8] J. Pearl. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible
Inference. Morgan Kaufmann, San Mateo, CA, 1988.
[9] D. Poole. First-order probabilistic inference. In Proceedings of the International
Joint Conference on Articial Intelligence, 2003.
[10] D. Poole. Probabilistic Horn abduction and Bayesian networks. Articial
Intelligence, 64(1):81129, 1993.
[11] M. Richardson and P. Domingos. Markov logic networks. Technical report,
Department of Computer Science, University of Washington, 2004.
[12] N. L. Zhang and D. Poole. A simple approach to Bayesian network computations. In Proceedings of the Tenth Biennial Canadian Articial Intelligence
Conference, 1994.

16 Feature Generation and Selection in MultiRelational Statistical Learning

Alexandrin Popescul and Lyle H. Ungar

Using rich sets of features generated from relational data often improves the predictive accuracy of regression models. The number of feature candidates, however,
rapidly grows prohibitively large as richer feature spaces are explored. We present
a framework, structural generalized linear regression (SGLR), which exibly integrates feature generation with model selection allowing (1) augmentation of relational representation with cluster-derived concepts, and (2) dynamic control over
the search strategy used to generate features. Clustering increases the expressivity
of feature spaces by creating new concepts which contribute to the creation of new
features, and can lead to more accurate models. Dynamic feature generation, in
which decisions of which features to generate are based on the results of run-time
feature selection, can lead to the discovery of accurate models with signicantly less
computation than generating all features in advance. We present experimental results supporting these claims in two multirelational document mining applications:
document classication and link prediction.

16.1

Introduction
We present a statistical relational learning method, structural generalized linear regression (SGLR), for building predictive regression models from relational databases
or domains with implicit relational structure such as collections of documents
linked by citations or hyperlinks. In SGRL, features are dynamically generated
by a renement-graph style search over SQL queries, and tested for potential inclusion into a generalized linear regression model, such as linear, logistic, or Poisson
regression. This approach has several advantages over more traditional logic-based
inductive logic programming (ILP) methods. The tables resulting from SQL queries
are easily aggregated in many ways, giving a rich space of quantitative, as well as
Boolean features. The resulting regression models are typically more accurate than

454

Feature Generation and Selection in Multi-Relational Statistical Learning

logical models. We also show how to automatically augment the original relational
schema with additional derived features, facilitating the search for compound features.
SGLR, like several related methods [23, 18, 9, 27], searches a space of feature
generating expressions to nd those which generate new predictive features. In
SGLR, a given relational database schema describing background data structures
a search over database queries. Features are generated in two steps: a renementgraph-like search of the space of SQL queries generates tables, which are then
aggregated into real-valued features, which are tested for inclusion in a generalized
linear model; i.e., each query generates a table, which in turn is aggregated to
produce scalar feature candidates, from which statistically signicant predictors
are selected.
The initial relational schema is dynamically augmented with new relations containing concepts derived by clustering the data in the tables. For example, clustering
documents by the words they contain or authors by venues they have published in
gives new concepts topics (document clusters) or communities (author clusters)
and new relations between the original items and the clusters they occur in (documents on a topic or authors in a community).
The main search is over the space of possible relational database queries, augmented to include aggregate or statistical operators, groupings, richer join conditions, and argmax-based queries. This search can be guided based on the types of
predictive features discovered so far. We show below that a very simple intelligent
search over the space of possible queries (and hence features) can result in discovery of predictive features with far less computation than static (e.g., breadth-rst)
search.
SGLR couples two elements helpful for successful learning:(1) a class of statistical
models which outperforms logic-based models and (2) principled means of choosing
what features to include in this model. Regression models are often more accurate
than recursive partitioning methods such as C4.5 or FOIL-style logic descriptions.
This dierence is particularly apparent when there are vast numbers of potential features, many of which contribute some signal, for example, when words are included
as features. Regression also allows us to use principled feature selection criteria
such as Akaike information criterion (AIC), Bayes information criterion (BIC), and
streaming feature selection (SFS) [4, 29, 33] to control against overtting.
Figure 16.1 highlights the components of SGLR. Two main processes relational
feature generation and statistical modeling are dynamically coupled into a single
loop. Knowing the types of features selected so far by the statistical modeler allows
the query generation component to guide its search, focusing on promising subspaces of the feature space. The search in the space of database queries involving
one or more relations produces feature candidates one at a time for consideration
by the statistical model selection component. The process results in a statistical
model where each selected feature is the evaluation of a database query encoding
a predictive data pattern in a given domain. We use logistic regression (or, equivalently, maximum entropy modeling). Features are tested sequentially for inclusion

16.1

Introduction

455

model selection
feedback

Statistical Model Selection

Learning Process
Control Module
y = f (X)
xs are selected
database queries

search control
information

Search:
Relational Feature Generation

y=f(x 21, x 75 , x1291 , x 723469 , ... )

feature
columns

Figure 16.1

Relational Database
Engine

database
query

Learning process diagram.

in the regression model, and accepted if they are statistically signicant after using
a BIC [29] penalty to control against false discovery.
SGLR has several key characteristics which distinguish it from either pure
probabilistic network modeling or ILP:
The use of regression rather than logic allows the feature space to include
statistical summaries or aggregates, and more expressive substitutions through
nesting of intermediate aggregates (e.g., How many times does this paper cite
the most cited author in a conference to which it was submitted?).
We use clustering to dynamically extend the set of relations generating new
features. Clusters give better models of sparse data, improve scalability, and
produce representations not possible with standard aggregates [12]. For example,
one can cluster words based on co-occurrence in documents, giving topics, or
authors based on the number of papers they have published in the same venues,
giving communities. Once clusters are formed, they represent new relational
concepts which are added to the relational database schema, and then used
together with the original relations.
We use relational database management systems and SQL rather than Prolog.
Most real-world data lies in relational databases, with schemata and metainformation we can use. Relational database management systems incorporate decades
of work on optimization, giving better scalability.
Coupling generation and feature selection using discriminative modeling into a
single loop gives a more exible search than propositionalization Since the total
number and type of features is not known in advance, the search formulation
does lazy feature evaluation, allowing it to focus on more promising feature
subspaces, giving higher time eciency. Space eciency is achieved by not
storing pregenerated features, but rather considering them one by one as they
are generated, and keeping only the few selected features.

456

Feature Generation and Selection in Multi-Relational Statistical Learning

We present results on two sets of tasks which use the data from CiteSeer
(a.k.a. ResearchIndex), an online digital library of computer science papers [19].
CiteSeer contains a rich set of data, including paper titles, text of abstracts and
documents, citation information, author names and aliations, and conference
or journal names. We represent CiteSeer as a relational database. For example,
citation information is represented as a binary relation between citing and cited
documents. Document authorship and publication venues of documents are also
binary relations, while word counts can be represented as a ternary relation.
16.1.1

Invention of Cluster Relations

SGLR uses clustering to derive new relations and adds them to the database
schema used in automatic generation of predictive features in statistical relational
learning. Entities and relationships derived from clusters increase the expressivity
of feature spaces by creating new rst-class concepts. These concepts and relations
are added to the database schema, and thus are considered (potentially in multiple
combinations) during the search of the space of possible queries (gure 16.2).
For example, in CiteSeer, papers can be clustered based on words or citations
giving topics, and authors can be clustered based on documents they coauthor
giving communities. In addition to simpler grouping (e.g., Is this document on a
given topic?), such cluster-derived concepts become part of more complex feature
expressions (e.g. Does the database contain another document on the same topic
and published in the same conference?). The original database schema is implicitly
used to decide which entities to cluster and what sources of attributes to use,
possibly several per entity, creating alternative clusterings of the same objects. For
example, documents can be clustered using words and, separately, using citations.
Out of the large number of features generated, those which improve predictive
accuracy are kept in the model, as decided by statistical feature selection criteria.
Using cluster improves accuracy. Perhaps surprisingly, using cluster relations can
also lead to a more rapid discovery of predictive features.
Cluster-relation invention as described here diers importantly from aggregation,
which also creates new features from a relational representation [23, 26]. Aggregation allows one to summarize the information in a table returned from an SQL or
logic query into scalar values usable by a regression model, for example, computing
the average of a word count in all cited documents, or selecting a citing document
with max number of incoming links. The clusters, on the other hand, create new
relations in the database schema. The cluster relations are then used repeatedly to
generate new queries and hence tables and features.
16.1.2

Dynamic Feature Generation

SGLR also supports dynamic feature generation, in which the order in which
features are generated and evaluated is determined at run-time. Generating features
is by far the most computationally demanding part of SGLR. In the example

16.1

Introduction

457

CLUSTERING
i

O
O

i
i
O

O
O

i
i

i i
X

O
X

X
X

X
X

DATABASE SCHEMA

y
1
1
0

Figure 16.2

FEATURES
(AGGREGATED)

EVALUATED TABLES

y
1
1

10
20

1
0

X
1.5 100
1.9 95

0.4 30

Cluster relations augment database schema used to produce feature

candidates.
presented below, generating 100,000 features can take several CPU days due to
the extensive SQL queries, particularly the joins. Dynamic feature generation can
lead to discovery of predictive features with far less computation than generating
all features in advance. When using the appropriate complexity penalties, one can
still guarantee no overtting, even when the order in which we generate features
and test them for inclusion in the model is dynamically determined based on which
features have so far been found to be predictive. This best rst search often vastly
reduces the number of computationally expensive feature evaluations.
Query expressions are assigned into multiple streams based on user-selected
properties of the feature expressions; for example, based on the aggregate operator
type. Since dierent sets of features are of dierent size (e.g., the number of dierent
words is much greater than the number of journals or the number of topics),
it is often easy to heuristically classify features into dierent streams. If feature
generation has a known cost, this can also be taken into account. At each iteration,
one of the streams is chosen to generate the next candidate feature, based on the
expected utility of the streams features relative to those of other streams. For
example, a simple and eective rule is to select the next query to be evaluated
from the stream whose features have been included in the model in the highest
percentage.
16.1.3

Chapter Overview

The following section describes the SGLR methodology in some detail, including
how we cast feature generation as a search in the space of relational database
queries, how cluster relations are created, and how the feature space is searched
dynamically. Section 16.3 then describes two tasks using CiteSeer data which we
use to test SGLR: classifying documents into their publication venues, conferences,

458

Feature Generation and Selection in Multi-Relational Statistical Learning

or journals, and predicting the existence of a citation between two documents.


The tasks serve as a proxy for a more general problem of learning from inherently
relational and noisy network data, including social networks. Relations between
documents, cited documents, authors, publication venues, and text can all be
explored to discover predictive features. We show that adding new relations to
the database can improve accuracy, and that dynamic search can achieve the same
accuracy as its static alternative while generating far fewer features, and hence
reducing the required CPU time. The nal two sections discuss SGLR in the context
of related work, and summarize some of its advantages.

16.2

Detailed Methodology
As described above, SGLR dynamically couples two main components: generation
of feature candidates via a search in the space of queries to a relational database,
and their selection for inclusion in a regression model using statistical model selection criteria. First, we give the high-level SGLR algorithm. Lines in italics are
the parts that do cluster-relation generation. We deliberately leave the stopping
criterion underspecied. Given the incremental nature of model building in SGLR,
deciding when to stop will often depend on the available CPU time and on the
accuracy achieved so far.
1:
2:
3:
4:
5:
6:
7:
8:

while more features needed do


generate next SQL query expression using a renement graph search
query the database and retrieve a table
generate new cluster relations from the table
augment the database with derived relations
apply aggregate operators to the table to produce a set of features
test the new features for inclusion in the model
end while

16.2.1

Notation and Basic Concepts

The language of nonrecursive rst-order logic formulae maps directly into SQL and
relational algebra, (see e.g., [8]). Our implementation uses SQL for eciency and
connectivity with relational database engines.
Throughout this paper we use the following schema:
cites(F romDoc, T oDoc),
author(Doc, Auth),
published in(Doc, V enue),
word count(Doc, W ord, Int).

16.2

Detailed Methodology

459

Domains, or types, used here are dierent from the primitive SQL types. The
specication of these domains in addition to the primitive SQL types is necessary
to guide the search process more eciently.
First-order expressions are treated as database queries resulting in a table of
all satisfying solutions, rather than a single Boolean value. The extended notation
supports aggregation over entire query results or over their individual columns.
Aggregate operators are subscripted with the corresponding variable name if applied
to an individual column, or are used without subscripts if applied to the entire
table. For example, an average count of the word learning in documents cited by
document d, is denoted as:


class (d) aveC [cites(d, D), word count(D, learning, C)],


where denotes modeled using, i.e., the right hand side of the expression is a
feature to be tested for inclusion in the regression model, the response variable,
or target concept, on the left-hand side of the expression. We use prime with the
target relation to avoid confusion with recursive queries when, as in the features
below, a background relation instance and the target relation are of the same type.
An example of a feature useful for predicting a citation link between two documents d1 and d2 is the number of documents they both cite.


cites (d1, d2) count [cites(d1, D), cites(d2, D)]


The target concept is binary, and the feature (the right-hand side of the expression)
is a database query about both target documents d1 and d2:
16.2.2
16.2.2.1

Feature Generation: Search in the Space of Database Queries


Renement Graphs

Relational feature generation is a search problem. We use top-down search of


renement graphs [30, 11], supplemented with aggregate operators. Figure 16.3
shows a fragment of the search space in the domain of document classication.
Each node is a database query about a learning example d, and evaluates to a table
of all satisfying solutions. Aggregate operators are then applied to produce multiple
features.
SGLR uses the following variation of the renement operator: it forms a join
of a given query with a relation from the database schema, expanding the query
into the nodes covering all possible congurations of equality conditions involving
attributes in the new relation, such that (1) each renement contains at least one
equality condition between a new and an old attribute, (2) any attribute can be
set equal to a constant of its type, and (3) the types are used to avoid equalities
between attributes of dierent types. This renement operator is complete. Not all
renements produced are the most general renements of a given query; however, we

460

Feature Generation and Selection in Multi-Relational Statistical Learning

author_of(d, Auth).
word_count(d, Word, Int).

cites(d,Doc).
author_of(d, Auth = "smith").

word_count(d, Word="statistical", Int).


cites(d, Doc = "doc20").
cites(d, Doc1), cites(Doc1, Doc2).
cites(d, Doc), word_count(Doc, Word, Int).

cites(d, Doc), word_count(Doc, Word = "learning", Int).


cites(d, Doc), published_in(Doc, Venue).

Figure 16.3

cites(d, Doc), published_in(Doc, Venue = "www").


Fragment of the renement graph in document classication domain

nd that this denition can simplify pruning of equivalent subspaces by accounting


only for the type and the number of relations in a query.
Table 16.1 presents the pseudocode of the renement operator. After introducing
notation with examples we walk over the pseudocode line by line. Evaluation
of queries in the renement graph nodes produces intermediate tables. Their
aggregation is described in the next section.
Ri is a relation in a database schema R; for example: cites(doc1 : Document, doc2 :
Document) is a binary relation in the database. Its two attributes are doc1 and
doc2, both of type Document. Attributes in a relation are denoted by a subscripted
letter A. A query to be rened is q, for example:
SELECT DISTINCT *
FROM cites R1, cites R2
WHERE R1.doc1=d1 AND R2.doc1=d2 AND R1.doc2=R2.doc2
Qref is a set of renements of query q, the variable which accumulates the result
of the renement function. Seq is a set of equality conditions which form the
conjunction in the W HERE clause of a rened query. attrib(q) is a set of all
attributes of relation instances in q:
{R1.doc1, R1.doc2, R2.doc1, R2.doc2}
dom(A) are allowed constants of type A, e.g., containing all document IDs. dom(A)
must contain at least the target example constants identifying observations (doc1

16.2

Detailed Methodology
Table 16.1

461

The renement operator

rene(Query: q)
Qref {}
for each Ri R(i [1, n])
Seq {}
for each Aj Ri
for each A {Ak |Ak attrib(q)} {Al |(Al in Ri ) Al = Aj }
if(type(Aj ) = type(A))
Seq Seq {norm(Aj = A)}
for each a dom(Aj )
Seq Seq {norm(Aj = a)}
for each S 2Seq
if (Ai = Aj ) S such that Ai attrib(q) Aj attrib(q)


Qref Qref {q |q .W HERE =

q.W HERE {S} q .F ROM = q.F ROM {Ri }}
return Qref

and doc2, if they identify a target observation in the example above). In situations
where generated features can include references to other constants, dom(A) can
include all values of A, or a subset, e.g., the entries with the highest correlation with
the response variable or those above a count cuto value. The following example of
a query about the target pair < d1, d2 > references other constants in the domain
of document IDs; the query is nonempty when both d1 and d2 cite a particular
document d2370:
SELECT DISTINCT *
FROM cites R1, cites R2
WHERE R1.doc1=d1 AND R2.doc1=d2 AND R1.doc2=R2.doc2
AND R2.doc2=d2370
norm(Ai = Aj ) alphanumerically orders Ai and Aj to avoid storing in Seq
equivalent entries Ai = Aj and Aj = Ai . type(A) is metatype of A, as is
Document in the examples above, rather than an SQL type String. The set of
equality conditions in query q is denoted by q.W HERE, e.g., a four-element set
corresponding to the latter query example:
{R1.doc1=d1, R2.doc1=d2, R1.doc2=R2.doc2, R2.doc2=d2370}
The renement operator given in table 16.1 takes a query q as argument and
returns the set of its renements, Qref . Renement of a given query starts by
picking a relation instance in the database schema (loop starting at line 4). Adding
this relation results in its Cartesian product with the view of q (not included in the

462

Feature Generation and Selection in Multi-Relational Statistical Learning

renement set), which forms a template to be lled by allowed congurations of


equality conditions. For each attribute Aj in the newly added relation Ri (starting
at line 6) we nd other attributes in q or in Ri itself, such that their equality
with Aj can be included in the conjunction of equality conditions. This has to take
into account metatypes (line 8), e.g., an attribute of type Document cannot be
checked for equality with an attribute of type Author : taskar. Entries in Seq are
normalized (line 9) alphabetically to avoid equivalent entries. Equality of Aj with
target example identiers (constants) are added to the set of equality conditions
Seq , as well as possibly other constants of the type of Aj (lines 12-14). At this point
Seq contains all possible terms which can enter in the conjunction of renements
of q when being joined with Ri . A rened query is formed for each subset of Seq
(starting at line 15) such that at least one of the equality conditions in the subset
(S) involves an attribute in q, i.e., an attribute of an old relation instance already
present in q. The process repeats for all relations Ri in the schema R (back to line
4). A node resulting in an empty table for each observation is not rened any further
since its renements will be empty too.
16.2.2.2

Search Space Extension via Aggregate Operators

As in predicate calculus, aggregates are not part of the abstract relational languages.
Practical systems, however, implement them as additional features. SQL supports
the use aggregates which produce real values, rather than the more limited Boolean
features produced by logic-based approaches. Regression modeling makes full use
of these real-valued features.
As we described above, a node in our renement graph is a query evaluating into a
table. These tables are in turn aggregated by multiple operators to produce features.
We use the aggregate operators common in relational language extensions: count,
ave, max, and min; binary logic-style features are included through the empty
aggregate operator. Aggregate operators are applied to an entire table or to its
columns, as appropriate given type restrictions, e.g., ave cannot be applied to a
column of a categorical type. When aggregate operators are not dened, e.g., the
average of an empty set, we use an interaction with a 1/0 (dened/not-dened)
indicator variable. Table 16.2 presents pseudocode of the aggregation procedure at
each search node (called for each observation i).
The use of aggregate operators in feature generation complicates pruning of the
search space. We use a hash function of partially evaluated feature columns to
avoid fully recomputing equivalent features. In general, determining equivalence
among relational expressions is known to be NP-complete, although polynomial
algorithms exist for restricted classes of expressions; see, e.g., [3, 22]. Equivalence
determination based on the homomorphism theorem for tableau query formalism,
essentially the class of conjunctive queries we consider before aggregation, is given
in [1]. Optimizations could be done by better avoiding generation of equivalent
queries. Children nodes in the renement graph can, of course, reuse evaluations

16.3

Experimental Evaluation
Table 16.2

463

Aggregation of renement graph views

aggregate(View: v)
v is the evaluation of a search node query per observation i
F {}
// A is a set of aggregate operators
for each Aggri A (i [1, n])
// applicability of Aggri is determined by typing
if(def ined(Aggri (v))
F F {Aggri (v)} // e.g. average cannot be applied a categorical column
for each column C v
if(def ined(Aggri(C)))
F F {Aggri (C)}
return F

performed in their parent nodes; this considerably reduces computational burden


at the expense of increased memory consumption.

16.3

Experimental Evaluation
16.3.1

Tasks and Data

We evaluate SGLR using data from CiteSeer (a.k.a. ResearchIndex), an online


digital library of computer science papers [19]. CiteSeer includes text and titles of
papers, citation information, author names, and conference or journal names. We
represent CiteSeer as a relational database. For example, citation information is
represented as a binary relation between citing and cited documents.
There are 1560 unique conferences and journals, 26,740 unique last names of
authors, and 173,410 citations among our universe of 60,646 publication venues
(conferences or journals) which could be extracted by matching with the DBLP
database.1 We limit the vocabulary to the 1000 most frequent words in the entire
collection after Porter stemming and stop word removal to keep the data size to
a manageable 6,894,712 HasWord relations denoting which words each document
contains. These relations, and the number of instances are listed in table 16.3,
along with the derived cluster relations which are later added to the database.
We explore two tasks using CiteSeer data: classifying documents into their publication venues and predicting the existence of a citation between two documents.
The target concept pair in the two tasks are <Document, Venue> and <Document,
Document> respectively. In both tasks, the search space contains queries based on
1. http://dblp.uni-trier.de/

464

Feature Generation and Selection in Multi-Relational Statistical Learning

Table 16.3

Sizes of the original and cluster-based relations

Relation
PublishedIn(doc:Document, vn:Venue)
Author(doc:Document, auth:Person)
Citation(from:Document, to:Document)
HasWord(doc:Document, word:Word)
ClusterDocumentsByAuthors(doc:Document, clust:Clust0)
ClusterAuthorsByDocuments(auth:Person, clust:Clust1)
ClusterDocumentsByCitingDocuments(doc:Document,clust:Clust2)
ClusterDocumentsByCitedDocuments(doc:Document,clust:Clust3)
ClusterDocumentsByWords(doc:Document, clust:Clust4)
ClusterWordsByDocuments(word:Word, clust:Clust5)

Size
60,646
131,582
173,410
6,894,712
53,660
26,740
31,603
42,749
56,104
1,000

several relations about documents, such as citation information, authorship, word


content and publication venues of the document, and the response to be predicted
is Boolean.
16.3.2

Cluster Creation

We use k-means (e.g., see [15]) to derive cluster relations; any other hard clustering
algorithm could also be used. The results of clustering are represented by binary
relations of the form <ClusteredEntity, ClusterID>.
Each many-to-many relation in the original schema can produce two distinct
cluster relations (e.g., clusters of words by documents or of documents by words).
Three out of the four relations in the schema presented above are many-tomany (PublishedIn is not); this results in six new cluster relations. Since the
PublishedIn relation does not produce new clusters, nothing needs to be done
to exclude the attributes of entities in the venue prediction training and test sets
from participating in clustering. In link prediction, on the other hand, the relation
corresponding to the target concept, Citation, does produce clusters, so in this
case clustering is run without the links sampled for training and test sets.
k-means clustering requires the selection of k, the number of groups into which
the entities are clustered. In the experiments presented here we x k equal to 100 in
all cluster relations except in ClusterWordsByDocuments, where only ten clusters
were used because there are roughly an order of magnitude fewer clustered words
than authors or documents. (This, since the vocabulary was limited to 1000 words.)
The accuracy of resulting cluster-based models reported below could potentially be
improved if one is willing to incur the cost of generating clusters with dierent
values of k and testing the resulting features for model inclusion. One could also
generate clusters from the rest of the tables generated as the space of queries is
searched. For simplicity, we stuck to the rst six such cluster relations. Table 16.3
summarizes the sizes of four original and the six derived cluster relations.

16.3

Experimental Evaluation

465

For clustering, we use the tf-idf vector-space cosine similarity [28]. The measure
was originally designed for document similarity using word features, but we apply
it here to broader types of data. In the formulae below, d stands for any object we
want to cluster, and w are the attributes used to cluster d. For example, authors
d can be clustered using the documents w they write. Below we refer to ds as
documents and ws as words.
Each document d is viewed as a vector whose dimensions correspond to the words
ws in the vocabulary; the vector elements are the tf-idf weights of the corresponding
words, where tf idf (w, d) = tf (w, d) idf (w). In the original formulation, term
frequency tf (w, d) is the number of times w occurs in d. In the experiments
reported here we use binary tf indicating whether or not w occurs in d.2 Inverse
document frequency idf (w) = log df|D|
(w) , where |D| is the number of documents in
a collection and df (w) is the number of documents in which word w occurs at least
once.
The similarity between two documents is then
sim(di , dj ) =

di dj
,
||di ||||dj ||

where di and dj are vectors with tf-idf coordinates as described above.


The cost of clustering is negligible compared to the cost of query evaluation,
especially when one uses an ecient clustering algorithm. Linear-time algorithms
are available using streaming methods [13], database methods [2, 32], or by
exploiting regularities in the document citation structure [24].
16.3.3

Eect of Adding Cluster Relations

We compare models learned from the feature space generated from the four original
noncluster relations with the models learned from the original four relations plus
six derived cluster relations (clustersNO and clustersYES models). Models are
learned with sequential feature selection using BIC [29], i.e., once each feature is
generated, it is added to the model permanently if the BIC-penalized error improves,
or is permanently dropped otherwise.
We use ten-fold reverse cross-validation to measure accuracy improvement from
using cluster relations. All observations are split equally into ten sets. Each of
the sets is used to train a model. Each of the models is tested on the remaining
90% of observations. This results in ten values per each tested level, which are
used to derive error bounds. In venue prediction, there are 10,000 observations:
5000 positive examples of <Document,Venue> target pairs uniformly sampled from
the relation PublishedIn, and 5000 negative examples where the document is
uniformly sampled from the remaining documents, and the venue is uniformly

2. We use binary tf for consistency with the relation HasWord; we do not use counts in
computing similarities since the original relation HasWord contains binary word occurrence
data. Other derived cluster relations use naturally binary attributes.

clustersNO

70

accuracy

70

clustersYES

clustersYES
clustersNO

50

60

60

accuracy

80

80

90

Feature Generation and Selection in Multi-Relational Statistical Learning

50

466

500

1000

1500

2000

# of features considered

2500

3000

3500

500

1000

1500

2000

2500

3000

3500

# of features considered

Learning curves: average test-set accuracy against the number of


features generated from the training sets in ten-fold cross validation. Left: venue
prediction, in each of ten runs Ntrain = 1000 and Ntest = 9000. Right: link prediction,
in each of ten runs Ntrain = 500 and Ntest = 4500. Balanced positive/negative priors.
Figure 16.4

sampled from the domain of venues other than the true venue of the document.
Positive example pairs are removed from the background relation PublishedIn, as
well as the tuples involving documents sampled for the negative set. The size of the
background relation PublishedIn decreases by 10,000 after removing training and
test set tuples. In link prediction, the total number of observations is 5000: 2500
positive examples of <Document,Document> target pairs uniformly sampled from
the Citation relation, and 2500 negative examples uniformly sampled from empty
links in the citation graph. Positive example pairs are removed from the background
relation Citation. The size of the background relation Citation reduces by 2500,
the number of sampled positive examples.
A total of 3500 features are used in training each model. A numeric signature
of partially evaluated features is maintained to avoid fully generating numerically
equivalent features; note that this is dierent from avoiding syntactically equivalent
nodes of the search space: two dierent queries can produce numerically equivalent
feature columns, e.g., all zeros. Such repetition becomes common when feature
generation progresses deeper in the search space.
Figure 16.4 presents test accuracy learning curves for models learned with and
without cluster relations in venue prediction and link prediction respectively. Curve
coordinates are averages over the runs in ten-fold cross validation. The learning
curves show test-set accuracy changing with the number of features, in intervals of
250, generated and sequentially considered for model selection from the training set.
The average test set accuracy of the cluster-based models after exploring the entire
feature stream is 87.2% in venue prediction and 93.1% in link prediction, which is,
respectively, 4.75 and 3.22 percentage points higher than the average accuracy of
the models not using cluster relations.

467

3
2
1
0

accuracy(clustersYES) accuracy(clustersNO)

4
2
0

accuracy(clustersYES) accuracy(clustersNO)

Experimental Evaluation

16.3

500

1000

1500

2000

2500

# of features considered

3000

3500

500

1000

1500

2000

2500

3000

3500

# of features considered

Figure
16.5 Mean
accuracy
dierence:
accuracy(clustersY ES)
accuracy(clustersN O) with 95% condence intervals (bounds based on N =10
points, t-test distribution). Left: venue prediction. Right: link prediction

Figure 16.5 presents 95% condence intervals of the dierence in mean test accuracies of clustersYES and clustersNO models in venue prediction and link prediction respectively. In venue prediction, after exploring approximately half of the
feature stream, the improvement in accuracy by the cluster-based models is statistically signicant at the 95% condence level according to the t-test (condence
intervals do not intersect with y=0). In the early feature generation, when considering the streams of about 1000 features, cluster-based models perform signicantly
worse: at this phase, additional cluster-based features, while not yet signicantly
improving accuracy, are delaying the discovery of signicant noncluster-based features. In link prediction, while the signicance of the improvement from clusterbased features is reduced early in the stream, it continuously increases throughout
the rest of the stream. At the end of the stream the improvement in accuracy of the
cluster-based model is 3.22 percentage points, statistically signicant at the 99.8%
condence level. The highest accuracies (after seeing 750 features by clustersNO
and after seeing 3500 features by clustersYES) also statistically dier: the accuracy improvement in cluster-based models is 1.49 percentage points, signicant at
the 99.9% condence level.
The average number of features selected in ten clustersYES models is 32.0 in
venue prediction and 32.3 in link prediction, respectively; 27.9 and 31.8 features on
average were selected into clustersNO models from equally many feature candidates
(3500). The BIC penalty used here allows a small amount of overtting (see
gure 16.4); more recent penalty methods such as SFS [33] avoid this problem.
The improved accuracy of the cluster-based model in venue prediction comes
mostly from a single cluster-based feature. This feature was selected in all crossvalidation runs. It is a binary cluster-based feature which is on for target document/venue pair <D,V>, if a document D1 exists in the cluster where D belongs
such that D1 is published in the same venue as D. Using a logic-based notation, the

468

Feature Generation and Selection in Multi-Relational Statistical Learning


Table 16.4 Selected features which improve test accuracy by at least 1.0 percentage point. Target pair: <D,V>
Feature

Model

size[publishedIn( , V )]
exists[cites(D, D1), publishedIn(D1, V )]
exists[cites(D1, D), publishedIn(D1, V )]
exists[cites(D, D2), cites(D1, D2), publishedIn(D1, V )]
exists[author(D, A), author(D1, A), publishedIn(D1, V )]

both
both
both
both
both

exists[publishedIn(D1, V ), docsByW ords(D, C), docsByW ords(D1, C)] clustersYES


exists[cites(D, D3), cites(D3, D2), cites(D1, D2), publishedIn(D1, V )]

clustersNO

feature is the following (abbreviated here clustDocsByWords by topic):3


exists[publishedIn(D1, V ), topic(D, C), topic(D1, C)].
The following three cluster-based features were selected in more than ve crossvalidation runs (nine, nine and six times respectively) in the link prediction task
(target: <D1,D2>):
exists[docsByCitedDocs(D1, C), docsByCitedDocs(D2, C)],
exists[docsByW ords(D1, C), docsByW ords(D2, C)],
exists[docsByCitingDocs(D1, C), docsByCitingDocs(D2, C)].
Table 16.4 gives examples of features which improved test accuracy by at least
1 percentage point over the previous state of the venue prediction model in one
of the cross-validation runs. The rst ve features, in the generated order, are in
both clustersYES and clustersNO models. D and V are respectively document and
venue in the target pair <D,V>.
The features in table 16.4 can be summarized as follows: document D is more
likely to appear in a conference or journal V , if venue V publishes a lot of papers;
if document D cites or is cited by another document published in the same venue
V ; if, more generally, document D is close in the citation graph to other documents
published in V ; if the author of D published another paper in the same venue V ;
and nally, in the case of the cluster-based model, if another document on the latent
topic of D and published in V exists.
The cluster relations shown above led to higher classication accuracy. Another
potential advantage, not shown experimentally, is that cluster-based features are
generally cheaper to generate, since cluster relations contain fewer tuples than
the original relations from which they were derived. This can lead to reduced
computational costs per number of generated feature candidates.

3. Note that D1 is always distinct from D as the tuple with publication venue of document
D is removed from the background relation PublishedIn.

16.3

Experimental Evaluation

16.3.4

469

Dynamic Feature Generation

Up to this point, we presented models learned when doing the breadth-rst search of
the feature space. In this section we explore an alternative search strategy in which
separate streams are used to generate queries (and hence features), and new queries
are preferentially selected from those streams which have been most productive of
useful features. The database query evaluation used in feature generation dominates
the computational cost of our statistical relational learning methodology; thus,
intelligently deciding which queries to evaluate can have a signicant eect on total
cost.
Feature generation in the SGLR framework consists of two steps: query expression
generation and query evaluation. The former is cheap as it involves only syntactic
operations on query strings; the latter is computationally demanding. The experiment is set up to test two strategies which dier in the order in which queries
are evaluated. In both strategies, query expressions are generated by breadth-rst
search. The base-line, static, strategy evaluates queries in the same order the expressions appear in the search queue, while the alternative, dynamic strategy, enqueues
queries into separate streams at the time its expression is generated, but chooses
the next feature to be evaluated from the stream with the highest ratio:
(f eaturesAdded + 1)/(f eaturesT ried + 1),
where f eaturesAdded is the number of features selected for addition to the model,
and f eaturesT ried is the total number of features tried by feature selection in
this stream. Many other ranking methods could be used; this one has the virtue
of being simple and, for the realistic situation in which the density of predictive
features tends to decrease as one goes far into a stream, complete.
In the tests below, we use two streams. The rst stream contains queries with
aggregate operators exists and count over the entire table. The second stream
contains features which are the counts of unique elements in individual columns.
We stop the experiment when one of the streams is exhausted.
We report the dierence in test-set accuracy between dynamic and static feature
generation. In each of four data sets, the dierence in accuracy is plotted against the
number of features evaluated and considered by feature selection We also kept track
of the CPU time required for each of these cases. The MySQL database engine was
used. Data sets 1, 2, and 3 took roughly 20,000 seconds, while data set 4 took 40,000
seconds. In all four cases, plots of accuracy vs. CPU time used were qualitatively
similar to the plots shown in gure 16.6.
In the experiments presented here, one of the two feature streams was a clear
winner, suggesting the heuristic splitting feature was eective. When the choice of
a good heuristic is dicult, dynamic feature generation, in the worst case, will
split features into equally good streams, and will asymptotically lead to the
same expected performance as the static feature generation by taking features from
dierent streams with equal likelihood.

3
2
1
1

(accuracyDynamic accuracyStatic)

6
4
2

(accuracyDynamic accuracyStatic)

Feature Generation and Selection in Multi-Relational Statistical Learning

500

1000

1500

500

1000

1500

# of features considered

6
4
2

(accuracyDynamic accuracyStatic)

4
2
0

(accuracyDynamic accuracyStatic)

# of features considered

470

500

1000
# of features considered

1500

500

1000

1500

# of features considered

Test accuracy of the dynamic search minus test accuracy of the


static search against the number of features considered. Errors: 95% condence
interval (ten-fold cross validation, in each of ten runs). In row rst order: (a) Set
1 (venue prediction, with cluster relations), (b) set 2 (venue prediction, without
cluster relations); (c) set 3 (link prediction, with cluster relations); (d) set 4 (link
prediction, without cluster relations). Ntrain = 1000, Ntest = 9000 in (a) and (b);
Ntrain = 500, Ntest = 4500 in (c) and (d)
Figure 16.6

16.4

Related Work and Discussion

471

The two-stream approach can be generalized to a multistream approach. For


example, some streams can be formed based on the types of aggregate operators
in query expressions, as we did here, and other streams can be formed based on
the type of relations joined in a query, for example, split based on whether a query
contains a cluster-derived relation, or a relation of the same type as the target
concept. A query can be enqueued into one stream based on its aggregate operator,
and into a dierent stream based on type of relation instances. The split of features
into multiple streams need not be a disjoint partition. This method, if used with
a check to avoid evaluation of a query which has been evaluated previously in a
dierent stream, will not incur a signicant increase in computational cost.
Another approach is to split features into multiple streams according to the sizes
of their relation instances, which would serve as an estimate of evaluation time.
This can lead to improvements for the following reasons: (1) out of two nearly
collinear features a cheaper one will likely correspond to a simpler query and will
be evaluated rst, leading to approximately the same accuracy improvement as the
second more expensive feature; and (2) there is no obvious correlation between the
cost to evaluate a query and its predictive power. Therefore it can be expected that
cheaper queries result in good features as likely as more expensive ones.

16.4

Related Work and Discussion


A number of approaches for modeling from relational representation have been proposed in the ILP community. Often, these approaches can be described as a propositionalization, or as an upgrade depending on whether feature generation and
modeling are integrated. Propositionalization [17] implies separation of modeling
from relational feature generation. A logic theory learned with an ILP method can
be used to produce binary features for a propositional learner. For example, Srinivasan and King [31] use linear regression to model features constructed from the
clauses returned by Progol [21] to build predictive models in a chemical domain.
Bernstein et al. [5] introduce a relational vector-space model for classication in
domains with linked structure; a derived classier based on known labels of linked
neighbors is proposed in [20]. Decoupling feature construction from modeling, as
done in propositionalization, retains the inductive bias of the technique used to
construct features; i.e., better models potentially can be built if one allows a propositional learner itself to select its own features based on its own criteria. First-order
regression system (FORS) [14] more closely integrates feature construction into
regression modeling, but does so using a FOIL-like covering approach for feature
construction. Additive, or cumulative, models, such as linear or logistic regression,
have dierent criteria for feature usefulness; integrating feature construction and
model selection into a single process is advocated in this context in [6, 25]. Coupling
feature generation to model construction can also reduce computational costs by
only generating features which will be tested for the model selection.

472

Feature Generation and Selection in Multi-Relational Statistical Learning

In contrast to propositionalization, upgrading usually implies that generation of


relational features and their modeling are tightly coupled and driven by propositional learners model selection criteria. SGLR shares these characteristics, and
its simpler form, when no cluster-derived relations are used, is, in this sense, an
upgrade of generalized linear models. TILDE [7] and WARMR [10] upgrade decision trees and association rules, respectively. S-CART [16] upgrades CART, a
propositional algorithm for learning classication and regression trees. Dehaspes
MACCENT system [9] uses expected entropy gain from adding binary features to
a maximum entropy classier to direct a beam search over rst-order clauses; it
determines when to stop adding variables by testing the classier on a held-out
data set. Van Laer and De Raedt present an overview of upgrading in [18]. ILP
gives one way to structure the search space; others can be used [27].

16.5

Conclusion
We presented structural generalized linear regression and used its logistic regression
variant for analyzing document data from CiteSeer. SGLR combines the strengths
of generalized linear regression modeling (e.g., linear, logistic, and Poisson) with
the higher expressivity of features automatically generated from relational data
sources. New, potentially predictive features and relations in the database are
generated lazily, and selected with statistically rigorous criteria derived from the
regression model being built. SGLR is applicable to large domains with complex,
sparse and noisy data sources; these characteristics suggest focused, dynamic feature
generation from rich feature spaces, regression modeling, rigorous feature selection,
and the use of query and statistical optimizations, all of which contribute to the
expressivity, accuracy, and scalability of SGLR.
SGLR is attractive in oering a factored architecture which allows one to plug in
any additive statistical modeling tool and its corresponding feature selection criterion. This contrasts with recursive subdivision methods in which one cannot easily
separate out search from modeling and feature selection. The factored architecture
oers many advantages, including support for dynamic feature selection.
We showed how clustering can be used to derive new concepts and relations
which augment database schema used in the automatic generation of predictive features in statistical relational learning. Clustering improves scalability
through dimensionality reduction. More importantly, entities derived from clusters increase the expressivity of feature spaces by creating new rst-class concepts which contribute to the creation of new features in more complex ways.
For example, in CiteSeer, papers can be clustered based on words giving topics. Associated with each cluster (or concept) is a cluster relation (e.g.,
on topic) which then becomes part of more complex feature expressions such as
exists[publishedIn(D1, V ), on topic(D, C), on topic(D1, C)]. Such richer features
result in more accurate models than those built only from the original relational
concepts.

References

473

We also showed that dynamically deciding which features to generate can lead to
the discovery of predictive features with substantially less computation than generating all features in advance, as done, for example, in propositionalization. Native
statistical feature selection criteria can give run-time feedback for determining the
order in which features are generated. Coupling feature generation to model construction can signicantly reduce computational costs. Some ILP systems, such as
Progol, also perform dynamic feature generation, albeit with logic models. Many
problem domains should benet from the SGLR or similar methods, including modeling of social networks, bioinformatics, disclosure control in statistical databases,
and modeling of other hyperlinked domains, such as the web and databases of
patents and legal cases.

References
[1] S. Abiteboul, R. Hull, and V. Vianu. Foundations of Databases. AddisonWesley, Boston, 1995.
[2] R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan. Automatic subspace
clustering of high dimensional data for data mining applications. In Proceedings
of ACM International Conference on Management of Data, 1998.
[3] A. V. Aho, Y. Sagiv, and J. D. Ullman. Equivalences among relational
expressions. SIAM Journal of Computing, 8(2):218246, 1979.
[4] H. Akaike. Information theory and an extension of the maximum likelihood
principle. In Second International Symposium on Information Theory, 1973.
[5] A. Bernstein, S. Clearwater, and F. Provost. The relational vector-space model
and industry classication. In IJCAI Workshop on Learning Statistical Models
from Relational Data, 2003.
[6] H. Blockeel and L. Dehaspe. Cumulativity as inductive bias. In Workshop on
Data Mining, Decision Support, Meta-Learning and ILP at PKDD, 2000.
[7] H. Blockeel and L. De Raedt. Top-down induction of logical decision trees.
Articial Intelligence, 101(1-2):285297, 1998.
[8] S. Ceri, G. Gottlob, and L. Tanca. Logic Programming and Databases. SpringerVerlag, Berlin, 1990.
[9] L. Dehaspe. Maximum entropy modeling with clausal constraints. In Proceedings of the International Conference on Inductive Logic Programming, 1997.
[10] L. Dehaspe and H. Toivonen. Discovery of frequent datalog patterns. Data
Mining and Knowledge Discovery, 3(1):736, 1999.
[11] S. Dzeroski and N. Lavrac. An introduction to inductive logic programming.
In Saso Dzeroski and Nada Lavrac, editors, Relational Data Mining, pages
4873. Springer-Verlag, Berlin, 2001.

474

Feature Generation and Selection in Multi-Relational Statistical Learning

[12] D. Foster and L. Ungar. A proposal for learning by ontological leaps. In


Proceedings of Snowbird Learning Conference, Snowbird, UT, 2002.
[13] S. Guha, N. Mishra, R. Motwani, and L. OCallaghan. Clustering data
streams. In IEEE Symposium on Foundations of Computer Science, 2000.
[14] A. Karalic and I. Bratko. First order regression. Machine Learning, 26:147
176, 1997.
[15] L. Kaufman and P. J. Rousseeuw. Finding Groups In Data: An Introduction
to Cluster Analysis. Wiley-Interscience, Hoboken, NJ, 1990.
[16] S. Kramer and G. Widmer. Inducing classication and regression trees in
rst order logic. In Saso Dzeroski and Nada Lavrac, editors, Relational Data
Mining, pages 140159. Springer-Verlag, Berlin, 2001.
[17] S. Kramer, N. Lavrac, and P. Flach. Propositionalization approaches to
relational data mining. In Saso Dzeroski and Nada Lavrac, editors, Relational
Data Mining, pages 262291. Springer-Verlag, Berlin, 2001.
[18] W. Van Laer and L. De Raedt. How to upgrade propositional learners to
rst order logic: A case study. In Saso Dzeroski and Nada Lavrac, editors,
Relational Data Mining, pages 235261. Springer-Verlag, Berlin, 2001.
[19] Steve Lawrence, C. Lee Giles, and Kurt Bollacker. Digital libraries and
autonomous citation indexing. IEEE Computer, 32(6):6771, 1999.
[20] S. A. Macskassy and F. Provost. A simple relational classier. In KDD
Workshop on Multi-Relational Data Mining, 2003.
[21] S. Muggleton. Inverse entailment and Progol. New Generation Computing,
13:245286, 1995.
[22] W. Nutt, Y. Sagiv, and S. Shurin. Deciding equivalences among aggregate
queries. In Proceedings of ACM International Conference on Principles of
Database Systems, 1998.
[23] C. Perlich and F. Provost. Aggregation-based feature invention and relational
concept classes. In International Conference on Knowledge Discovery and Data
Mining, 2003.
[24] A. Popescul, G. Flake, S. Lawrence, L. H. Ungar, and C. L. Giles. Clustering
and identifying temporal trends in document databases. In Proceedings of the
IEEE Advances in Digital Libraries, 2000.
[25] A. Popescul, L. H. Ungar, S. Lawrence, and D. Pennock. Towards structural
logistic regression: Combining relational and statistical learning. In Proceedings
of the Workshop on Multi-Relational Data Mining at KDD-2002, Edmonton,
Canada, 2002.
[26] A. Popescul, L. H. Ungar, S. Lawrence, and D. Pennock. Statistical relational
learning for document mining. In Proceedings of the IEEE International
Conference on Data Mining, 2003.

References

475

[27] D. Roth and W. Yih. Relational learning via propositional algorithms: An


information extraction case study. In Proceedings of the International Joint
Conference on Articial Intelligence, 2001.
[28] G. Salton and M.J. McGill. Introduction to Modern Information Retrieval.
McGraw-Hill, New York, 1983.
[29] G. Schwartz. Estimating the dimension of a model. Annals of Statistics, 6
(2):461464, 1978.
[30] E. Shapiro. Algorithmic Program Debugging. MIT Press, Cambridge, MA,
1983.
[31] A. Srinivasan and R. King. Feature construction with inductive logic programming: A study of quantitative predictions of biological activity aided by
structural attributes. Data Mining and Knowledge Discovery, 3(1):3757, 1999.
[32] T. Zhang, R. Ramakrishnan, and M. Livny. Birch: An ecient data clustering method for very large databases. In Proceedings of ACM International
Conference on Management of Data, 1996.
[33] J. Zhou, B. Stine, D. Foster, and L. Ungar. Streaming feature selection using
alpha investing. In International Conference on Knowledge Discovery and Data
Mining, 2005.

17 Learning a New View of a Database: With


an Application in Mammography

Jesse Davis, Elizabeth Burnside, In


es Dutra, David Page,
Raghu Ramakrishnan, Jude Shavlik and Vtor Santos Costa

Statistical relational learning (SRL) algorithms model joint probability distributions over relational databases. However, current SRL techniques that operate on
databases are restricted to using only the elds and tables already in the database.
Yet, database users often dene additional elds or tables, known as views, that can
be computed from the existing ones. We augment SRL algorithms by adding the
ability to learn new elds. We present two dierent approaches to view learning.
First, we develop a two-step approach where we search for all views of interest and
then build a statistical model incorporating the selected views. Second, we describe
SAYU-View, which integrates the view generation and model building steps. We
motivate view learning in the context of creating an expert system for mammography. We show that view learning signicantly improves the performance of the
expert system.

17.1

Introduction
Statistical relational learning (SRL) focuses on algorithms for learning statistical
models from relational databases. SRL advances beyond Bayesian network learning
and related techniques by handling domains with multiple tables, by representing
relationships between dierent rows of the same table, and by integrating data from
several distinct databases. Currently, SRL techniques can learn joint probability
distributions over the elds of a relational database with multiple tables. Nevertheless, SRL techniques are constrained to use only the tables and elds already
in the database, without modication. In contrast, many human users of relational
databases nd it benecial to dene alternative views of a databasefurther elds
or tables that can be computed from existing ones. This chapter shows that SRL
algorithms also can benet from the ability to dene new views. Namely, it shows

478

Learning a New View of a Database: With an Application in Mammography

that view learning can be used for more accurate prediction of important elds in
the original database.
We augment SRL algorithms by adding the ability to learn new elds, intentionally dened in terms of existing elds and intentional background knowledge. In
database terminology, these new elds constitute a learned view of the database.
We use inductive logic programming (ILP) to learn rules which intentionally dene
the new elds. We present two dierent methods to accomplish this goal. The rst
is a two-step approach where we search for all views of interest. This process is
expensive and does not necessarily guarantee selecting the most useful view. The
second framework, which we refer to as SAYU-View, has a tighter coupling between
view generation and view usage. Our results show that view learning can result in
signicant benets.
We present view learning in the specic application of creating an expert system
in mammography. We chose this application for a number of reasons. First, it is an
important practical application where there has been recent progress in collecting
sizable amounts of data. Second, we have access to an expert-developed system.
This provides a base reference against which we can evaluate our work [3]. Third, a
large proportion of examples are negative. This distribution skew is often found in
multi-relational applications. Last, our data consists of a single table. This allows
us to compare our techniques against standard propositional learning. In this case,
it is sucient for view learning to extend an existing table with new elds, achieved
by using ILP to learn rules for unary predicates. For other applications, it may be
desirable to learn predicates of higher arity, which will correspond to learning a
view with new tables rather than new elds only.

17.2

View Learning for Mammography


Oering breast cancer screening to the ever-increasing number of women over age
40 represents a great challenge. Cost-eective delivery of mammography screening
depends on a consistent balance of high sensitivity and high specicity. It has been
demonstrated that subspecialist, expert mammographers achieve this balance and
perform signicantly better than general radiologists [2, 34]. General radiologists
have higher false-positive rates and hence biopsy rates, diminishing the positive
predictive value for mammography [2, 34]. Unfortunately, despite the fact that
specially trained mammographers detect breast cancer more accurately, there is a
longstanding shortage of these individuals [10].
An expert system in mammography has the potential to help the general radiologist approach the eectiveness of a subspecialty expert, thereby minimizing
both false-negative and false-positive results. Bayesian networks are probabilistic
graphical models that have been applied to the task of breast cancer diagnosis from
mammography data [17, 3, 5]. Bayesian networks produce diagnoses with probabilities attached. Because of their graphical nature, they are comprehensible to
humans and useful for training. As an example, gure 17.1 shows the structure

17.2

View Learning for Mammography

479

Ca
Mass Stability

++

Lucent

Centered

Milk of
Calcium

Mass Margins

++

Ca

Mass Density

Ca

Mass Shape

Ca

Mass Size
Breast

Tubular

Dystrophic

Ca

++

Ca

++

++

Pleomorphic

Ca
Asymmetric

Family

Density

Table 17.1

++

Eggshell

Punctate

HRT

Architectural

Figure 17.1

Fine/

Ca

Ca
Distortion

Popcorn

++

Linear

Age

LN

Round

++

Ca

Skin Lesion

Density

++

Disease

Mass P/A/O

Density

Dermal

Ca

++

++

Amorphous

Rod-like

hx

Expert Bayes net.


The National Mammography Database schema, omitting some of the

features

Patient

Abnormality

Date

Mass Shape

...

Mass Size

Location

Be/Mal

P1

5/02

S pic

...

0.03

RU4

P1

5/04

Var

...

0.04

RU4

P1

5/04

S pic

...

0.04

LL4

...

...

...

...

...

...

...

...

of a Bayesian network developed by a subspecialist, expert mammographer. For


each variable (node) in the graph, the Bayes net has a conditional probability table
giving the probability distribution over the values that the variable can take for
each possible setting of its parents. The Bayesian network in gure 17.1 achieves
accuracies higher than those of other systems and of general radiologists who perform mammograms, and commensurate with the performance of radiologists who
specialize in mammography [3].
Table 17.1 shows the main table (with some elds omitted for brevity) in a
large relational database of mammography abnormalities. The database schema is

480

Learning a New View of a Database: With an Application in Mammography

specied in the National Mammography Database (NMD) standard established by


the American College of Radiology [1]. The NMD was designed to standardize data
collection for mammography practices in the United States and is widely used for
quality assurance. We omit a second, much smaller biopsy table, simply because we
are interested in predictingbefore the biopsywhether an abnormality is benign
or malignant. Note that the database contains one record per abnormality. By
putting the database into one of the standard database normal forms, it would
be possible to reduce some data duplication, but only a very small amount: the
patients age, status of hormone replacement therapy, and family history could be
recorded once per patient and date in cases where multiple abnormalities are found
on a single mammogram date. Such normalization would have no eect on our
approach or results, so we choose to operate directly on the database in its dened
form.
Figure 17.2 presents a hierarchy of the four types of learning that might be
used for this task. Level 1 and level 2 are standard types of Bayesian network
learning. Level 1 is simply learning the parameters for the expert-dened network
structure. Level 2 involves learning the actual structure of the network in addition
to its parameters. Notice that to predict the probability of malignancy of an
abnormality, a Bayes net uses only the record for that abnormality. Nevertheless,
data in other rows of the table may also be relevant: radiologists may also consider
other abnormalities on the same mammogram or previous mammograms. For
example, it may be useful to know that the same mammogram also contains another
abnormality, with a particular size and shape; or that the same person had a
previous mammogram with certain characteristics. Incorporating data from other
rows in the table is not possible with existing Bayesian network learning algorithms
and requires SRL techniques, such as probabilistic relational models [12]. Level 3 in
gure 17.2 shows the state of the art in SRL techniques, illustrating how relevant
elds from other rows (or other tables) can be incorporated into the network, using
aggregation if necessary. Rather than using only the size of the abnormality under
consideration, the new aggregate eld allows the Bayes net to also consider the
average size of all abnormalities found in the mammogram.
Presently, SRL is limited to using the original view of the database, that is,
the original tables and elds, possibly with aggregation. Despite the utility of
aggregation, simply considering only the existing elds may be insucient for
accurate prediction of malignancies. Level 4 in gure 17.2 shows the key capability
that will be introduced and evaluated in this chapter: using techniques from rule
learning to learn a new view. In this gure, the new view includes two new features
utilized by the Bayes net that cannot be dened simply by aggregation of existing
features. The new features are dened by two learned rules that capture hidden
concepts potentially useful for accurately predicting malignancy, but that are not
explicit in the given database tables. One learned rule states that a change in
the shape of an abnormality at a location since an earlier mammogram may be
indicative of a malignancy. The other says that an increase in the average of the sizes
of the abnormalities may be indicative of malignancy. Note that both rules require

17.2

View Learning for Mammography

481

Figure 17.2 Hierarchy of learning types. Levels 1 and 2 are available through
ordinary Bayesian network learning algorithms, level 3 is available only through
state-of-the-art SRL techniques, and level 4 is described in this chapter.

reference to other rows in the table for the given patient, as well as intensional
background knowledge to dene concepts such as increases over time. Neither
rule can be captured by standard aggregation of existing elds.
Note that level 3 and level 4 learning would not be necessary if the database
initially contained all the potentially useful elds capturing information from other
relevant rows or tables. For example, the database might be initially constructed
to contain elds such as slope of change in abnormality size at this location over

482

Learning a New View of a Database: With an Application in Mammography

time, average abnormality size on this mammogram, and so on. If humans can
identify all such potentially useful elds beforehand and
dene views containing these, then level 3 and level 4 learning are unnecessary.
Nevertheless, the space of such possibly useful elds is quite large, and perhaps more
easily searched by computer via level 3 and level 4 learning. Certainly in the case of
the National Mammography Database standard [1], such elds were not available
because they had not been dened and populated in the database by the domain
experts, thus making level 3 and level 4 learning potentially useful.

17.3

Naive View Learning Framework


One can imagine a variety of approaches to perform view learning. As a rst step,
we apply existing technology to obtain a view learning capability. Any relational
database can be naturally and simply represented using a subset of rst-order
logic [30]. ILP provides algorithms to learn rules, also expressed in logic, from
such relational data [22], possibly together with background knowledge expressed
as a logic program. ILP systems operate by searching a space of possible logical
rules, looking for rules that score well according to some measure of t to the data.
Our rst learning framework works in two steps. First, we learn rules to predict
whether an abnormality is malignant. We extend the original database by introducing the new rules as additional features. More precisely, each rule will correspond
to a binary feature such that it takes the value true if the body, or condition, of the
rule is satised, and false otherwise. We then run the Bayesian network structure
learning algorithm, allowing it to use these new features in addition to the original
features. Section 17.7 notes the relationship of the approach to earlier work on ILP
for feature construction.
Below we show a simple rule learned by an ILP system. The rule covers 48 positive
examples and 123 negative examples. This rule can now be used as a eld in a new
view of the database, and consequently as a new feature in the Bayesian network.
Abnormality A in mammogram M may be malignant if:
As tissue is not asymmetric,
M contains another abnormality A2,
A2s margins are spiculated, and
A2 has no architectural distortion.
Note that the last two lines of the rule refer to other rows of the relational table for
abnormalities in the database. Hence this rule encodes information not available to
the current version of the Bayesian network [9].

17.4

17.4

Initial Experiments

483

Initial Experiments
The purposes of the experiments we conducted are twofold. First, we want to
determine if using SRL yields an improvement compared to propositional learning.
Secondly, we want to evaluate whether we see an improvement when moving up
a level in the hierarchy outlined in gure 17.2. First, we try to learn a structure
with just the original attributes (level 2) and see if that performs better than using
the expert structure with trained parameters (level 1). Next, we add aggregate
features to our network, representing summaries of abnormalities found either in a
particular mammogram or for a particular patient. This corresponds to level 3 and
we test whether this improves over levels 1 and 2. Finally, we investigate doing level
4 learning through the two-step algorithm and compare its performance to levels 1
through 3.
We experimented with a number of structure learning algorithms for Bayesian
networks, including naive Bayes, tree-augmented nave (TAN) Bayes [11], and the
sparse candidate algorithm [13]. However, we obtained the best results with the
TAN algorithm in all experiments, so we will focus our discussion on TAN. In a
TAN network, each attribute can have at most one other parent in addition to
the class variable. The TAN model can be constructed in polynomial time with
a guarantee that the model maximizes the log-likelihood of the network structure
given the data set [14, 11].
17.4.1

Data and Methodology

We collected data for all screening and diagnostic mammography examinations that
were performed at the Froedtert and Medical College of Wisconsin Breast Imaging
Center between April 5, 1999 and February 9, 2004. It is important to note that
the data consists of a radiologists interpretation of a mammogram and not the
raw image data. The radiologist reports conformed to the National Mammography
Database (NMD) standard established by the American College of Radiology. From
these reports, we followed the original network [3] to cull the 36 features deemed
to be relevant by coauthor Burnside, an expert mammographer.
To evaluate and compare these approaches, we used stratied ten-fold crossvalidation. We randomly divided the abnormalities into ten roughly equal-sized
sets, each with approximately one-tenth of the malignant abnormalities and onetenth of the benign abnormalities. When evaluating just the structure learning and
aggregation, nine folds were used for the training set. When performing aggregation,
we used binning to discretize the created features. We took care to only use the
examples from the training set to determine the bin widths. When performing view
learning, we had two steps in the learning process. In the rst part, four folds of
data were used to learn the ILP rules. The remaining ve folds were used to learn
the Bayes net structure and parameters.

484

Learning a New View of a Database: With an Application in Mammography

When using cross-validation on a relational database, there exists one major


methodological pitfall. Some of the cases may be related. For example, we may
have multiple abnormalities for a single patient. Because these abnormalities are
related (same patient), having some of these in the training set and others in the
test set may cause us to perform better on those test cases than we would expect
to perform on cases for other patients. To avoid such leakage of information
into a training set, we ensured that all abnormalities associated with a particular
patient were placed into the same fold for cross-validation. Another potential pitfall
is that we may learn a rule that predicts an abnormality to be malignant based on
properties of abnormalities in later mammograms. We ensured that we will never
predict the status of an abnormality at a given date based on ndings recorded for
later dates.
17.4.2

Approach for Each Level of Learning

Level 1: Parameter learning We estimated the parameters of the expert


structure from the data set using maximum likelihood estimates with Laplace
correction. It has been previously noted that learning the parameters of the network
improves performance over having expert-dened probabilities in each node [4].
Level 2: Structure learning The relational database for the mammography
data contains one row for each abnormality described on a mammogram. Fields in
this relational table include all those shown in the Bayesian network of gure 17.1.
Therefore it is straightforward to use existing Bayesian network structure learning
algorithms to learn a possibly improved structure for the Bayesian network.
Level 3: Aggregate learning We selected the numeric (e.g., the size of a mass)
and ordered features (e.g., the density of a mass) in the database and computed
aggregates for each of these features. In all, we determined that 27 of the 36
attributes were suitable for aggregation. We computed aggregates on both the
patient and the mammogram level. On the patient level, we looked at all of the
abnormalities for a specic patient. On the mammogram level, we only considered
the abnormalities present on that specic mammogram. To discretize the averages,
we divided each range into three bins. For binary features we used predened
bin sizes, while for the other features we attempted to get equal numbers of
abnormalities in each bin. For aggregation functions we used maximum and average.
The aggregation introduced 27 4 = 108 new features. The following paragraph
presents further details of our aggregation process.
We used a three-step process to construct aggregate features. First, we chose a
eld to aggregate. Second, we selected an aggregation function. Third, we needed
to decide over which rows to aggregate the feature, that is, which keys or links
to follow. This is known as a slot chain in probabilistic relational models (PRM)
terminology [12]. In our database, two such links exist. The patient ID eld allows
access to all the abnormalities for a given patient, providing aggregation on the

17.4

Initial Experiments

485

Patient

Abnormality

Date

Mass Shape

...

Mass Size

Location

Average
Patient
Mass Size

Average
Mammogram
Mass Size

Be/Mal

P1

5/02

Spic

...

0.03

RU4

0.0367

0.03

P1

5/04

Var

...

0.04

RU4

0.0367

0.04

P1

5/04

Spic

...

0.04

LL4

0.0367

0.04

...

...

...

...

...

...

...

...

...

...

Table 17.2 Database after aggregation on Mass Size eld. Note the addition of
two new elds, Average Patient Mass Size and Average Mammogram Mass Size,
which represent aggregate features.

patient level. The second key is the combination of patient ID and mammogram
date, which returns all abnormalities for a patient on a specic mammogram,
providing aggregation on the mammogram level. To demonstrate this process, we
will work though an example of computing an aggregate feature for patient 1 in
the database given in gure 17.1. We will aggregate on the Mass Size eld and use
average as the aggregation function. Patient 1 has three abnormalities, one from a
mammogram in May 2002 and two from a mammogram in May 2004. To calculate
the aggregate on the patient level, we average the size for all three abnormalities,
which is .0367. To nd the aggregate on the mammogram level for patient 1, we have
to perform two separate computations. First, we follow the link P1 and 5/02, which
yields abnormality 1. The average for this key mammogram is simply .03. Second,
we follow the link P1 and 5/04, which yields abnormalities 2 and 3. The average
for these abnormalities is .04. Table 17.2 shows the database following construction
of these aggregate features.

Level 4: View learning We used the ILP system Aleph [35] to implement level
4 learning. Aleph was asked to learn rules predictive of malignancy. We introduced
three new intensional tables into Alephs background knowledge to take advantage
of relational information.
1. The prior Mammogram relation connects information about any prior abnormality
that a given patient may have.
2. The same Location relation is a specication of the previous predicate. It adds
the restriction that the prior abnormality must be in the same location as the
current abnormality. Radiology reports include information about the location of
abnormalities.

486

Learning a New View of a Database: With an Application in Mammography

3. The in Same Mammogram relation incorporates information about other abnormalities a patient may have on the current mammogram.
By default, Aleph is set up to generate rules that would fully explain the
examples. In contrast, our goal was to extract rules that would be benecial as
new views. The major problem in implementing level 4 learning was how to select
rules that would best complement level 3 information. Clearly, Alephs standard
coverage algorithm was not designed for this application. Instead, we chose to rst
enumerate as many rules of interest as possible, and then chose interesting rules.
In order to obtain a varied set of rules, we ran Aleph under induce max for
each fold. Induce max uses every positive example in each fold as a seed for the
search. Also note that induce max does not discard previously covered examples
when scoring a new clause. Several thousand distinct rules were learned for each
fold, with each rule covering many more malignant cases than (incorrectly covering)
benign cases. We avoid the rule overtting found by other authors [24] by doing
breadth-rst search for rules and by having a minimal limit on coverage.
Each seed generated anywhere from zero to tens of thousands of rules. Adding
all rules would mean introducing thousands of often redundant features. We implemented the following algorithm:
1. We scanned all rules looking for duplicates and for rules that performed worse
than a more general rule. This step signicantly reduced the number of rules to
consider.
2. We sorted rules according to their m-estimate.
3. We used a greedy algorithm that picks the rule with the highest m-estimate such
that it covers an unexplained training example. Furthermore, each rule needs to
cover a signicant number of malignant cases. This step is similar to the standard
ILP greedy covering algorithm, except that we do not follow the original order of
the seed examples.
4. Last, we scanned the remaining rules, selecting those that covered a signicant
number of examples, and that were dierent from all previous rules, even though
these rules would not cover any new examples.
It is important to note that the rule selection was an automated process. We
picked the top fty clauses in our experiments, obtained from practical considerations on the size of the Bayesian networks we would need to learn. The resulting
views were added as new features to the database.
17.4.3

Results

We present the results of our rst experiment, comparing levels 1 and 2, using
both ROC and precision-recall curves. Figure 17.3 shows the ROC curve for these
experiments, and gure 17.4 shows the precision-recall curves. Because of our
skewed class distribution, due to the large number of benign cases, we prefer
precision-recall curves over ROC curves because they better show the number of

17.4

Initial Experiments

487

ROC curves for parameter learning (level 1) compared to structure


learning (level 2).
Figure 17.3

false alarms, or unnecessary biopsies. Therefore, we use precision-recall curves


for the remainder of the results. Here, precision is the percentage of abnormalities
that we classied as malignant that are truly cancerous. Recall is the percentage of
malignant abnormalities that were correctly classied. To generate the curves, we
pooled the results over all ten folds by treating each prediction as if it had been
generated from the same model. We sorted the estimates and used all possible split
points to create the graphs.
Figure 17.5 compares performance for all levels of learning. We can observe very
signicant improvements when adding multi-relational features. Aggregates provide
the most benet for higher recalls whereas rules help in the medium and low ranges
of recall. We believe this is because ILP rules are more accurate than the other
features, but have limited coverage.
Figure 17.6 shows the average area under the precision-recall curve for each
level of learning that we dened in gure 17.2. We only consider recalls above
50%, as for this application radiologists would be required to perform at least at
this level. We further use the paired t -test to compare the areas under the curve
(recall 0.5) for every fold. We found improvement of level 2 over level 1 to be
statistically signicant with a 99% level of condence. According to the paired t test the improvement of level 3 presents an improvement over level 2 at the 97%
condence level. Furthermore, level 4 over level 2 is signicant, using the area
under the curve metric, at the 99% level. However, there is no signicant dierence
between level 3 and level 4.

488

Learning a New View of a Database: With an Application in Mammography

Precision-recall curves for parameter learning (level 1) compared to


structure learning (level 2).
Figure 17.4

In this task, considering relational information is still crucial for improving


performance since the relational approaches outperform the propositional methods.
We mostly see signicant improvement as we move the learning hierarchy outlined
in gure 17.2. However, in this initial approach we see no signicant dierence
between level 3 and level 4.
The process of generating the views in level 4 can be useful to the radiologist,
as it identies potentially interesting correlations between attributes. During our
experiments, we presented coauthor Burnside with a set of 130 rules to review. She
found several rules interesting, including the following:
Abnormality A in mammogram M for patient P is maligant if:
A has BI-RADS category 5,
A has a mass present,
A has a mass with high density,
P has a prior history of breast cancer,
P has an extra finding on same mammogram (B),
B has no pleomorphic microcalcifications,
B had no punctate calcifications.
This rule identied 42 malignant mammographic ndings while only misclassifying 11 benign ndings as cancer. The radiologist was intrigued by this rule because
it suggests a hitherto unknown relationship between malignancy and high-density
masses. In general, mass density was not previously thought to be a highly predictive feature, so this rule is valuable in its own right [6].

17.4

Initial Experiments

Figure 17.5

Precision-recall curves for each level of learning.

Figure 17.6

Area under the curve for recalls above 50%.

489

490

17.5

Learning a New View of a Database: With an Application in Mammography

Integrated View Learning Framework


The initial methodology for level 4 follows a two-step process. In the rst step, an
ILP algorithm learns a set of rules. In the second step, the learned rules are added
to the preexisting features to form a nal model. This approach suers from several
weaknesses. First, we follow a brute-force approach to search for all good rules, but
we have no way to evaluate which ones will actually improve the network. Second,
the metric used to score the rules diers from the one we will ultimately use to
evaluate the nal model. Thus, we have no guarantee that the rule-learning process
will select the rules that best contribute to the nal classier.
We propose an alternative approach, based on the idea of constructing the
classier as we learn the rules [8]. In the new approach, rules are scored by how
much they improve the classier, providing a tight coupling between rule generation
and rule usage. We call this methodology score as you use or SAYU.
SAYU is closely related to nFOIL [20] and also to the work of Popescul et al. on
structural logistic regression [29]. The relationships to these important works are
discussed in section 17.7.
Our implementation of SAYU depends on both an ILP system and a propositional
learner. Following the original work, we used Aleph as a rule proposer and TAN as
our propositional learner.
Our algorithm works as follows. We randomly choose a seed example, and obtain
its most specic, or saturated clause. We then perform a top-down breadth-rst
search of the subsumption lattice. We evaluate each clause by converting it to a
binary feature, which is added to the current training set. We learn a new Bayes net
incorporating this new feature, and score the network. If the new feature improves
the score of the network, then we retain the feature in the network. If the feature
degrades the performance of the network, it is discarded, and we revert back to the
old classier and continue searching. One other central dierence exists with our
algorithm compared to Aleph in that after the network accepts a rule, we randomly
select a new seed. Thus, we are not searching for the best rule, but only the rst rule
that helps. However, nothing prevents the same seed from being selected multiple
times during the search.
Finally, we need to dene a scoring function. The main goal is to use the same
scoring function for both learning and evaluation. Furthermore, we wish to be able
to handle data sets that have a highly Skewed-class distribution. In the presence of
skew, precision and recall are often used to evaluate classier quality. In order to
characterize how the algorithm performs over the whole precision-recall space, we
follow Goadrich et al. [15], and adopt the area under the precision-recall curve as
our scoring metric. When calculating the area under the precision-recall curve, we
integrate from recall levels of 0.5 or greater. As we previously noted, a radiologist
would have to achieve levels of recall in this range.
We have previously reported that SAYU performs on a par with level 3 and the
initial approach to level 4. However, in these experiments we implemented SAYU

17.6

Further Experiments and Results

Figure 17.7

491

The SAYU-View algorithm.

as a rule combiner only, not as a tool for view learning that adds elds to the
existing set of elds (features) in the database [8]. We have modied SAYU to take
advantage of the predened features yielding a more integrated approach to view
learning. We also report on a more natural design where SAYU starts from the
level 3 network. We call this approach SAYU-View. Figure 17.7 gives pseudocode
for the SAYU-View algorithm.

17.6

Further Experiments and Results


We use essentially the same methodology as described previously for the initial
approach to view learning. On each round of cross-validation, we use four folds
as a training set, ve folds as a tuning set, and one fold as a test set. We only
saturate examples from the training set. For SAYU-View, we use only the training
set to learn the rules. The key dierence between initial level 4 and SAYU-View
is the following: for SAYU-View we use the training set to learn the structure and
parameters of the Bayes net, and we use the tuning set to calculate the score of a
network structure. Previously, we used the tune set to learn the network structure
and parameters. In order to retain a clause in the network, the area under the
precision-recall curve of the Bayes net incorporating the rule must achieve at least
a 2% improvement over the area of the precision-recall curve of the best Bayes net.

492

Learning a New View of a Database: With an Application in Mammography

Figure 17.8

Precision-recall curves for each level of learning.

Within SAYU, the time to score a rule has increased. The Bayes net algorithm
has to learn a new network topology and new parameters each time we score a rule
(feature). Furthermore, inference must be performed to compute the score after
incorporating a new feature. The SAYU algorithm is strictly more expensive than
standard ILP as SAYU also has to prove whether a rule covers each example in
order to create the new feature. To reect the added cost, we use a time-based stop
criterion for the new algorithm. This criterion is described in further detail in [8].
For each fold, we use the times from the baseline experiments in [8], so that our
new approach to view learning takes the same time as the old approach. In practice,
our settings resulted in evaluating around 20,000 clauses for each fold, requiring on
average around four hours per fold on a Xeon 3MHz class machine.
Figure 17.8 includes a comparison of SAYU-View to level 3 and the initial
approach to level 4. Again, we perform a two-tailed paired t -test on the area under
the precision recall curve for levels of recall 0.5. SAYU-view performs signicantly
better than both these approaches at the 99% condence level. Although we do not
include the graph, SAYU-View performs signicantly better than the SAYU-TAN
(no initial features), also with a p-value < 0.01. SAYU-View also performs better
than level 1 and level 2 with a p-value < 0.01. With the integrated framework for
level 4, we now see signicant improvement over lower levels of learning when we
ascend the hierarchy dened in gure 17.2.
Figure 17.9 shows the average area under the precision-recall curve (AUCPR) for
levels of recall 0.5 for level 3, the initial approach to level 4, and SAYU-View.
The average AUCPR for SAYU-View yields a 30% increase in the average AUCPR
over the initial approach to level 4. Furthermore, we see an increase in the average
AUCPR of 53% over level 3. Another way to look at these results is the potential
reduction of benign biopsies: procedures done on women without cancer. When
detecting 90% of cancers (i.e., recall = 0.9), SAYU-View achieves a 35% reduction
in benign biopsies over level 3 and a 39% reduction over the initial level 4 method.

17.7

Related Work

Figure 17.9

17.7

493

Area under the curve for recalls above 50%.

Related Work
Research in SRL has advanced along two main lines: methods that allow graphical
models to represent relations, and frameworks that extend logic to handle probabilities. Along the rst line, probabilistic relational models, or PRMs, introduced
by Friedman et al., represent one of the rst attempts to learn the structure of
graphical models while incorporating relational information[12]. Recently Heckerman et al. have discussed extensions to PRMs and compared them to other graphical
models[16]. A statistical learning algorithm for probabilistic logic representations
was rst given by Sato [33], and later Cussens [7] proposed a more general algorithm
to handle log-linear models. Additionally, Muggleton [21] has provided learning algorithms for stochastic logic programs. The structure of the logic program is learned
using ILP techniques, while the parameters are learned using an algorithm scaled
up from that used for stochastic context-free grammars.
Newer representations garnering arguably the most attention are Bayesian logic
programs (BLPs)[18], relational Markov networks (RMNs) [37], constraint logic
programming with Bayes net constraints, or CLP(BN ) [32], and Markov logic
networks (MLNs) [31]. MLNs are most similar to our approach. Nodes of MLNs
are the ground instances of the literals in the rule, and the arcs correspond to
the rules. One major dierence is that, in our approach, nodes are the rules
themselves. Although we cannot work at the same level of detail, our approach
makes it straightforward to combine logical rules with other features, and we now
can take full advantage of propositional learning algorithms.

494

Learning a New View of a Database: With an Application in Mammography

The present work builds upon previous work on using ILP for feature construction. Such work treats ILP-constructed rules as Boolean features, re-represents each
example as a feature vector, and then uses a feature-vector learner to produce a
nal classier. To our knowledge, Pompe and Kononenko [25] were the rst to apply
naive Bayes to combine clauses. Other work in this category was by Srinivasan and
King [36], who used rules as extra features for the task of predicting biological activities of molecules from their atom and bond structures. Popescul and Unger [26]
use k means to derive cluster relations, which are then combined with the original features through structural regression. In a dierent vein, relational decision
trees [23] use aggregation to provide extra features on a multi-relational setting, and
are close to our level 3 setting. Knobbe et al. [19] proposed numeric aggregates in
combination with logic-based feature construction for single attributes. Perlich and
Provost discuss several approaches for attribute construction using aggregates over
multi-relational features [24]. They also propose a hierarchy of levels of learning:
feature vectors, independent attributes on a table, multidimensional aggregation on
a table, and aggregation across tables. Some of these techniques in their hierarchy
could be applied to perform view learning in SRL.
Another approach for a tight coupling between rule learning and rule usage is
the recent work (done in parallel with ours) by Landwehr et al. [20]. That work
presented a new system called nFOIL. We would like to highlight that several
signicant dierences in the two pieces of work appear to be the following. First,
nFOIL scores clauses by conditional log-likelihood rather than improvement in
classier accuracy or classier AUC (area under ROC or precision-recall curve).
Second, nFOIL can handle multiple-class classication tasks, which SAYU cannot.
Third, the present chapter reports experiments on data sets with signicant class
skew, to which probabilistic classiers are often sensitive. Fourth, this work looks
at TAN opposed to naive Bayes. Finally, this work extends both [20] and [8] by
giving the network an initial feature set.
Another related piece of work is that by Popescul et al. [28, 27, 29] on structural
logistic regression. They use an ILP-like (renement graph) search over rules,
expressed as database queries, to dene new features. Dierences from the present
work include their use of the new features within an logistic regression model rather
than a graphical model, and the fact that they do not update the logistic regression
model after adding each rule. A notable strength of their approach is that the
rule-learning process itself can include aggregation.

17.8

Conclusions and Future Work


We presented a method for statistical relational learning which integrates learning
from attributes, aggregates, and rules. Our example application shows benets
from the several levels of learning we proposed. Level 2, structure learning, clearly
outperforms the expert structure. We further show that multi-relational techniques
can achieve very signicant improvements, even on a single table domain.

17.8

Conclusions and Future Work

495

This chapter has shown that a simple form of view learningtreating rules
induced by a standard ILP system as the additional features of a new view
yields improved performance over level 2 learning. Nevertheless, this improvement
is roughly equal to that obtained by level 3 learningby aggregation, as might
be performed, for example, by a PRM. We have noted how this approach to view
learning is quite similar to earlier work using ILP for feature construction.
A more interesting form of view learning, or level 4 learning, is SAYU-View, which
closely integrates the ILP system and Bayesian network learning. It signicantly
improves performance over both level 3 learning and the simple form of view
learning.
We believe many further improvements in view learning are possible. It makes
sense to include aggregates in the background knowledge for rule generation.
Alternatively, one can extend rules with aggregation operators, as proposed in
recent work by Vens et al. [38]. We have found the rule selection problem to
be nontrivial. Our greedy algorithm often generates too similar rules, and is not
guaranteed to maximize coverage. We would like to approach this problem as an
optimization problem weighing coverage, diversity, and accuracy.
Our approach of using ILP to learn new features for an existing table merely
scratches the surface of the potential for view learning. A more ambitious approach
would be to more closely integrate structure learning and view learning. A search
could be performed in which each move in the search space is either to modify the
probabilistic model or to rene the intentional denition of some eld in the new
view. Going further still, one might learn an intentional denition for an entirely
new table. As a concrete example, for mammography one could learn rules dening
a binary predicate that identies similar abnormalities. Because such a predicate
would represent a many-to-many relationship among abnormalities, a new table
would be required.
SRL algorithms provide a substantial extension to existing statistical learning
algorithms, such as Bayesian networks, by permitting statistical learning to be
applied directly to relational databases with multiple tables. Nevertheless, the
schemata for relational databases often are dened based on criteria other than
eectiveness of learning. If a schema is not the most appropriate for a given learning
task, it may be necessary to change itby dening a new viewbefore applying
other SRL techniques. View learning, as presented in this chapter, provides an
automated capability to make such schema changes. Our approaches so far to view
learning build on existing ILP technology. We believe ILP-based view learning
can be greatly improved and extended, as outlined in the preceding paragraphs,
for example to learn entirely new tables. Furthermore, many approaches to view
learning outside of ILP remain to be explored.

496

Learning a New View of a Database: With an Application in Mammography

Acknowledgments
Support for this research was partially provided by U.S. Air Force grant F3060201-2-0571. Elizabeth Burnside is supported by a General Electric Research in Radiology Academic Fellowship. Ines Dutra and Vtor Santos Costa did this work while
visiting the University of Wisconsin-Madison. Vtor Santos Costa was partially supported by the Fundacao para a Ciencia e Tecnologia. We thank Lisa Torrey, Mark
Goadrich, Rich Maclin, Jill Davis, and Allison Holloway for reading over drafts of
this chapter. We also thank the referees for their insightful comments.

References
[1] American College of Radiology. Breast imaging reporting and data system
(bi-rads), 2004. American College of Radiology.
[2] M. Brown, F. Houn, E. Sickles, and L. Kessler. Screening mammography in
community practice: Positive predictive value of abnormal ndings and yield
of follow-up diagnostic procedures. American Journal of Roentgenology, 165:
13731377, 1995.
[3] E. Burnside, D. Rubin, and R. Shachter. A Bayesian network for screening
mammography. In American Medical Informatics Association, pages 106110,
2000.
[4] E. Burnside, Y. Pan, C. Kahn, K. Shaer, and D. Page. Training a Probabilistic
Expert System to Predict the Likelihood of Breast Cancer Using a Large Dataset
of Mammograms (abstract). Radiological Society of North America, 2004.
[5] E. Burnside, D. Rubin, and R. Shachter. Using a Bayesian network to predict
the probability and type of breast cancer represented by microcalcications on
mammography. Medinfo, 2004:1317, 2004.
[6] E. Burnside, J. Davis, V. Santos Costa, I. Dutra, C. Kahn, J. Fine, and D. Page.
Knowledge discovery from structured mammography reports using inductive
logic programming. In American Medical Informatics Association Symposium,
pages 96100, 2005.
[7] J. Cussens. Parameter estimation in stochastic logic programs.
Learning, 44(3):245271, 2001.

Machine

[8] J. Davis, E. Burnside, I. Dutra, D. Page, and V. Santos Costa. An integrated


approach to learning Bayesian networks of rules. In Proceedings of the European
Conference on Machine Learning, 2005.
[9] J. Davis, E. Burnside, I. Dutra, D. Page, R. Ramakrishnan, V. Santos Costa,
and J. Shavlik. View learning for statistical relational learning: With an application to mammography. In Proceedings of the International Joint Conference
on Articial Intelligence, 2005.

References

497

[10] G. W. Eklund. Shortage of qualied breast imagers could lead to crisis.


Diagnostic Imaging, 22:3133, 2000.
[11] N. Friedman, D. Geiger, and M. Goldszmidt. Bayesian networks classiers.
Machine Learning, 29:131163, 1997.
[12] N. Friedman, L. Getoor, D. Koller, and A. Pfeer. Learning probabilistic
relational models. In Proceedings of the International Joint Conference on
Articial Intelligence, 1999.
[13] N. Friedman, I. Nachman, and D. Peer. Learning Bayesian network structure
from massive datasets: The sparse candidate algorithm. In Proceedings of
the Conference on Uncertainty in Articial Intelligence, 1999.
[14] D. Geiger. An entropy-based learning algorithm of Bayesian conditional trees.
In Proceedings of the National Conference on Articial Intelligence, 1992.
[15] M. Goadrich, L. Oliphant, and J. Shavlik. Learning ensembles of rst-order
clauses for recall-precision curves: A case study in biomedical information
extraction. In Proceedings of the International Conference on Inductive Logic
Programming, 2004.
[16] D. Heckerman, C. Meek, and D. Koller. Probabilistic entity-relationship
models, prms, and plate models. Technical Report MSR-TR-2004-30, Microsoft
Research, Seattle,WA, 2004.
[17] C. Kahn, L. Roberts, K. Shaer, and P. Haddawy. Construction of a Bayesian
network for mammographic diagnosis of breast cancer. Computers in Biology
and Medicine, 27:1929, 1997.
[18] K. Kersting and L. De Raedt. Basic principles of learning Bayesian logic
programs. Technical report, Institute for Computer Science, University of
Freiburg, Germany, 2002.
[19] A. J. Knobbe, M. de Haas, and A. Siebes. Propositionalisation and aggregates.
In Proceedings of the European Conference on Principles and Practice of
Knowledege Discovery in Databases, 2001.
[20] N. Landwehr, K. Kersting, and L. De Raedt. nFOIL: Integrating naive Bayes
and FOIL. In Proceedings of the National Conference on Articial Intelligence,
2005.
[21] S. Muggleton. Learning stochastic logic programs. Electronic Transactions in
Articial Intelligence, 4(041), 2000.
[22] S. Muggleton. Inductive logic programming. New Generation Computing, 8:
295318, 1991.
[23] J. Neville, D. Jensen, L. Friedland, and M. Hay. Learning relational probability trees. In International Conference on Knowledge Discovery and Data
Mining, 2003.
[24] C. Perlich and F. Provost. Aggregation-based feature invention and relational
concept classes. In International Conference on Knowledge Discovery and Data
Mining, 2003.

498

Learning a New View of a Database: With an Application in Mammography

[25] U. Pompe and I. Kononenko. Naive Bayesian classier within ILP-R. In


L. De Raedt, editor, Proceedings of the International Conference on Inductive
Logic Programming, 1995.
[26] A. Popescul and L. H. Ungar. Cluster-based concept invention for statistical
relational learning. In International Conference on Knowledge Discovery and
Data Mining, 2004.
[27] A. Popescul and L. H. Ungar. Statistical relational learning for link prediction.
In Workshop on Learning Statistical Models from Relational Data at IJCAI
2003, 2003.
[28] A. Popescul, L. H. Ungar, S. Lawrence, and D. Pennock. Towards structural
logistic regression: Combining relational and statistical learning. In Workshop
on Multi-Relational Data Mining at KDD, 2002.
[29] A. Popescul, L. H. Ungar, S. Lawrence, and D. M. Pennock. Statistical relational learning for document mining. In Proceedings of the IEEE International
Conference on Data Mining, 2003.
[30] R. Ramakrishnan and J. Gehrke. Database Management Systems. McGrawHill, New York, 2000.
[31] M. Richardson and P. Domingos. Markov logic networks. Technical report,
Department of Computer Science, University of Washington, 2004.
[32] V. Santos Costa, D. Page, M. Qazi, and J. Cussens. CLP(BN ): Constraint
logic programming for probabilistic knowledge. In Proceedings of the Conference on Uncertainty in Articial Intelligence, 2003.
[33] T. Sato. A statistical learning method for logic programs with distributional
semantics. In Proceedings of the International Conference on Inductive Logic
Programming, 1995.
[34] E. Sickles, D. Wolverton, and K. Dee. Performance parameters for screening
and diagnostic mammography: specialist and general radiologists. Radiology,
224:861869, 2002.
[35] A. Srinivasan. The Aleph Manual, 2001.
[36] A. Srinivasan and R. King. Feature construction with inductive logic programming: A study of quantitative predictions of biological activity aided by
structural attributes. In Proceedings of the International Conference on Inductive Logic Programming, 1997.
[37] B. Taskar, P. Abbeel, and D. Koller. Discriminative probabilistic models for
relational data. In Proceedings of the Conference on Uncertainty in Articial
Intelligence, 2002.
[38] C. Vens, A. Van Assche, H. Blockeel, and S. Dzeroski. First order random
forests with complex aggregates. In Proceedings of the International Conference
on Inductive Logic Programming, 2004.

18 Reinforcement Learning in Relational


Domains: A Policy-Language Approach

Alan Fern, SungWook Yoon, and Robert Givan

We study reinforcement learning in large relational Markov decision processes


(MDPs). We introduce a new variant of approximate policy iteration (API) that
replaces the usual value-function learning step with a learning step in policy space.
This is advantageous in domains where good policies are easier to represent and
learn than the corresponding value functions, which is often the case for the
relational MDPs we are interested in. In order to apply API to such problems, we
introduce a relational policy language and corresponding learner. In addition, we
introduce a new bootstrapping routine for goal-based planning domains, based on
random walks. Such bootstrapping is necessary for many large relational MDPs,
where reward is extremely sparse, as API is ineective in such domains when
initialized with an uninformed policy. Our experiments show that the resulting
system is able to nd good policies for a number of classical planning domains and
their stochastic variants by solving them as extremely large relational MDPs.

18.1

Introduction
Many planning domains are most most naturally represented in terms of objects and
relations among them. Accordingly, AI researchers have long studied algorithms for
planning and learning-to-plan in relational state and action spaces. These include,
for example, classical STRIPS domains such as the blocks world and logistics.
A common criticism of such domains and algorithms is the assumption of an
idealized, deterministic world model. This, in part, has led AI researchers to
study planning and learning within a decision-theoretic framework, which explicitly
handles stochastic environments and generalized reward-based objectives. However,
most of this work is based on explicit or propositional state-space models, and so far

500

Reinforcement Learning in Relational Domains: A Policy-Language Approach

has not demonstrated scalability to the large relational domains that are commonly
addressed in classical planning.
Intelligent agents must be able to simultaneously deal with both the complexity
arising from relational structure and the complexity arising from uncertainty. The
primary goal of this research is to move toward such agents by bridging the gap
between classical and decision-theoretic techniques.
In this chapter, we describe a straightforward and practical method for solving
very large, relational Markov decision processes (MDPs). Our work can be viewed
as a form of relational reinforcement learning (RRL) where we assume a strong
simulation model of the environment. That is, we assume access to a black-box
simulator, for which we can provide any (relationally represented) state/action pair
and receive a sample from the appropriate next-state and reward distributions. The
goal is to interact with the simulator in order to learn a policy for achieving high
expected reward. It is a separate challenge, not considered here, to combine our
work with methods for learning the environment simulator to avoid dependence on
being provided such a simulator.
Dynamic-programming approaches to nding optimal control policies in MDPs
[6, 25], using explicit (at) state-space representations, break down when the state
space becomes extremely large. More recent work extends these algorithms to use
propositional [8, 11, 12, 9, 18, 21] as well as relational [10, 20] state-space representations. These extensions have signicantly expanded the set of approachable
problems, but have not yet shown the capacity to solve large classical planning
problems such as the benchmark problems used in planning competitions [3], let
alone their stochastic variants (see section 18.6 for example benchmarks). One possible reason for this is that these methods are based on calculating and representing
value functions. For familiar STRIPS planning domains (among others), useful value
functions can be dicult to represent compactly, and their manipulation becomes
a bottleneck.
Most of the above techniques are purely deductivethat is, each value function
is guaranteed to have a certain level of accuracy. Rather, in this work, we will
focus on inductive techniques that make no such guarantees in practice. Existing
inductive forms of approximate policy iteration (API) utilize machine learning
to select compactly represented approximate value functions at each iteration of
dynamic programming [7]. As with any machine learning algorithm, the selection
of the hypothesis space, here a space of value functions, is critical to performance.
An example space used frequently is the space of linear combinations of a humanselected feature set.
To our knowledge, there has been no previous work that applies any form of
API to benchmark problems from classical planning, or their stochastic variants.1
1. Recent work in relational reinforcement learning has been applied to STRIPS problems
with much simpler goals than typical benchmark planning domains, and is discussed below
in section 18.7.

18.1

Introduction

501

Again, one reason for this is the high complexity of typical value functions for these
large relational domains, making it dicult to specify good value-function spaces
that facilitate learning. Comparably, it is often much easier to compactly specify
good policies, and accordingly good policy spaces for learning. This observation
is the basis for recent work on inductive policy selection in relational planning
domains, both deterministic [29, 33], and probabilistic [51]. These techniques show
that useful policies can be learned using a policy-space bias described by a generic
(relational) knowledge representation language. Here we incorporate those ideas
into a novel variant of API that achieves signicant success without representing or
learning approximate value functions. Of course, a natural direction for future work
is to combine policy-space techniques with value-function techniques, to leverage
the advantages of both.
Given an initial policy, our approach uses the simulation technique of policy
rollout [46] to generate trajectories of an improved policy. These trajectories are
then given to a classication learner, which searches for a classier, or policy,
that matches the trajectory data, resulting in an approximately improved policy.
These two steps are iterated until no further improvement is observed. The resulting
algorithm can be viewed as a form of API where the iteration is carried out without
inducing approximate value functions.
By avoiding value-function learning, this algorithm addresses the representational
challenge of applying API to relational planning domains. However, another fundamental challenge is that, for nontrivial relational domains, API requires some form
of bootstrapping. In particular, for most STRIPS planning domains the reward,
which corresponds to achieving a goal condition, is sparsely distributed and unlikely to be reached by random exploration. Thus, initializing API with a random
or uninformed policy, will likely result in no reward signal and hence no guidance
for policy improvement. One approach to bootstrapping is to rely on the user to
provide a good initial policy or heuristic that gives guidance toward achieving reward. Rather, in this work we develop a new automatic bootstrapping approach for
goal-based planning domains which does not require user intervention.
Our bootstrapping approach is based on the idea of random-walk problem distributions. For a given planning domain, such as the blocks world, this distribution
randomly generates a problem (i.e., an initial state and a goal) by selecting a random initial state and then executing a sequence of n random actions, taking the
goal condition to be a subset of properties from the resulting state. The problem
diculty typically increases with n, and for small n (short random walks) even
random policies can uncover reward. Intuitively, a good policy for problems with
walk length n can be used to bootstrap API for problems with slightly longer walk
lengths. Our bootstrapping approach iterates this idea, by starting with a random
policy and very small n, and then gradually increasing the walk length until we
learn a policy for very long random walks. Such long-random-walk policies clearly
capture much domain knowledge, and can be used in various ways. Here, we show
that empirically such policies often perform well on problems distributions from

502

Reinforcement Learning in Relational Domains: A Policy-Language Approach

relational domains used in recent deterministic and probabilistic planning competitions.


Here, we give an evaluation of our system on a number of probabilistic and
deterministic relational planning domains, including the AIPS-2000 competition
benchmarks, and benchmarks from the hand-tailored track of the 2004 Probabilistic Planning Competition. The results show that the system is often able to learn
policies in these domains that perform well for long-random-walk problems. In addition, these same policies often perform well on the planning-competition problem
distributions, comparing favorably with the state-of-the-art planner Fast-Forward
(FF) [24] in the deterministic domains. Our experiments also highlight a number
of limitations of our current system, which point to interesting directions for future
work.
The remainder of this chapter proceeds as follows. In section 18.2, we introduce
our problem setup and then, in section 18.3, present our new variant of API.
In sections 18.4 and 18.5, we describe an implemented instantiation of our API
approach for relational planning domains. This includes a description of a generic
policy language for relational domains, a classication learner for that language,
and a novel bootstrapping technique for goal-based domains. Section 18.6 presents
our empirical results, and nally sections 18.7 and 18.8 discuss related work and
future directions.

18.2

Problem Setup
We formulate our work in the framework of MDPs. While our primary motivation is
to develop algorithms for relational planning domains, we rst describe our problem
setup and approach for a general, action-simulator-based MDP representation.
Later, in section 18.4, we describe a particular representation of planning domains
as relational MDPs and the corresponding relational instantiation of our approach.
Following and adapting Kearns et al. [27] and Bertsekas and Tsitsiklis [7], we
represent an MDP using a generative model S, A, T, R, I , where S is a nite set
of states, A is a nite, ordered set of actions, and T is a randomized actionsimulation algorithm that, given state s and action a, returns a next state s
according to some unknown probability distribution PT (s |s, a). The component R
is a reward function that maps S A to real numbers, with R(s, a) representing the
reward for taking action a in state s, and I is a randomized initial-state algorithm
with no inputs that returns a state s according to some unknown distribution P0 (s).
We sometimes treat I and T (s, a) as random variables with distributions P0 () and
PT (|s, a) respectively.
For an MDP M = S, A, T, R, I , a policy is a (possibly stochastic) mapping
from S to A. The value function of , denoted V (s), represents the expected,
cumulative, discounted reward of following policy in M starting from state s, and

18.3

Approximate Policy Iteration with a Policy Language Bias

503

is the unique solution to


V (s) = E[R(s, (s)) + V (T (s, (s)))],
where 0 < 1 is the discount factor. The Q-value function Q (s, a) represents
the expected, cumulative, discounted reward of taking action a in state s and then
following , and is given by
Q (s, a) = R(s, a) + E[V (T (s, a))].

(18.1)

We will measure the quality of a policy by the objective function V () = E[V (I)],
giving the expected value obtained by that policy when starting from a randomly
drawn initial state. A common objective in MDP planning and reinforcement
learning is to nd an optimal policy = argmax V (). However, no automated
technique, including the one we present here, has to date been able to guarantee
nding an optimal policy in the relational planning domains we consider, in
reasonable running time.
It is a well-known fact that given a current policy , we can dene a new improved
policy
PI (s) = argmaxaA Q (s, a)

(18.2)

such that the value function of PI is guaranteed to (1) be no worse than that of
at each state s, and (2) strictly improve at some state when is not optimal.
Policy iteration is an algorithm for computing optimal policies by iterating policy
improvement (PI) from any initial policy to reach a xed point, which is guaranteed
to be an optimal policy. Each iteration of policy improvement involves two steps:
(1) policy evaluation, where we compute the value function V of the current policy
, and (2) policy selection, where, given V from step 1, we select the action that
maximizes Q (s, a) at each state, dening a new improved policy.

18.3

Approximate Policy Iteration with a Policy Language Bias


Exact solution techniques, such as policy iteration, are typically intractable for large
state-space MDPs, such as those arising from relational planning domains. In this
section, we introduce a new variant of API intended for such domains. First, we
review a generic form of API used in prior work, based on learning approximate
value functions. Next, motivated by the fact that value functions are often dicult
to learn in relational domains, we describe our API variant, which avoids learning
value functions and instead learns policies directly as state-action mappings.

504

Reinforcement Learning in Relational Domains: A Policy-Language Approach

18.3.1

API with Approximate Value Functions

API, as described in Bertsekas and Tsitsiklis [7], uses a combination of Monte


Carlo simulation and inductive machine learning to heuristically approximate policy
iteration in large state-space MDPs. Given a current policy , each iteration of API
approximates policy evaluation and policy selection, resulting in an approximately
improved policy
. First, the policy evaluation step constructs a training set of
samples of V from a small but representative set of states. Each sample
is computed using simulation, estimating V (s) for the policy at each state
s by drawing some number of sample trajectories of starting at s and then
averaging the cumulative, discounted reward along those trajectories. Next, the
policy selection step uses a function approximator (e.g., a neural network) to learn
an approximation V to V based on the training data. V then serves as a
representation for
, which selects actions using sampled one-step lookahead based
on V , using

(s) = arg max R(s, a) + E[V (T (s, a))].


aA

A common variant of this procedure learns an approximation of Q rather than


V .
API exploits the function approximators generalization ability to avoid evaluating each state in the state space, instead only directly evaluating a small number
of training states. Thus, the use of API assumes that states and perhaps actions
are represented in a factored form (typically, a feature vector) that facilitates generalizing properties of the training data to the entire state and action spaces. Note
that in the case of perfect generalization (i.e., V (s) = V (s) for all states s), we
have that
is equal to the exact policy improvement PI , and thus API simulates exact policy iteration. However, in practice, generalization is not perfect, and
there are typically no guarantees for policy improvement2nevertheless, API often
converges usefully [45, 47].
The success of the above API procedure depends critically on the ability to
represent and learn good value-function approximations. For some MDPs, such as
those arising from relational planning domains, it is often dicult to specify a space
of value functions and learning mechanism that facilitate good generalization. For
example, work in relational reinforcement learning [13] has shown that learning
approximate value functions for classical domains, such as the blocks world, can be
problematic.3 In spite of this, it is often relatively easy to compactly specify good
policies using a language for (relational) state-action mappings. This suggests that

2. Under very strong assumptions, API can be shown to converge in the innite limit to
a near-optimal policy [7].
3. In particular, the RRL work has considered a variety of value-function representation
including relational regression trees, instance-based methods, and graph kernels, but none
of them have generalized well over varying numbers of objects.

18.3

Approximate Policy Iteration with a Policy Language Bias

505

such languages may provide useful policy-space biases for learning in API. However,
all prior API methods are based on approximating value functions and hence can
not leverage these biases. With this motivation, we introduce a new form of API
that directly learns policies without directly representing or approximating value
functions.
18.3.2

Using a Policy Language Bias

A policy is simply a classier that maps states to actions. Our API approach is
based on this view, and is motivated by recent work that casts policy selection
as a standard classication learning problem. In particular, given the ability to
observe trajectories of a target policy, we can use machine learning to select a
policy, or classier, that mimics the target as closely as possible. This idea has
been studied previously under the name behavioral cloning [44]. Khardon [30]
studied this learning setting and provided PAC-like learnability results, showing
that under certain assumptions, a small number of trajectories is sucient to learn
a policy whose value is close to that of the target. In addition, recent empirical
work, in relational planning domains [29, 33, 51], has shown that by using expressive
languages for specifying state-action mappings, good policies can be learned from
sample trajectories of good policies.
These results suggest that, given a policy , if we can somehow generate trajectories of an improved policy, then we can learn an approximately improved policy
based on those trajectories. This idea is the basis of our approach. Figure 18.1 gives
pseudocode for our API variant, which starts with an initial policy 0 and produces
a sequence of approximately improved policies. Each iteration involves two primary
steps: First, given the current policy , the procedure Improved-Trajectories
(approximately) generates trajectories of the improved policy  = PI . Second,
these trajectories are used as training data for the procedure Learn-Policy, which
returns an approximation of  . We now describe each step in more detail.
Step 1: Generating Improved-Trajectories Given a base policy policy ,
the simulation technique of policy rollout [46, 7] computes an approximation

to the improved policy  = PI , where  is the result of applying one step of


policy iteration to . Furthermore, for a given state s, policy rollout computes

(s) without the need to solve for  at all other states, and thus provides a
tractable way to approximately simulate the improved policy  in large state-space
, which can lead
MDPs. Often  is signicantly better than , and hence so is
to substantially improved performance at a small cost. Policy rollout has provided
signicant benets in a number of application domains, including, for example,
backgammon [46], instruction scheduling [37], network congestion control [49], and
solitaire [50].
Policy rollout computes
(s), the estimate of  (s), by estimating Q (s, a) for
each action a and then taking the maximizing action to be
(s) as suggested by

(18.2). Each Q (s, a) is estimated by drawing w trajectories of length h, where


each trajectory is the result of starting at s, taking action a, and then following the

506

Reinforcement Learning in Relational Domains: A Policy-Language Approach

API (n, w, h, M, 0 , )
0 ;
loop
T Improved-Trajectories(n, w, h, M, );
Learn-Policy(T );
until satised with ;
// e.g. until change is small
Return ;
Improved-Trajectories(n, w, h, M, )
// training set size n, sampling width w,
// horizon h, MDP M , current policy
T ;
repeat n times // generate n trajectories of improved policy
t nil;
s state drawn from I; // draw random initial state
for i = 1 to h
am ) Policy-Rollout(, s, w, h, H);
a1 ), . . . , Q(s,
Q(s,
am )); // concatenate sample to trajectory
a1 ), . . . , Q(s,
t t s, (s), Q(s,
a); // action of the improved policy at state s
a action maximizing Q(s,
s state sampled from T (s, a); // simulate action of improved policy
T T t;
Return T ;
Policy-Rollout (s, w, h, M, )
// policy , state s, sampling width w, horizon h, cost estimator H
for each action ai in A
ai ) 0;
Q(s,
ai ) is an average over w trajectories
repeat w times // Q(s,
R R(s, ai );
s a state sampled from T (s, ai ); // take action ai in s
for i = 1 to h 1 // take h 1 steps of , accumulating reward in R
R R + i R(s , (s ));
s a state sampled from T (s , (s ))
ai ) + R; // include trajectory in average
ai ) Q(s,
Q(s,

i)
ai ) Q(s,a
;
Q(s,
w
am )
a1 ), . . . , Q(s,
Return Q(s,

Figure 18.1 Pseudocode for our API algorithm. See section 18.4.3 for an instantiation of Learn-Policy called Learn-Decision-List.

18.4

API for Relational Planning

507

actions selected by for h1 steps. The estimate of Q (s, a) is then taken to be the
average of the cumulative discounted reward along each trajectory. The sampling
width w and horizon h are specied by the user, and control the tradeo between
increased computation time for large values, and reduced accuracy for small values.
The procedure Improved-Trajectories uses rollout to generate n length h
trajectories of the improved policy
, each trajectory beginning at a randomly
drawn initial state. Rather than just recording the sequence of states encountered
and actions selected by
along each trajectory, we store additional information
that is used by our policy-learning algorithm. In particular, the ith element of a
i , a1 ), . . . , Q(s
i , am ) , giving the ith state si
trajectory has the form si , (si ), Q(s
along the trajectory, the action selected by the current (unimproved) policy at si ,
i , a) for each action. Thus each trajectory generated
and the Q-value estimates Q(s
by Improved-Trajectories records for each state the action selected by
and the
Q-values for all actions. Note that given the Q-value information for si the learning
algorithm can determine the approximately improved action (s), by maximizing
over actions, if desired.
Step 2: Learn-Policy Intuitively, we want Learn-Policy to select a new
policy that closely matches the training trajectories. In our experiments, we
use relatively simple learning algorithms based on greedy search within a space of
policies specied by a policy-language bias. In sections 18.4.2 and 18.4.3 we detail
the policy-language learning bias used by our technique, and the associated learning
algorithm. In Fern et al. [16] we provide a technical analysis of an idealized version
of this algorithm, providing guidance regarding the number of training trajectories,
horizon, and sampling width required to guarantee policy improvement with high
probability. We note that by labeling each training state in the trajectories with
the associated Q-values for each action, rather than simply with the best action,
we enable the learner to make more informed tradeos, focusing on accuracy at
states where wrong decisions have high costs, which was empirically useful. Also,
the inclusion of (s) in the training data enables the learner to adjust the data
relative to , if desirede.g., our learner uses a bias that focuses on states where
large improvement appears possible.
Finally, we note that for API to be eective, it is important that the initial
policy 0 provide guidance toward improvement, i.e., 0 must bootstrap the API
process. For example, in goal-based planning domains 0 should reach a goal from
some of the sampled states. In section 18.5 we will discuss this important issue of
bootstrapping and introduce a new bootstrapping technique.

18.4

API for Relational Planning


Our work is motivated by the goal of solving relational MDPs. In particular, we are
interested in nding policies for relational MDPs that represent classical planning
domains and their stochastic variants. Such policies can then be applied to any

508

Reinforcement Learning in Relational Domains: A Policy-Language Approach

problem instance from a planning domain, and hence can be viewed as a form of
domain-specic control knowledge.
In this section, we rst describe a straightforward way to view classical planning
domains (not just single problem instances) as relationally factored MDPs. Next,
we describe our relational policy space in which policies are compactly represented
as taxonomic decision lists. Finally, we present a heuristic learning algorithm for
this policy space.
18.4.1

Planning Domains as MDPs

We say that an MDP S, A, T, R, I is relational when S and A are dened by giving


a nite set of objects O, a nite set of predicates P , and a nite set of action types
Y . A fact is a predicate applied to the appropriate number of objects, e.g., on(a, b)
is a blocks-world fact. A state is a set of facts, interpreted as representing the true
facts in the state. The state space S contains all possible sets of facts. An action is
an action type applied to the appropriate number of objects, e.g., putdown(a), is
a blocks-world action, and the action space A is the set of all such actions.
A classical planning domain describes a set of problem instances with related
structure, where a problem instance gives an initial world state and goal. For
example, the blocks world is a classical planning domain, where each problem
instance species an initial block conguration and a set of goal conditions. Classical
planners attempt to nd solutions to specic problem instances of a domain. Rather,
our goal is to solve entire planning domains by nding a policy that can be
applied to all problem instances. As described below, it is straightforward to view a
classical planning domain as a relational MDP where each MDP state corresponds
to a problem instance.
State and Action Spaces Each classical planning domain species a set of
action types Y , world predicates W , and possible world objects O. Together, Y
and O dene the MDP action space. Each state of the MDP corresponds to a single
problem instance (i.e., a world state and a goal) from the planning domain by
specifying both the current world and the goal. We achieve this by letting the set
of relational MDP predicates be P = W G, where G is a set of goal predicates.
The set of goal predicates contains a predicate for each world predicate in W ,
which is named by prepending a g onto the corresponding world predicate name
(e.g., the goal predicate gclear corresponds to the world predicate clear). With
this denition of P we see that the MDP states are a set of goal and world facts,
indicating the true world facts of a problem instance and the goal conditions. It
is important to note, as described below, that the MDP actions will only change
world facts and not goal facts. Thus, this large relational MDP can be viewed
as a collection of disconnected sub-MDPs, where each sub-MDP corresponds to a
distinct goal condition.
Reward Function Given an MDP state the objective is to reach another MDP
state where the goal facts are a subset of the corresponding world factsi.e., reach
a world state that satises the goal. We will call such states goal states of the MDP.

18.4

API for Relational Planning

509

For example, the MDP state


{on-table(a), on(a, b), clear(b), gclear(b)}
is a goal state in a blocks-world MDP, but would not be a goal state without the
world fact clear(b). We represent the objective of reaching a goal state quickly by
dening R to assign a reward of zero for actions taken in goal states and negative
rewards for actions in all other states, representing the cost of taking those actions.
Typically, for classical planning domains, the action costs are uniformly 1; however,
our framework allows the cost to vary across actions.
Transition Function Each classical planning domain provides an action simulator (e.g., as dened by STRIPS rules) that, given a world state and action, returns
a new world state. We dene the MDP transition function T to be this simulator
modied to treat goal states as terminal and to preserve without change all goal
predicates in an MDP state. Since classical planning domains typically have a large
number of actions, the action denitions are usually accompanied by preconditions
that indicate the legal actions in a given state, where usually the legal actions are
a small subset of all possible actions. We assume that T treats actions that are not
legal as no-ops. For simplicity, our relational MDP denition does not explicitly
represent action preconditions; however, we assume that our algorithms do have
access to preconditions and thus only need to consider legal actions. For example,
we can restrict rollout to only the legal actions in a given state.
Initial State Distribution Finally, the initial state distribution I can be any
program that generates legal problem instances (MDP states) of the planning
domain. For example, problem domains from planning competitions are commonly
distributed with problem generators.
With these denitions, a good policy is one that can reach goal states via lowcost action sequences from initial states drawn from I. Note that here policies
are mappings from problem instances to actions and thus can be sensitive to goal
conditions. In this way, our learned policies are able to generalize across dierent
goals. We next describe a language for representing such generalized policies.
18.4.2

Taxonomic Decision List Policies

For single argument action types, many useful rules for planning domains take
the form of apply action type A to any object in class C [33]. For example, in
the blocks world a useful planning rule might be, Pick up any clear block that
belongs on the table but is not on the table, or in a logistics world, Unload
any object that is at its destination. This motivates the idea of using a formal
class description language for representing such classes or sets of objects, and
then learning policies that are represented via rules expressed in that language. In
particular, if the selected class description language can compactly encode useful
classes of objects, then we can learn rules for the policy by simply searching over
short class descriptions.

510

Reinforcement Learning in Relational Domains: A Policy-Language Approach

This idea was rst explored by Martin and Gener [33] who introduced the
use of decision lists of such rules, using description logic as a class description
language. Their experiments in the deterministic blocks world showed promising
results, highlighting the potential benets of using class description languages to
represent policies. With that motivation, we consider a policy space that is similar
to the one used originally by Martin and Gener, but generalized to handle multiple
action arguments. Also, for historical reasons, rather than use description logic as
our class description language, we use taxonomic syntax [35, 36], as described below.
Comparison Predicates For relational MDPs with world and goal predicates,
such as those corresponding to classical planning domains, it is often useful for
policies to compare the current state with the goal. To this end, we introduce a new
set of predicates, called comparison predicates, which are derived from the world
and goal predicates. For each world predicate p and corresponding goal predicate
gp, we introduce a new comparison predicate cp that is dened as the conjunction
of p and gp. That is, a comparison predicate fact is true if and only if both the
corresponding world and goal predicates facts are true. For example, in the blocks
world, the comparison predicate fact con(a, b) indicates that a is on b in both the
current state and the goali.e., on(a, b) and gon(a, b) are true.
Taxonomic Syntax Taxonomic syntax provides a language for writing class
expressions that represent sets of objects with properties of interest and serve
as the fundamental pieces with which we build policies. Class expressions are
built from the MDP predicates (including comparison predicates if applicable)
and variables. In our policy representation, the variables will be used to denote
action arguments, and at run-time will be instantiated by objects. For simplicity
we only consider predicates of arity one and two, which we call primitive classes and
relations, respectively. When a domain contains predicates of arity three or more,
we automatically convert them to multiple auxiliary binary predicates. Given a list
of variables X = (x1 , . . . , xk ), the syntax of class expressions is given by
C[X] ::= C0 | xi | a-thing | C[X] | (R C[X]) | (min R)
R ::= R0 | R 1 | R ,
where C[X] is a class expression, R is a relation expression, C0 is a primitive class,
R0 is a primitive relation, and xi is a variable in X. Note that, for classical planning
domains, the primitive classes and relations can be world, goal, or comparison
predicates. We dene the depth d(C[X]) of a class expression C[X] to be one if
C[X] is either a primitive class, a-thing, a variable, or (min R); otherwise we
dene d(C[X]) and d(R C[X]) to be d(C[X]) + 1, where R is a relation expression
and C[X] is a class expression. For a given relational MDP we denote by Cd [X] the
set of all class expressions C[X] that have a depth of d or less.
The semantics of class expressions are given in terms of an MDP state s and a
variable assignment O = (o1 , . . . , ok ), which assigns object oi to variable xi . The
interpretation of C[X] relative to s and O is a set of objects and is denoted by
C[X]s,O . A primitive class C0 is interpreted as the set of objects for which the

18.4

API for Relational Planning

511

predicate symbol C0 is true in s. For example, in the blocks world, the primitive
class expressions clear and gclear represent the sets of blocks that are clear in the
current world state and clear in the goal respectively. Likewise, a primitive relation
R0 is interpreted as the set of all object tuples for which the relation R0 holds in s.
For example, the primitive relation expression on represents the set of all pairs of
blocks (o1 , o2 ) such that o1 is on o2 in the current world state. The class expression
a-thing denotes the set of all objects in s. The class expression xi , where xi is a
variable, is interpreted to be the singleton set {oi }.
The interpretation of compound expressions is given by
(C[X])s,O = {o | o  C[X]s,O }
(R C[X])s,O = {o | o C[X]s,O s.t. (o , o) Rs,O }
(min R)s,O = {o | o s.t. (o, o ) Rs,O ,  o s.t. (o , o) Rs,O }
(R )s,O = ID {(o1 , ov ) | o2 , . . . , ov1 s.t. (oi , oi+1 ) Rs,O for 1 i < v}
(R1 )s,O = {(o, o ) | (o , o) Rs,O },
where C[X] is a class expression, R is a relation expression, and ID is the identity
relation. Intuitively the class expression (R C[X]) denotes the set of objects that
are related through relation R to some object in the set C[X]. For example, in
the blocks world, the expression (on on-table) denotes the set of blocks that are
currently on a block that is on the table. The expression (R C[X]) denotes the
set of objects that are related through some R chain to an object in C[X]
this constructor is important for representing recursive concepts. For example, the
expression (on a), where a is a block, represents the set of blocks that are currently
above a. The expression (min R) denotes the set of objects that are minimal under
the relation R. For example, the expression (min on) represents the set of blocks
that have no blocks above them, and are on some other block (i.e., the set of clear
blocks).
The following class expressions are some examples of useful blocks-world concepts, given the primitive classes clear, gclear, holding, and con-table, along
with the primitive relations on, gon, and con.
(gon1 holding) has depth two, and denotes the block that we want under the
block being held.
(on (on gclear)) has depth three, and denotes the blocks currently above blocks
that we want to make clear.
(con con-table) has depth two, and denotes the set of blocks in well-constructed
towers.
(gon (con con-table)) has depth three, and denotes the blocks that belong on
top of a currently well-constructed tower.
Decision-List Policies We represent policies as decision lists of action-selection
rules. Each rule has the form a(x1 , . . . , xk ) : L1 , L2 , . . . Lm , where a is a k-argument
action type, the Li are literals, and the xi are action-argument variables. We will

512

Reinforcement Learning in Relational Domains: A Policy-Language Approach

denote the list of action-argument variables as X = (x1 , . . . , xk ). Each literal has


the form x C[X], where C[X] is a taxonomic syntax class expression and x is an
action-argument variable.
Given an MDP state s and a list of action-argument objects O = (o1 , . . . , ok ), we
say that a literal xi C[X] is true given s and O i oi C[X]s,O . We say that a
rule R = a(x1 , . . . , xk ) : L1 , L2 , . . . Lm allows action a(o1 , . . . ok ) in s i each literal
in the rule is true given s and O. Note that if there are no literals in a rule for action
type a, then all possible actions of type a are allowed by the rule. A rule can be
viewed as placing mutual constraints on the tuples of objects that an action type
can be applied to. Note that a single rule may allow no actions or many actions
of one type. Given a decision list of such rules we say that an action is allowed by
the list if it is allowed by some rule in the list, and no previous rule allows any
actions. Again, a decision list may allow no actions or multiple actions of one type.
A decision list L for an MDP denes a policy [L] for that MDP. If L allows no
actions in state s, then [L](s) is the least legal action in s; otherwise, [L](s) is
the least (according to the action ordering) legal action that is allowed by L. It is
important to note that since [L] only considers legal actions, as specied by action
preconditions, the rules do not need to encode the preconditions, which allows for
simpler rules and learning. In other words, we can think of each rule as implicitly
containing the preconditions of its action type.
As an example of a taxonomic decision-list policy consider a simple blocks-world
domain where the goal condition is always to clear o all of the red blocks. The
primitive classes in this domain are red, clear, and holding, and the single relation
is on. The following policy will solve any problem in the domain.
putdown(x1 ) : x1 holding
pickup(x1 ) : x1 clear, x1 (on (on red))
The rst rule will cause the agent to put down any block that is being held.
Otherwise, if no block is being held, then nd a block x1 that is clear and is above
a red block (expressed by (on (on red))) and pick it up.
18.4.3

Learning Taxonomic Decision Lists

For a given relational MDP, dene Rd,l to be the set of action-selection rules that
have a length of at most l literals and whose class expressions have depth at most
d. Also, dene Hd,l to be the policy space dened by decision lists whose rules are
from Rd,l . Since the number of depth-bounded class expressions is nite there are
a nite number of rules, and hence Hd,l is nite, though exponentially large. Our
implementation of Learn-Policy, as used in the main API loop, learns a policy in
Hd,l for user-specied values of d and l.
We use a Rivest-style decision-list learning approach [43]an approach also taken
by Martin and Gener [33] for learning class-based policies. The primary dierence
between Martin and Gener [33] and our technique is the method for selecting

18.4

API for Relational Planning

513

individual rules in the decision list. We use a greedy, heuristic search, while previous
work used an exhaustive enumeration approach. This dierence allows us to nd
rules that are more complex, at the potential cost of failing to nd some good simple
rules that enumeration might discover.
Recall from section 18.3, that the training set given to Learn-Policy contains
trajectories of the rollout policy. Our learning algorithm, however, is not sensitive to
the trajectory structure (i.e., the order of trajectory elements) and thus, to simplify
our discussion, we will take the input to our learner to be a training set D that
contains the union of all the trajectory elements. This means that for a trajectory
set that contains n length h trajectories, D will contain a total of n h training
examples. As described in section 18.3, each training example in D has the form
am ) , where s is a state, (s) is the action selected in s
a1 ), . . . , Q(s,
s, (s), Q(s,
ai ) is the Q-value estimate of Q (s, ai ). Note that
by the previous policy, and Q(s,
in our experiments the training examples only contain values for the legal actions
in a state.
Given a training set D, a natural learning goal is to nd a decision-list policy that
for each training example selects an action with the maximum estimated Q-value.
This learning goal, however, can be problematic in practice as often there are several
best (or close to best) actions as measured by the true Q-function. In such case, due
to random sampling, the particular action that looks best according to the Q-value
estimates in the training set is arbitrary. Attempting to learn a concise policy that
matches these arbitrary actions will be dicult at best and likely impossible.
One approach [31] to avoiding this problem is to use statistical tests to determine
the actions that are clearly the best (positive examples) and the ones that are
clearly not the best (negative examples). The learner is then asked to nd a policy
that is consistent with the positive and negative examples. While this approach has
shown some empirical success, it has the potential shortcoming of throwing away
most of the Q-value information. In particular, it may not always be possible to
nd a policy that exactly matches the training data. In such cases, we would like
the learner to make informed tradeos regarding suboptimal actionsi.e., prefer
suboptimal actions that have larger Q-values. With this motivation, below we
describe a cost-sensitive decision-list learner that is sensitive to the full set of Qvalues in D. The learning goal is roughly to nd a decision list that selects actions
with large cumulative Q-values over the training set.
18.4.3.1

Learning Lists of Rules

am )
a1 ), . . . , Q(s,
We say that a decision list L covers a training example s, (s), Q(s,
if L suggests an action in state s. Given a set of training examples D, we search for
a decision list that selects actions with high Q-value via an iterative set-covering
approach carried out by Learn-Decision-List. Decision-list rules are constructed
one at a time and in order until the list covers all of the training examples. Pseudocode for our algorithm is given in gure 18.2. Initially, the decision list is the null
list and does not cover any training examples. During each iteration, we search for a

514

Reinforcement Learning in Relational Domains: A Policy-Language Approach

Learn-Decision-List (D, d, l, b)
// training set D, concept depth d, rule length l, beam width b
L nil;
while (D is not empty)
R Learn-Rule(D, d, l, b);
D D {d d | R covers d};
L Extend-List(L, R); // add R to end of list
Return L;
Learn-Rule(D, d, l, b)
// training set D, concept depth d, rule length l, beam width b
for each action type a // compute rule for each action type a
Ra Beam-Search(D, d, l, w, a);
Return argmaxa Hvalue(Ra , D);
Beam-Search (D, d, l, w, a)
// training set D, concept depth d, rule length l, beam width b, action type a
k arity of a;
X (x1 , . . . , xk ); // X is a sequence of action-argument variables
L {(x C) | x X, C Cd [X]}; // set of depth bounded candidate literals
B0 { a(X) : nil }; i 1; // initialize beam to a single rule with no literals
loop
G = Bi1 {R Rd,l | R = Add-Literal(R , l), R Bi1 , l L};
Bi Beam-Select(G, w, D); // select best b heuristic values
i i + 1;
until Bi1 = Bi ; // loop until there is no more improvement in heuristic
Return argmaxRBi Hvalue(R, D) // return best rule in nal beam

Pseudocode for learning a decision list in Hd,l given training data D.


The procedure Extend-List(L, R) simply adds rule R to the end of the decision list
L. The procedure Add-Literal(R, l) simply returns a rule where literal l is added to
the end of rule R. The procedure Beam-Select(G, w, D) selects the best b rules in
G with dierent heuristic values. The procedure Hvalue(R, D) returns the heuristic
value of rule R relative to training data D and is described in the text.
Figure 18.2

high-quality rule R with quality measured relative to the set of currently uncovered training examples. The selected rule is appended to the current decision-list,
and the training examples newly covered by the selected rule are removed from the
training set. This process repeats until the list covers all of the training examples.
The success of this approach depends heavily on the function Learn-Rule, which
selects a good rule relative to the uncovered training examplestypically a good
rule is one that selects actions with the best (or close to best) Q-value and also
covers a signicant number of examples.

18.4

API for Relational Planning

18.4.3.2

515

Learning Individual Rules

The input to the rule learner Learn-Rule is a set of training examples, along with
depth and length parameters d and l, and a beam width b. For each action type
a, the rule learner calls the routine Beam-Search to nd a good rule Ra in Rd,l
for action type a. Learn-Rule then returns the rule Ra with the highest value as
measured by our heuristic, which is described later in this section.
For a given action type a, the procedure Beam-Search generates a beam
B0 , B1 . . ., where each Bi is a set of rules in Rd,l for action type a. The sets evolve by
specializing rules in previous sets by adding literals to them, guided by our heuristic
function. Search begins with the most general rule a(X) : nil, which allows any
action of type a in any state. Search iteration i produces a set Bi that contains b
rules with the highest dierent heuristic values among those in the following set4:
G = Bi1 {R Rd,l | R = Add-Literal(R , l), R Bi1 , l L},
where L is the set of all possible literals with a depth of d or less. This set includes
the current best rules (those in Bi1 ) and also any rule in Rd,l that can be formed
by adding a new literal to a rule in Bi1 . The search ends when no improvement
in heuristic value occurs, that is, when Bi = Bi1 . Beam-Search then returns the
best rule in Bi according to the heuristic.
am ) ,
a1 ), . . . , Q(s,
Heuristic Function For a training instance s, (s), Q(s,
following Harmon and Baird [22], we dene the Q-advantage of taking action ai
ai ) Q(s,
(s)). Likewise, the Qinstead of (s) in state s by (s, ai ) = Q(s,
advantage of a rule R is the sum of the Q-advantages of actions allowed by R
in s. Given a rule R and a set of training examples D, our heuristic function
Hvalue(R, D) is equal to the number of training examples that the rule covers plus
the sum of all the Q-advantages of the rule over those training examples.5 Using Qadvantage rather than Q-value focuses the learner toward instances where a large
improvement over the previous policy is possible. Naturally, one could consider
using dierent weights for the coverage and Q-advantage terms, possibly tuning
the weight automatically using validation data.
4. Since many rules in Rd,l are equivalent, we must prevent the beam from lling up
with semantically equivalent rules. Rather than deal with this problem via expensive
equivalence testing we take an ad hoc, but practically eective approach. We assume that
rules do not coincidentally have the same heuristic value, so that ones that do must be
equivalent. Thus, we construct beams whose members all have dierent heuristic values.
We choose between rules with the same value by preferring shorter rules, then choose
arbitrarily.
5. If the coverage term is not included, then covering a zero Q-advantage example is the
same as not covering it. But zero Q-advantage can be good (e.g., the previous policy is
optimal in that state).

516

18.5

Reinforcement Learning in Relational Domains: A Policy-Language Approach

Bootstrapping
There are two issues that are critical to the success of our API technique. First,
API is fundamentally limited by the expressiveness of the policy language and
the strength of the learner, which dictates its ability to capture the improved
policy described by the training data at each iteration. Second, API can only
yield improvement if Improved-Trajectories successfully generates training data
that describes an improved policy. For large classical planning domains, initializing
API with an uninformed random policy will typically result in essentially random
training data, which is not helpful for policy improvement. For example, consider
the MDP corresponding to the 20-block blocks world with an initial problem
distribution that generates random initial and goal states. In this case, a random
policy is unlikely to reach a goal state within any practical horizon time. Hence,
the rollout trajectories are unlikely to reach the goal, providing no guidance toward
learning an improved policy (i.e., a policy that can more reliably reach the goal).
Because we are interested in solving large domains such as this, providing guiding inputs to API is critical. In Fern et al. [15], we showed that by bootstrapping
API with the domain-independent heuristic of the planner FF [24], API was able
to uncover good policies for the blocks world, simplied logistics world (no planes),
and stochastic variants. This approach, however, is limited by the heuristics ability
to provide useful guidance, which can vary widely across domains.
Here we describe a new bootstrapping procedure for goal-based planning domains, based on random walks, for guiding API toward good policies. Our planning
system, which is evaluated in section 18.6, is based on integrating this procedure
with API in order to nd policies for goal-based planning domains. For non-goalbased MDPs, this bootstrapping procedure cannot be directly applied, and other
bootstrapping mechanisms must be used if necessary. This might include providing
an initial nontrivial policy, providing a heuristic function, or some form of reward
shaping [34]. Below, we rst describe the idea of random-walk distributions. Next,
we describe how to use these distributions in the context of bootstrapping API,
giving a new algorithm LRW-API.
18.5.1

Random-Walk Distributions

Throughout we consider an MDP M = S, A, T, R, I that correspond to goalbased planning domains, as described in section 18.4.1. Recall that each state
s S corresponds to a planning problem, specifying a world state (via world
facts) and a set of goal conditions (via goal facts). We will use the terms MDP
state and planning problem interchangeably. Note that, in this context, I is a
distribution over planning problems. For convenience we will denote MDP states
as tuples s = (w, g), where w and g are the sets of world facts and goal facts in s
respectively.

18.5

Bootstrapping

517

Given an MDP state s = (w, g) and set of goal predicates G, we dene s|G to be
the MDP state (w, g  ) where g  contains those goal facts in g that are applications
of a predicate in G. Given M and a set of goal predicates G, we dene the nstep random walk problem distribution RW n (M, G) by the following stochastic
algorithm:
1. Draw a random state s0 = (w0 , g0 ) from the initial state distribution I.
2. Starting at s0 take n uniformly random actions, 6, giving a state sequence
(s0 , . . . , sn ), where sn = (wn , g0 ) (recall that actions do not change goal facts).
At each uniformly random action selection, we assume that an extra no-op
action (that does not change the state) is selected with some xed probability,
for reasons explained below.
3. Let g be the set of goal facts corresponding to the world facts in wn , so, e.g., if
wn = {on(a, b), clear(a)}, then g = {gon(a, b), gclear(a)}. Return the planning
problem (MDP state) (s0 , g)|G as the output.
We will sometimes abbreviate RW n (M, G) by RW n when M and G are clear in
context.
Intuitively, to perform well on this distribution a policy must be able to achieve
facts involving the goal predicates that typically result after an n-step random walk
from an initial state. By restricting the set of goal predicates G we can specify the
types of facts that we are interested in achievinge.g., in the blocks world we may
only be interested in achieving facts involving the on predicate.
The random-walk distributions provide a natural way to span a range of problem
diculties. Since longer random walks tend to take us further from an initial
state, for small n we typically expect that the planning problems generated by
RW n will become more dicult as n grows. However, as n becomes large, the
problems generated will require far fewer than n steps to solvei.e., there will be
more direct paths from an initial state to the end state of a long random walk.
Eventually, since S is nite, the problem diculty will stop increasing with n.
A question raised by this idea is whether, for large n, good performance on
RW n ensures good performance on other problem distributions of interest in the
domain. In some domains, such as the simple blocks world, 7, good random-walk
performance does seem to yield good performance on other distributions of interest.
In other domains, such as the grid world (with keys and locked doors), intuitively,
a random walk is very unlikely to uncover a problem that requires unlocking a
sequence of doors.

6. In practice, we only select random actions from the set of applicable actions in a state
si , provided our simulator makes it possible to identify this set.
7. In the blocks world with large n, RW n generates various pairs of random block
congurations, typically pairing states that are far apartclearly, a policy that performs
well on this distribution has captured signicant information about the blocks world.

518

Reinforcement Learning in Relational Domains: A Policy-Language Approach

We believe that good performance on long random walks is often useful, but
is only addressing one component of the diculty of many planning benchmarks.
To successfully address problems with other components of diculty, a planner
will need to deploy orthogonal technology such as landmark extraction for setting
subgoals [23]. For example, in the grid world, if we could automatically set the
subgoal of possessing a key for the rst door, a long random-walk policy could
provide a useful macro for getting that key.
For the purpose of developing a bootstrapping technique for API, we limit our
focus to nding good policies for long random walks. In our experiments, we dene
long by specifying a large walk length N . Theoretically, the inclusion of the
no-op action in the denition of RW ensures that the induced random-walk
Markov chain is aperiodic, and thus that the distribution over states reached
by increasingly long random walks converges to a stationary distribution.8 Thus
RW = limn RW n is well-dened, and we take good performance on RW to
be our goal.
18.5.2

Random-Walk Bootstrapping

For an MDP M , we dene M [I  ] to be an MDP identical to M only with the initial


state distribution replaced by I  . We also dene the success ratio SR(, M [I]) of
on M [I] as the probability that solves a problem drawn from I. Also treating I as
a random variable, the average length AL(, M [I]) of on M [I] is the conditional
expectation of the solution length of on problems drawn from I given that
solves I. Typically the solution length of a problem is taken to be the number of
actions; however, when action costs are not uniform, the length is taken to be the
sum of the action costs. Note that for the MDP formulation of classical planning
domains, given in section 18.4.1, if a policy achieves a high V (), then it will also
have a high success ratio and low average cost.
Given an MDP M and set of goal predicates G, our system attempts to nd a
good policy for M [RW N ], where N is selected to be large enough to adequately
approximate RW , while still allowing tractable completion of the learning. Naively,
given an initial random policy 0 , we could try to apply API directly. However, as
already discussed, this will not work in general, since we are interested in planning
domains where RW produces extremely large and dicult problems where random
policies provide an ineective starting point.
However, for very small n (e.g., n = 1), RW n typically generates easy problems,
and it is likely that API, starting with even a random initial policy, can reliably
nd a good policy for RW n . Furthermore, we expect that if a policy n performs
well on RW n , then it will also provide reasonably good, but perhaps not perfect,
guidance on problems drawn from RW m when m is only moderately larger than
8. The Markov chain may not be irreducible, so dierent initial states may give dierent
stationary distributions; however, we only consider one initial state, described by I.

18.5

Bootstrapping

519

LRW-API (N, G, n, w, h, M, 0 , )
// max random-walk length N , goal predicates G
// training set size n, sampling width w, horizon h,
// MDP M , initial policy 0 , discount factor .
0 ; n 1;
loop
c (n) >
if SR
// Find harder n-step distribution for .
c (i) < , or N if none;
n least i [n, N ] s.t. SR
M  = M [RW n (M, G)];
T Improved-Trajectories(n, w, h, M  , );
Learn-Policy(T );
until satised with
Return ;
c (n) estimates the success ratio of
Pseudocode for LRW-API. SR
in planning domain D on problems drawn from RW n (M, G) by drawing a set of
problems and returning the fraction solved by . Constants and are described
in the text.
Figure 18.3

n. Thus, we expect to be able to nd a good policy for RW m by bootstrapping API


with initial policy n . This suggests a natural iterative bootstrapping technique to
nd a good policy for large n (in particular, for n = N ).
Figure 18.3 gives pseudocode for the procedure LRW-API which integrates
API and random-walk bootstrapping to nd a policy for the long-random-walk
problem distribution. Intuitively, this algorithm can be viewed as iterating through
two stages: rst, nding a hard enough distribution for the current policy (by
increasing n); and then nding a good policy for the hard distribution using API.
The algorithm maintains a current policy and current walk length n (initially
n = 1). As long as the success ratio of on RWn is below the success threshold
, which is a constant close to one, we simply iterate steps of approximate policy
improvement. Once we achieve a success ratio of with some policy , the ifstatement increases n until the success ratio of on RW n falls below . That
is, when performs well enough on the current n-step distribution we move on
to a distribution that is slightly harder. The constant determines how much
harder and is set small enough so that can likely be used to bootstrap policy
improvement on the harder distribution. (The simpler method of just increasing n
by 1 whenever success ratio is achieved will also nd good policies whenever this
method does; however, this can take much longer, as it may run API repeatedly on
a training set for which we already have a good policy.)
Once n becomes equal to the maximum walk length N , we will have n = N for all
future iterations. It is important to note that even after we nd a policy with a good

520

Reinforcement Learning in Relational Domains: A Policy-Language Approach

success ratio on RW N , it may still be possible to improve on the average length of


the policy. Thus, we continue API on this distribution until we are satised with
both the success ratio and average length of the current policy.

18.6

Relational Planning Experiments


In this section, we evaluate the LRW-API technique on relational MDPs corresponding to deterministic and stochastic classical planning domains. We rst give
results for a number of deterministic benchmark domains, showing promising results in comparison with the state-of-the-art planner FF [24], while also highlighting
the limitations of our approach. Next, we give results for several stochastic planning domains, including those in the domain-specic track of the 2004 International
Probabilistic Planning Competition (IPPC).
In all of our experiments, we use the policy learner described in section 18.4.3
to learn taxonomic decision-list policies. In all cases, the number of training
trajectories is 100, and policies are restricted to rules with a depth bound d and
length bound l. The discount factor was always one, and LRW-API was always
initialized with a policy that selects random actions. We utilize a maximum-walklength parameter N = 10, 000 and set and equal to 0.9 and 0.1 respectively.
18.6.1

Deterministic Planning Experiments

We perform experiments in seven familiar STRIPS planning domains including


those used in the AIPS-2000 Planning Competition, those used to evaluate TLPlan in Bacchus and Kabanza [4], and the Gripper domain. Each domain has a
standard problem generator that accepts parameters, which control the size and
diculty of the randomly generated problems. Below we list each domain and the
parameters associated with them. A detailed description of these domains can be
found in Homann and Nebel [24]
Blocks world (n) : the standard blocks worlds with n blocks
Freecell (s, c, f, l) : a version of solitaire with s suits, c cards per suit, f freecells,
and l columns
Logistics (a,c,l,p) : the logistics transportation domain with p packages, l locations, c cities, and a airplanes
Schedule (p) : a job shop scheduling domain with p parts
Elevator (f, p) : elevator scheduling with f oors and p people
Gripper (b) : a robotic gripper domain with b balls
Briefcase (i) : a transportation domain with i items

18.6

Relational Planning Experiments

18.6.1.1

521

LRW Experiments

Our rst set of experiments evaluates the ability of LRW-API to nd good policies
for RW . Here we utilize a sampling width of one for rollout, since these are
deterministic domains. Recall that in each iteration of LRW-API we compute
an (approximately) improved policy and may also increase the walk length n to
nd a harder problem distribution. We continued iterating LRW-API until we
observed no further improvement. The training time per iteration is approximately
ve hours. Though the initial training period is signicant, once a policy is learned
it can be used to solve new problems very quickly, terminating in seconds with a
solution when one is found, even for very large problems.
Figure 18.4 provides data for each iteration of LRW-API in each of the seven
domains with the indicated parameter settings. The rst column, for each domain,
indicates the iteration number (e.g., the Blocks World was run for eight iterations).
The second column records the walk length n used for learning in the corresponding
iteration. The third and fourth columns record the success rate (SR) and average
lenght (AL) of the policy learned at the corresponding iteration as measured on 100
problems drawn from RW n for the corresponding value of n (i.e., the distribution
used for learning). When this SR exceeds , the next iteration seeks an increased
walk length n. The fth and sixth columns record the SR and AL of the same
policy, but measured on 100 problems drawn from the LRW target distribution
RW , which in these experiments is approximated by RW N for N = 10, 000.
So, for example, we see that in the Blocks World there are a total of eight
iterations, where we learn at rst for one iteration with n = 4, one more iteration
with n = 14, four iterations with n = 54, and then two iterations with n = 334.
At this point we see that the resulting policy performs well on RW . Further
iterations with n = N , not shown, showed no improvement over the policy found
after iteration 8. In other domains, we also observed no improvement after iterating
with n = N , and thus do not show those iterations. We note that all domains except
Logistics (see below) achieve policies with good performance on RW N by learning
on much shorter RW n distributions, indicating that we have indeed selected a large
enough value of N to capture RW , as desired.
18.6.1.2

General Observations

For several domains, our learner bootstraps very quickly from short random-walk
problems, nding a policy that works well even for much longer random-walk
problems. These include Schedule, Briefcase, Gripper, and Elevator. Typically,
large problems in these domains have many somewhat independent subproblems
with short solutions, so that short random walks can generate instances of all the
dierent typical subproblems. In each of these domains, our best LRW policy is
found in a small number of iterations and performs comparably to FF on RW .
We note that FF is considered a very good domain-independent planner for these
domains, so we consider this a successful result.

RW
RW n
SR AL SR AL

Blocks World (20)


1
2
3
4
5
6
7
8

4
14
54
54
54
54
334
334

0.92
0.94
0.56
0.78
0.88
0.98
0.84
0.99

2.0
5.6
15.0
15.0
33.7
25.1
45.6
37.8

FF
Freecell
1
2
3
4
5
6
7
8
9

5
8
30
30
30
30
30
30
30

0.97
0.97
0.65
0.72
0.90
0.81
0.78
0.90
0.93
FF

0
0.10
0.17
0.32
0.65
0.90
0.87
1

0
41.4
42.8
40.2
47.0
43.9
50.1
43.3

iter. #

Reinforcement Learning in Relational Domains: A Policy-Language Approach


iter. #

522

RW
RW n
n SR AL SR AL
Logistics (1,2,2,6)

1
2
3
4
5
6
7
8
0.96 49.0 9
10

(4,2,2,4)
43
1.4 0.08 3.6
44
2.7 0.26 6.3
45
7.0 0.78 7.0
7.1 0.85 7.0
6.7 0.85 6.3
6.7 0.89 6.6
6.8 0.87 6.8
1
2
6.9 0.89 6.6
7.7 0.93 7.9
1

5 0.86
45 0.86
45 0.81
45 0.86
45 0.76
45 0.76
45 0.86
45 0.76
45 0.70
45 0.81

45 0.74
45 0.90
45 0.92
FF

3.1
6.5
6.9
6.8
6.1
5.9
6.2
6.9
6.1
6.1

6.4
6.9
6.6

0.25
0.28
0.31
0.28
0.28
0.32
0.39
0.31
0.19
0.25

0.25
0.39
0.38
1

11.3
7.2
8.4
8.9
7.8
8.4
9.1
11.0
7.8
7.6

9.0
9.3
9.4
13

Schedule (20)
1 0.79 1 0.48
4 1 3.45 1
FF

27
34
36

5.4
Briefcase (10)

Elevator (20,10)
1 20 1

4.0 1

26

FF

23

1 5 0.91 1.4 0
2 15 0.89 4.2 0.2
3 15 1 3.0 1
FF

0
38
30
28

Gripper (10)
1 10 1

3.8 1

13

FF

13

Figure 18.4 Results for each iteration of LRW-API in seven deterministic planning domains. For each iteration, we show the walk length n used for learning, along
with the success ratio (SR) and average length (AL) of the learned policy on both
RW n and RW . Note that larger SR and smaller AL is better. The nal policy
shown in each domain performs above = 0.9 SR on walks of length N = 10, 000
(with the exception of Logistics), and further iteration does not improve the performance. For each benchmark we also show the SR and AL of the planner FF on
problems drawn from RW .

18.6

Relational Planning Experiments

523

For two domains, Logistics9 and Freecell, our planner is unable to nd a policy
with success ratio one on RW . We believe that this is a result of the limited
knowledge representation we allowed for policies for the following reasons. First,
we ourselves cannot write good policies for these domains within our current
policy language.10 Second, the nal learned decision lists for Logistics and Freecell
contain a much larger number of more specic rules than the lists learned in the
other domains. This indicates that the learner has diculty nding general rules
within the language restrictions that are applicable to large portions of training
data, resulting in poor generalization. Third, the success ratio (not shown) for the
sampling-based rollout policy, i.e., the improved policy simulated by ImprovedTrajectories, is substantially higher than that for the resulting learned policy that
becomes the policy of the next iteration. This indicates that Learn-Decision-List
is learning a much weaker policy than the sampling-based policy generating its
training data, indicating a weakness in either the policy language or the learning
algorithm. For example, in the Logistics domain, at iteration 8, the training data
for learning the iteration 9 policy is generated by a sampling rollout policy that
achieves success ratio 0.97 on 100 training problems drawn from the same RW 45
distribution, but the learned iteration 9 policy only achieves success ratio 0.70, as
shown in the gure at iteration 9. Extending our policy language to incorporate
the expressiveness that appears to be required in these domains will require a more
sophisticated learning algorithm, which is a point of future work.
In the remaining domain, the Blocks World, the bootstrapping provided by
increasingly long random walks appears particularly useful. The policies learned
at each of the walk lengths 4, 14, 54, and 334 are increasingly eective on the
target LRW distribution RW . For walks of length 54 and 334, it takes multiple
iterations to master the provided level of diculty beyond the previous walk length.
Finally, upon mastering walk length 334, the resulting policy appears to perform
well for any walk length. The learned policy is modestly superior to FF on RW
in success ratio and average length.
18.6.1.3

Evaluation on the Original Problem Distributions

In each domain we denote by the best learned LRW policyi.e., the policy, from
each domain, with the highest performance on RW , as shown in gure 18.4. Figure
18.5 shows the performance of , in comparison to FF, on the original intended
problem distributions for each of our domains. We measured the success ratio of
both systems by giving a time limit of 100 seconds to solve a problem. Here we

9. In Logistics, the planner generates a long sequence of policies with similar, oscillating
success ratios that are elided from the gure with ellipses for space reasons.
10. For example, in Logistics, one of the important concepts is the set containing all
packages on trucks such that the truck is in the packages goal city. However, the domain
is dened in such a way that this concept cannot be expressed within the language used
in our experiments.

524

Reinforcement Learning in Relational Domains: A Policy-Language Approach

Domain

Size

Blocks (20)
(50)

FF

SR AL SR AL
1 54 0.81 60
1 151 0.28 158

Freecell (4,2,2,4) 0.36 15 1 10


(4,13,4,8)
0 0.47 112
6
Logistics (1,2,2,6) 0.87 6 1
(3,10,2,30) 0 1 158
Elevator (60,30)

1 112 1

Schedule (50)

1 175 1 212

Briefcase (10)
(50)

1 30 1
1 162 0

Gripper (50)

98

29

1 149 1 149

Figure 18.5 Results on standard problem distributions for seven benchmarks.


Success ratio (SR) and average length (AL) are provided for both FF and our policy
learned for the LRW problem distribution. For a given domain, the same learned
LRW policy is used for each problem size shown.

have attempted to select the largest problem sizes previously used in evaluation of
domain-specic planners (either in AIPS-2000 or in Bacchus and Kabanza [4]), as
well as show a smaller problem size for those cases where one of the planners we
show performed poorly on the large size. In each case, we use the problem generators
provided with the domains, and evaluate on 100 problems of each size.
Overall, these results indicate that our learned, reactive policies are competitive
with the domain-independent planner FF. It is important to remember that these
policies are learned in a domain-independent fashion, and thus LRW-API can
be viewed as a general approach to generating domain-specic reactive planners.
On two domains, Blocks World and Briefcase, our learned policies substantially
outperform FF on success ratio, especially on large domain sizes. On three domains,
Elevator, Schedule, and Gripper, the two approaches perform quite similarly on
success ratio, with our approach superior in average length on Schedule but FF
superior in average length on Elevator.
On two domains, Logistics and Freecell, FF substantially outperforms our learned
policies on success ratio. We believe that this is partly due to an inadequate policy
language, as discussed above. We also believe, however, that another reason for
the poor performance is that the long-random-walk distribution RW does not
correspond well to the standard problem distributions. This seems to be particularly
true for Freecell. The policy learned for Freecell (4,2,2,4) achieved a success ratio
of 93 % on RW ; however, for the standard distribution it only achieved 36%.
This suggests that RW generates problems that are signicantly easier than the

18.6

Relational Planning Experiments

525

standard distribution. This is supported by the fact that the solutions produced
by FF on the standard distribution are on average twice as long as those produced
on RW . One likely reason for this is that it is easy for random walks to end up
in dead states in Freecell, where no actions are applicable. Thus the random-walk
distribution will typically produce many problems where the goals correspond to
such dead states. The standard distribution on the other hand will not treat such
dead states as goals.
18.6.2

Probabilistic Planning Experiments

Here we present experiments in three probabilistic domains that are described in


the probabilistic planning domain language PPDDL [52].
Ground Logistics (c, p) : a probabilistic version of logistics with no airplanes with
c cities and p packages. The driving action has a probability of failure in this
domain.
Colored Blocks World (n) : a probabilistic blocks world with n colored blocks,
where goals involve constructing towers with certain color patterns. There is a
probability that moved blocks fall to the oor.
Boxworld (c, p) : a probabilistic version of full logistics with c cities and p packages.
Transportation actions have a probability of going in the wrong direction.
The Ground Logistics domain is originally from Boutilier et al. [10], and was also
used for evaluation in Yoon et al. [51]. The Colored Blocks World and Boxworld
domains are the domains used in the hand-tailored track of the International
Planning Competition in which our LRW-API technique was entered. In the handtailored track, participants were provided with problem generators for each domain
before the competition and were allowed to incorporate domain knowledge into
the planner for use at competition time. We provided the problem generators to
LRW-API and learned policies for these domains, which were then entered into the
competition.
We have also conducted experiments in the other probabilistic domains from
Yoon et al. [51], including variants of the blocks world and a variant of Ground
Logistics, some of which appeared in Fern et al. [15]. However, we do not show
those results here since they are qualitatively identical to the deterministic blocks
world results described above and the Ground Logistics results we show below.
For our three probabilistic domains, we conducted LRW experiments using the
same procedure as above. All parameters given to LRW-API were the same as
above except that the sampling width used for rollout was set to w = 10, and was
set to 0.85 in order to account for the stochasticity in these domains. The results
of these experiments are shown in gure 18.6. These tables have the same form as
gure 18.4, only the last row given for each domain now gives the performance of
on standard distribution, i.e., problems drawn from the domains problem generator.

Reinforcement Learning in Relational Domains: A Policy-Language Approach


iter. #

526

SR

RW n
AL

RW
SR AL

Boxworld (10,5)
1
2
3
4
5
6
7
8
9

10
10
20
40
170
170
170
170
170

0.73
0.93
0.91
0.96
0.62
0.49
0.63
0.63
0.48

4.3
2.3
4.4
6.1
30.8
37.9
29.3
29.1
36.4

Standard Distribution (15,15)

0.03
0.13
0.17
0.31
0.25
0.17
0.21
0.18
0.17

61.5
58.4
55.9
50.4
52.2
55.7
55
55.3
55.3

Ground Logistics (3,4,4,3)


1 5 0.95
2 10 0.97
3 160 1

2.71
2.06
6.41

Standard Distribution (5,7,7,20)

0.17 168.9
0.84 17.5
1 7.2
1

20

Colored Blocks World (10)


1 2 0.86
2 5 0.89
3 40 0.92
4 100 0.76
5 100 0.94

1.7
8.4
11.7
37.5
20.0

Standard Distribution (50)

0.19
0.81
0.85
0.77
0.95

93.6
40.8
32.7
38.5
21.9

0.95 123

Results for each iteration of LRW-API in three probabilistic planning


domains. For each iteration, we show the walk length n used for learning, along with
the success ratio (SR) and average length (AL) of the learned policy on both RW n
and RW . For each benchmark, we show performance on the standard problem
distribution of the policy whose performance is best on RW .

Figure 18.6

18.7

Related Work

527

For Boxworld, LRW-API is not able to nd a good policy for RW or the


standard distribution. Again, as for deterministic Logistics and Freecell, we believe
that this is primarily because of the restricted policy language that is currently used
by our learner. Here, as for those domains, we see that the decision-list learned for
Boxworld contains many very specic rules, indicating that the learner was not
able to generalize well beyond the training trajectories. For Ground Logistics, we
see that LRW-API quickly nds a good policy for both RW and the standard
distribution.
For Colored Blocks World, we also see that LRW-API is able to quickly nd
a good policy for both RW and the standard distribution. However, unlike the
deterministic (uncolored) blocks world, here the success ratio is observed to be less
than one, solving 95% of the problems. It is unclear why LRW-API is not able to
nd a perfect policy. It is relatively easy to hand-code a policy for Colored Blocks
World using the language of the learner, hence inadequate knowledge representation
is not the answer. The predicates and action types for this domain are not the same
as those in its deterministic counterpart and other stochastic variants that we have
previously considered. This dierence apparently interacts badly with our learners
search bias, causing it to fail to nd a perfect policy. Nevertheless, these two
results, along with the probabilistic planning results not shown here, indicate that
when a good policy is expressible in our language, LRW-API can nd good policies
in complex relational MDPs. This makes LRW-API one of the few techniques that
can simultaneously cope with the complexity resulting from stochasticity and from
relational structure in domains such as these.

18.7

Related Work
Boutilier et al. [10] presented the rst exact solution technique for relational MDPs
based on structured dynamic programming. However, a practical implementation
of the approach was not provided, primarily due to the need for the simplication
of rst-order logic formulae. These ideas, however, served as the basis for a logic
programming-based system [28] that was successfully applied to blocks world
problems involving simple goals and a simplied logistics world. This style of
approach is inherently limited to domains where the exact value functions or
policies can be compactly represented in the chosen knowledge representation.
Unfortunately, this is not generally the case for the types of domains that we
consider here, particularly as the planning horizon grows. Nevertheless, providing
techniques such as these that directly reason about the MDP model is an important
direction. Note that our API approach essentially ignores the underlying MDP
model, and simply interacts with the MDP simulator as a black box.
An interesting research direction is to consider principled approximations of these
techniques that can discover good policies in more dicult domains. This has been
considered by Guestrin et al. [20], where a class-based MDP and value function
representation was used to compute an approximate value function that could

528

Reinforcement Learning in Relational Domains: A Policy-Language Approach

generalize across dierent sets of objects. Promising empirical results were shown
in a multiagent tactical battle domain. Presently the class-based representation
does not support some of the representation features that are commonly found in
classical planning domains (e.g., relational facts such as on(a, b) that change over
time), and thus is not directly applicable in these contexts. However, extending
this work to richer representations is an interesting direction. Its ability to reason
globally about a domain may give it some advantages compared to API.
Our approach is closely related to work in RRL [13], a form of online API that
learns relational value-function approximations. Q-value functions are learned in
the form of relational decision trees (Q-trees) and are used to learn corresponding
policies (P -trees). The RRL results clearly demonstrate the diculty of learning
value-function approximations in relational domains. Compared to P -trees, Q-trees
tend to generalize poorly and be much larger. RRL has not yet demonstrated
scalability to problems as complex as those considered hereprevious RRL blocks
world experiments include relatively simple goals,11, which lead to value functions
that are much less complex than the ones here. For this reason, we suspect that
RRL would have diculty in the domains we consider precisely because of the valuefunction approximation step that we avoid; however, this needs to be experimentally
tested.
We note, however, that our API approach has the advantage of using an unconstrained simulator, whereas RRL learns from irreversible world experience
(pure reinforcment learning). By using a simulator, we are able to estimate the
Q-values for all actions at each training state, providing us with rich training data.
Without such a simulator, RRL is not able to directly estimate the Q-value for
each action in each training statethus, RRL learns a Q-tree to provide estimates
of the Q-value information needed to learn the P -tree. In this way, value-function
learning serves a more critical role when a simulator is unavailable. We believe that
in many relational planning problems, it is possible to learn a model or simulator
from world experiencein this case, our API approach can be incorporated as the
planning component of RRL. Otherwise, nding ways to either avoid learning or to
more eectively learn relational value functions in RRL is an interesting research
direction.
Researchers in classical planning have long studied techniques for learning to
improve planning performance. For a collection and survey of work on learning
for planning domains, see [39, 53]. Two primary approaches are to learn domainspecic control rules for guiding search-based planners (e.g., see [40, 48, 14, 26,
2, 1]), and, more closely related, to learn domain-specic reactive control policies
[29, 33, 51].
Regarding the latter, our work is novel in using API to iteratively improve standalone control policies. Regarding the former, in theory, search-based planners can

11. The most complex blocks world goal for RRL was to achieve on(A, B) in an n block
environment. We consider blocks world goals that involve all n blocks.

18.7

Related Work

529

be iteratively improved by continually adding newly learned control knowledge


however, it can be dicult to avoid the utility problem [38], i.e., being swamped
by low utility rules. Critically, our policy-language bias confronts this issue by
preferring simpler policies. Our learning approach is also not tied to having a base
planner (let alone tied to a single particular base planner), unlike most previous
work. Rather, we only require a domain simulator.
The ultimate goal of such systems is to allow for planning in large, dicult
problems that are beyond the reach of domain-independent planning technology.
Clearly, learning to achieve this goal requires some form of bootstrapping and almost
all previous systems have relied on the human for this purpose. By far, the most
common human bootstrapping approach is learning from small problems. Here,
the human provides a small problem distribution to the learner, by limiting the
number of objects (e.g., using two to ve blocks in the blocks world), and control
knowledge is learned for the small problems. For this approach to work, the human
must ensure that the small distribution is such that good control knowledge for the
small problems is also good for the large target distribution. In contrast, our longrandom-walk bootstrapping approach can be applied without human assistance
directly to large planning domains. However, as already pointed out, our goal of
performing well on the LRW distribution may not always correspond well with a
particular target problem distribution.
Our bootstrapping approach is similar in spirit to the bootstrapping framework
of learning from exercises[41, 42]. Here, the learner is provided with planning
problems, or exercises, in order of increasing diculty. After learning on easier
problems, the learner is able to use its new knowledge, or skills, in order to
bootstrap learning on the harder problems. This work, however, has previously
relied on a human to provide the exercises, which typically requires insight into
the planning domain and the underlying form of control knowledge and planner.
Our work can be viewed as an automatic instantiation of learning from exercises,
specically designed for learning LRW policies.
Our random-walk bootstrapping is most similar to the approach used in MicroHillary [17], a macrolearning system for problem solving. In that work, instead
of generating problems via random walks starting at an initial state, random
walks were generated backward from goal states. This approach assumes that
actions are invertible or that we are given a set of backward actions. When such
assumptions hold, the backward random-walk approach may be preferable when
we are provided with a goal distribution that does not match well with the goals
generated by forward random walks. Of course, in other cases forward random
walks may be preferable. Micro-Hillary was empirically tested in the N N
sliding-puzzle domain; however, as discussed in that work, there remain challenges
for applying the system to more complex domains with parameterized actions and
recursive structure, such as familiar STRIPS domains. To the best of our knowledge,
the idea of learning from random walks has not been previously explored in the
context of STRIPS planning domains.

530

Reinforcement Learning in Relational Domains: A Policy-Language Approach

Our API approach can be viewed as a type of reduction from planning or


reinforcement learning to classication learning. That is, we solve an MDP by
generating and solving a series of cost-sensitive classication problems. Recently,
there have been several other proposals for reducing reinforcement learning to
classication. The most closely related approach is by Lagoudakis and Parr [31],
who also proposed a form of classication-based API. The primary dierence is
the form of the classication problem produced on each iteration. They generate
standard multi-class classication problems, where the training data consists of
states paired with either the best action (a positive example) or a nonbest action
(negative example). Rather, we generate cost-sensitive classication problems where
the training set consists of states paired with a cost vector that species the cost
of selecting each action. The use of cost-sensitive classication allows a learner to
make more informed tradeos when it is unable to nd a rule that correctly selects
the best action for all of the training data. Bagnell et al. [5] introduced a closely
related algorithm for learning non-stationary policies in reinforcement learning. For
a specied horizon time h, their approach learns a sequence of h policies. At each
iteration, all policies are held xed except for one, which is optimized by forming
a classication problem via policy rollout12. Finally, Langford and Zadrozny [32]
provide a formal reduction from reinforcement learning to classication, showing
that -accurate classication learning implies near-optimal reinforcement learning.
This approach uses an optimistic variant of sparse sampling to generate h
classication problems, one for each horizon time step.

18.8

Summary and Future Work


We introduced a new variant of API that learns policies directly, without representing approximate value functions. This allowed us to utilize a relational policy
language for learning compact policy representations. We also introduced a new
API bootstrapping technique for goal-based planning domains. Our experiments
show that the LRW-API algorithm, which combines these techniques, is able to
nd good policies for a variety of relational MDPs corresponding to classical planning domains and their stochastic variants. We know of no previous MDP technique
that has been successfully applied to problems such as these.
Our experiments also pointed to a number of weaknesses of our current approach.
First, our bootstrapping technique, based on long random walks, does not always
correspond well to the problem distribution of interest. Investigating other automatic bootstrapping techniques is an interesting direction, related to the general
problems of exploration and reward shaping in reinforcement learning. Second, we

12. Here the initial state distribution is dictated by the policies at previous time steps,
which are held xed. Likewise the actions selected along the rollout trajectories are dictated
by policies at future time steps, which are also held xed.

References

531

have seen that limitations of our current policy language and learner are partly
responsible for some of the failures of our system. In such cases, we must either (1)
depend on the human to provide useful features to the system, or (2) extend the
policy language and develop more advanced learning techniques. Policy-language
extensions that we are considering include various extensions to the knowledge representation used to represent sets of objects in the domain (in particular, for route
nding in maps/grids), as well as non-reactive policies that incorporate search into
decision making.
As we consider ever more complex planning domains, it is inevitable that our
brute-force enumeration approach to learning policies from trajectories will not
scale. Presently our policy learner, as well as the entire API technique, makes no
attempt to use the denition of a domain when one is available. We believe that
developing a learner that can exploit this information to bias its search for good
policies is an important direction of future work. Recently, Gretton and Thiebaux
[19] have taken a step in this direction by using logical regression (based on a
domain model) to generate candidate rules for the learner. Developing tractable
variations of this approach is a promising research direction. In addition, exploring
other ways of incorporating a domain model into our approach and other modelblind approaches are critical. Ultimately, scalable AI planning systems will need
to combine experience with stronger forms of explicit reasoning.

Acknowledgments
We thank Lin Zhu for originally suggesting the idea of using random walks for
bootstrapping. This work was supported in part by NSF grants 9977981-IIS and
0093100-IIS.

References
[1] R. Aler, D. Borrajo, and P. Isasi. Using genetic programming to learn and
improve control knowledge. Articial Intelligence, 141(1-2):2956, 2002.
[2] J. Ambite, C. Knoblock, and S. Minton. Learning plan rewriting rules. In
Proceedings of the International Conference on Articial Intelligence Planning
and Scheduling Systems, 2000.
[3] F. Bacchus. The AIPS 00 planning competition. AI Magazine, 22(3)(3):5762,
2001.
[4] F. Bacchus and F. Kabanza. Using temporal logics to express search control
knowledge for planning. Articial Intelligence, 16:123191, 2000.
[5] J. Bagnell, S. Kakade, A. Ng, and J. Schneider. Policy search by dynamic
programming. In Proceedings of Neural Information Processing Systems, 2003.

532

Reinforcement Learning in Relational Domains: A Policy-Language Approach

[6] R. Bellman. Dynamic Programming. Princeton University Press, Princeton,


NJ, 1957.
[7] D. P. Bertsekas and J. N. Tsitsiklis. Neuro-Dynamic Programming. Athena
Scientic, Nashua, NH, 1996.
[8] C. Boutilier and R. Dearden. Approximating value trees in structured dynamic
programming. In Proceedings of the International Conference on Machine
Learning, 1996.
[9] C. Boutilier, R. Dearden, and M. Goldszmidt. Stochastic dynamic programming with factored representations. Articial Intelligence, 121(1-2):49107,
2000.
[10] C. Boutilier, R. Reiter, and B. Price. Symbolic dynamic programming for
rst-order MDPs. In Proceedings of the International Joint Conference on
Articial Intelligence, 2001.
[11] T. Dean and R. Givan. Model minimization in Markov decision processes. In
Proceedings of the National Conference on Articial Intelligence, 1997.
[12] T. Dean, R. Givan, and S. Leach. Model reduction techniques for computing
approximately optimal solutions for Markov decision processes. In Proceedings
of the National Conference on Articial Intelligence, 1997.
[13] S. Dzeroski, L. DeRaedt, and K. Driessens. Relational reinforcement learning.
Machine Learning, 43:752, 2001.
[14] T. Estlin and R. Mooney. Multi-strategy learning of search control for
partial-order planning. In Proceedings of the National Conference on Articial
Intelligence, 1996.
[15] A. Fern, S. Yoon, and R. Givan. Approximate policy iteration with a policy
language bias. In Proceedings of Neural Information Processing Systems, 2003.
[16] A. Fern, S. Yoon, and R. Givan. Approximate policy iteration with a
policy language bias: Solving relational Markov decision processes. Journal
of Articial Intelligence Research, to appear.
[17] L. Finkelstein and S. Markovitch. A selective macro-learning algorithm and
its application to the NxN sliding-tile puzzle. Journal of Articial Intelligence
Research, 8:223263, 1998.
[18] R. Givan, T. Dean, and M. Greig. Equivalence notions and model minimization in Markov decision processes. Articial Intelligence, 147(1-2):163223,
2003.
[19] C. Gretton and S. Thiebaux. Exploiting rst-order regression in inductive
policy selection. In Proceedings of the Conference on Uncertainty in Articial
Intelligence, 2004.
[20] C. Guestrin, D. Koller, C. Gearhart, and N. Kanodia. Generalizing plans
to new environments in relational MDPs. In Proceedings of the International
Joint Conference on Articial Intelligence, 2003.

References

533

[21] C. Guestrin, D. Koller, R. Parr, and S. Venkataraman. Ecient solution


algorithms for factored MDPs. Journal of Articial Intelligence Research, 19:
399468, 2003.
[22] M. Harmon and L. Baird. Residual advantage learning applied to a dierential
game. In Proceedings of Neural Information Processing Systems, 1995.
[23] J. Homan, J. Porteous, and L. Sebastia. Ordered landmarks in planning.
Journal of Articial Intelligence Research, 22:215278, 2004.
[24] J. Homann and B. Nebel. The FF planning system: Fast plan generation
through heuristic search. Journal of Articial Intelligence Research, 14:263
302, 2001.
[25] R. Howard. Dynamic Programming and Markov Decision Processes. MIT
Press, Cambridge, MA, 1960.
[26] Y. Huang, B. Selman, and H. Kautz. Learning declarative control rules for
constraint-based planning. In International Conference on Machine Learning,
2000.
[27] M. Kearns, Y. Mansour, and A. Ng. A sparse sampling algorithm for nearoptimal planning in large Markov decision processes. Machine Learning, 49
(23):193208, 2002.
[28] K. Kersting, M. Van Otterlo, and L. DeRaedt. Bellman goes relational. In
Proceedings of the International Conference on Machine Learning, 2004.
[29] R. Khardon. Learning action strategies for planning domains. Articial
Intelligence, 113(1-2):125148, 1999.
[30] R. Khardon. Learning to take actions. Machine Learning, 35(1):5790, 1999.
[31] M. Lagoudakis and R. Parr. Reinforcement learning as classication: Leveraging modern classiers. In Proceedings of the International Conference on
Machine Learning, 2003.
[32] J. Langford and B. Zadrozny. Reducing t-step reinforcement learning to classication. hunch.net/jl/projects/reductions/RL to class/colt submission.ps,
2004.
[33] M. Martin and H. Gener. Learning generalized policies in planning domains
using concept languages. In Proceedings of the International Conference on
Principles of Knowledge Representation and Reasoning, 2000.
[34] M. Mataric. Reward functions for accelarated learning. In Proceedings of the
International Conference on Machine Learning, 1994.
[35] D. McAllester. Observations on cognitive judgements. In Proceedings of the
National Conference on Articial Intelligence, 1991.
[36] D. McAllester and R. Givan. Taxonomic syntax for rst order inference.
Journal of the ACM, 40(2):246283, 1993.
[37] A. McGovern, E. Moss, and A. Barto. Building a basic block instruction
scheduler using reinforcement learning and rollouts. Machine Learning, 49

534

Reinforcement Learning in Relational Domains: A Policy-Language Approach

(2/3):141160, 2002.
[38] S. Minton. Quantitative results concerning the utility of explanation-based
learning. In National Conference on Articial Intelligence, 1988.
[39] S. Minton, editor. Machine Learning Methods for Planning. Morgan Kaufmann, San Fransisco, CA, 1993.
[40] S. Minton, J. Carbonell, C. A. Knoblock, D. R. Kuokka, O. Etzioni, and
Y. Gil. Explanation-based learning: A problem solving perspective. Articial
Intelligence, 40:63118, 1989.
[41] B. K. Natarajan. On learning from exercises.
Computational Learning Theory, 1989.

In Annual Workshop on

[42] C. Reddy and P. Tadepalli. Learning goal-decomposition rules using exercises.


In Proceedings of the International Conference on Machine Learning, 1997.
[43] R. Rivest. Learning decision lists. Machine Learning, 2(3):229246, 1987.
[44] C. Sammut, S. Hurst, D. Kedzier, and D. Michie. Learning to y.
Proceedings of the International Conference on Machine Learning, 1992.

In

[45] G. Tesauro. Practical issues in temporal dierence learning. Machine Learning, 8:257277, 1992.
[46] G. Tesauro and G. Galperin. On-line policy improvement using monte-carlo
search. In Conference on Advances in Neural Information Processing, 1996.
[47] J. Tsitsiklis and B. Van Roy. Feature-based methods for large scale DP.
Machine Learning, 22:5994, 1996.
[48] M. Veloso, J. Carbonell, A. Perez, D. Borrajo, E. Fink, and J. Blythe.
Integrating planning and learning: The PRODIGY architecture. Journal of
Experimental and Theoretical AI, 7(1):81120, 1995.
[49] G. Wu, E. Chong, and R. Givan. Congestion control via online sampling. In
Infocom, 2001.
[50] X. Yan, P. Diaconis, P. Rusmevichientong, and B. Van Roy. Solitaire: Man
versus machine. In Proceedings of Neural Information Processing Systems,
2004.
[51] S. Yoon, A. Fern, and R. Givan. Inductive policy selection for rst-order
MDPs. In Proceedings of the Conference on Uncertainty in Articial Intelligence, 2002.
[52] H. Younes. Extending PDDL to model stochastic decision processes. In Proceedings of the International Conference on Automated Planning and Scheduling Workshop on PDDL, 2003.
[53] T. Zimmerman and S. Kambhampati. Learning-assisted automated planning:
Looking back, taking stock, going forward. AI Magazine, 24(2)(2):7396, 2003.

19 Statistical Relational Learning for Natural


Language Information Extraction

Razvan C. Bunescu and Raymond J. Mooney

Traditionally, information extraction (IE) systems treat separate potential extractions as independent. There are, however, cases when modeling the inuences between dierent potential extractions could improve overall accuracy. In this chapter, we use the framework of relational Markov networks (RMNs) in order to model
several specic relationships between candidate extractions. Inference and learning using this graphical model allow for collective information extraction in a
way that exploits the mutual inuence between possible extractions. Experiments
on learning to extract protein names from biomedical abstracts demonstrate the
advantage of this approach over existing IE methods.

19.1

Introduction
Understanding natural language presents many challenging problems that lend
themselves to statistical relational learning (SRL). Historically, both logical and
probabilistic methods have found wide application in natural language processing
(NLP). NLP inevitably involves reasoning about an arbitrary number of entities
(people, places, and things) that have an unbounded set of complex relationships
between them. Representing and reasoning about unbounded sets of entities and
relations has generally been considered a strength of predicate logic. However, NLP
also requires integrating uncertain evidence from a variety of sources in order to resolve numerous syntactic and semantic ambiguities. Eectively integrating multiple
sources of uncertain evidence has generally been considered a strength of Bayesian
probabilistic methods and graphical models. Consequently, NLP problems are particularly suited for SRL methods that combine the strengths of rst-order predicate
logic and probabilistic graphical models. In this chapter, we review our recent work
[4] on using relational Markov networks (RMNs) [30] for information extraction,

536

Statistical Relational Learning for Natural Language Information Extraction

the problem of identifying phrases in natural language text that refer to specic
types of entities [7]. We use the expressive power of RMNs to represent and reason
about several specic relationships between candidate entities and thereby collectively identify the appropriate set of phrases to extract. We present experiments
on learning to extract protein names from biomedical text that demonstrate the
advantage of this approach over existing information extraction methods.
The remainder of the chapter is organized as follows. In section 19.2, we review
the history of logical and probabilistic approaches to NLP, and discuss the unique
suitability of SRL for NLP. Section 19.3 introduces the problem of information
extraction, followed by section 19.4, where we summarize our work on collective information extraction using RMNs. In section 19.5, we examine challenging problems
for future research on SRL for NLP. In section 19.6, we present our conclusions.

19.2

Background on Natural Language Processing


Early research in NLP focused on symbolic techniques in which the knowledge required for understanding and generating language consisted of manually written
production rules, semantic networks, or axioms in predicate logic [1]. The semantic analysis of language was a particular focus of NLP research in the 1970s, with
researchers exploring tasks ranging from responding to commands and answering
questions in a microworld [33] to answering database queries [34] and understanding
short stories [26]. These early systems could perform impressive semantic interpretation and inference when understanding particular sentences or stories; however,
they tended to require tedious amounts of application-specic knowledge engineering and were therefore quite brittle and not easily extended to new texts or new
applications.
Disenchantment with the knowledge-engineering requirements and brittleness of
symbolic, manually developed NLP systems grew. Meanwhile, researchers in speech
recognition started to obtain promising results using statistical methods trained on
large annotated corpora [16]. Eventually, statistical methods came to dominate
speech recognition [15], and this development began to motivate the application of
similar methods to other aspects of NLP, such as part-of-speech (POS) tagging [8].
During the early 1990s, research in computational linguistics underwent a dramatic paradigm shift. Statistical learning methods that automatically acquire
knowledge for language processing from empirical data largely supplanted systems
based on human knowledge engineering [14, 19]. However, in order to avoid the dicult problems of detailed semantic interpretation, NLP research focused on building
robust systems for simpler tasks, such as POS tagging, syntactic parsing, wordsense
disambiguation, and information extraction of specic types of entities.
Many of the methods used in statistical NLP are fundamentally SRL techniques
since they perform some form of collective classication on unbounded length
strings. Strings can be seen as simple instances of relational data where the
individual items are characters, words, or tokens and the single relation after

19.3

Information Extraction

537

holds between adjacent items. Many NLP tasks, such as POS tagging, phrase
chunking [24], and information extraction (e.g., named entity tagging), can be
viewed as sequence labeling problems in which each word is assigned one of a
small number of class labels. The label of each word typically depends on the labels
of adjacent words in the sentence and collective inference must be performed to
assign the overall most probable combination of labels to all of the words in the
sentence. Statistical sequence models such as hidden Markov models (HMMs) [22]
or conditional random elds (CRFs) [18] are used to model the data and some form
of the Viterbi dynamic programming algorithm [31] is used to eciently perform the
collective classication. However, in order to develop systems that accurately and
robustly perform natural language analysis, we believe that more advanced SRL
methods are needed. In this chapter, we explore the application of an alternative
SRL method to the natural language task of information extraction. We introduce
the task in the following section and then present our recent SRL approach.

19.3

Information Extraction
Information extraction, locating references to specic types of items in natural
language documents, is an important task with many practical applications. Typical examples include identifying various named entities such as names of people, companies, and locations. In this chapter, we consider the task of identifying names of human proteins in abstracts of biomedical journal articles. Figure 19.1 shows part of a sample abstract highlighting the protein names to be
identied. This task is an important part of mining the scientic literature in
order to build structured databases of existing biological knowledge. In particular, by mining 753,459 abstracts on the human organism from the Medline
repository (http://www.ncbi.nlm.nih.gov/entrez/) we have extracted a database
of 6580 interactions among 3737 human proteins. The details of this database have
been published in the biological literature [23] and it is available on the web at
http://bioinformatics.icmb.utexas.edu/idserve.

Production of nitric oxide ( NO ) in endothelial cells is regulated by direct


interactions of endothelial nitric oxide synthase ( eNOS ) with effector proteins such
as Ca2+ calmodulin . Here we have ... identified a novel 34 kDa protein , termed
NOSIP ( eNOS interaction protein ) , which avidly binds to the carboxyl terminal
region of the eNOS oxygenase domain .

Figure 19.1

Medline abstract with all protein names emphasized.

In the simplest case, protein name identication can be treated as a sequence


labeling problem in which each word (token) in the text is classied as either part
of a protein name or not part of a protein name. As long as protein names are

538

Statistical Relational Learning for Natural Language Information Extraction

not immediately contiguous (a constraint consistently satised in the more than


1000 human-annotated abstracts we have examined), this labeling allows immediate
recovery of all substrings constituting protein names. However, in practice, a larger
set of word labels can result in more accurate extraction. In particular, we found
that ve word labelsBegin (the rst word in a multiword name), End (the last
word in a multiword name), Inside (an internal word in a multiword name), Single
(a word corresponding to a single-word name), and Other (a word that is not part
of a name)gave the best empirical results by creating word classes with the most
easily captured regularities.
In a recent follow-up to previously published experiments comparing a wide variety of information extraction learning methods (including HMM, support vector
machines (SVMs), MaxEnt, and rule-based methods) on the task of tagging references to human proteins in Medline abstracts [6], CRFs were found to outperform
competing techniques [23]. However, although CRFs capture the dependence between the labels of adjacent words, they do not adequately capture long-distance
dependencies between potential extractions in dierent parts of a document. For
example, in our protein-tagging task, repeated references to the same protein are
common. If the context surrounding one occurrence of a phrase is very indicative of
it being a protein, then this should also inuence the tagging of another occurrence
of the same phrase in a dierent context which is not typical of protein references.
Consequently, more complex SRL methods that can capture such dependencies
may result in more accurate information extraction. In the following section we
show how RMNs can be used to model long-distance dependencies in the context
of information extraction (for two recent alternative approaches, see the skip-chain
CRFs introduced in [28] and the Gibbs sampling method from [11]).

19.4

Collective Information Extraction with RMNs


In this section, we present our research on using RMNs to collectively extract all of
the entities in a particular document. In particular, we have tested our approach on
the dicult problem of identifying names of human proteins in biomedical journal
abstracts. Unlike proteins in some other organisms (e.g., yeast), human proteins
have no standardized nomenclature, making them particularly dicult to recognize
among the variety of entity types referenced in biomedical text. One important
source of potential evidence is the correlations between the labels of repeated
phrases inside a document, as well as between acronyms and their corresponding
long form. In both cases, the mentioned phrases tend to have the same entity
label. For example, gure 19.2 shows part of an abstract from Medline, an online
database of biomedical articles. In this abstract, the protein referenced by rpL22
is rst introduced by its long name, ribosomal protein L22, followed by the short
name, rpL22, within parentheses. The presence of the word protein is a very good
indicator that the entire phrase ribosomal protein L22 is a protein name. Also,
rpL22 is an acronym of ribosomal protein L22, which increases the likelihood that

19.4

Collective Information Extraction with RMNs

539

it too is a protein name. The same name rpL22 occurs later in the abstract in
contexts which do not indicate so clearly the entity type; however, we can use the
fact that repetitions of the same name tend to have the same type inside the same
document.

The control of human ribosomal


protein
L22 (rpL22 ) to enter into the nucleolus
and its ability to be assembled into the ribosome is regulated by its sequence .
The nuclear import of rpL22 depends on a classical nuclear localization signal
of four lysines at positions 13 - 16 . RpL22 normally enters the nucleolus via
a compulsory sequence of KKYLKK ( I - domain , positions 88 - 93 ) ... Once it
reaches the nucleolus , the question of whether rpL22 is assembled into the
ribosome depends upon the presence of the N - domain .

Figure 19.2

Medline abstract with all protein names emphasized.

The capitalization pattern of the name itself is another useful indicator; nevertheless it is not sucient by itself, as similar patterns are also used for other types
of biological entities such as cell types or amino acids. Therefore, correlations between the labels of repeated phrases, or between acronyms and their long form can
provide additional useful information. Our intuition is that a method that could use
this kind of information would show an increase in performance, especially when
doing extraction from biomedical literature, where phenomena like repetitions and
acronyms are pervasive. This type of document-level knowledge can be captured
using relational Markov networks (RMNs), a version of undirected graphical models
which have already been successfully used to improve the classication of hyperlinked webpages [30].
The rest of this section is organized as follows. In sections 19.4.1 and 19.4.2 we
describe the input to our named entity extractor in terms of a set of candidate
entities and their features. Subsequent sections introduce the RMN framework for
entity recognition (representation, inference, and learning), ending with experimental results in section 19.4.8.
19.4.1

Candidate Entities

Typically, as described in section 19.3, entity recognition has been approached


by classifying individual tokens. We [4] considered a dierent approach, where
candidate phrases in a document are classied according to the desired set of entity
types. An advantage of using phrase classication is that it allows for phrase-based
features such as the text of the candidate phrase, or its similarity to dictionary
entries. However, phrase classication requires an initial set of candidate entity
phrases. Considering as candidate entities all contiguous word sequences from a
document would lead to a quadratic number of phrases, which would adversely
aect the time complexity of the extraction algorithm. For our task, there are

540

Statistical Relational Learning for Natural Language Information Extraction

various heuristics that can signicantly reduce the size of the candidate set; two of
these are listed below:
H1: In general, named entities have limited length. Therefore, one simple way of
creating the set of candidate phrases is to compute the maximum length of all
annotated entities in the training set, and then consider as candidates all word
sequences whose length is up to this maximum length. This is also the approach
followed in SRV [12].
H2: In the task of extracting protein names from Medline abstracts, we noticed
that, like most entity names, almost all proteins in our data are base noun
phrases (NPs) or parts of them. Therefore, such substrings are used to determine
candidate entities. To avoid missing options, we adopt a very broad denition of
base NP a maximal contiguous sequence of tokens with their POS restricted to
nouns, gerund verbs, past participle verbs, adjectives, numbers, and dashes. The
complete set of POS tags is {JJ, VBN, VBG, POS, NN, NNS, NNP, NNPS, CD,
} (using the treebank notation [20]). Also, the last word (the head) of a base
NP is constrained to be either a noun or a number. Candidate extractions then
consist of base NPs, together with all their contiguous subsequences headed by a
noun or number.
19.4.2

Entity Features

The set of features associated with each candidate is based on the feature templates
introduced in [9], used there for training a reranking algorithm on the extractions
returned by a maximum-entropy tagger. Many of these features use the concept
of word type, which allows a dierent form of token generalization than POS tags.
The short type of a word is created by replacing any maximal contiguous sequences
of capital letters with A, of lowercase letters with a, and of digits with 0. For
example, the word TGF-1 would be mapped to type A-0.
Consequently, each token position i in a candidate extraction provides three types
of information: the word itself wi , its POS tag ti , and its short type si . The full
set of feature types is listed in table 19.1, where we consider a generic candidate
extraction as a sequence of n + 1 words w0 w1 ...wn .
Each feature template instantiates numerous features. For example, the candidate
extraction HDAC1 enzyme has the headword HD=enzyme, the short type ST=A0 a,
the prexes PF=A0 and PF=A0 a, and the suxes SF=a and SF=A0 a. All other
features depend on the left or right context of the entity. Feature values that occur
less than three times in the training data are ltered out.
19.4.3

The RMN Framework for Entity Recognition

Given a collection of documents D, we associate with each document d D a set


of candidate entities d.E, in our case a restricted set of token sequences from the
document as given by H2 section 19.4.1. Each entity e d.E is characterized by a

19.4

Collective Information Extraction with RMNs


Table 19.1

541

Feature templates

Description

Feature Template

Description

Feature Template

Text / head

w0 w1 ...wn / wn

Short type

s0 s1 ...sn

Bigram left
(4 bigrams)

z1 z0
where z {w, s}

Bigram right
(4 bigrams)

zn zn+1
where z {w, s}

Trigram left
(8 trigrams)

z2 z1 z0
where z {w, s}

Trigram right
(8 trigrams)

zn zn+1 zn+2
where z {w, s}

POS left

t1

POS right

tn+1

Prex
(n+1 prexes)

s0 s0 s1 ...
s0 s1 ...sn+1

Sux
(n+1 suxes)

sn sn1 sn
s0 s1 ...sn+1

...

predened set of Boolean attributes e.F section 19.4.2, the same for all candidate
entities. One particular attribute is e.label which is set to 1 if e is considered a valid
extraction, and 0 otherwise. In this document model, labels are the only hidden
variables, and the inference procedure will try to nd a most probable assignment
of values to labels, given the current model parameters and the values of all other
variables.
Each document is associated with a factor graph [17], which is a bipartite graph
containing two types of nodes:
Variable nodes correspond directly to the labels of all candidate entities in the
document.
Potential nodes model the correlations between two or more entity attributes.
For each such correlation, a potential node is created that is linked to all variable
nodes involved. This is equivalent to creating a clique in the corresponding Markov
random eld.
The types of correlations captured by factor graphs (see gure 19.4 for some
examples) are specied by matching clique templates against the entire set of
candidate entities d.E. A clique template is a procedure that nds all subsets of
entities satisfying a given constraint, after which, for each entity subset, it connects
through a potential node all the variable nodes corresponding to a selected set of
attributes. Formally, there is a set of clique templates C, with each template c C
specied by:
1. A matching operator Mc for selecting subsets of entities, Mc (E) 2E .
2. A selected set of features Sc = Xc , Yc , the same for all subsets of entities returned
by the matching operator. Xc denotes the observed features, while Yc refers to
the hidden labels.
3. A clique potential c which gives the compatibility of each possible conguration
of values for the features in Sc , s.t. c (s) 0, s Sc .

542

Statistical Relational Learning for Natural Language Information Extraction

Given a set E of nodes, Mc (E) consists of subsets of entities whose attribute


nodes Sc are to be connected in a clique. In previous applications of RMNs, the
selected subsets of entities for a given template have the same size; however, some
of our clique templates may match a variable number of entities. The set Sc may
contain the same attribute from dierent entities. Usually, for each entity in a
matching set, its label is included in Sc . All these will be illustrated with examples
in sections 19.4.4 and 19.4.5 where the clique templates used in our model are
described in detail.
Depending on the number of hidden labels Yc selected by a clique c, we dene
two categories of clique templates:
Local templates are all templates c C for which |Yc | = 1. They model the
correlations between an entitys observed features and its label.
Global templates are all templates c C for which |Yc | > 1. They capture
inuences between multiple entities from the same document.
After the factor graph model for a document d has been completed with potential
nodes from all templates, the probability distribution over the random eld of
hidden entity labels d.Y given the observed features d.X is given by the Gibbs
distribution:


1
C (G.Xc , G.Yc ),
(19.1)
P (d.Y |d.X) =
Z(d.X)
cC GMc (d.E)

where Z(d.X) is the normalizing partition function:




Z(d.X) =
C (G.Xc , G.Yc ).
Y

(19.2)

cC GMc (d.E)

There are two problems that need to be addressed when working with RMNs:
1. Inference Usually, two types of quantities are needed from an RMN model:
The marginal distribution for a hidden variable, or for a subset of hidden
variables in the graphical model.
The most probable assignment of values to all hidden variables in the model.
2. Learning As the structure of the RMN model is already dened by its clique
templates, learning refers to nding the clique potentials that maximize the
likelihood over the training data. Inference is usually performed multiple times
during the learning algorithm, which means that an accurate, fast inference
procedure is doubly important.
The actual algorithms used for inference and learning will be described in
sections 19.4.6 and 19.4.7 respectively.

19.4

Collective Information Extraction with RMNs

19.4.4

543

Local Clique Templates

As described in the previous section, the role of local clique templates is to model
correlations between an entitys observed features (see table 19.1 and its label. For
each binary feature f we introduce a local template LTf . Given a candidate entity
e, with the observed feature e.f = 1, the template LTf creates a potential node
linked to the variable node e.label. As an example, gure 19.3 shows that part of the
factor graph which is generated around the entity label for HDAC1 enzyme, with
potential nodes for the head feature (HD), prex features (PF) and sux features
(SF). Variable nodes are shown as empty circles and potential nodes are gured
as black squares. The potential f associated with all potential nodes created by
template LTf would consist in a 1 2 table, as e.f is known to be 1, and e.label
has cardinality 2 (assuming only one entity type is to be extracted, we need only
two values for the label attribute).

e label

...

HD=enzyme PF=A0_a
PF=A0
Figure 19.3

19.4.5

SF=A0_a
SF=a

Factor graph for local templates.

Global Clique Templates

Global clique templates enable us to model hypothesized inuences between entities


from the same document. They create potential nodes connected to the label nodes
of two or more entities. In our experiments we use three global templates:
Overlap template (OT) No two entity names overlap in the text; i.e., if the
span of one entity is [s1 , e1 ] and the span of another entity is [s2 , e2 ], and s1 s2 ,
then e1 < s2 .
Repeat template (RT) If multiple entities in the same document are repetitions
of the same name, their labels tend to have the same value (i.e., most of them
are protein names, or most of them are not protein names). In section 19.4.5.2 we
discuss situations in which repetitions of the same protein name are not tagged
as proteins, and design an approach to handle this.
Acronym template (AT) It is common convention that a protein is rst
introduced by its long name, immediately followed by its short form (acronym)
in parentheses.

544

Statistical Relational Learning for Natural Language Information Extraction

In gure 19.4 we show the factor graphs created by these global templates, each of
which is explained in the following sections.

RT

u or
or

OT
u
v
(a) Overlap factor graph
Figure 19.4

19.4.5.1

u1

u2

AT
v

... u
n

vor
or
v1

v2

u or
or

... v
m

(b) Repeat factor graph

u1

u2

...

un

(c) Acronym factor graph

Factor graphs for global templates.

The Overlap Template

The denition of a candidate extraction from section 19.4.1 leads to many overlapping entities. For example, glutathione S - transferase is a base NP, and it generates ve
candidate extractions: glutathione, glutathione S, glutathione S - transferase, S - transferase,
and transferase. If glutathione S - transferase has label-value 1, the other four entities
should all have label-value 0, because they overlap with it.
This type of constraint is enforced by the overlap template by creating a potential
node for each pair of overlapping entities and connecting it to their label nodes,
as shown in gure 19.4(a). To avoid clutter, all entities in this and subsequent
factor graphs stand for their corresponding labels. The potential function OT is
manually set so that at most one of the overlapping entities can have label-value 1,
as illustrated in table 19.2.
Table 19.2

Overlap potential
OT

e1 .label = 0

e1 .label = 1

e2 .label = 0

e2 .label = 1

Continuing with the previous example, because glutathione S and S - transferase are
two overlapping entities, the factor graph model will contain an overlap potential
node connected to the label nodes of these two entities.

19.4

Collective Information Extraction with RMNs

19.4.5.2

545

The Repeat Template

We could specify the potential for the repeat template in a 2 2 table, this time
leaving the table entries to be learned, given that assigning the same label to
repetitions is not a hard constraint. However, we can do better by noting that
the vast majority of cases where a repeated protein name is not also tagged as a
protein happens when it is part of a larger phrase that is tagged. For example,
HDAC1 enzyme is a protein name, therefore HDAC1 is not tagged in this phrase,
even though it may have been tagged previously in the abstract where it was not
followed by enzyme. We need a potential that allows two entities with the same
text to have dierent labels if the entity with label-value 0 is inside another entity
with label-value 1. But a candidate entity may be inside more than one including
entity, and the number of including entities may vary from one candidate extraction
to another. Using the example from section 19.4.5.1, the candidate entity glutathione
is included in two other entities: glutathione S and glutathione S - transferase.
In order to instantiate potentials over a variable number of label nodes, we
introduce a logical OR clique template that matches a variable number of entities.
When this template matches a subset of entities e1 , e2 , ..., en , it will create an
auxiliary OR entity eOR , with a single attribute eOR .label. The potential function
OR is manually set so that it assigns a nonzero potential only when eOR .label =
e1 .labele2.label...en .label. The potential nodes are only created as needed, e.g.,
when the auxiliary OR entity is required by repeat and acronym clique templates.
Figure 19.4(b) shows the factor graph for a sample instantiation of the repeat
template using the OR template. Here, u and v represent two same-text entities, u1 ,
u2 , ... un are all entities that include u, and v1 , v2 , ..., vm are entities that include v.
The potential function RT can either be manually preset to prohibit unlikely label
congurations, or it can be learned to represent an appropriate soft constraint. In
our experiments, it was learned since this gave slightly better performance.
Following the previous example, suppose that the word glutathione occurs inside
two base NPs in the same document, glutathione S - transferase and glutathione antioxidant system. Then the rst occurrence of glutathione will be associated with the entity
u, and correspondingly its including entities will be u1 = glutathione S and u2 =
glutathione S - transferase. Similarly, the second occurrence of glutathione will be associated with the entity v, with the corresponding including entities v1 = glutathione
antioxidant and v2 = glutathione antioxidant system.
19.4.5.3

The Acronym Template

One approach to the acronym template would be to use an extant algorithm for
identifying acronyms and their long forms in a document, and then dene a potential
function that would favor label congurations in which both the acronym and its
denition have the same label. One such algorithm is described by Schwartz and
Hearst[27], achieving a precision of 96% at a recall rate of 82%. However, because
this algorithm would miss a signicant number of acronyms, we have decided to

546

Statistical Relational Learning for Natural Language Information Extraction

implement a softer version as follows: detect all situations in which a single word is
enclosed between parentheses, such that the word length is at least 2 and it begins
with a letter. Let v denote the corresponding entity. Let u1 , u2 , ..., un be all entities
that end exactly before the open parenthesis. If this is a situation in which v is an
acronym, then one of the entities ui is its corresponding long form. Consequently,
we use a logical OR template to introduce the auxiliary entity uOR , and connect it
to vs node label through an acronym potential AT , as illustrated in gure 19.4(c).
For example, consider the phrase the antioxidant superoxide dismutase - 1 ( SOD1 ).
SOD1 satises our criteria for acronyms, thus it will be associated with the entity v
in gure 19.4(c). The candidate long forms are u1 = antioxidant superoxide dismutase 1, u2 = superoxide dismutase - 1, and u3 = dismutase - 1.
19.4.6

Inference in Factor Graphs

In our setting, given the clique potentials, the inference step for the factor graph
associated with a document involves computing the most probable assignment of
values to the hidden labels of all candidate entities:
d.Y = arg max P (d.Y |d.X),
d.Y

(19.3)

where P (d.Y |d.X) is dened as in (19.1). A brute-force approach is excluded,


since the number of possible label congurations is exponential in the number of
candidate entities. The sum-product algorithm [17] is a message-passing algorithm
that can be used for computing the marginal distribution over the label variables in
factor graphs without cycles, and with a minor change (replacing the sum operator
used for marginalization with a max operator) it can also be used for deriving the
most probable label assignment. In our case, in order to get an acyclic graph, we
would have to use local templates only. However, it has been observed that the
algorithm often converges in general factor graphs, and when it converges, it gives
a good approximation to the correct marginals. The algorithm works by altering
the belief at each label node by repeatedly passing messages between the node and
all potential nodes connected to it [17].
19.4.7

Learning Potentials in Factor Graphs

Following a maximum likelihood estimation, we shall use the log-linear representation of potentials:
C (G.Xc , G.Yc ) = exp{wc fc (G.Xc , G.Yc )}.

(19.4)

Let w be the concatenated vector of all potential parameters wc . One approach to


nding the maximum likelihood solution for w is to use a gradient-based method,
which requires computing the gradient of the log-likelihood with respect to potential
parameters wc . It can be shown that this gradient is equal with the dierence
between the empirical counts of fc and their expectation under the current set of

19.4

Collective Information Extraction with RMNs

547

parameters w.
L(w, D) =

fc (d.X, d.Y )



fc (d.X, d.Y  )Pw (d.Y  |d.X)

(19.5)

dD d.Y 

dD

The expectation in the second term is expensive to compute, since it requires


summing over all possible congurations of candidate entity labels from a given
document. To circumvent this complexity, we used the voted perceptron approach
[13], which can be seen as approximating the full expectation of fc with the fc
counts for the most likely labeling under the current parameters w.


L(w, D)
fc (d.X, d.Y )
fc (d.X, d.Yw )
(19.6)
dD

dD

The voted perceptron algorithm is detailed in table 19.3. At each step i in the
Table 19.3

The voted perceptron algorithm

Input: a set of documents D, number of epochs T , learning rate e.


set parameters w0 = 0, counter i = 0
for t = 1...T
for every document d D
d.Yi = arg maxd.Y  Pwi (d.Y  |d.X)
wi+1 = wi + e [f (d.X, d.Y ) f (d.X, d.Yi )]
i= i+1
P
1
Output: w = T |D|
i wi

algorithm, inference is performed using the current parameters wi , which results


in the most likely labeling d.Yi . The parameters are then updated based on the
dierence between the features counts computed on the ideal labeling d.Y and those
computed on the current most likely labeling d.Yi . The nal set of parameters is
the average taken over the parameters at all steps i in the algorithm. In all our
experiments, the perceptron was run for fty epochs, with a learning rate set at
0.01.
19.4.8

Experimental Results

We have tested the RMN approach on two data sets that have been hand-tagged for
human protein names. The rst data set is Yapex1 which consists of 200 Medline
abstracts. The second dataset is Aimed2, which consists of 225 Medline abstracts
we previously annotated for evaluating systems that extract both human proteins
and their interactions [6].

1. URL:www.sics.se/humle/projects/prothalt/
2. URL:ftp.cs.utexas.edu/mooney/bio-data/

548

Statistical Relational Learning for Natural Language Information Extraction

We compared the performance of three systems:


LT-RMN is the RMN approach using local templates and the overlap template.
GLT-RMN is the full RMN approach, using all local and global templates.
CRF, which uses a CRF for labeling token sequences. We used the CRF implementation from [21] with the set of tags and features employed by the maximumentropy tagger described in [6].
All Medline abstracts were tokenized and then POS-tagged using the [2] tagger.
Each extracted protein name in the test data was compared to the human-tagged
data, with the positions taken into account. Two extractions are considered a match
if they consist of the same character sequence in the same position in the text.
Results are shown in table 19.4, which presents the standard information extraction metrics of average precision (percentage of extracted names that are correct),
recall (percentage of correct names that are extracted), and F-measure (harmonic
mean of precision and recall) using ten-fold cross-validation.
Table 19.4

Information extraction performance on two human protein corpora

Method
LT-RMN
GLT-RMN
CRF

Yapex
Precision
70.79
69.71
72.45

Recall
53.81
65.76
58.64

F-m
61.14
67.68
64.81

Method
LT-RMN
GLT-RMN
CRF

Aimed
Precision
81.33
82.79
85.37

Recall
72.79
80.04
75.90

F-m
76.82
81.39
80.36

In terms of F-measure, the use of global templates for modeling inuences


between possible entities from the same document signicantly improves extraction
performance over the local approach (a one-tailed paired t -test for statistical
signicance results in a p-value less than 0.01 on both data sets). There is also
a small improvement over CRFs, with the results being statistically signicant only
for the Yapex data set, corresponding to a p-value of 0.02. As expected, GLT-RMN
gave a consistently higher recall additional protein names were extracted as a
result of linking them to repetitions with more informative contexts.
We hypothesize that further improvements to the LT-RMN approach and a
better inference algorithm would push the GLT-RMN performance even higher.
In [3], based on a version of the junction tree algorithm that exploits the sparsity
of the overlap potential, we show that exact inference for the LT-RMN case can
be performed eciently, with time complexity linear in terms of the number of
candidate entities. In the same work, it is shown that if the candidate entities are
given by the weak (but complete) heuristic H1, the new LT-RMN approach can be
used for returning all text positions that are unlikely to belong to a named entity.
This provides a general method for reducing the number of candidate extractions,
which can replace the domain-dependent heuristic H2. The main drawback of this
heuristic is that sometimes it may miss true entity names - its coverage is 95.6% on

19.5

Future Research on SRL for NLP

549

Yapex and 97.1% on Aimed. As an example, H2 assumes that a candidate entity


cannot contain parentheses; however the Yapex corpus contains a few entity names
like V (1a) receptor, or interleukin 10 (IL-10) receptor, which violate this assumption.
Instead, the local phrase model can be used to learn patterns like allow a close
parenthesis in an entity name if it is followed by the word receptor.
For the global model GLT-RMN, the inference procedure can be improved by
using a tree-based message propagation schedule, also known as tree reparameterization (TRP) [32]. TRP has the advantage that if often converges in cases where the
sum-product algorithm fails, requiring a considerably shorter time for convergence.

19.5

Future Research on SRL for NLP


There are a variety of promising directions for future research in applying SRL to
NLP. With respect to information extraction, in addition to identifying entities,
an important problem is extracting specic types of relations between entities. For
example, in newspaper text, one can identify that an organization is located in a
particular city or that a person is aliated with a specic organization [35]; in
biomedical text, one can identify that a protein interacts with another protein or
that a protein is located in a particular part of the cell [5, 10]. SRL methods may be
usefully applied to such problems since they require identifying relations between
phrases that occur in dierent parts of a sentence or paragraph.
The complete task of natural language understanding incorporates a wide variety
of interacting subtasks such as speech recognition, morphology, POS tagging, phrase
chunking, syntactic parsing, word-sense disambiguation, semantic interpretation,
anaphora (e.g. pronoun) resolution, and discourse processing. Each of these tasks
requires disambiguating between numerous possibilities and resolving each of these
ambiguities interacts in complex ways with many of the others. For example, when
understanding the passage, At the zoo, several men were showing a group of
students various types of ying animals. Suddenly, one of the students hit the
man with a bat, one must rst use the context in the previous sentence to resolve
the meaning of the word bat before being able to properly attach the misleading
prepositional phrase with a bat to the man (NP) rather than to the hitting
(verb phrase). SRL methods hold the promise of being able to integrate decisions
at all levels of syntactic, semantic, and pragmatic processing in order to correctly
interpret natural language. Several recent projects have taken the rst steps in this
direction. For example, Sutton et al. [29] present a dynamic version of a CRF that
integrates POS tagging and NP chunking into one coherent process. Roth and Yi
[25] present an information-extraction approach based on linear programming that
integrates recognition of entities with the identication of relations between these
entities. The ability of SRL techniques to integrate uncertain evidence from many
interacting problems in order to collectively determine a globally coherent solution
to all of them could help develop a complete, robust NLP system. However, such

550

Statistical Relational Learning for Natural Language Information Extraction

a system would create massive collective inference problems and would require
ecient SRL methods that could scale to very large networks.

19.6

Conclusions
The area of natural language processing includes many problems that lend themselves to SRL methods. Most existing statistical methods in NLP, such as HMMs,
sequence CRFs, and probabilistic context-free grammars are actually restrictive
forms of SRL. More general SRL techniques have advantages over these existing
methods and hold the promise of improving results on a number of dicult NLP
problems. In this chapter, we have reviewed our research on applying SRL techniques to information extraction. By using RMNs to capture dependencies between
distinct candidate extractions in a document, we achieved improved results on identifying names of proteins in biomedical abstracts compared to a traditional CRF.
By using the ability of SRL to integrate disparate sources of evidence to perform
collective inference over complex relational data, robust NLP systems that accurately resolve many interacting ambiguities can hopefully be developed.

Acknowledgments
This research was partially supported by the National Science Foundation under
grants IIS-0325116 and IRI-9704943.

References
[1] J. Allen. Natural Language Understanding. Benjamin/Cummings, Menlo Park,
CA, 1987.
[2] E. Brill. Transformation-based error-driven learning and natural language
processing: A case study in part-of-speech tagging. Computational Linguistics,
21(4):543565, 1995.
[3] R. Bunescu. Learning for collective information extraction. Technical Report
TR-05-02, Department of Computer Sciences, University of Texas at Austin,
2004.
[4] R. Bunescu and R. J. Mooney. Collective information extraction with relational
Markov networks. In Proceedings of the Annual Meeting of the Association for
Computational Linguistics, 2004.
[5] R. Bunescu and R. J. Mooney. Subsequence kernels for relation extraction.
In Proceedings of the Conference on Neural Information Processing Systems,
2005.
[6] R. Bunescu, R. Ge, R. Kate, E. Marcotte, R. J. Mooney, A. Kumar Ramani,
and Y. Wah Wong. Comparative experiments on learning information extrac-

References

551

tors for proteins and their interactions. Articial Intelligence in Medicine (Special Issue on Summarization and Information Extraction from Medical Documents), 33(2):139155, 2005.
[7] C. Cardie. Empirical methods in information extraction. AI Magazine, 18(4):
6579, 1997.
[8] Kenneth W. Church. A stochastic parts program and noun phrase parser
for unrestricted text. In Proceedings of the Conference on Applied Natural
Language Processing, 1988.
[9] M. Collins. Ranking algorithms for named-entity extraction: Boosting and the
voted perceptron. In Proceedings of the Annual Meeting of the Association for
Computational Linguistics, 2002.
[10] M. Craven and J. Kumlien. Using multiple levels of learning and diverse
evidence sources to uncover coordinately controlled genes. In Proceedings of the
International Conference on Intelligent Systems for Molecular Biology, 1999.
[11] J. Finkel, T. Grenager, and C. Manning. Incorporating non-local information
into information extraction systems by Gibbs sampling. In Proceedings of the
Annual Meeting of the Association for Computational Linguistics, 2005.
[12] D. Freitag. Information extraction from HTML: Application of a general
learning approach. In Proceedings of the National Conference on Articial
Intelligence, 1998.
[13] Y. Freund and R. Schapire. Large margin classication using the perceptron
algorithm. Machine Learning, 37:277296, 1999.
[14] J. Hirschberg. Every time I re a linguist, my performance goes up, and other
myths of the statistical natural language processing revolution. Presented at
the National Conference on Articial Intelligence, 1998.
[15] F. Jelinek. Statistical Methods for Speech Recognition. MIT Press, Cambridge,
MA, 1998.
[16] F. Jelinek. Continuous speech recognition by statistical methods. Proceedings
of the IEEE, 64(4):532556, 1976.
[17] F. R. Kschischang, B. Frey, and H.-A. Loeliger. Factor graphs and the sumproduct algorithm. IEEE Transactions on Information Theory, 47(2):498519,
2001.
[18] J. Laerty, A. McCallum, and F. Pereira. Conditional random elds: Probabilistic models for segmenting and labeling sequence data. In Proceedings of
the International Conference on Machine Learning, 2001.
[19] C. Manning and H. Sch
utze. Foundations of Statistical Natural Language
Processing. MIT Press, Cambridge, MA, 1999.
[20] M. Marcus, B. Santorini, and M. Marcinkiewicz. Building a large annotated
corpus of English: The Penn treebank. Computational Linguistics, 19(2):313
330, 1993.

552

Statistical Relational Learning for Natural Language Information Extraction

[21] A. McCallum.
Mallet: A machine learning for language toolkit.
http://mallet.cs.umass.edu, 2002.
[22] L. Rabiner. A tutorial on hidden Markov models and selected applications in
speech recognition. Proceedings of the IEEE, 77(2):257286, 1989.
[23] A. Ramani, R. Bunescu, R. J. Mooney, and E. Marcotte. Consolidating the
set of known human protein-protein interactions in preparation for large-scale
mapping of the human interactome. Genome Biology, 6(5):r40, 2005.
[24] L. Ramshaw and M. Marcus. Text chunking using transformation-based
learning. In Proceedings of the Third Workshop on Very Large Corpora, 1995.
[25] D. Roth and W. Yih. A linear programming formulation for global inference in
natural language tasks. In Proceedings of the Conference on Natural Language
Learning, 2004.
[26] R. Schank and C. Riesbeck. Inside Computer Understanding: Five Programs
plus Miniatures. Lawrence Erlbaum and Associates, Hillsdale, NJ, 1981.
[27] A. Schwartz and M. Hearst. A simple algorithm for identifying abbreviation
denitions in biomedical text. In Proceedings of the Eighth Pacic Symposium
on Biocomputing, 2003.
[28] C. Sutton and A. McCallum. Collective segmentation and labeling of distant
entities in information extraction. In ICML Workshop on Statistical Relational
Learning and Its Connections to Other Fields, 2004.
[29] C. Sutton, K. Rohanimanesh, and A. McCallum. Dynamic conditional random
elds: Factorized probabilistic models for labeling and segmenting sequence
data. In Proceedings of the International Conference on Machine Learning,
2004.
[30] B. Taskar, P. Abbeel, and D. Koller. Discriminative probabilistic models for
relational data. In Proceedings of the Conference on Uncertainty in Articial
Intelligence, 2002.
[31] A. Viterbi. Error bounds for convolutional codes and an asymptotically
optimum decoding algorithm. IEEE Transactions on Information Theory, 13
(2):260269, 1967.
[32] M. Wainwright, T. Jaakkola, and A. Willsky. Tree-based reparameterization
framework for approximate estimation on graphs with cycles. In Proceedings
of the Conference on Neural Information Processing Systems, 2001.
[33] T. Winograd. Understanding Natural Language. Academic Press, Orlando,
FL, 1972.
[34] W. A. Woods. Lunar rocks in natural English: Explorations in natural
language question answering. In Antonio Zampoli, editor, Linguistic Structures
Processing. Elsevier North-Holland, New York, 1977.
[35] D. Zelenko, C. Aone, and A. Richardella. Kernel methods for relation
extraction. Journal of Machine Learning Research, 3:10831106, 2003.

20 Global Inference for Entity and Relation


Identication via a
Linear Programming Formulation

Dan Roth and Wen-tau Yih

Natural language decisions often involve assigning values to sets of variables, representing low-level decisions and context-dependent disambiguation. In most cases
there are complex relationships among these variables representing dependencies
that range from simple statistical correlations to those that are constrained by
deeper structural, relational, and semantic properties of the text.
In this chapter we study a specic instantiation of this problem in the context
of identifying named entities and relations between them in free-form text. Given
a collection of discrete random variables representing outcomes of learned local
predictors for entities and relations, we seek an optimal global assignment to the
variables that respects multiple constraints, including constraints on the type of
arguments a relation can take, and the mutual activity of dierent relations.
We develop a linear programming formulation to address this global inference
problem and evaluate it in the context of simultaneously learning named entities and
relations. We show that global inference improves stand-alone learning; in addition,
our approach allows us to eciently incorporate expressive domain and task-specic
constraints at decision time, resulting, beyond signicant improvements in the
accuracy, in coherent quality of the inference.

20.1

Introduction
In a variety of AI problems there is a need to learn, represent, and reason with
respect to denitions over structured and relational data. Examples include learning
to identify properties of text fragments such as functional phrases and named
entities, identifying relations such as A is the assassin of B in text, learning to
classify molecules for mutagenicity from atom-bond data in drug design, learning

554

Global Inference for Entity and Relation Identication via a Linear Programming Formulation

to identify 3D objects in their natural surrounding, and learning a policy to map


goals to actions in planning domains.
Learning to make decisions with respect to natural language input is a prime
source of examples for the need to represent, learn, and reason with structured and
relational data [12, 13, 16, 18]. Natural language tasks presents several challenges
to statistical relational learning (SRL). It is necessary (1) to represent structured
domain elements in the sense that their internal (hierarchical) structure can be
encoded, and learning functions in these terms can be supported, and (2) it is
essential to represent concepts and functions relationally, in the sense that dierent
data instantiations may be abstracted to yield the same representation, so that
evaluation of functions over dierent instantiations will produce the same output.
Moreover, beyond having to deal with structured input, in many natural language
understanding tasks there is a rich relational structure also on the output of
predictors. Natural language decisions often depend on the outcomes of several
dierent but mutually dependent predictions. These predictions must respect some
constraints that could arise from the nature of the data or from domain-specic
conditions. For example, in part-of-speech (POS)tagging, a sentence must have
at least one verb, and cannot have three consecutive verbs. These facts can be
used as constraints. In named entity recognition, no entities overlap is a common
constraint used in various works [37]. When predicting whether phrases in sentences
represent entities and determining their type, the relations between the candidate
entities provide constraints on their allowed (or plausible) types, via selectional
restrictions.
While the classiers involved in these global decisions need to exploit the relational structure in the input [30], we will not discuss these issues here, and will
focus here instead on the task of inference with the classiers outcomes. Namely,
this work is concerned with the relational structure over the outcomes of predictors, and studies natural language inferences which exploit the global structure of
the problem, aiming at making decisions which depend on the outcomes of several
dierent but mutually dependent classiers.
Ecient solutions to problems of this sort have been given when the constraints
on the predictors are sequential [25, 15]. These solutions can be categorized into
the following two frameworks. The rst, which we call learning global models, trains
a probabilistic model under the constraints imposed by the domain. Examples include variations of hidden Markov models (HMMs), conditional models, and sequential variations of Markov random elds (MRFs) [21]. The other framework,
inference with classiers [28], views maintaining constraints and learning component classiers as separate processes. Various local classiers are trained without
the knowledge of the global output constraints. The predictions are taken as input to an inference procedure which is given these constraints and then nds the
best global prediction. In addition to the conceptual simplicity and modularity of
this approach, it is more ecient than the global training approach, and seems to
perform better experimentally in some tasks [37, 26, 32].

20.1

Introduction

555

Typically, ecient inference procedures in both frameworks rely on dynamic programming (e.g., Viterbi), which works well for sequential data. However, in many
important problems, the structure is more general, resulting in computationally intractable inference. Problems of these sorts have been studied in computer vision,
where inference is generally performed over low-level measurements rather than
over higher-level predictors [22, 3].
This work develops a novel inference with classiers approach. Rather than being
restricted to sequential data, we study a fairly general setting. The problem is
dened in terms of a collection of discrete random variables representing binary
relations and their arguments; we seek an optimal assignment to the variables
in the presence of the constraints on the binary relations between variables and
the relation types. Following ideas that were developed recently in the context
of approximation algorithms [8], we model inference as an optimization problem,
and show how to cast it in a linear programming (LP) formulation. Using existing
numerical packages, which are able to solve very large LP problems in a very short
time1, inference can be done very quickly.
Our approach could be contrasted with other approaches to sequential inference or to general MRF approaches [21, 35]. The key dierence is that in these
approaches, the model is learned globally, under the constraints imposed by the
domain. Our approach is designed to address also cases in which some of the local
classiers are learned (or acquired otherwise) in other contexts and at other times,
or incorporated as background knowledge. That is, some components of the global
decision need not, or cannot, be trained in the context of the decision problem.
This way, our approach allows the incorporation of constraints into decisions in a
dynamic fashion and can therefore support task-specic inference. The signicance
of this is clearly shown in our experimental results.
We develop our model in the context of natural language inference and evaluate
it here on the problem of simultaneously recognizing named entities and relations
between them.
For instance, in the sentence J. V. Oswald was murdered at JFK after his
assassin, R. U. KFJ shot..., we want to identify the kill (KFJ, Oswald) relation. This task requires making several local decisions, such as identifying named
entities in the sentence, in order to support the relation identication. For example,
it may be useful to identify that Oswald and KFJ are people, and JFK is a location.
This, in turn, may help to identify that a kill action is described in the sentence. At
the same time, the relation kill constrains its arguments to be people (or at least,
not to be locations) and helps to enforce that Oswald and KFJ are likely to be
people, while JFK may not.
In our model, we rst learn a collection of local predictors, e.g., entity and
relation identiers. At decision time, given a sentence, we produce a global decision

1. For example, CPLEX [11] is able to solve a linear programming problem of 13 million
variables within 5 minutes.

556

Global Inference for Entity and Relation Identication via a Linear Programming Formulation

that optimizes over the suggestions of the classiers that are active in the sentence,
known constraints among them and, potentially, domain-specic or task-specic
constraints relevant to the current decision. Although a brute-force algorithm may
seem feasible for short sentences, as the number of entity variables grows, the
computation becomes intractable very quickly. Given n entities in a sentence, there
are O(n2 ) possible binary relations between them. Assume that each variable (entity
2
or relation) can take l labels (none is one of these labels). Thus, there are ln
possible assignments, which is too large to explicitly enumerate even for a small n.
When evaluated on simultaneous learning of named entities and relations, our
approach not only provides a signicant improvement in the predictors accuracy;
more importantly, it provides coherent solutions. While many statistical methods
make incoherent mistakes (i.e., inconsistency among predictions) that no human
ever makes, as we show, our approach improves also the quality of the inference
signicantly.
The rest of the chapter is organized as follows. Section 20.2 formally denes
our problem and section 20.3 describes the computational approach we propose.
Experimental results are given in section 20.5, including a case study that illustrates
how our inference procedure improves the performance. We introduce some common
inference methods used in various text problems as comparison in section 20.6,
followed by some discussions and conclusions in section 20.7.

20.2

The Relational Inference Problem


We consider the relational inference problem within the reasoning with classiers
paradigm, and study a specic but fairly general instantiation of this problem,
motivated by the problem of recognizing named entities (e.g., persons, locations,
organization names) and relations between them (e.g. work for, located in, live in).
Conceptually, the entities and relations can be viewed, taking into account the
mutual dependencies, as shown in gure 20.1, where the nodes represent entities
(e.g., phrases) and the links denote the binary relations between the entities. Each
entity and relation has several properties. Some of the properties, such as words
inside the entities and POS tags of words in the context of the sentence, are easy
to acquire. However, other properties like the semantic types (i.e., class labels, such
as people or locations) of phrases are dicult. Identifying the labels of entities
and relations is treated here as a learning problem. In particular, we learn these
target properties as functions of all other properties of the sentence.
To describe the problem in a formal way, we rst dene sentences and entities as
follows.
Denition 20.1 Sentence and Entities
A sentence S is a linked list which consists of words w and entities E. An entity can
be a single word or a set of consecutive words with a predened boundary. Entities
in a sentence are labeled as E = {E1 , E2 , , En } according to their order, and

20.2

The Relational Inference Problem

557

R 31
R 32
E1

R 32
E2

R 12

E3

Spelling
POS
...
Label

R 23
R13
Label-1
Label-2
...
Label-n

Figure 20.1

A conceptual view of entities and relations.

they take values (i.e., labels) that range over a set of entity types LE . The value
assigned to Ei E is denoted fEi LE .
Notice that determining the entity boundaries is also a dicult problem the
segmentation (or phrase detection) problem [1, 25]. Here we assume it is solved and
given to us as input; thus we only concentrate on classication.

Figure 20.2

Dole s wife , Elizabeth , is a native of Salisbury , N.C.


E1
E2
E3
A sentence that has three entities.

Example 20.1
The sentence in gure 20.2 has three entities: E1 = Dole, E2 = Elizabeth, and
E3 = Salisbury, N.C.
A relation is dened by the entities that are involved in it (its arguments). Note
that we only discuss binary relations.
Denition 20.2 Relations
A (binary) relation Rij = (Ei , Ej ) represents the relation between Ei and Ej , where
Ei is the rst argument and Ej is the second. In addition, Rij can range over a set of
entity types LR . We use R = {Rij }{1i,jn;i=j} as the set of binary relations on the
entities E in a sentence. Two special functions N 1 and N 2 are used to indicate the
argument entities of a relation Rij . Specically, Ei = N 1 (Rij ) and Ej = N 2 (Rij ).
Note that in this denition, the relations are directed (e.g., there are both Rij and
Rji variables). This is because the arguments in a relation often take dierent roles
and have to be distinguished. Examples of this sort include work for, located in and

558

Global Inference for Entity and Relation Identication via a Linear Programming Formulation

live in. If a relation variable Rij is predicted as a mutual relation (e.g., spouse of ),
then the corresponding relation Rji should be also assigned the same label. This
additional constraint can be easily incorporated in our inference framework. Also
notice that we simplify the denition slightly by not considering self-relations (e.g.,
Rii ). This can be relaxed if this type of relations appears in the data.
Example 20.2
In the sentence given in gure 20.2, there are six relations between the entities: R12
= (Dole, Elizabeth), R21 = (Elizabeth, Dole), R13 = (Dole, Salisbury,
N.C.), R31 = (Salisbury, N.C., Dole), R23 = (Elizabeth, Salisbury, N.C.),
and R32 = (Salisbury, N.C., Elizabeth)
We dene the types (i.e., classes) of relations and entities as follows.
Denition 20.3 Classes
We denote the set of predened entity classes and relation classes as LE and LR
respectively. LE has one special element, other ent, which represents any unlisted
entity class. Similarly, LR also has one special element, other rel, which means the
involved entities are irrelevant or the relation class is undened.
When it is clear from the context, we use Ei and Rij to refer to the entity and
relation, as well as their types (class labels). Note that each relation and entity
variable can take only one class according to denition 20.3. Although there may
be dierent relations between two entities, it seldom occurs in the data. Therefore,
we ignore this issue for now.
Example 20.3
Suppose LE = { other ent, person, location } and LR = { other rel, born in,
spouse of }. For the entities in gure 20.2, E1 and E2 belong to person and E3
belongs to location. In addition, relation R23 is born in, R12 and R21 are spouse of.
Other relations are other rel.
Given a sentence, we want to predict the labels of a set V which consists of two
types of variables entities E and relations R. That is, V = E R. However, the
class label of a single entity or relation depends not only on its local properties
but also on the properties of other entities and relations. The classication task
is somewhat dicult since the predictions of entity labels and relation labels are
mutually dependent. For instance, the class label of E1 depends on the class label of
R12 and the class label of R12 also depends on the class label of E1 and E2 . While
we can assume that all the data is annotated for training purposes, this cannot
be assumed at evaluation time. We may presume that some local properties, such
as the words or POS tags, are given, but none of the class labels for entities or
relations are.
To simplify the complexity of the interaction within the graph but still preserve
the characteristic of mutual dependency, we abstract this classication problem
in the following probabilistic framework. First, the classiers are trained independently and used to estimate the probabilities of assigning dierent labels given the

20.2

The Relational Inference Problem

559

observation (that is, the easily classied properties in it). Then, the output of the
classiers is used as a conditional distribution for each entity and relation, given
the observation. This information, along with the constraints among the relations
and entities, is used to make global inference.
In the task of entity and relation recognition, there exist some constraints on the
labels of corresponding relation and entity variables. For instance, if the relation
is live in, then the rst entity should be a person, and the second entity should
be a location. The correspondence between the relation and entity variables can be
represented by a bipartite graph. Each relation variable Rij is connected to its rst
entity Ei , and second entity Ej . We dene a set of constraints on the outcomes of
the variables in V as follows.
Denition 20.4 Constraints
A constraint is a function that maps a relation label and an entity label to either
0 or 1 (contradict or satisfy the constraint). Specically, C 1 : LR LE {0, 1}
constrains values of the rst argument of a relation. C 2 is dened similarly and
constrains values of the second argument.
Note that while we dene the constraints here as Boolean functions, our formalism allows us to associate weights with constraints and to include statistical
constraints [32]. Also note that we can dene a large number of constraints, such
as C R : LR LR {0, 1} which constrain the labels of two relation variables.
For example, we can dene a set of constraints on a mutual relation spouse of
as {(spouse of, spouse of) = 1, (spouse of, lr ) = 0, and (lr , spouse of) = 0 for any
lr LR , where lr = spouse of}. By enforcing these constraints on a pair of symmetric relation variables Rij and Rji , the relation class spouse of will be assigned
to either both Rij and Rji or none of them. [In fact, as will be clear in section 20.3,
the language used to describe constraints is very rich linear (in)equalities over V.]
We seek an inference algorithm that can produce a coherent labeling of entities
and relations in a given sentence. Furthermore, it optimizes an objective function
based on the conditional probabilities or other condence scores estimated by the
entity and relation classiers, subject to some natural constraints. Examples of
these constraints include whether specic entities can be the argument of specic
relations, whether two relations can occur together among a subset of entity
variables in a sentence, and any other information that might be available at the
inference time. For instance, suppose it is known that entities A and B represent the
same location; one may like to incorporate an additional constraint that prevents
an inference of the type: C lives in A; C does not live in B.
We note that a large number of problems can be modeled this way. Examples
include problems such as chunking sentences [25], coreference resolution and sequencing problems in computational biology, and the recently popular problem of
semantic role labeling [5, 6]. In fact, each of the components of our problem here,
namely the separate task of recognizing named entities in sentences and the task of
recognizing semantic relations between phrases, can be modeled this way. However,

560

Global Inference for Entity and Relation Identication via a Linear Programming Formulation

our goal is specically to consider interacting problems at dierent levels, resulting


in more complex constraints among them, and exhibit the power of our method.

20.3

Integer Linear Programming Inference


The most direct way to formalize our inference problem is using MRFs [9]. Rather
than doing that, for computational reasons, we rst use a fairly standard transformation of MRFs to a discrete optimization problem (see [19] for details). Specically, under weak assumptions we can view the inference problem as the following
optimization problem, which aims at minimizing the objective function that is the
sum of the following two cost functions.
Assignment cost This is the cost of deviating from the assignment of the
variables V given by the classiers. The specic cost function we use is dened
as follows: Let l be the label assigned to variable u V. If the posterior probability
x), where x
represents the input feature vector, then
estimation is p = P (fu = l|
the assignment cost cu (l) is log p.
Constraint cost This is the cost imposed by breaking constraints between
neighboring nodes. The specic cost function we use is dened as follows: Consider
two entity nodes Ei , Ej and their corresponding relation node Rij ; that is, Ei =
N 1 (Rij ) and Ej = N 2 (Rij ). The constraint cost indicates whether the labels
are consistent with the constraints. In particular, we use: d1 (fEi , fRij ) is 0 if
(fRij , fEi ) C 1 ; otherwise, d1 (fEi , fRij ) is 2. Similarly, we use d2 to force the
consistency of the second argument of a relation.
Since we are looking for the most probable global assignment that satises the
constraints, the overall cost function we optimize for a global labeling f of all
variables is

C(f ) =


uV

cu (fu ) +


d1 (fRij , fEi ) + d2 (fRij , fEj )

(20.1)

Rij R

Unfortunately, this combinatorial problem ( 20.1) is computationally intractable


even when placing assumptions on the cost function [19]. The computational
approach we adopt is to develop a linear programming formulation of the problem,
and then solve the corresponding integer linear programming (ILP) problem3. Our
LP formulation is based on the method proposed by Chekuri et al. [8]. Since the
objective function ( 20.1) is not a linear function in terms of the labels, we introduce
2. In practice, we use a very large number (e.g., 915 ).
3. In this chapter, ILP only means integer linear programming, not inductive logic programming.

20.3

Integer Linear Programming Inference

561

new binary variables to represent dierent possible assignments to each original


variable; we then represent the objective function as a linear function of these
binary variables.
Let x{u,i} be an indicator variable, dened to be 1 if and only if variable u is
labeled i and 0 otherwise, where u E, i LE or u R, i LR . For example,
x{E1 ,person} = 1 when the label of entity E1 is person; x{R23 ,spouse of} = 0 when
the label of relation R23 is not spouse of. Let x{Rij ,r,Ei ,e1 } be an indicator variable
representing whether relation Rij is assigned label r and its rst argument, Ei ,
is assigned label e1 . For instance, x{R12 ,spouse of,E1 ,person} = 1 means the label of
relation R12 is spouse of and the label of its rst argument, E1 , is person. Similarly,
x{Rij ,r,Ej ,e2 } = 1 indicates that Rij is assigned label r and its second argument,
Ej , is assigned label e2 . With these denitions, the optimization problem can be
represented as the following integer linear program.

min

 

cE (e) x{E,e} +

EE eLE


Ei ,Ej E
Ei =Ej

 

cR (r) x{R,r}

RR rLR

d1 (r, e1 ) x{Rij ,r,Ei ,e1 } +

rLR e1 LE

subject to:


 

 


d2 (r, e2 ) x{Rij ,r,Ej ,e2 } ,

rLR e2 LE

x{E,e} = 1

E E

(20.2)

x{R,r} = 1

R R

(20.3)

eLE

rLR

x{E,e} =

x{R,r,E,e}

rLR

x{R,r} =

E E, e LE ,
R {R : E = N 1 (R) or E = N 2 (R)}

(20.4)

x{R,r,E,e}

R R, r LR , E = N 1 (R)

(20.5)

x{R,r,E,e}

R R, r LR , E = N 2 (R)

(20.6)

x{E,e} {0, 1}

E E, e LE

(20.7)

x{R,r} {0, 1}

R R, r LR

(20.8)

R R, r LR , E E, e LE

(20.9)

eLE

x{R,r} =

eLE

x{R,r,E,e} {0, 1}

Equations (20.2) and (20.3) require that each entity or relation variable can
only be assigned one label. Equations (20.4), (20.5), and (20.6) assure that the
assignment to each entity or relation variable is consistent with the assignment
to its neighboring variables. Equations (20.7), (20.8), and (20.9) are the integral
constraints on these binary variables.

562

Global Inference for Entity and Relation Identication via a Linear Programming Formulation

There are several advantages of representing the problem in an LP formulation.


First of all, linear (in)equalities are fairly general and are able to represent many
types of constraints (e.g., the decision time constraint in the experiment in section 20.5). More importantly, an ILP problem at this scale can be solved very
quickly using current numerical packages, such as Xpress-MP [42] or CPLEX [11].
We introduce the general strategies of solving an ILP problem here.

20.4

Solving Integer Linear Programming


To solve an ILP problem, a straightforward idea is to relax the integral constraints.
That is, replacing (20.7), (20.8), and (20.9) with
x{E,e} 0

E E, e LE

(20.10)

x{R,r} 0

R R, r LR

(20.11)

R R, r LR , E E, e LE ,

(20.12)

x{R,r,E,e} 0

If linear programming relaxation (LPR) returns an integer solution, then it is


also the optimal solution to the ILP problem. In fact, it can be shown that the
optimal solution of a linear program is always integral if the coecient matrix of
its standard form is unimodular [34].
Denition 20.5
A matrix A of rank m is called unimodular if all the entries of A are integers, and
the determinant of every square submatrix of A of order m is in 0,+1,-1.
Theorem 20.6 Veinott and Dantzig
Let A be an (m, n)-integral matrix with full row rank m. Then the polyhedron
{x|x 0; Ax = b} is integral for each integral vector b, if and only if A is
unimodular.
Theorem 20.6 indicates that if a linear program is in its standard form, then
regardless of the cost function and the integral vector b, the optimal solution is
an integer point if and only if the coecient matrix A is unimodular.
When LPR returns a noninteger solution, the ILP problem is usually handled by
one of the two strategies: rounding and search.
The goal of rounding is to nd an integer point that is close to the noninteger
solution. Under some conditions of the cost function, which do not hold in our
problem, a well-designed rounding algorithm can be shown that the rounded
solution is a good approximation to the optimal solution [19, 8]. Nevertheless, in
general, the outcomes of the rounding procedure may not even be a legitimate
solution to the problem.
To nd the optimal solution of an ILP problem, a search approach based on the
idea of branch and bound divides an ILP problem into several LP subproblems, and
uses the noninteger solutions returned by an LP solver to reduce the search space.

20.5

Experiments

563

When LPR nds a noninteger solution, it splits the problem on the noninteger
variable. For example, suppose variable xi is fractional in a noninteger solution to
the ILP problem min{cx : x S, x {0, 1}n}, where S is the linear constraints. The
ILP problem can be split into two sub-LPR problems, min{cx : x S {xi = 0}}
and min{cx : x S {xi = 1}}. Since any feasible solution provides an upper bound
and any LPR solution generates a lower bound, the search tree can be eectively
cut.
One technique that is often combined with branch and bound
is cutting plane. When a noninteger solution is given by LPR, it adds a new
linear constraint that makes the noninteger point infeasible, while still keeping the
optimal integer solution in the feasible region. As a result, the feasible region is
closer to the ideal polyhedron, which is the convex hull of feasible integer solutions.
The most well-known cutting plane algorithm is Gomorys fractional cutting plane
method [41], for which it can be shown that only a nite number of additional
constraints are needed. Moreover, researchers developed dierent cutting plane
algorithms for dierent types of ILP problems. One example is [40], which only
focuses on binary ILP problems.
In theory, a search-based strategy may need several steps to nd the optimal
solution. However, LPR always generates integer solutions for all the (thousands
of) cases we have experimented with, even though the coecient matrix in our
problem is not unimodular.

20.5

Experiments

We describe below two sets of experiments on the problem of simultaneously


recognizing entities and relations. In the rst, we view the task as a knowledge
acquisition task we let the system read sentences and identify entities and relations
among them. Given that this is a dicult task which may require quite often
information beyond the sentence, we consider also a forced decision task, in which
we simulate a question-answering situation we ask the system, say, Who killed
whom? and evaluate it on identifying correctly the relation and its arguments,
given that it is known that somewhere in this sentence this relation is active. In
addition, this evaluation exhibits the ability of our approach to incorporate task
specic constraints at decision time. At the end of this section, we will also provide
a case study to illustrate how the inference procedure corrects mistakes both in
entity and relation predictions.
20.5.1

Data Preparation

We annotated the named entities and relations in some sentences from the TREC
documents. In order to eectively observe the interaction between relations and

564

Global Inference for Entity and Relation Identication via a Linear Programming Formulation

entities, we chose 1437 sentences4 that have at least one active relation. Among
those sentences, there are 5336 entities, and 19,048 pairs of entities (binary relations). Entity labels include 1685 persons, 1968 locations, 978 organizations, and
705 other ent. Relation labels include 406 located in, 394 work for, 451 orgBased in,
521 live in, 268 kill, and 17,007 other rel. Note that most pairs of entities have no
active relations at all. Therefore, relation other rel signicantly outnumbers others.
Examples of each relation label and the constraints between a relation variable and
its two entity arguments are shown in table 20.1.
Table 20.1

Relations of interest and the corresponding constraints


Relation
located in
work for
orgBased in
live in
kill

20.5.2

Entity1 Entity2
loc
per
org
per
per

loc
org
loc
loc
per

Example
(New York, US)
(Bill Gates, Microsoft)
(HP, Palo Alto)
(Bush, US)
(Oswald, JFK)

Tested Approaches

In order to focus on the evaluation of our inference procedure, we assume the


problem of segmentation (or phrase detection) [1, 25] is solved, and the entity
boundaries are given to us as input; thus we only concentrate on their classications.
We evaluate our LP-based inference procedure by observing its eect in dierent
approaches of combining the entity and relation classiers. The rst approach is to
train entity and relation classiers separately. In particular, the relation classier
does not know the labels of its entity arguments, and the entity classier does not
know the labels of relations in the sentence, either. For the entity classier, one
set of features is extracted from words within a size 4 window around the target
phrase. They are (1) words, POS tags, and conjunctions of them; and (2) bigrams
and trigrams of the mixture of words and tags. In addition, some other features are
extracted from the target phrase, which are listed in table 20.2.
For the relation classier, there are three sets of features:
1. features similar to those used in the entity classication are extracted from the
two argument entities of the relation;
2. conjunctions of the features from the two arguments;
3. some patterns extracted from the sentence or between the two arguments.

4. The data used here is available by following the data link from
http://L2R.cs.uiuc.edu/cogcomp/
5. We collected names of famous places, people, and popular titles from other data sources
in advance.

20.5

Experiments
Table 20.2

Table 20.3

565

Some features extracted from the target phrase


Symbol

Explanation

icap
acap
incap
sux
bigram
len
place5
prof5
name5

the rst character of a word is capitalized


all characters of a word are capitalized
some characters of a word are capitalized
the sux of a word is ing, ment, etc.
bigram of words in the target phrase
number of words in the target phrase
the phrase is/has a known places name
the phrase is/has a professional title (e.g., Lt.)
the phrase is/has a known persons name

Some patterns used in relation classication


Pattern

Example

arg1 , arg2
arg1 , a arg2 prof
in/at arg1 in/at/, arg2
arg2 prof arg1
arg1 native of arg2
arg1 based in/at arg2

San Jose, CA
John Smith, a Starbucks manager
Ocials in Perugia in Umbria province said
CNN reporter David McKinley
Elizabeth Dole is a native of Salisbury, N.C.
a manager for Kmart based in Troy, Mich. said

Some features in category 3 are the number of words between arg1 and arg2 ,
whether arg1 and arg2 are the same word, or arg1 is the beginning of the
sentence and has words that consist of all capitalized characters, where arg1 and
arg2 represent the rst and second argument entities respectively. Table 20.3
presents some patterns we use.
The learning algorithm used is a regularized variation of the Winnow update rule
incorporated in SNoW [29, 31, 4], a multiclass classier that is specically tailored
for large-scale learning tasks. SNoW learns a sparse network of linear functions, in
which the targets (entity classes or relation classes, in this case) are represented
as linear functions over a common feature space. While SNoW can be used as
a classier and predicts using a winner-take-all mechanism over the activation
value of the target classes, we can also rely directly on the raw activation value
it outputs, which is the weighted linear sum of the active features, to estimate
the posteriors. It can be veried that the resulting values provide a good source
of probability estimation. We use softmax [2] over the raw activation values as
conditional probabilities. Specically, suppose the number of classes is n, and the
raw activation values of class i is acti . The posterior estimation for class i is derived
by the following equation
pi = 

eacti
1jn

eactj

566

Global Inference for Entity and Relation Identication via a Linear Programming Formulation

In addition to the separate approach, we also test several pipeline models, which
we denote E R, R E and E R. The E R approach rst trains the
basic entity classier (E), which is identical to the entity classier in the separate
approach. Its predictions on the two entity arguments of a relation are then used
conjunctively as additional features (e.g., personperson or personlocation) in
learning the relation classier (R). Similarly, R E rst trains the relation classier
(R); its output is then used as additional features in the entity classier (E).
For example, the additional feature could be this entity is predicted as the rst
argument of a work for relation. The E R model is the combination of the above
two. It uses the entity classier in the R E model and the relation classier in
the E R model as its nal classiers.
Although the true labels of entities and relations are known during training, only
the predicted labels are available during evaluation on new data (and in testing).
Therefore, rather than training the second-stage pipeline classiers on the available
true labels, we train them on the predictions of the previous stage classiers. This
way, at test time the classiers are being evaluated on data of the same type they
were trained on, making the second-stage classier more tolerant to the mistakes 6.
The need to train pipeline classiers this way has been observed multiple times
in natural language processing (NLP) research, and we also have validated it in
our experiments. For example, when the relation classier is trained using the true
entity labels, the performance is usually worse than when training it using the
predicted entity labels.
The last approach, omniscient, tests the conceptual upper bound of this entityrelation classication problem. It also trains the two classiers separately. However,
it assumes that the entity classier knows the correct relation labels, and similarly
the relation classier knows the right entity labels as well. This additional information is then used as features in training and testing. Note that this assumption is
unrealistic. Nevertheless, it may give us a hint on how accurately the classiers with
global inference can achieve. Finally, we apply the LP-based inference procedure to
the above ve models, and observe how it improves the performance.
20.5.3

Results

We test the aforementioned approaches using ve fold cross-validation. For each


approach, we also perform a paired t -test on its F1 scores before and after inference.
Tables 20.4 and 20.5 show the performance of each approach in recall, precision,
and F1 .
The results show that the inference procedure consistently improves the performance of the ve models, both in entities and relations. One interesting observation

6. In order to derive similar performance in testing, ideally the previous stage classier
should be trained using a dierent corpus. We didnt take this approach because of data
scarcity.

20.5

Experiments

567

Results of the entity classication in dierent approaches. Experiments


are conducted using ve fold cross-validation. Numbers in boldface indicate that
the p -values are smaller than 0.1. Symbols and indicate signicance at 95% and
99% levels respectively. Signicance tests were computed with a two-tailed paired
t -test.

Table 20.4

Approach

person
Rec Prec F1

Separate 89.5 89.8 89.4


Separate w/ Inf 90.5 90.6 90.4

location
Rec Prec F1

organization
Rec Prec F1

87.0 91.5 89.0 67.6 91.3 77.0


88.6 91.8 90.1 71.0 91.2 79.4

E R 89.5 89.8 89.4 87.0 91.5 89.0 67.6 91.3 77.0


E R w/ Inf 89.7 90.1 89.7 87.0 91.7 89.1 69.0 91.2 78.0
R E 89.1 88.7 88.6
R E w/ Inf 88.6 88.6 88.3

88.1 89.8 88.9 71.4 89.3 78.7


88.2 89.4 88.7 72.1 88.5 79.0

E R 89.1 88.7 88.6 88.1 89.8 88.9 71.4 89.3 78.7


E R w/ Inf 89.5 89.1 89.0 88.7 89.7 89.1 72.0 89.5 79.2
Omniscient 94.9 93.7 94.2 92.4 96.6 94.4 88.1 93.5 90.7
Omniscient w/ Inf 96.1 94.2 95.1 94.0 97.0 95.4 88.7 94.9 91.7

is that the omniscient classiers, which know the correct entity or relation labels,
can still be improved by the inference procedure. This demonstrates the eectiveness of incorporating constraints, even when the learning algorithm may be able to
learn them from the data.
One of the more signicant results in our experiments, we believe, is the improvement in the quality of the decisions. As mentioned in section 20.1, incorporating
constraints helps to avoid inconsistency in classication. It is interesting to investigate how often such mistakes happen without global inference, and see how eective
the global inference is.
For this purpose, we dene the quality of the decision as follows. For a relation
variable and its two corresponding entity variables, if the labels of these variables are
predicted correctly and the relation is active (i.e., not other rel ), then we count it
as a coherent prediction. Quality is then the number of coherent predictions divided
by the sum of coherent and incoherent predictions. When the inference procedure
is not applied, 5% to 25% of the predictions are incoherent. Therefore, the quality
is not always good. On the other hand, our global inference procedure takes the
natural constraints into account, so it never generates incoherent predictions. If the
relation classier has the correct entity labels as features, a good learner should
learn the constraints as well. As a result, the quality of omniscient is almost as
good as omniscient with inference.
Another experiment we performed is the forced decision test, which boosts the F1
score of the kill relation to 86.2%. In this experiment, we assume that the system
knows which sentences have the kill relation at the decision time, but it does not
know which pair of entities have this relation. We force the system to determine

568

Global Inference for Entity and Relation Identication via a Linear Programming Formulation
Table 20.5 Results of the relation classication in dierent approaches. Experiments are conducted using ve-fold cross-validation. Numbers in boldface indicates
that that the p -values are smaller than 0.1. Symbols and indicate signicance at
95% and 99% levels respectively. Signicance tests were computed with a two-tailed
paired t -test.
Approach

located in
Rec Prec F1

work for
Rec Prec F1

orgBased in
Rec Prec F1

Separate 53.0 43.3 45.2 41.9 55.1 46.3 35.6 85.4 50.0
Separate w/ Inf 51.6 56.3 50.5 40.1 74.1 51.2 35.7 90.8 50.8
E R 56.4 52.5 50.7
E R w/ Inf 55.7 53.2 50.9

44.4 60.8 51.2 42.1 77.8 54.3


42.9 72.1 53.5 42.3 78.0 54.5

R E 53.0 43.3 45.2 41.9 55.1 46.3 35.6 85.4 50.0


R E w/ Inf 53.0 49.8 49.1 41.6 67.5 50.4 36.6 87.1 51.2
E R 56.4 52.5 50.7
E R w/ Inf 55.7 53.9 51.3

44.4 60.8 51.2 42.1 77.8 54.3


42.3 72.0 53.1 41.6 79.8 54.3

Omniscient 62.9 59.5 57.5 50.3 69.4 58.2 50.3 77.9 60.9
Omniscient w/ Inf 62.9 61.9 59.1 50.3 79.2 61.4 50.9 81.7 62.5
Approach

live in
Rec Prec F1

kill
Rec Prec F1

Separate 39.7 61.7 48.0 81.5 75.3 77.6


Separate w/ Inf 41.7 68.2 51.4 80.8 82.7 81.4
E R 50.0 58.9 53.5
E R w/ Inf 50.0 57.7 53.0

81.5 73.0 76.5


80.6 77.2 78.3

R E 39.7 61.7 48.0


R E w/ Inf 40.6 64.1 49.4

81.5 75.3 77.6


81.5 79.7 80.1

E R 50.0 58.9 53.5


E R w/ Inf 49.0 59.1 53.0

81.5 73.0 76.5


81.5 77.5 79.0

Omniscient 56.1 61.7 58.2 81.4 76.4 77.9


Omniscient w/ Inf 57.3 63.9 59.9 81.4 79.9 79.9

which of the possible relations in a sentence (i.e., which pair of entities) has this
kill relation
by adding the following linear inequality.

x{R,kill} 1
RR

This is equivalent to saying that at least one of the relation variables in the sentence
should be labeled as kill. Since this additional constraint only applies to on
the sentences in which the kill relation is active, the inference results of other
sentences are not changed. Note that it is a realistic situation (e.g., in the context
of question answering) in that it adds an external constraint, not present at the time

20.5

Experiments

569

of learning the classiers, and it evaluates the ability of our inference algorithm to
cope with it. The results exhibit that our expectations are correct.
20.5.4

Case Study

Although tables 20.4 and 20.5 clearly demonstrate that the inference procedure
improves the performance, it is interesting to see how it corrects the mistakes by
examining a specic case. The following sentence is taken from a news article in
our corpus. The eight entities are in boldface, labeled E1 to E8 .
At the proposal of the Serb Radical Party|E1 , the Assembly elected political
Branko Vojnic|E2 from Beli Manastir|E3 as its speaker, while Marko
Atlagic|E4 and Dr. Milan Ernjakovic|E5 , Krajina|E6 Serb Democratic
Party|E7 (SDS|E8 ) candidates, were elected as deputy speakers.
Table 20.6 shows the probability distribution estimated by the basic classiers,
the predictions before and after the inference, along with the true labels. Table 20.7
provides this information for the relation variables. Because the values of most of
them are other rel, we only show a small set of them here.
Table 20.6 Example: Inference eect on entities predictions: the true labels, the
predictions before and after inference, and the probabilities estimated by the basic
classiers.
Label before Inf. after Inf.
E1
E2
E3
E4
E5
E6
E7
E7

Org
Per
Loc
Per
Per
Loc
Org
Org

Org
Other
Loc
Other
Loc
Loc
Per
Org

Org
Other
Loc
Other
Per
Loc
Org
Org

other person

loc.

org

0.21
0.46
0.29
0.37
0.10
0.24
0.15
0.35

0.06
0.33
0.31
0.33
0.36
0.61
0.03
0.11

0.60
0.05
0.15
0.10
0.23
0.10
0.40
0.37

0.13
0.16
0.25
0.20
0.31
0.05
0.41
0.17

In this example, the inference procedure corrects two variables E5 (Milan Ernjakovic) and E7 (Serb Democratic Party). If we examine the probability distribution
of these two entity variables in table 20.6, it is easy to see that the classier has
diculty deciding whether E5 is a persons name or location, and whether E7 is a
person or organization. The strong belief that there is a work for relation between
these two entities (see the row R57 in table 20.7) enables the inference procedure
to correct this mistake. In addition, several relation predictions are also corrected
from work for to other rel because they lack the support of the entity classier.
Note that not every mistake can be rectied, as several work for relations are
misidentied as other rel. This may be due to the fact that the relation other rel
can take any types of entities as its arguments. In some rare cases, the inference

570

Global Inference for Entity and Relation Identication via a Linear Programming Formulation

Example: Inference eect on relations predictions: the true labels, the


predictions before and after inference, and the probabilities estimated by the basic
classiers.

Table 20.7

Label
R23
R37
R47
R48
R51
R52
R56
R57
R58
R67
R68

kill
other rel
work for
work for
other rel
other rel
other rel
work for
work for
work for
work for

before Inf. after Inf.


other rel
work for
other rel
other rel
work for
work for
work for
work for
other rel
other rel
other rel

other rel
other rel
other rel
other rel
work for
other rel
other rel
work for
other rel
other rel
other rel

other rel located in work for org in live in kill


0.66
0.38
0.65
0.83
0.36
0.24
0.23
0.26
0.58
0.67
0.76

0.10
0.07
0.05
0.06
0.06
0.15
0.16
0.07
0.06
0.06
0.09

0.03
0.41
0.19
0.03
0.42
0.28
0.35
0.44
0.14
0.19
0.04

0.03
0.02
0.02
0.02
0.01
0.04
0.01
0.01
0.02
0.02
0.04

0.11
0.10
0.06
0.04
0.13
0.22
0.22
0.21
0.17
0.05
0.05

0.08
0.02
0.03
0.03
0.02
0.07
0.02
0.02
0.02
0.01
0.02

procedure may change a correct prediction to a wrong label. However, since this
seldom happens, the overall performance is still improved after inference.
One interesting thing to notice is the eciency of this ILP inference in practice.
Using a Pentium III 800MHz machine, it takes less than 30 seconds to process all
the 1437 sentences (5336 entity variables and 19,048 relation variables in total).

20.6

Comparison with Other Inference Methods


In this section, we provide a broader view of inference methods and place the ILP
approach described in this chapter in this context. Our approach to the problem
of learning with structured output decouples learning and inference stages. As
mentioned earlier, this is not the only approach. In other approaches (e.g., [39, 36]),
training can be done globally, coupled with the inference. Coupling training and
inference has multiple eects on performance and time complexity, which we do not
discuss here (but see [32, 26] for some comparative discussion) as we concentrate
on the inference component. Inference is the problem of determining the best global
F(Y) given model parameters , according to some cost function f ,
output y
where Y is the output space and F (Y) Y is the subset of Y that satisfy some
is decided as follows:
constraints. Formally, if x represents the input data, then y
= argmaxyF (Y) f (x, y; ).
y
The eciency and tractability of the inference procedure dictate the feasibility of
the whole framework. However, whether there exists an ecient and exact inference
algorithm highly depends on the problems structure. Polynomial-time algorithms
usually do not exist when there are complex constraints among the output variables
(just like the entity/relation problem described in this chapter). In this section, we

20.6

Comparison with Other Inference Methods

571

briey introduce several common inference algorithms in various text-processing


problems, and contrast them with our ILP approach.
20.6.1

Exact Polynomial-time Methods

Most polynomial-time inference methods are based on dynamic programming. For


linear chain structures, the Viterbi algorithm and its variations are the most popular. For tree structures, dierent cubic-time algorithms have been proposed. Although replacing these algorithms with the ILP approach does not necessarily make
the inference more ecient in practice, as we show below, the ILP framework does
provide these polynomial-time algorithms an easy way to incorporate additional
declarative constraints, which may not be possible to express within the original
inference algorithm. We describe these methods here and sketch how they can be
formulated as an integer linear programming problem.
20.6.1.1

The Viterbi Algorithm

Linear-chain structures are often used for sequence labeling problems, where the
task is to decide the label of each token. For this problem, HMMs [27], conditional
sequential models and other extensions [25], and conditional random elds [21] are
commonly used. While the rst two methods learn the state transition between
a pair of consecutive tokens, conditional random elds relax the directionality
assumption and train the potential functions for the size-1 (i.e., a single token) and
size-2 (a pair of consecutive tokens) cliques. In both cases, the Viterbi algorithm is
usually used to nd the most probable sequence assignment.
We describe the Viterbi algorithm in the linear-chain conditional random elds
setting as follows. Suppose we need to predict the labels of a sequence of tokens,
t0 , t1 , , tm1 . Let Y be the set of possible labels for each token, where |Y| = m.
A set of m m matrices {Mi (x)|i = 0, . . . , n 1} is dened over each pair of labels
y, y Y

Mi (y  , y|x) = exp(
j fj (y  , y, x, i)),
j

where j are the model parameters and fj are the features. By augmenting two
special nodes y1 and yn before and after the sequence with labels start and end
respectively, the sequence probability is
1 
Mi (yi1 , yi |x).
Z(x) i=0
n

p(y|x, ) =

Z(x) is a normalization factor that can be computed from the Mi s but is not
needed in evaluation. We only need to nd the label sequence y that maximizes
the product of the corresponding elements of these n + 1 matrices. The Viterbi
algorithm is the standard method that computes the most likely label sequence

572

Global Inference for Entity and Relation Identication via a Linear Programming Formulation

given the observation. It grows the optimal label sequence incrementally by scanning
the matrices from position 0 to n. At step i, it records all the optimal sequences
ending at a label y, y Y (denoted by yi (y)), and also the corresponding product
Pi (y). The recursive function of this dynamic programming algorithm is
1. P0 (y) = M0 (start, y|x) and y0 (y) = y.

(
y ).(y) and Pi (y) = maxy Y Pi1 (y  )M (y  , y|x),
2. for 1 i n, yi (y) = yi1

where y = argmaxy Y Pi1 (y )M (y  , y|x) and . is the concatenation operator.

= [yn ]0..n1 , which is the best path to the


The optimal sequence is therefore yn1
end symbol but taking only position 0 to position n 1.

The solution that Viterbi outputs is in fact the shortest path in the graph
constructed is as follows. Let n be the number of tokens in the sequence, and m be
the number of labels each token can take. The graph consists of nm + 2 nodes and
(n 1)m2 + 2m edges. In addition to two special nodes start and end that denote
the start and end positions of the path, the label of each token is represented by a
node vij , where 0 i n 1, and 0 j m 1. If the path passes node vij , then
label j is assigned to token i. For nodes that represent two adjacent tokens v(i1)j
and vij  , where 0 i n, and 0 j, j  m 1, there is a directed edge xi,jj  from
v(i1)j to vij  , with the cost log(Mi (jj  |x)).
Obviously, the path from start to end will pass exactly one node on position i. That
is, exactly one of the nodes vi,j , 0 j m1, will be picked. Figure 20.3 illustrates
the graph. Suppose that y = y0 y1 yn1 is the label sequence determined by the
path. Then
argminy

n1

i=0

log(Mi (yi1 yi |x)) = argmaxy

n1


Mi (yi1 yi |x).

i=0

Namely, the nodes in the shortest path are exactly the labels returned by the Viterbi
algorithm.

Figure 20.3 The graph that represents the labels of the tokens and the state
transition (also known as the trellis in hidden Markov models).

20.6

Comparison with Other Inference Methods

573

The Viterbi algorithm can still be used when the matrix is slightly modied to
incorporate simple constraints. For example, in the task of information extraction,
if the label of a word is the beginning of an entity (B), inside an entity (I), or
outside any entity (O), a token label O immediately followed by a label I is not a
valid labeling. The constraint can be incorporated by changing the corresponding
transitional probability or matrix entries to 0 [10, 20]. However, more general,
nonMarkovian constraints cannot be resolved using the same trick.
Recently, Roth and Yih [32] proposed a dierent inference approach based on
ILP to replace the Viterbi algorithm. The basic idea there is to use integer linear
programming to nd the shortest path in the trellis (e.g., gure 20.3). Each edge
of the graph is represented by an indicator variable to represent whether this edge
is in the shortest path or not. The cost function can be written in terms of a linear
function of these indicator variables. In this ILP, linear (in)equalities are added to
enforce that the values of these indicator variables represent a legitimate path. This
ILP can be solved simply by LP relaxation because the coecient matrix is totally
unimodular. However, the main advantage of this new setting is its ability to allow
more general constraints that can be encoded either in linear (in)equalities or in
the cost function. Interested readers may see [32] for more details.
20.6.1.2

Constraint Satisfaction with Classiers

A second ecient inference algorithm for linear sequence tasks that has been used
successfully for natural language and information extraction problems is constraint
satisfaction with classiers (CSCL) [25]. This method was rst proposed for shallow
parsing identifying atomic phrases (e.g., base noun phrases) in a given sentence.
In that case, two classiers are rst trained to predict whether a word opens
(O) a phrase or closes (C) a phrase. Since these two classiers may generate
inconsistent predictions, the inference task has to decide which OC pairs are indeed
the boundaries of a phrase.
We illustrate their approach by the following example. Suppose a sentence has
six tokens, t1 , , t6 , as indicated in gure 20.4. The classiers have identied three
opens (O) and three closes (C) in this sentence (i.e., the open and close brackets).
Among the OC pairs (t1 , t3 ), (t1 , t5 ), (t1 , t6 ), (t2 , t3 ), (t2 , t5 ), (t2 , t6 ), (t4 , t5 ), (t4 , t6 ),
the inference procedure needs to decide which of them are the predicted phrases,
based on the cost function. In addition, the chosen phrases should not overlap or
embed with each other. Let the predicate this pair is selected as a phrase be
represented by an indicator variable xi X, where |X| = 8 in this case. They
associate a cost function c : X R with each variable (where the value c(xi ) is
determined as a function of the corresponding OC classiers), and try to nd a

solution that minimizes the overall cost, ni=1 c(xi )xi .
This problem can be reduced elegantly to a shortest path problem by the following
graph construction. Each open and close word is represented by an O node and a C
node. For each possible OC pair, there is a direct link from the corresponding open

574

Global Inference for Entity and Relation Identication via a Linear Programming Formulation

Figure 20.4 Identifying phrases in a sentence using the constraints satisfaction


with classiers (CSCL) approach, courtesy of Vasin Punyakanok.

node O to the close node C. Finally, one source (s) node and one target (t) node
are added. Links are added from s to each O and from each C to t. The cost of
an OC link is pi , where pi is the probability that this OC pair represents a phrase,
estimated by the O and C classiers.
Because the inference process is also done by nding the shortest path in the
graph, the ILP framework described in [32] is applicable here as well.
20.6.1.3

Clause Identication

The two ecient approaches mentioned above can be generalized beyond the
sequential structure, to tree structures. Cubic-time dynamic algorithms are often
used for inference in various tree-structure problems, such as parsing [17] or clause
identication [38]. As an example, we discuss the inference approach proposed
by Carreras et al. [7], in the context of clause identication. Clause identication is a
partial parsing problem. Given a sentence, a clause is dened as a sequence of words
that contains a subject and a predicate [38]. In the following example sentence taken
from the Penn yreebank [23], each pair of corresponding parentheses represents a
clause. The task is thus to identify all the clauses in the sentence.
(The deregulation of railroads and trucking companies (that (began in 1980))
enabled (shippers to bargain for transportation).)
Although the problem looks similar to shallow parsing, the constraints between
the clauses are weaker clauses may not overlap, but a clause can be embedded in
another. Formally speaking, let wi be the ith word in a sentence of n words. A clause
can be dened as a pair of numbers (s, t), where 1 s t n, which represents the

20.6

Comparison with Other Inference Methods

575

word sequence ws , ws+1 , . . . , wt . Given two clauses c1 = (s1 , t1 ) and c2 = (s2 , t2 ),


we say that these two clauses overlap i s1 < s2 t1 < t2 or s2 < s1 t2 < t1 .
Similarly to the approach presented throughout this chapter, in [7, 6], this
problem is solved by combining learning and inference. Briey speaking, each
candidate clause c = (s, t) in the targeted sentence is associated with a score,
score(c), estimated by the classiers. Let C be the set of all possible clauses in
the given sentence, F (C) all possible subsets of C that satisfy the nonoverlapping
constraint. Then the best clause prediction is dened as

score(c).
c = argmaxcF (C)
cc

Carreras et al. [7] proposed a dynamic programming algorithm to solve this inference problem. In this algorithm, two 2D matrices are maintained: best-split[s,t]
stores the optimal clause predictions in ws , ws+1 , . . . , wt ; score[s,t] is the score
of the clause (s, t). By lling the table recursively, the optimal clause prediction
can be found in O(n3 ) time.
As in the previous cases discussed in this section, it is clear that this problem
can be represented as an ILP. Each candidate clause (s, t) can be represented
by an indicator variable xs,t . The cost function is the sum of the score times

the corresponding indicator variable, namely (score(s, t) xs,t ). Suppose clause
candidates (s1 , t1 ) and (s2 , t2 ) overlap. The nonoverlapping constraint can be
enforced by adding a linear inequality, xs1 ,t1 + xs2 ,t2 1.
20.6.2

Generic Methods Search

As discussed above, exact polynomial time algorithms exist for specic constraint
structures; however, the inference problem typically becomes computationally intractable when additional constraints are introduced, or more complex structures
are needed. A common computational approach to the inference problem in this
case is search. Following the denition in [33], search is used to nd a legitimate
state transition path from the initial state to a goal state while trying to minimize the cost. The problem can be treated as consisting of four components: state
space, operators (the legitimate state transitions), goal-test (a function that examines whether a goal state is reached), and path-cost-function (the cost function of
the whole path). Figure 20.5 depicts a generic search algorithm.
To solve the entity-relation problem described in this chapter, we can dene the
state space as the set of all possible labels of the entities and relations (namely,
LE and LR ), plus undecided. In the initial state, the values of all the variables
are undecided. A legitimate operator changes an entity or relation variable from
undecided to one of the possible labels, subject to the constraints. The goal-test
evaluates whether every variable has been assigned a label, and the path-cost is the
sum of the assignment cost of each variable.
The main advantage of inference using search is its generality. The cost function
need not be linear. The constraints can also be fairly general: as long as the decision

576

Global Inference for Entity and Relation Identication via a Linear Programming Formulation

Algorithm 1
generic-search(problem, enqueue-func)
nodes MakeQueue(MakeNode(init-state(problem))
while (node is not empty)
node RemoveFront(nodes)
if (goal-test(node)) then return node
next Operators(node)
nodes enqueue-func(problem, nodes, next)
end
return failure
end
Figure 20.5

The generic search algorithm, adapted from [33].

on whether a state violates constraints can be evaluated eciently, they can be used
to dene the operators.
The main disadvantage, however, is that there is no guarantee of optimality.
Despite this weakness, it has been shown that search is a successful approach in
some tasks empirically. For instance, Moore [24] applied beam search to nd the
best word alignment given a linear model learned using voted perceptron. Recently,
Daume and Marcu [14] demonstrated an approximate large margin method for
learning structured output, where the key inference component is search.
In contrast, our ILP approach may or may not be able to replace this search
mechanism, depending on the specic cost function. Nevertheless, in several realworld problems, we observed that our ILP method may not be slower than search,
but is guaranteed to nd the optimal solution.

20.7

Conclusion
We presented a linear-programming based approach for global inference in cases
where decisions depend on the outcomes of several dierent but mutually dependent
classiers. Even in the presence of a fairly general constraint structure, deviating
from the sequential nature typically studied, this approach can nd the optimal
solution eciently.
Contrary to general search schemes (e.g., beam search), which do not guarantee
optimality, the LP approach provides an ecient way of nding the optimal
solution. The key advantage of the LP formulation is its generality and exibility; in
particular, it supports the ability to incorporate classiers learned in other contexts,
hints supplied, and decision-time constraints, and reason with all these
for the best global prediction. In sharp contrast with the typically used pipeline
framework, our formulation does not blindly trust the results of some classiers,
and therefore is able to overcome mistakes made by classiers with the help of
constraints.

References

577

Our experiments have demonstrated these advantages by considering the interaction between entity and relation classiers. In fact, more classiers can be added
and used within the same framework. For example, if coreference resolution is available, it is possible to incorporate it in the form of constraints that force the labels of
the coreferred entities to be the same (but, of course, allowing the global solution
to reject the suggestion of these classiers). Consequently, this may enhance the
performance of entity-relation recognition and, at the same time, correct possible
coreference resolution errors. Another example is to use chunking information for
better relation identication; suppose, for example, that we have available chunking
information that identies Subj+Verb and Verb+Object phrases. Given a sentence
that has the verb murder, we may conclude that the subject and object of this
verb are in a kill relation. Since the chunking information is used in the global inference procedure, this information will contribute to enhancing its performance and
robustness, relying on having more constraints and overcoming possible mistakes
by some of the classiers. Moreover, in an interactive environment where a user can
supply new constraints (e.g., a question-answering situation) this framework is able
to make use of the new information and enhance the performance at decision time,
without retraining the classiers. As we have shown, our formulation supports not
only improved accuracy but also improves the coherent quality of the decisions.
We believe that it has the potential to be a powerful way for supporting natural
language inference.

Acknowledgments
Most of this research was done when Wen-tau Yih was at the University of Illinois
at Urbana-Champaign. This research was supported by NSF grants CAREER IIS9984168 and ITR IIS-0085836, an ONR MURI award and by the Advanced Research
and Development Activity (ARDA)s Advanced Question Answering for Intelligence
(AQUAINT) program.

References
[1] S. Abney. Parsing by chunks. In R. Berwick, S. Abney, and C. Tenny, editors,
Principle-Based Parsing: Computation and Psycholinguistics, pages 257278.
Kluwer, Dordrecht, Netherlands, 1991.
[2] C. Bishop. Neural Networks for Pattern Recognition. Oxford University Press,
Oxford, UK, 1995.
[3] Y. Boykov, O. Veksler, and R. Zabih. Fast approximate energy minimization
via graph cuts. IEEE Transactions on Pattern Analysis and Machine Intelligence, 23(11):12221239, 2001.

578

Global Inference for Entity and Relation Identication via a Linear Programming Formulation

[4] A. Carlson, C. Cumby, J. Rosen, and D. Roth. The SNoW learning architecture. Technical Report UIUCDCS-R-99-2101, University of Illinois at UrbanaChampaign Computer Science Department, May 1999.
[5] X. Carreras and L. M`
arquez. Introduction to the CoNLL-2004 shared tasks:
Semantic role labeling. In Proceedings of the Conference on Natural Language
Learning, 2004.
[6] X. Carreras and L. M`arquez. Introduction to the CoNLL-2005 shared task:
Semantic role labeling. In Proceedings of the Conference on Natural Language
Learning, 2005.
[7] X. Carreras, L. M`arquez, V. Punyakanok, and D. Roth. Learning and inference
for clause identication. In Proceedings of the European Conference on Machine
Learning, 2002.
[8] C. Chekuri, S. Khanna, J. Naor, and L. Zosin. Approximation algorithms for
the metric labeling problem via a new linear programming formulation. In
Symposium on Discrete Algorithms, 2001.
[9] R. Chellappa and A. Jain. Markov Random Fields: Theory and Application.
Academic Press, April 1993.
[10] H. Chieu and H. Ng. A maximum entropy approach to information extraction
from semi-structure and free text. In Proceedings of the National Conference
on Articial Intelligence, 2002.
[11] CPLEX. ILOG, Inc. http://www.ilog.com/products/cplex/, 2003.
[12] C. Cumby and D. Roth. Relational representations that facilitate learning.
In Proceedings of the International Conference on Principles of Knowledge
Representation and Reasoning, 2000.
[13] C. Cumby and D. Roth. On kernel methods for relational learning. In
Proceedings of the International Conference on Machine Learning, 2003.
[14] H. Daume III and D. Marcu. Learning as search optimization: Approximate
large margin methods for structured prediction. In Proceedings of the International Conference on Machine Learning, 2005.
[15] T. Dietterich. Machine learning for sequential data: A review. In Structural,
Syntactic, and Statistical Pattern Recognition, pages 1530. Springer-Verlag,
2002.
[16] Y. Even-Zohar and D. Roth. A classication approach to word prediction.
In Proceedings of the Annual Meeting of the North American Association of
Computational Linguistics, 2000.
[17] M. Johnson. PCFG models of linguistic tree representations. Computational
Linguistics, 24(4):613632, 1998.
[18] R. Khardon, D. Roth, and L. G. Valiant. Relational learning for NLP using
linear threshold elements. In Proceedings of the International Joint Conference
on Articial Intelligence, 1999.

References

579

[19] J. Kleinberg and E. Tardos. Approximation algorithms for classication


problems with pairwise relationships: Metric labeling and markov random
elds. In IEEE Symposium on Foundations of Computer Science, 1999.
[20] T. Kristjannson, A. Culotta, P. Viola, and A. McCallum. Interactive information extraction with constrained conditional random elds. In Proceedings
of the National Conference on Articial Intelligence, 2004.
[21] J. Laerty, A. McCallum, and F. Pereira. Conditional random elds: Probabilistic models for segmenting and labeling sequence data. In Proceedings of
the International Conference on Machine Learning, 2001.
[22] A. Levin, A. Zomet, and Y. Weiss. Learning to perceive transparency from the
statistics of natural scenes. In Proceedings of Neural Information Processing
Systems, 2002.
[23] M. P. Marcus, M. A. Marcinkiewicz, and B. Santorini. Building a large
annotated corpus of English: the Penn treebank. Computational Linguistics,
19(2):313330, 1993.
[24] R. C. Moore. A discriminative framework for bilingual word alignment.
In Proceedings of the Conference on Empirical Methods in Natural Language
Processing, 2005.
[25] V. Punyakanok and D. Roth. The use of classiers in sequential inference. In
Proceedings of Neural Information Processing Systems, 2001.
[26] V. Punyakanok, D. Roth, W. Yih, and D. Zimak. Learning and inference over
constrained output. In Proceedings of the International Joint Conference on
Articial Intelligence, 2005.
[27] L.R. Rabiner. A tutorial on hidden Markov models and selected applications
in speech recognition. Proceedings of the IEEE, 77(2), February 1989.
[28] D. Roth. Reasoning with classiers. In Proceedings of the European Conference on Machine Learning, 2002.
[29] D. Roth. Learning to resolve natural language ambiguities: A unied approach. In Proceedings of the National Conference on Articial Intelligence,
1998.
[30] D. Roth and W. Yih. Relational learning via propositional algorithms: An
information extraction case study. In Proceedings of the International Joint
Conference on Articial Intelligence, 2001.
[31] D. Roth and W. Yih. Probabilistic reasoning for entity and relation recognition. In Proceedings of the International Conference on Computational Linguistics, 2002.
[32] D. Roth and W. Yih. Integer linear programming inference for conditional
random elds. In Proceedings of the International Conference on Machine
Learning, 2005.
[33] S. Russell and P. Norvig. Articial Intelligence: A Modern Approach. Prentice
Hall, Upper Saddle River, NJ, 1995.

580

Global Inference for Entity and Relation Identication via a Linear Programming Formulation

[34] A. Schrijver. Theory of Linear and Integer Programming. Wiley Interscience


Series in Discrete Mathmatics. John Wiley & Sons, Hoboken, NJ, 1986.
[35] B. Taskar, P. Abbeel, and D. Koller. Discriminative probabilistic models for
relational data. In Proceedings of Uncertainty in Articial Intelligence, 2002.
[36] B. Taskar, D. Klein, M. Collins, D. Koller, and C. Manning. Max-margin
parsing. In Proceedings of the Conference on Empirical Methods in Natural
Language Processing, 2004.
[37] E. F. Tjong Kim Sang and F. De Meulder. Introduction to the CoNLL-2003
shared task: Language-independent named entity recognition. In Proceedings
of the Conference on Natural Language Learning, 2003.
[38] E. F. Tjong Kim Sang and H. Dejean. Introduction to the CoNLL-2001 shared
task: Clause identication. In Walter Daelemans and Remi Zajac, editors,
Proceedings of the Conference on Natural Language Learning, pages 5357,
2001.
[39] I. Tsochantaridis, T. Hofmann, T. Joachims, and Y. Altun. Support vector
machine learning for interdependent and structured output spaces. In Proceedings of the International Conference on Machine Learning, 2004.
[40] X. Wang and A. Regan. A cutting plane method for integer programming
problems with binary variables. Technical Report UCI-ITS-WP-00-12, University of California, Irvine, 2000.
[41] L. Wolsey. Integer Programming. John Wiley & Sons, Hoboken, NJ, 1998.
[42] Xpress-MP. Dash Optimization. http://www.dashoptimization.com/products.html,
2003.

Contributors

Pieter Abbeel
Computer Science Department
Stanford University
[email protected]
Eyal Amir
Department of Computer Science
University of Illinois, Urbana-Champaign
[email protected]
Rodrigo de Salvo Braz
Department of Computer Science
University of Illinois, Urbana-Champaign
[email protected]
Razvan C. Bunescu
Department of Computer Sciences
University of Texas, Austin
[email protected]
Elizabeth Burnside
Department of Radiology
Department of Biostatistics and Medical Informatics
University of Wisconsin, Madison
[email protected]
Vtor Santos Costa
COPPE/Sistemas
Universidade Federal do Rio de Janeiro, Brazil
[email protected]

582

Contributors

James Cussens
Department of Computer Science &
York Centre for Complex Systems Analysis
University of York, UK
[email protected]
Jesse Davis
Department of Computer Science
University of Wisconsin, Madison
[email protected]
Luc De Raedt
Institute for Computer Science, Machine Learning Lab
Albert-Ludwigs-Universitat Freiburg, Germany
[email protected]
Pedro Domingos
Department of Computer Science and Engineering
University of Washington
[email protected]
In
es Dutra
COPPE/Sistemas
Universidade Federal do Rio de Janeiro, Brazil
[email protected]
Sa
so D
zeroski
Department of Knowledge Technologies
Jozef Stefan Institute, Slovenia
[email protected]
Alan Fern
School of Electrical Engineering and Computer Science
Oregon State University
[email protected]

Nir Friedman
School of Computer Science and Engineering
Hebrew University, Israel
[email protected]

Contributors

Lise Getoor
Computer Science Department
University of Maryland, College Park
[email protected]
Robert Givan
School of Electrical and Computer Engineering
Purdue University
[email protected]
David Heckerman
Microsoft Research Redmond
[email protected]
David Jensen
Computer Science Department
University of Massachusetts, Amherst
[email protected]
Kristian Kersting
Institute for Computer Science, Machine Learning Lab
Albert-Ludwigs-Universitat Freiburg, Germany
[email protected]
Daphne Koller
Computer Science Department
Stanford University
[email protected]
Andrey Kolobov
Computer Science Division
University of California, Berkeley
[email protected]
Bhaskara Marthi
Computer Science Division
University of California, Berkeley
[email protected]

583

584

Contributors

Andrew McCallum
Department of Computer Science
University of Massachusetts, Amherst
[email protected]
Chris Meek
Microsoft Research Redmond
[email protected]
Brian Milch
Computer Science Division
University of California, Berkeley
[email protected]
Raymond J. Mooney
Department of Computer Sciences
University of Texas, Austin
[email protected]
Stephen Muggleton
Department of Computing,
Imperial College London, UK
[email protected]
Jennifer Neville
Computer Science Department
University of Massachusetts, Amherst
[email protected]
Daniel L. Ong
Computer Science Division
University of California, Berkeley
[email protected]
David Page
Department of Biostatistics and Medical Informatics
University of Wisconsin, Madison
[email protected]
Niels Pahlavi
Department of Computing,
Imperial College London, UK

Contributors

Avi Pfeer
Division of Engineering and Applied Sciences
Harvard University
[email protected]
Alexandrin Popescul
Department of Computer and Information Science
University of Pennsylvania
[email protected]
Raghu Ramakrishnan
Department of Computer Science
University of Wisconsin, Madison
[email protected]
Matthew Richardson
Microsoft Research Redmond
[email protected]
Dan Roth
Department of Computer Science
University of Illinois, Urbana-Champaign
[email protected]
Stuart Russell
Computer Science Division
University of California, Berkeley
[email protected]
Jude Shavlik
Department of Computer Science
University of Wisconsin, Madison
[email protected]
David Sontag
Department of Electrical Engineering and Computer Science
Massachusetts Institute of Technology
[email protected]
Charles Sutton
Department of Computer Science
University of Massachusetts, Amherst
[email protected]

585

586

Contributors

Ben Taskar
Department of Computer and Information Science
University of Pennsylvania
[email protected]
Lyle H. Ungar
Department of Computer and Information Science
University of Pennsylvania
[email protected]
Ming-Fai Wong
Computer Science Department
Stanford University
[email protected]
Wen-tau Yih
Machine Learning and Applied Statistics Group
Microsoft Research Redmond
[email protected]
SungWook Yoon
School of Electrical and Computer Engineering
Purdue University
[email protected]

Index

An online index is available on the book webage at


http://www.cs.umd.edu/srl-book/index.htm

You might also like