Peter McCullagh - Ten Projects in Applied Statistics-2023
Peter McCullagh - Ten Projects in Applied Statistics-2023
Peter McCullagh
Ten Projects
in Applied Statistics
Springer Series in Statistics
Series Editors
Peter Bühlmann, Seminar für Statistik, ETH Zürich, Zürich, Switzerland
Peter Diggle, Dept. Mathematics, University Lancaster, Lancaster, UK
Ursula Gather, Dortmund, Germany
Scott Zeger, Baltimore, MD, USA
Springer Series in Statistics (SSS) is a series of monographs of general interest that
discuss statistical theory and applications.
The series editors are currently Peter Bühlmann, Peter Diggle, Ursula Gather,
and Scott Zeger. Peter Bickel, Ingram Olkin, and Stephen Fienberg were editors of
the series for many years.
Peter McCullagh
© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland
AG 2022
This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether
the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse
of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and
transmission or information storage and retrieval, electronic adaptation, computer software, or by similar
or dissimilar methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication
does not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
The publisher, the authors, and the editors are safe to assume that the advice and information in this book
are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or
the editors give a warranty, expressed or implied, with respect to the material contained herein or for any
errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional
claims in published maps and institutional affiliations.
This Springer imprint is published by the registered company Springer Nature Switzerland AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
To Rosa
Preface
Goals
The book begins with ten chapters, each devoted to a detailed discussion of a specific
project. Some of these are projects that arose as part of the statistical consulting
program at the University of Chicago; others are taken from recent publications in
the scientific literature. The discussion specifically covers analyses that might seem
superficially plausible, but are in fact misleading.
The areas of application range from medical and biological sciences to animal
behavior studies, growth curves, time series, and environmental work. Statistical
techniques are kept as simple as is reasonable to do justice to the scientific
goals. They range from summary tables and graphs to linear models, generalized
linear models, variance-component models, time series, spatial processes, and so
on. Recognition of relationships among the observational units and the need to
accommodate correlations among responses is a recurring theme.
The second half of the book begins by discussing a range of fundamental
considerations that shape my attitude to applied statistics. Once they are pointed out,
these matters appear so simple and obvious that it is hard to explain why a detailed
discussion should be needed in an advanced-level text. But the simple fact that they
are so frequently overlooked shows that this attitude is unhelpful. Most such matters
are related to statistical design, the placement of the baseline, the identification of
observational units and experimental units, the role of covariates and relationships,
initial values, randomization and treatment assignment, and so on. Others are related
to the interplay between design and modelling: Is the proposed model compatible
with the design? More technical matters related to stochastic processes, including
stationarity and isotropy, techniques for constructing spatio-temporal processes,
likelihood functions, and so on are covered in later chapters. Parametric inference
is important, but it is not a major or primary focus. More important by far is to
estimate the right thing, however inefficiently, than to estimate the wrong thing with
maximum efficiently.
vii
viii Preface
The book is aimed at professional statisticians and at students who are already
familiar with linear models but who wish to gain experience in the application of
statistical ideas to scientific research. It is also aimed at scientific researchers in
ecology, biology, or medicine who wish to use appropriate statistical methods in
their work.
The primary emphasis is on the translation of experimental concepts and
scientific ideas into stochastic models, and ultimately into statistical analyses and
substantive conclusions. Although numerical computation plays a major role, it is
not a driving force.
The aim of the book is not so much to illustrate algorithms or techniques of
computation, but to illustrate the role of statistical thinking and stochastic modelling
in scientific research. Typically, that cannot be accomplished without a detailed
discussion of the scientific objectives, the experimental design, randomization,
treatment assignment, baseline factors, response measurement, and so on. Before
settling on a standard family of stochastic processes, the statistician must first
ask whether the model is adequate as a formulation of the scientific goals. Is it
self-consistent? Is it in conflict with randomization? Does it address adequately
the sexual asymmetry in Drosophila courtship rituals? Is the sampling scheme
adequate for the stated objective? A glance at Chaps. 1–5 shows the extent to which
the discussion must be tailored to the specific needs of the application, and the
unfortunate consequences of adopting an off-the-shelf model in a routine way. As
D.R. Cox once remarked at an ISI meeting in 1979, “There are no routine statistical
questions—only questionable statistical routines.”
Every analysis and every stochastic model is open to criticism on the grounds that
it is not a good match for the application. A perfect match is a rarity, so compromise
is needed in every setting, and a balance must be struck in order to proceed. At
various points, I have included plausible analyses that are deficient, inappropriate,
misleading, or simply incorrect. The hope is that students might learn not only from
their own mistakes but from the mistakes of others.
Computation
The R package (R Core Team, 2015) is used throughout the book for all plots
and computations. Initial data summaries may use built-in functions such as
apply(...) or tapply(..) for one- and two-way tables of averages, or
plot(...) for plots of seasonal trends or residuals. Apart from standard functions
such as lm(...) for fitting linear models, glm(...) for generalized linear mod-
els, and fft(...) for the fast Fourier transformation, two additional packages are
used for specialized purposes:
1. regress(...) (Clifford and McCullagh, 2006) for fitting linear Gaussian
models having non-trivial spatial or temporal correlations;
Preface ix
2. lmer(...) (Bates et al., 2015) for fitting linear and generalized linear random-
effects models.
Either package may be used for fitting the models in Chaps. 1 and 2; glm() is
adequate for most computations in Chaps. 3 and 7; regress() is better suited to
the needs of Chaps. 4 and 5; and lmer() is better suited for Chap. 9.
Organization
The book is not organized linearly by topic. From a logical standpoint, it might
have been intellectually more satisfying to begin at the beginning with Chap. 11 and
to illustrate the various statistical design concepts with examples drawn from the
literature. That option was considered and quickly abandoned. A deliberate choice
has been made to put as much emphasis on the projects as on the statistical methods
and to draw upon whatever statistical techniques are required as and when they are
required. For that reason, the projects come first. Thus, a reader who is unsure of
the distinction between covariates and relationships or between observational and
experimental units or the implications of those distinctions may consult the relevant
portions of Chap. 11.
Several of the projects are taken from experiments reported in the recent scientific
literature. In most cases, the experimental design is fairly easy to understand when
the structure of the units is properly laid out. The importance of accommodating
correlations arising from temporal or other relationships among the units is a
recurring theme. And if that is the only lesson learned, the book will have served a
useful purpose.
Acknowledgments
My attitude toward applied statistics has been shaped over many years by discus-
sions with colleagues, particularly David Wallace, Mike Stein, Steve Stigler, Mei
Wang, and Colm O’Muircheartaigh. The books on experimental design by Cox
(1958), Mead (1988) and Bailey (2008), and the pair of papers Nelder (1965a,b)
on randomized field experiments have been particularly influential. Cox and Snell
(1981) is a trove of statistical wisdom and unteachable common sense.
For the past 25 years, I have worked closely with my colleague Mei Wang,
encouraging graduate students and advising research workers on projects brought
to the statistical consulting program at the University of Chicago. Over that period,
we have given advice on perhaps 500 projects from a wide range of researchers,
from all branches of the Physical and Biological Sciences to the Social Sciences,
Public Policy, Law, Humanities, and even the Divinity School. All of the attitudes
expressed in these notes are the product of direct experiences with researchers,
x Preface
plus discussions with students and colleagues. Parts of three consulting projects
are included in Chaps. 1, 3 and 10.
Other themes and whole chapters have emerged from an advanced graduate
course taught at the University of Chicago, in which students are encouraged to
work on a substantial applied project taken from the recent scientific literature.
Chapter 5 is based on a course project selected by Wei Kuang in 2020. I have
included a detailed analysis of this project because it raises a number of technical
issues concerning factorial models that are well understood but seldom adequately
emphasized in the applied statistical literature. Chapter 9 is based on course projects
by Dongyue Xie, Irina Cristali, Lin Gui, and Y. Wei in 2018–2020. Chapter 16 was
motivated by a 2020 masters project by Ben Harris on spatio-temporal variation
in summer solar irradiance and its effect on solar power generation in downstate
Illinois. All of the attitudes and opinions expressed in these analyses are mine alone.
Over the past few years, several students have tackled various aspects of the Out-
of-Africa project. Some have gone beyond the call of duty, including Shane Miao
for her Masters project at the University of Oxford in 2016, and Josephine Santoso
for her Masters project at the University of Chicago in 2022.
I am indebted to students and colleagues, particularly Heather Battey, Laurie
Butler, Emma McCullagh, Mike Stein, Steve Stigler, and Mei Wang, for reading
and commenting on an earlier draft of the manuscript.
1 Rat Surgery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 1
1.1 Healing of Surgical Wounds .. . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 1
1.2 An Elementary Analysis . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 3
1.3 Two Incorrect Analyses.. . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 4
1.4 Model Formulae . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 5
1.5 A More Appropriate Formal Analysis . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 6
1.6 Further Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 7
1.6.1 Exclusions .. . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 7
1.6.2 Missing Components . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 8
1.6.3 Back-Transformation . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 9
1.7 Summary of Statistical Concepts.. . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 10
1.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 10
2 Chain Saws . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 15
2.1 Efficiency of Chain Saws . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 15
2.2 Covariate and Treatment Factors .. . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 16
2.3 Goals of Statistical Analysis . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 17
2.4 Formal Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 19
2.5 REML and Likelihood Ratios . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 20
2.6 Summary of Conclusions .. . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 21
2.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 22
3 Fruit Flies. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 25
3.1 Diet and Mating Preferences . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 25
3.2 Initial Analyses.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 26
3.2.1 Assortative Mating .. . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 26
3.2.2 Initial Questions and Exercises. . . . . .. . . . . . . . . . . . . . . . . . . . 27
3.3 Refractory Effects .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 28
3.3.1 More Specific Mating Counts . . . . . . .. . . . . . . . . . . . . . . . . . . . 28
3.3.2 Follow-Up Analyses . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 30
3.3.3 Lexis Dispersion . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 31
3.3.4 Is Under-Dispersion Possible? . . . . . .. . . . . . . . . . . . . . . . . . . . 32
xi
xii Contents
3.3.5 Independence.. . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 33
3.3.6 Acknowledgement . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 36
3.4 Technical Points .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 36
3.4.1 Hypergeometric Simulation by Random Matching . . . . 36
3.4.2 Pearson’s Statistic . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 37
3.5 Further Drosophila Project . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 38
3.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 40
4 Growth Curves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 43
4.1 Plant Growth: Data Description .. . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 43
4.2 Growth Curve Models . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 45
4.3 Technical Points .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 47
4.3.1 Non-linear Model with Variance Components . . . . . . . . . 47
4.3.2 Fitted Versus Predicted Values . . . . . .. . . . . . . . . . . . . . . . . . . . 48
4.4 Modelling Strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 51
4.5 Miscellaneous R Functions .. . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 52
4.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 52
5 Louse Evolution .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 55
5.1 Evolution of Lice on Captive Pigeons . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 55
5.1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 55
5.1.2 Experimental Design . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 56
5.1.3 Deconstruction of the Experimental Design .. . . . . . . . . . . 56
5.2 Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 58
5.2.1 Role of Tables and Graphs . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 58
5.2.2 Trends in Mean Squares . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 59
5.2.3 Initial Values and Factorial Subspaces .. . . . . . . . . . . . . . . . . 62
5.2.4 A Simple Variance-Components Model . . . . . . . . . . . . . . . . 63
5.2.5 Conformity with Randomization .. . .. . . . . . . . . . . . . . . . . . . . 64
5.3 Critique of Published Claims . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 66
5.4 Further Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 68
5.4.1 Role of Louse Sex . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 68
5.4.2 Persistence of Initial Patterns. . . . . . . .. . . . . . . . . . . . . . . . . . . . 69
5.4.3 Observational Units . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 70
5.5 Follow-Up .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 71
5.5.1 New Design Information . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 71
5.5.2 Modifications to Analyses . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 73
5.5.3 Further Remarks . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 75
5.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 75
6 Time Series I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 81
6.1 A Meteorological Temperature Series . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 81
6.2 Seasonal Cycles .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 82
6.2.1 Means and Variances . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 82
6.2.2 Skewness and Kurtosis . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 84
Contents xiii
17 Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 327
17.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 327
17.1.1 Non-Bayesian Model . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 327
17.1.2 Bayesian Resolution . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 328
17.2 Likelihood Function . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 329
17.2.1 Definition .. . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 329
17.2.2 Bartlett Identities.. . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 330
17.2.3 Implications for Estimation . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 332
17.2.4 Likelihood-Ratio Statistic I . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 333
17.2.5 Profile Likelihood .. . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 334
17.2.6 Two Worked Examples . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 335
17.3 Generalized Linear Models .. . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 337
17.4 Variance-Components Models . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 339
17.5 Mixture Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 339
17.5.1 Two-Component Mixtures.. . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 339
17.5.2 Likelihood-Ratio Statistic . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 341
17.5.3 Sparse Signal Detection . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 342
17.6 Inferential Compromises . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 343
17.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 345
18 Residual Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 349
18.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 349
18.2 Simple Linear Regression . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 351
18.3 The REML Likelihood . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 351
18.3.1 Projections . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 351
18.3.2 Determinants . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 352
18.3.3 Marginal Likelihood with Arbitrary Kernel . . . . . . . . . . . . 352
18.3.4 Likelihood Ratios . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 353
18.4 Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 354
18.4.1 Software Options . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 354
18.4.2 Likelihood-Ratios .. . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 355
18.4.3 Testing for Interaction . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 356
18.4.4 Singular Models . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 358
18.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 358
19 Response Transformation.. . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 363
19.1 Likelihood for Gaussian Models . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 363
19.2 Box-Cox Transformation . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 364
19.2.1 Power Transformation . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 364
19.2.2 Re-scaled Power Transformation . . .. . . . . . . . . . . . . . . . . . . . 365
19.2.3 Worked Example .. . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 366
19.2.4 Transformation and Residual Likelihood .. . . . . . . . . . . . . . 368
19.3 Quantile-Matching Transformation . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 370
19.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 371
Contents xix
References .. .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 401
Index . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 407
Chapter 1
Rat Surgery
The data shown in Table 1.1 were obtained in an experiment by Dr. George Huang of
the Department of Surgery at the University of Chicago, the purpose of which was to
investigate the effect of hyperbaric O2 treatment on the healing of surgical wounds
in diabetic rats. (Diabetics, both human and animal, tend to have more complications
following surgery than non-diabetics, and these rats made the ultimate murine
sacrifice by serving as the surgical model for diabetic effects in humans.) Thirty
rats were first given a drug that has the effect of destroying the pancreas, with the
goal of making the rats diabetic. All the rats underwent surgery, during which an
incision was made along the entire length of the back. This was immediately sewn
up with surgical staples, and the rats were returned to their cages.
The treatment group of fifteen rats was subjected to hyperbaric O2 treatment,
i.e., a 100% oxygen environment at two atmospheres of pressure, for 90 minutes
per day following surgery. The control group also received a similar treatment for
90 minutes daily, but at standard oxygen concentration and normal atmospheric
pressure. Six rats had glucose levels that were deemed too low to be considered
diabetic, and were excluded from the analysis. (You may assume initially that these
exclusions are unrelated to the O2 treatment.) After a 24 day recuperation period, the
24 rats still participating in the experiment were sacrificed, i.e., killed. Strips of skin
were taken from five sites labelled A–E on each rat, each site crossing the surgical
scar in a right angle. The strips were put on a tensiometer, stretched to the breaking
point, and the energy required to break the specimen was recorded. Unfortunately
some prepared specimens were deemed sub-par for procedural reasons unconnected
with skin strength, and in such cases no observation could be made: the unmeasured
specimens are indicated by – in the table. Rats 1–14 received the hyperbaric
treatment: rats 15–24 were the controls.
Handling by humans is known to be stressful for rats, and stress is associated with
poor health and shorter lifetimes. The experiment was designed to ensure that treated
and control rats were handled in a similar manner throughout the experiment, so
that any systematic differences between the groups could confidently be attributed
to treatment rather than to differences in the way that the rats were handled. Hence,
the control rats were inserted daily into the hyperbaric chamber so that they might
experience the same stress levels as the treated rats.
The main objective is to determine whether or not hyperbaric O2 treatment has
an effect on the healing of surgical wounds, and if so, whether the effect depends on
the position of the wound on the back. It was anticipated that the increased oxygen
effect would be beneficial, or at least not detrimental, for healing, and that the site
1.2 An Elementary Analysis 3
It is unclear whether in fact the treatment was assigned to the rats by a scheme
that could be described by a statistician as objective randomization. But we can be
assured that no effort was made to select rats differentially for the two groups so,
after tut-tutting, it is reasonable to proceed as if the assignment were randomized. In
principle, this is a completely randomized design with no blocking. Each rat-site pair
is one observational unit, and each rat is one experimental unit, so each experimental
unit consists of five observational units. The distinction between observational
units and experimental units is crucial in the design, in the analysis and in the
interpretation of results: see Sect. 11.4.2.
If there were no missing values, the analysis would be relatively straightforward,
so we first illustrate the reasoning behind the simpler analysis. First, the observations
are strictly positive strength measurements having a moderately large dynamic range
from 3.8 to 82.5, so a log transformation is more or less automatic. Normality of
residuals is important, but it is not nearly so important as additivity assumptions that
are made in a typical linear model—in this case additivity of rat effects, site effects
and treatment effects. Although it is easy to check for symmetry in the distribution
of residuals or in marginal histograms, it is arguably misleading to point to skewness
or non-normality as the principal reason for transformation.
To each experimental unit there corresponds an average response, giving 14
values for O2 -treated rats, and ten values for control rats. In the absence of missing
components, the site effects contribute equally to rat averages, so the averages are
not contaminated by additive differences that may be present among sites. The
sample means for control and treated rats are 3.372 and 3.113 on the log scale,
the sample variances are 0.174 and 0.200 respectively, and the pooled variance is
9 × 0.174 + 13 × 0.200
= 0.189
22
on 22 degrees of freedom. This analysis, which is based on the rat averages, leads
to an estimated treatment effect
comparison may be unfair if site effects are appreciable and the pattern of missing
treated units is substantially different from the pattern for controls. For example,
site D is missing for 40% of the control rats, but only for 7% of treated rats. If the
response at site D were appreciably higher than that at other sites, the pattern of
missing units would create a bias in the treatment comparison. However, the site
averages on the log scale
One way to adjust for site effects is to fit a simple linear Gaussian model in which
site and treatment effects are additive on the log scale:
where s is the site, and t (i) is the treatment indicator for rat i. The least-squares
treatment-effect estimate is −0.298 with standard error 0.119, which is computed
from the residual sum of squares of 34.30 on 98 degrees of freedom. According to
this analysis, the treatment estimate is 2.5 standard errors away from its null value,
a magnitude that is sufficient to make a case for publication in certain scientific
journals, even if its direction is opposite to that anticipated.
Although the error in this analysis may seem obvious, the glib partial description
in (1.1) is extremely common in the scientific literature. Very often, the model is
stated additively in the form
Implicitly or explicitly, the errors εis are assumed to be independent Gaussian with
constant variance. Failure to account for correlations between different observations
on the same experimental unit has little effect on the point estimate of the treatment
effect, but it has a more substantial effect on the variance estimate.
It is good to bear in mind that there cannot be more degrees of freedom for the
estimation of treatment contrasts than there are experimental units in the design.
This design has 24 experimental units split into two subsets, so there cannot be
more than 22 degrees of freedom for the estimation of inter-unit experimental
variability. Thus, failure to mention covariances in the linear model specification,
and the claim of 98 degrees of freedom are two red-flag indicators of gross statistical
transgressions.
1.4 Model Formulae 5
One way to adjust for rat-to-rat variability is to include an additive rat effect:
The ANOVA function reports a very substantial F -ratio of 10.45 for treatment, with
a p-value of 0.2%. However, the treatment effect estimate is only −0.600 ± 0.33,
and the p-value for the hypothesis of no effect is a more modest 7%. We will
not attempt here to explain this apparent contradiction because the displayed code
points to a serious lack of understanding of linear algebra, geometry, orthogonal
projections, and their connection with statistical models. Neither part of the code
or the computation is appropriate, and the fitted model is not suited for its intended
purpose.
The default Gaussian model for the log-transformed measurements Yis incorporates
site and treatment effects additively as follows:
in which all effects contributing only to variances and covariances are shown in
parentheses. Independence of components is not to be taken for granted, so is
necessary to state explicitly that the rat effects 1 , . . . , 24 are independent and
identically distributed with zero mean, and are independent of the 120 standard
Gaussian residual effects is , which are also mutually independent.
All told, there are five site parameters, one treatment parameter, and two variance
components whose estimates are σ̂02 = 0.211 and σ̂12 = 0.148. Observations on
distinct rats are independent, but the covariance between observations at different
1.6 Further Issues 7
sites on the same rat is σ12 , and the correlation is σ12 /(σ02 + σ12 ), which is estimated
as 0.41.
In standard software, the site and treatment parameters are estimated by weighted
least squares using the inverse of the fitted covariance matrix as weights. The
treatment effect estimate, which is automatically adjusted for additive site effects, is
−0.294 with standard error 0.184. If the design were complete with no missing
values, the null distribution of the ratio would be t22 , so the observed effect
corresponds to a two-sided p-value of about 12%. The likelihood-ratio statistic of
2.51 on one degree of freedom gives an essentially identical conclusion. To be clear,
this is the version recommended in Sects. 18.3–18.4: see Exercise 1.14. The less
recommended version using ordinary maximum likelihood is typically somewhat
larger—2.62 in this instance.
The code shown in Exercise 1.4 reports fitted site effects
in head-to-tail order with the anterior site (nearest to the head) as reference. The
standard errors of pairwise site contrasts are in the range 0.14–0.15, so it appears that
skin from the caudal site is appreciably weaker than that from other sites. The REML
log likelihood ratio statistic (Chap. 18) for testing equality of site effects is 13.92,
which is beyond the 99th percentile of the limiting null distribution, which is χ42 .
Although they appear to be non-zero, the site effects are not sufficiently large to
change appreciably the conclusions about treatment reached by the more elementary
analysis based on rat averages.
It is mathematically and biologically possible that the treatment could have
an effect on either the mean or on the variance or on both, but the standard
default formulation assumes that treatment affects only the mean of the distribution.
Equality of the two variances is an entirely reasonable assumption in practice, but it
is also an assumption that can easily be checked by including two different variance
components, one for treated rats and one for the controls. If the variance for treated
rats were appreciably different from the variance for controls, it would not be
possible to encode the treatment effect in a single number. For these data, there
is absolutely no evidence of an effect of treatment on variances: see Exercise 1.14.
1.6.1 Exclusions
Six rats that were deemed non-diabetic on the basis of post-baseline glucose
measurements were excluded from the main analysis. For the main goal of this
study, this exclusion was judged to be scientifically reasonable on the basis of an
argument that implies that the probability of exclusion is unrelated to treatment.
8 1 Rat Surgery
However, the excluded rats consisted of five controls and only one treated rat. How
extreme is that allocation relative to expectation? Does it suggest that treated rats
are less likely to be excluded than the controls? If so, treatment may have an effect
of an entirely different nature.
Given that six rats were excluded, the number of excluded controls is a random
variable whose distribution is central hypergeometric
6 24 30
;
y 15 − y 15
the numerical values are 0.8, 7.6, 24.1, 34.9, 24.1, 7.6, 0.8 in percentages for y =
0, . . . , 6. The probability of an allocation at least as extreme as that observed is
8.4% in each tail. Exclusions and other departures from protocol must always be
described and included as a part of the discussion. The imbalance in this study is
greater than we might have wished for, but it is not sufficiently extreme to imply a
systematic bias.
strong correlation with weight or weight loss might lead to a different understanding
of the biology.
The fourth explanation implies that each component reported as missing is
associated with a large number, at least Y > 82.5. Given the pattern of missing
components observed in Table 1.1, this explanation is implausible. Site C with
the highest mean value has zero missing components; site E with the lowest mean
has the highest fraction. If the description were partly true, perhaps in the reverse
direction, the implications for analysis would be very substantial.
The third explanation is unlikely but not entirely implausible. In principle, it
would have been better to record the observation as a pair (yu , su ) in censored-
data format. In other words, yu is the value at which the breakage occurred for
specimen u, su = 1 if the breakage occurred at the scar, and su = 0 otherwise. For
example, the value (20.3, 0) implies a skin-strength measurement strictly greater
than 20.3, which is informative. If censoring is admitted, the state space must be
extended to the Cartesian product R×{0, 1} rather than R or R∪{-}: see Sects. 11.1
and 11.4.9.
1.6.3 Back-Transformation
The analysis for this experiment used additive effects on the log scale. How should
the conclusions be reported? The great majority of physical measurements are
made on a scale that is strictly positive. In such cases, it is always better to report
effects as multiplicative factors or as multiplicative percentages rather than additive
increments. Examples 2, 4 and 5 are typical.
The moment generating function of a Gaussian variable Z ∼ N(μ, σ 2 )
implies that the log-Gaussian variable Y = exp(Z) has median eμ and moments
2 /2
E(Y ) = M(1) = eμ+σ ;
2
var(Y ) = M(2) − M 2 (1) = M 2 (1) eσ − 1 ;
2
cv2 (Y ) = var(Y )/E(Y )2 = eσ − 1,
1.8 Exercises
returns the pair of averages 3.360, 3.085, which is not the pair reported in the text.
The alternative log-scale computation
1.8 Exercises 11
returns the numbers 3.372, 3.113 for means, and 0.174, 0.200 for variances. Under
what circumstances do the two mean calculations return the same pair of averages?
Explain the difference between the factors trt and treat.
1.2 In the balanced case with no missing cells, the standard analysis first reduces
the data to 24 rat averages Ȳi. , the treatment and control averages ȲT , ȲC , and the
overall average Ȳ.. . The sum of squares for treatment effects is
5 × 10 × 14
SST = 70ȲT2 + 50ȲC2 − 120Ȳ..2 = (ȲT − ȲC )2 .
24
The total sum of squares for rats splits into two orthogonal parts
5 (Ȳi. − Ȳ.. )2 = SST + SSR ,
i
SST /1
F =
SSR /22
is distributed as F1,22 . Simulate a complete design with additive effects, and check
that the two terms shown above agree with parts of the decomposition reported by
anova(lm(y~site+treat+rat)).
1.3 The F -ratio reported by anova(...) for treatment effects is not the ratio
shown above. At least one is misleading for this design. Which one? Explain your
reasoning.
return the same parameter estimates and the same standard errors in a slightly
different format.
1.5 The sub-model with zero rat variance can be fitted by omitting the relevant
term from the model formula in either syntax. The conventional log likelihood ratio
statistic is twice the increase in log likelihood for the larger model relative to the sub-
model. Check that these functions report different numbers for the log likelihood,
but they return the same log likelihood ratio. Report the value. (Recall that the log
12 1 Rat Surgery
1.6 REML, or residual maximum likelihood, is the standard method for the
estimation of variance components: see Chap. 18 for details. Both regress()
and lmer() allow other options, but both use REML as the default. However,
lmer() constrains the coefficients to be positive, whereas regress() allows
negative coefficients unless otherwise requested. If σ̂12 > 0, both functions should
report the same values for all coefficients; otherwise if the unconstrained maximum
occurs at a negative value, differences are to be expected, both in the fitted variance
components and in the regression coefficients.
For regular problems in which the null model is not a boundary subset, the
null distribution of the conventional log likelihood ratio statistic is distributed
asymptotically as χ12 . Assuming that the unconstrained version is regular with fitted
coefficients approximately unbiased, what is the asymptotic distribution of the log
likelihood ratio statistic for the constrained problem? Using this null distribution,
report the tail p-value for the hypothesis of zero rat variance.
1.7 In the balanced case with no missing cells, show that the REML likelihood-ratio
statistic for treatment effects is
LLR = (n − 1) log 1 + F /(n − 2) ,
1.9 Use polynomials up to degree four to re-parameterize the site effects, and
repeat the fitting procedure for (1.2) using the unnormalized orthogonal polynomial
basis. Check that the treatment effect estimate and its standard error are unaffected
by site re-parameterization. What effect does the change of basis have on the log
likelihood?
1.10 How can we be assured that the log transformation is really needed or
substantially beneficial? (Chap. 19).
1.8 Exercises 13
1.11 Extend the model (1.2) so that it contains one variance component for treated
rats and another for untreated rats. Show your code for fitting the extended model,
report the two fitted variance components, and the REML likelihood ratio statistic
for comparison with the simpler model.
1.12 Parameter estimates reported in Sect. 1.5 were computed using the code in
Exercise 1.4. Following recommendations in Sect. 18.5, the likelihood ratio statistic
for treatment effects was computed using the code
K <- model.matrix(~site)
fit0 <- regress(log(y)~site, ~rat)
fit1 <- regress(log(y)~treat+site, ~rat, kernel=K)
llr <- 2*(fit1$llik - fit0$llik)
Modify this code to obtain the likelihood-ratio statistic for site effects.
1.14 It is possible that treatment could have an effect on variances in addition to its
effect on the mean. Investigate this possibility by replacing the identity matrix with
two diagonal matrices D0 and D1 such that D0 + D1 = In , and using some version
of the code
fit0 <- regress(log(y)~treat+site, ~rat)
fit1 <- regress(log(y)~treat+site, ~rat+D1)
fit2 <- regress(log(y)~treat+site, ~rat+D0)
Report the two estimated variances. Use the log likelihood ratio statistic in support
of your conclusion.
1.15 Let Y be an n × m array of real-valued random variables with zero mean and
covariance matrix
for some non-negative coefficients σ02 , . . . , σ32 . Show that the covariance matrix is
invariant with respect to the product group consisting of n! permutations applied to
rows and m! permutations applied to columns. In other words, show that the n × m
matrix whose (i, s)-component is Yσ (i),τ (s), has the same covariance matrix as Y .
14 1 Rat Surgery
1.16 For an n×m array of real numbers, show that the four quadratic forms, mnȲ..2 ,
Row SS : m (Ȳi. − Ȳ.. )2 ;
i
Col SS : n (Ȳ.r − Ȳ.. )2 ;
r
Resid SS : (Yir − Ȳi. − Ȳ.r + Ȳ.. )2 ,
ir
are invariant with respect to row and column permutations. Here, Yi. is the ith row
total, and Ȳi. is the row average.
1.17 Each of these quadratic forms is non-negative definite. In each case, the
expected value is a non-negative linear combination of the four variance compo-
nents, in which the coefficient of σ02 is the rank of the quadratic form. Find the
expected value of each quadratic form as a linear combination of the four variance
components.
1.18 Let 1n be the vector in Rn whose components are all one. Show that Jn =
1n 1n /n is a projection matrix, i.e., that Jn2 = Jn , and that it has rank one: tr(Jn ) = 1.
Show also that In − Jn is the complementary projection of rank n − 1.
Each of the quadratic forms in Exercise 1.15 can be expressed in the form
Y Mr Y , where each Mr is a projection matrix of order mn × mn. Show that each
matrix is a Kronecker product
1.19 The set of linear functionals Rnm → R is called the dual vector space; it has
dimension mn. Show that the column and row totals Y → Y.r and Y → Yi. are
linear functionals, and that they are linearly independent. Show that the subspace
spanned by {Y.1 , . . . , Y.m } is closed with respect to row and column permutations.
What is its dimension? Show that the subspace spanned by Ȳ.. , and the subspace
spanned by {Ȳ.1 − Ȳ.. , . . . , Ȳ.m − Ȳ.. } are both closed with respect to row and column
permutations. What are their dimensions?
team gets to use each saw exactly once, and each saw is used exactly once for each
species/bark combination.
Each observational unit is an ordered pair consisting of one log and one saw, so the
units available at the outset may be arranged in a 36 × 6 array of (log, saw)-pairs.
However, each measurement is destructive of the log, so it is necessary to choose a
subset or subsample of 36 observational units, one from each row. The Latin square
design also calls for six units from each column or saw.
Despite the restrictions, the observational units are log-saw pairs arranged in a
36 × 6 array. Recall that a covariate is an intrinsic property of the observational
units, as opposed to a treatment which is, or may be, assigned to the units. By
definition, each marginal component log and saw is a covariate. In addition, brand
is a covariate or classification factor, which is a property the saws, and species is
also a covariate, which is a property of the logs; the design consists of two saws of
each brand, and twelve logs of each species.
By contrast, team is a treatment that is assigned by the investigator to each of the
36 selected units only. In the description as given, debarking is also a treatment that
is assigned to the logs. However, if the logs were initially segregated by bark status,
it could plausibly be argued that no random assignment has occurred, in which case
bark is a covariate or classification factor.
Regardless of its status, bark is a Boolean function [36] → {0, 1} on logs such
that six logs of each species are debarked, and six are left intact. There are 9243
functions of this type. If bark is a treatment factor assigned by randomization, it is a
random variable selected according to some specified distribution from the indicated
set of 9243 functions. In most instances, the randomization distribution is uniform
on functions having the desired balance.
Algebraically, team is a function from the 36 selected units into the set of
teams; statistically, it is a random function chosen uniformly from a subset of such
2.3 Goals of Statistical Analysis 17
The chief purpose of the study is to compare the relative efficiencies of the three
brands of saw, i.e., to compare one brand with another. There are 12 observations
for each brand, and the sample averages are 8.78, 8.55 and 7.42 in minutes, or 2.10,
2.11 and 1.95 in log minutes, so brand 3 appears to be the most efficient. The main
statistical challenge is to come up with a reasonable assessment of the standard error
for brand contrasts. Is it better to do the analysis additively on the time scale or on
the logarithmic scale? If we do the analysis on the log scale, how do we report
effects on the time scale? Regardless of which scale is used, how do we calculate a
standard error for brand effects?
In addition to brand effects, we can also investigate the effect of de-barking. The
sample averages with and without bark are 8.80 and 7.70 minutes, or 2.13 and 1.97
on the log scale, so de-barking appears to reduce the cutting time by about 12–15%.
How do we compute a standard error? Is the reduction approximately the same for
each saw brand?
In addition to brand and de-barking effects, we can also investigate differences
between the three species. There are 12 observations for each species, and the
average cutting times for spruce, pine and larch are 8.03, 6.42 and 10.30 minutes
respectively, or 2.04, 1.82 and 2.29 on the log scale. Larch, one of the few deciduous
conifers, is evidently substantially harder or tougher than the other two. Regardless
of whether we use the log scale or the time scale for averages, how do we calculate
an honest standard error for species contrasts? Do we compute the standard error for
18 2 Chain Saws
each species contrast in the same way that we compute the standard error for brand
contrasts?
Detailed answers to all of these questions are given in subsequent sections. At
this stage, we provide brief answers to some of the questions without offering a
detailed rationale.
First, the response is a time in minutes as measured by a stopwatch; ordinarily,
the appropriate scale for analysis of temporal measurements by linear methods is
the log scale. For some, this is obvious and needs no support; others may demand a
formal justification (Sect. 19.2). Research workers from an engineering background
are accustomed to using logs to the base 10 without comment, but natural logs are
used throughout these notes. On the log scale, the effects of bark, species, brand and
team are additive, which implies that they are multiplicative on the time scale. An
analysis on the log scale does not imply that the conclusions must be reported on
the same scale. Thus, in reporting point estimates of effects, we say that de-barking
reduces the cutting time by 12–15%, not that de-barking reduces the cutting time by
1.1 minutes. For the particular task in the experiment, both statements are equally
true, but, as an isolated statement, one is more sensible than the other. A more careful
statement might emphasize that the de-barking reduction applies to the mean of the
distribution. Likewise for species contrasts and brand contrasts.
Second, the estimated spruce versus larch contrast is a difference of average
cutting times for two disjoint subsets of 12 observations each, the variance is
σ 2 (1/12 + 1/12), and the standard error has the form
s 1/12 + 1/12
for some suitable estimator s 2 of σ 2 . The estimated brand 3 versus brand 1 contrast
is also a difference of averages of two disjoint subsets of 12 observations each, but
the variance formula is entirely different and the estimated standard error is about
30% larger than that of the spruce/larch contrast. Why so? The reasons for this
difference are subtle, but they are also fundamental and easily overlooked.
The difference is a consequence of the experimental design as described in the
third paragraph of this section, rather than a consequence of any parametric or
nonparametric model. The crux of the matter is that 36 logs are used in the design,
but only six saws. It is one thing to make a statement about the relative efficiencies
of two specific saws, C versus A; it is different matter to make a statement about the
relative efficiencies of two brands, brand 3 versus brand 1. For a statement of the
latter type, or a statement about spruce versus larch, the observed specimens must
be typical for the brand or species. But the design includes 12 specimens of each
species, and only two specimens of each brand.
The use of the two-sample variance formula σ 2 (1/12 + 1/12) for the spruce
versus larch contrast does not imply that the set of cutting times for spruce and the
set of cutting times for larch are independent. They are not independent, and they are
not assumed to be so, even conditionally on the design. Nonetheless, the two-sample
formula makes good use of orthogonality, additivity and balance associated with the
2.4 Formal Models 19
Apart from the indicator for distinct logs, which is in 1–1 correspondence with
sample units, the factors available pre-baseline in this design are as follows:
E(Y ) ∈ X = species+bark+brand+team .
By default, the variances are constant and the covariances are zero. By assumption,
two observations on the same saw are independent, and they are identically
distributed if the two logs are of the same species and bark status. This is not the
standard Latin-square model because it does not contain either the full row factor or
the full letter factor.
The preceding model is contrasted with one in which saw.id occurs as a block
factor in the variance
where s(i) denotes the saw. Once again, duplicates of the same brand have the same
one-dimensional marginal distribution, all observations have the same variance σ02 +
σ12 , observations on different saws are independent, but observations on the same
saw are positively correlated.
The least-squares estimates for both models can be computed from the code
fit0 <- regress(log(time)~species+bark+brand+team)
fit1 <- regress(log(time)~species+bark+brand+team, ~saw_id)
Ordinarily, the regression parameter estimates for these two models should be
similar but not identical. Because of the balanced design, they are identical, but the
standard errors are different, some a little smaller, others appreciably larger. Despite
20 2 Chain Saws
the fact that the mean square for brand replicates is not significantly larger than
the mean square for residuals, the argument for a zero between-replicate variance
cannot be regarded as compelling. Accordingly, the second version is preferred. On
the other hand, additivity for species and bark effects is plausible on the log scale.
Both models assume additivity for species and bark effects, which can be tested in
the usual way.
If team effects were not a primary focus, they could reasonably be regarded as
independent and identically distributed, in which case, the fitted model is obtained
by using team as a block factor rather than a treatment factor
regress(log(time)~species+bark+brand, ~saw_id+team)
Because of orthogonality, the fitted values and standard errors for species and bark
contrasts are exactly the same, whether team effects are fixed constants contributing
to the mean or independent and identically distributed random variables contributing
to the covariances.
All of the models described above assume that the effects of species and debarking
are additive on the log scale. How do we compute a likelihood ratio statistic for
testing additivity in a situation where the model contains more than one variance
component? For various reasons, this is a technically complicated question and
there is at least one technically incorrect answer. But there is one answer that is
both mathematically natural and technically correct, which is the one given by
Welham and Thompson (1997): see Chap. 18 for a detailed analysis. The answer
that is recommended in the lmer() literature, which is to abandon REML and use
ordinary maximum likelihood, may be technically defensible, but it is not the most
natural for this setting.
The Welham-Thompson likelihood-ratio statistic on two degrees of freedom for
testing the null hypothesis of additivity can be computed as follows:
K <- model.matrix(~species+bark+team+brand)
fit0 <- regress(log(time)~species+bark+team+brand, ~saw_id, kernel=K)
fit1 <- regress(log(time)~species*bark+team+brand, ~saw_id, kernel=K)
2*(fit1$llik - fit0$llik) # 1.591
The kernel is a subspace of the observation space, which determines the likelihood
criterion that is used for estimation purposes. For a valid likelihood-ratio statistic,
it is essential that the kernel subspaces be the same for both fits. The kernel shown
above is the REML default for the first fit, but it is not the default for the second.
If we choose to follow the advice in the lmer() literature, we must adjust the
argument to kernel=0 in both regress() expressions, giving a likelihood-
ratio statistic of 2.17 in place of 1.59. Although the difference is numerically not
negligible, the asymptotic null distribution is χ22 , for which the 95th percentile is 6.0,
so neither statistic indicates a departure from additivity.
2.6 Summary of Conclusions 21
If team is removed from the mean model but included as a block factor
in the variance, the two likelihood-ratio statistics are 1.59 and 1.81 respec-
tively. In that case, the kernel subspace for the Welham-Thompson statistic is
species+bark+brand, which is the mean-value subspace under the null model.
The principal effects of interest are those related to species and saw brands. Relative
to spruce as the reference level, the estimated additive effects on the log scale are
Since exp(−0.213) = 0.808 and exp(0.256) = 1.292, cutting times for pine logs
are about 20% less than spruce logs, and cutting time for larch logs exceed those for
spruce by approximately 29%. These differences are large enough to be of practical
or commercial interest. Qualitatively speaking, they are in accord with on-the-job
impressions of anyone who has ever wielded a chain saw on softwood. Standard
errors for pairwise contrasts are 0.038, so the observed differences are roughly 5.5
and 6.6 standard deviations respectively.
The estimated bark effect is 0.152 with standard error 0.031. Thus, the effect of
bark is to increase average cutting times by an estimated 16%. Or, to put it the other
way round, the effect of bark removal is to decrease average cutting times by 14%
regardless of species. In both of these comparisons, the standard error is based on
the residual mean square on a respectable 25 degrees of freedom.
Relative to brand I as the reference, the the estimated brand effects, or mean
differences, are
In percentage terms, these amount to 100%, 101% and 86% respectively. The
estimated standard errors for pairwise contrasts are 0.05, or about five percentage
points. But this figure is based on the between-saw mean square on only three
degrees of freedom. In essence, each duplicate pair of saws furnishes one degree
of freedom for between-saw contrasts. The Welham-Thompson log likelihood-ratio
statistic for brand effects comes in at 8.29 on two degrees of freedom, which is
near the 98.4 percentile of the nominal χ22 distribution. It appears that Brand III
gives faster cuts than the other two, but the paucity of degrees of freedom for saw
replicates makes this comparison less clear-cut than it might otherwise be.
22 2 Chain Saws
2.7 Exercises
2.1 Suppose that intact logs are numbered 1:36, and that species is the species
factor. Write code in R that picks uniformly at random a subset of six logs of each
species for debarking, and stores the information as a Boolean treatment factor.
Explain where the number 924 = 3 × 4 × 7 × 11 comes from.
are designed to decompose the total sum of squares additively into components
associated with certain subspaces, which are mutually orthogonal for this design.
Explain how to compute the row sum of squares on five degrees of freedom directly
from the six row averages
Arrange these six numbers in a 3 × 2 table, and explain the computation of the sums
of squares for species, bark, and species:bark from this table of numbers.
to compute the brand sum of squares on two degrees of freedom, the saw replicate
sum of squares on three degrees of freedom, and the F -ratio (ratio of mean
squares). Why is this two-part decomposition structurally different from the three-
part decomposition in the preceding exercise?
2.4 Use the method described by Welham and Thompson (1997) to compute the
REML likelihood-ratio statistic for comparing the two linear models
X0 = species+bark+team , X1 = species+bark+team+brand
in the setting where saw.id occurs as a variance component. You may use R code as
follows:
fit0 <- regress(log(y)~species+bark+team, ~saw_id)
K <- model.matrix(~species+bark+team)
fit1 <- regress(log(y)~species+bark+team+brand, ~saw_id, kernel=K)
c(fit1$llik, fit0$llik, fit1$llik-fit0$llik)
2.5 In the simple linear model setting with μ ∈ X and ∝ In , show that the
maximum value of the log likelihood is const − n log QY , where Q = I − P is
the orthogonal projection with kernel X , and the constant is independent of X .
2.7 Exercises 23
2.6 In the simple linear model setting, the F -ratio for testing the hypothesis μ ∈ X0
versus μ ∈ X1 is the ratio of mean squares
Q0 Y 2− Q1 Y 2 n − p1
F = ,
Q1 Y 2 p1 − p0
2.7 Check that the F -ratio for brand differences is in approximate agreement with
the Welham-Thompson REML statistic computed in Exercise 2.4. Explain why you
need m = 6 − 1 rather than m = 36 − 9 in this comparison.
2.8 Express the random-effects models from the previous section in lmer()
syntax, and check that the parameter estimates agree with regress() output.
y 2
= y y = y A1 y + · · · + y Ak y
2.10 For a m×n array, i.e., for y ∈ Rmn , show that the Kronecker-product matrices
are complementary and mutually orthogonal projections Rmn → Rmn with ranks
one, m−1, n−1, and (m−1)(n−1) respectively. Show also that this decomposition
is invariant with respect to row and column permutation.
2.11 For a Latin-square design of order m, show that the last term in the preceding
decomposition can be split into two parts associated with letters. Show also that the
five-part decomposition is invariant with respect to permutation of rows, columns
and letters.
Chapter 3
Fruit Flies
This project concerns the experimental design and the data analysis in the
paper titled Commensal bacteria play a role in mating preference of Drosophila
melanogaster, published in 2010 by Sharon et al.. The experimental design and the
initial goals are straightforward in principle: do female flies have a preference
for male flies that have been reared on the same diet rather than genetically
indistinguishable flies that have been fed a different diet? If so, what is the cause?
Some of the finer experimental details are crucial for model formulation, analysis
and interpretation, but are easy to miss in a superficial reading. Partial information
on the design and analysis is given below, so you are encouraged to read the paper
for yourself for additional background.
Two breeding populations of genetically identical fruit flies were raised sep-
arately for roughly forty generations on one of two diets, here denoted by C
(corn-molasses-yeast) and S (starch). At certain stages, flies destined for experi-
mentation (test matings) were removed from the breeding populations and raised for
one intermediate generation on the standard C diet before testing was done. Thus
the testing for generation six was done on the virgin offspring, so generation six
is really 6+1: see Fig. 1 in Sharon et al. (2010). Mating tests were done on
selected generations from two to 37. The mating counts in Table 3.1 are implicit
in the authors’ Fig. 2. It is not given explicitly in the published paper or in the
supplementary online materials, but was provided by the authors on request. It
contains five columns of data, generation number, followed by the mating counts
for the four types, CxC, CxS, SxC, SxS. Here SxC denotes matings of male flies
whose parents were raised on diet S with females whose parents were raised on
diet C. Matings of types CxC and SxS are called homogamic; the other types are
heterogamic. The experimental set-up for mating tests consisted of a number of
mating wells, from 20 to 70, with four flies in each well, one male and one female of
each dietary type. Over a one-hour period, each mating was noted, and the totals for
each type were recorded. The number of mating wells was not reported for the first
three generations, but the values reported for subsequent generations were 24, 39,
20, 24, 36, 23, 70, 46, 24, 45, 23, 48, 48, 48, 48. The last three rows of data are taken
from a parallel experiment run under a similar protocol, so the mating probabilities
are expected to be similar, but the generation numbers should be disregarded.
The main summary of the experimental data is given in the authors’ Fig. 2, which
is a barplot of the estimated sexual isolation index (SII) for each of 15 generations.
It is similar in style to Fig. 3.1, which also includes the additional three generations.
The sexual isolation index is defined as the difference phom − phet = 2phom − 1
between the probability that a mating is homogamic and the probability that it is
heterogamic. The reported values are the empirical relative frequencies observed
in each generation. Random mating, or absence of assortative mating, implies
3.2 Initial Analyses 27
-
-
-
0.5
- -
-
- -
0.4
sexual isolation index +/- se
- - -
-
- -
- -
-
0.3
-
-
-
- -
0.2
- -
-
- - -
-
-
0.1
- -
- -
-
0.0
2 6 7 9 10 13 16 20 26 37 12
generation
phom = 1/2 or SII = 0. Under the assumption of binomial sampling, the estimate
p̂hom has variance phom (1 − phom )/n, so the observed isolation index has variance
4phom phet /n, or (1 − SII2 )/n which reduces to 1/n in the absence of assortative
mating.
The height of each bar in Fig. 3.1 is the empirical sexual isolation index for
√ that
generation, and the whisker length is one binomial standard error, i.e., ± (1 −
SII2 )/n. The horizontal line at SII = 0.27 is the overall average estimated from
the pooled data. Superficially, at least, all of these calculations seem quite standard
statistically. However, there are both statistical and non-statistical reasons to have a
closer look at the design and the analysis.
Table 3.2, which was subsequently provided by the authors, contains a more detailed
description of the mating events in each generation. For each mating well, either
zero, one or two matings may occur during the observation period. In single-
mating wells, the mating is one of four types cc, cs, sc or ss, of which two are
homogamic and two heterogamic; in double-mating wells, the female refractory
period constrains the set of mating combinations to four, cc.ss, cs.sc, cc.cs and sc.ss,
of which one is double homogamic, one is double heterogamic, and two are mixed.
The order in which the matings occur is not considered here. The combination
cc.cs implies that the c-male mated with both females; the s-male may or may not
3.3 Refractory Effects 29
have courted, but did not mate with either female. The other combinations do not
occur because of the refractory constraint: a female that has already mated does
not mate a second time within about 24 hours. Each courtship ritual and mating
takes approximately 10–12 minutes, so the observation period of 40–60 minutes is
sufficient for one male to copulate with both females if they are receptive.
It is important to observe the fundamental difference between the two versions
of the Drosophila data. The objects that are counted in Table 3.1 are matings, which
are of four types; the objects that are counted in Table 3.2 are wells of various types,
one type for each column. In the first case, each observational unit is a mating, and
the response is the mating type; in the second case, each observational unit is a well,
and the response is one of nine types.
From a statistical standpoint, it is natural to regard the activity in one well
as a multinomial event with nine activity classes that are both disjoint and also
exhaustive in the biological sense if not in the mathematical sense. It is also natural
to regard flies as exchangeable modulo their sex and diet type, so that events in
distinct wells may be taken as independent with identical distributions for all wells
in the same generation. Those assumptions justify the reduction of the data to the
counts in Table 3.2 as the sufficient statistic. Provided that the activity in one well is
independent of that in other wells, each row is an independent multinomial random
variable.
30 3 Fruit Flies
Biologically speaking, the multinomial parameters need not be constant from one
generation to the next. Apart from the possibility of a monotone increasing sexual
isolation index, there are more mundane reasons for distributional heterogeneity that
may be related to experimental procedure. One possibility is that the inclination to
mate may depend on temperature and other environmental factors that vary with
the season, and hence the rates are not constant from one generation to the next.
Another very real possibility is that the period set aside for observation is not quite
constant from generation to generation, in which case the fraction of null-mating
wells is expected to be be greater for shorter observation periods. Likewise, for
purely bio-mechanical reasons, the fraction of double-mating wells is likely to be
low for shorter observation periods.
In principle, each column in Table 3.1 is derivable as a specific linear com-
bination of the columns in Table 3.2. Each linear combination has three unit
coefficients and six zeros. For example, the CxS column is the sum of columns
cs, cs.sc and cc.cs, while the SxC column is the sum of sc, cs.sc and sc.ss. Both
combinations include cs.sc. This linear projection structure implies that the counts
in one row of Table 3.1 are correlated in a non-multinomial way, which invalidates
the distributional assumptions on which the paper is based. In practice, there are
a few numerical discrepancies between the two tables, which is not uncommon in
laboratory work. Unless otherwise specified, all subsequent analyses in this chapter
use the version in Table 3.2.
Given that the main focus is on the excess of homogamic over heterogamic matings,
how should we analyze the new version of the data for evidence bearing on the issue
of commensally-related assortative mating? Assuming for the moment that there is
sufficient homogeneity across generations, it is natural first to examine the aggregate
counts or column totals, which are as follows:
Among all events in 163 single-mating wells, 102 are homogamic and 61 het-
erogamic, so the homogamic sample fraction is 0.626 and the standard error is
0.038. If mating events occurred non-preferentially according to the Bernoulli-
1/2 model, we should expect about 81.5 ± 6.4 homogamic and the same number
of heterogamic matings, so the observed value is a little more than 3.2 standard
deviations away from the non-preferential null. Equivalently, the SII index is 0.251
√
with standard error 1/ 163 computed under the Bernoulli-1/2 model, and the ratio
√
is 0.251 163 = 3.2. Three or more standard deviations is usually regarded as
moderately strong evidence against the null, so even if we restrict attention to single-
mating wells, the evidence for assortative mating is clearly established.
3.3 Refractory Effects 31
In the double-mating wells, 448 matings out of 708 are homogamic, so the
homogamic fraction is 0.633. To obtain a standard error for the sample fraction,
index Y. = 354,
the four totals (Y1 , Y2 , Y3 , Y4 ) are regarded as multinomial with
parameter vector π, and covariance matrix diag(π) − ππ Y. . The number of
homogamic matings is the linear combination 2Y1 + Y3 + Y4 , the number of
heterogamic matings is 2Y2 +Y3 +Y4 , and the total number of matings is 2Y. = 708.
The variance of the linear combination is a quadratic form in the multinomial
covariances, whose estimate is 235.04, so the standard error of the homogamic
√
fraction in the sample is 235.04/708 = 0.022. The observed value is six standard
errors away from the null, so once again the evidence strongly supports assortative
mating.
For a slightly cleaner version of the preceding argument, the difference between
the number of homogamic and heterogamic matings is 2Y1 − 2Y2 , which does
not involve the mixed-well counts Y3 or Y4 . Arguably, the mixed-event wells are
uninformative for testing. The null hypothesis of no assortative mating implies that
Y1 has the same distribution as Y2 , so it is possible in this setting to construct an
exact binomial test by conditioning on the total Y1 + Y2 . However, the estimate of
the isolation index is not independent of the mixed double-mating well counts.
The probability estimates obtained from single- and double-mating wells,
0.626 ± 0.038 and 0.633 ± 0.022 are in unusually good agreement with one
another. The standard error of the difference is the square root of 0.0382 + 0.0222,
which is 0.044, whereas the observed difference is only 0.007. A similar analysis
on the SII scale gives an equivalent answer. If necessary, the estimates may be
pooled or combined in the standard manner with weights inversely proportional to
variances.
The Lexis dispersion statistic, which is the ratio of Pearson’s chi-squared statistic
to its degrees of freedom, is a natural gauge of variation in a contingency table
in which the reference value of unity is the expected value under homogeneous
multinomial sampling. For the four single-mating columns in Table 3.2 the value
is is 48.4/51, and for the four double-mating columns the value is 50.95/51. As
we had hoped, both are satisfactorily close to unity, so there is no evidence of
inter-generational inhomogeneity in mating behaviour for either the single-mating
wells or the double-mating wells. Inter-generational homogeneity of Drosophila
behaviour is reassuring.
For the 18 × 2 matrix whose columns are the tallies for single- and double-
mating wells in each generation, the Lexis dispersion statistic is 48.2/17 = 2.83.
We conclude that there is substantial heterogeneity in the fraction of single- versus
double-mating wells in successive generations. This type of inter-generational
inhomogeneity is not a major concern. It does not invalidate the analyses proposed
in the preceding section or those in subsequent sections. As mentioned earlier, it
32 3 Fruit Flies
The dispersion index for Table 3.1 is 19.14/51 = 0.37, which shows clearly
that the counts in that table are substantially under-dispersed. Over-dispersion is
common in experimental and observational work, while under-dispersion is rare,
so statisticians are naturally on the lookout for phenomena that give rise to under-
dispersion. The main explanation for under-dispersion in this instance appears to lie
in the experimental design with four flies per mating well and its interaction with
the female refractory effect.
This section offers an analysis of whether the under-dispersion that is observed
in Table 3.1 should be expected on the basis of its derivation from Table 3.2.
The analysis is done under the following ‘multinomial assumption’, which seems
mathematically natural for this setting.
1. Given the vector m1 of single-mating well counts in each generation, the 18 × 4
table T1 consisting of the first four columns of Table 3.2 has independent
multinomial rows, and the probability vector π1 is constant across generations.
2. Given the vector m2 of double-mating well counts in each generation, the 18 × 4
table T2 consisting of the columns 5–8 of Table 3.2 has independent multinomial
rows, and the probability vector π2 is constant across generations.
3. The tables T1 and T2 are conditionally independent given m1 , m2 .
Although homogeneity across generations is an important component, we refer to
these collectively as ‘the multinomial assumption’.
Let L be the matrix that converts double-mating well counts into mating counts
of four types:
cc cs sc ss
cc.ss 1 0 0 1
L = cs.sc 0 1 1 0
cc.cs 1 1 0 0
sc.ss 0 0 1 1
(Yij − μ̂ij )2
X2 =
μ̂ij
i,j
18
tr ˆ i diag(μ̂−1 ) .
i
i=1
Using the natural moment estimates for the vectors π1 , π2 and π, the estimated
mean of X2 is 39.39. For a more accurate approximation, we should multiply
by 17/18 to account for parameter estimation. This gives E(X2 ) 37.2, so the
appropriate null reference level for the Lexis dispersion index is 39.39 × 17/(18 ×
51) = 0.73. The conclusion from this analysis is that under-dispersion is not only
possible but also expected in this particular situation.
A more accurate estimate of the null distribution of Pearson’s statistic can be
obtained by simulation along the lines of Sect. 3.4. First generate two conditionally
independent hypergeometric tables having the same marginal totals as T1 and T2 ,
combine them into a single table as in T1 + T2 L, and then compute the Pearson
statistic. The conclusion from 5000 simulations is that the null mean is 37.2 and
the variance is approximately 76.8. Relative to this distribution, the observed value
X2 (T ) = 24.1 falls just below the 5% point; the observed value for Table 3.1 is 19.1,
which is below the 1% point. The conclusion is that under-dispersion is expected,
though not quite to the extent observed in T or in Table 3.1.
3.3.5 Independence
tion. After all, distinct wells contain distinct flies whose activities cannot possibly
be coordinated. But science rightly demands that assumptions be checked where
possible, and the design of this experiment with the data in Table 3.2 provide a rare
opportunity to check the independence assumption—at least in part.
Flies in single-mating wells are necessarily distinct from flies in double-mating
wells, so in the absence of inter-well communication, we must expect all activity
in single-mating wells to be independent of all activity in double-mating wells.
This is part of the third component of the multinomial assumption in the preceding
section. In particular, we must expect the homogamic fraction in single-mating wells
to be statistically independent of the homogamic fraction in double-mating wells.
Any failure of independence in this form must have far-reaching consequences for
Drosophila experimentation. To paraphrase Lord Denning’s notorious judgement
from 1980, . . . the possibility of coordinated mating activities in distinct wells is
such an appalling vista that every sensible Drosophila experimentalist would say ‘It
cannot be right. . . ’.
Each generation furnishes a pair of homogamic fractions, one for single-mating
wells and one for double-mating wells. Figure 3.2 is a scatterplot of the 18 pairs, one
pair for each generation. Contrary to expectation, it shows not only that homogamic
fractions in the same generation are correlated, but also that the correlation is
negative (r = −0.64). The null distribution of sample correlations is symmetric
about zero with a standard deviation of about 0.23, so the observed correlation is
far removed from the bulk of the null distribution. Hypergeometric simulation by
random matching points to a left tail p-value of approximately one in 850, which
is equivalent to three standard deviations from expectation on the standard normal
scale.
To the extent that the entire edifice of experimental work on Drosophila rests
on the assumption of independent behaviours for unrelated observational units,
lack of independence is an extraordinary claim, and negative dependence even
more so. But, as Lord Denning rightly points out, extraordinary claims require
extraordinarily strong evidence. By prevailing standards in the literature on animal
behaviour, a three-σ deviation is regarded as good evidence supporting a non-zero
effect, so the null hypothesis would be firmly rejected. In this instance, however,
a three-σ deviation may be very surprising or strongly suggestive, but it cannot
suffice to overturn a fundamental tenet or a long-assumed law of behaviour without
independent confirmation from other laboratories.
The correlations indicated by Fig. 3.2 are synchronous in time; there is no sugges-
tion that the homogamic fractions in one generation are correlated with homogamic
fractions in previous or subsequent generations. Inter-well communication is one
potential explanation for synchronous correlations. Drosophila courtship rituals
are not silent, so sound leakage may be possible. Pheromonal leakage may be
more likely. However, in order to achieve the observed negative correlation, the
communication must be anti-symmetric or conspiratorial, so it is unlikely that
pheromonal or sound leakage alone could suffice. On balance, therefore, it seems
safe to rule out inter-well communication as a likely explanation. However, it
3.3 Refractory Effects 35
16
Homogamic fraction in double-mating wells
21
0.70
15
111
17
20
0.65
26
10 2
7
0.60
13
9
113
6
112
31
0.55
37
3.3.6 Acknowledgement
Table 3.2 and much of the analysis in this section is based on an unpublished report
by Daniel Yekutieli, which was provided by the author.
i yi. ! j y.j !
pr(Y = y) = .
y.. ! ij yij !
The row and column totals are fixed positive integers, so the probability mass
function is inversely proportional to ij yij ! on the space of non-negative integer-
valued arrays having the given row and column totals.
If Y has the hypergeometric distribution, so also does the transposed array.
If Y is a random matrix whose rows Yi are independent multinomial vectors,
Yi ∼ M(mi , π), which are homogeneous in the sense that they have the same
multinomial probability vector, then the conditional distribution given the column
totals is hypergeometric.
One way to simulate a hypergeometric random table having given row and
column totals is by random matching of the components of two n-component
vectors. Suppose that row has n components of which mr are equal to r, and col
has n components of which sj are equal to j , with mr = sj = n. Random
matching permutes the components of row uniformly at random, does the same
independently for col, and then tabulates or counts the ordered pairs (r, j ) thus
generated. Distributionally speaking, it is necessary only to permute one of the
vectors as follows:
RHG <- function(rowsum, colsum){
# rowsum and colsum are integer vectors having the same sum
row <- rep(1:length(rowsum), rowsum)
col <- rep(1:length(colsum), colsum)[order(runif(sum(colsum)))]
table(row, col)
}
3.4 Technical Points 37
To simulate the null distribution of Pearson’s statistic or any other statistic such
as the deviance, we compute the statistic for each table thus generated, and report
the histogram. The analysis near the end of Sect. 3.3.4 calls for two independent
hypergeometric tables T1 , T2 , followed by Pearson’s statistic computed on the linear
combination T1 + T2 L. The analysis in Sect. 3.3.5 also calls for the same pair of
independent hypergeometric tables followed by a symmetric correlation statistic
R(·, ·) computed as a function of the pair (T1 , T2 L).
(Yi − μ̂i )2
X2 = ,
μ̂i
i
where μ̂i is the fitted mean value. In the binomial case, the sum extends over both
response classes—failure and success—so that the net contribution from a binomial
pair (Yi,0 , Yi,1 ) ∼ B(mi , πi ) for which μ̂i,0 = mi π̂i and μ̂i,1 = mi (1 − π̂i ) is
The Poisson form of Pearson’s statistic differs from the binomial form only in the
variance function, = diag(μ) for the Poisson covariance; and = diag(mi π(1 −
π)) for the binomial. But the Poisson form covers both provided that we sum over
both successes and failures.
The sampling distribution of the statistic depends on the distribution of Y and on
the degrees of freedom used up in the estimation of μ. Exact moments are available
in a few special cases, all of them null in a suitable sense. For a single multinomial
Y ∼ M(m, π) with k classes and given probability vector, we have
E(X2 ) = k − 1,
m−1 1 −1
var(X2 ) = 2(k − 1) + πj − k 2 /m.
m m
38 3 Fruit Flies
The third cumulant is given in McCullagh and Nelder (1989, p. 169). The asymptotic
2 .
distribution for large m is χk−1
For an r × c contingency table that is distributed according to the hypergeometric
distribution with strictly positive row and column totals, the Haldane-Dawson
formulae give the exact mean and variance. The mean does not depend on the row
or column totals, but only on the overall total:
The variance of X2 depends on the sum of reciprocals of the row and column totals:
see McCullagh and Nelder (1989, p. 244).
Despite warnings given freely by over-cautious computer software, the nominal
2
χ(r−1)(c−1) approximation is quite accurate even for a large sparse table such as
the 12 × 12 birth-death table where the average cell count is only 2.4. Even for
Bortkewitsch’s horsekick fatality data for 14 Prussian army corps over 20 years
(Andrews and Herzberg, 1985, 17–18), where the mean is only 0.7 fatalities per
2 approximation is reasonably good in the upper tail. The left
corps per year, the χ247
tail is not so good. In that instance, the Haldane-Dawson values for the mean and
variance are 248.3 and 419.8, so the variance-to-mean ratio is only 1.69 as opposed
to 2.0 for the χ 2 approximation. Exercise 3.7 shows that the moment-matching
approximation 0.85χ294 2 is quite accurate in both tails.
Pearson’s statistic has a role to play in the analysis of counted data, mainly
as a metric for relative dispersion. Over the past 70 years, various authors have
pointed out its inferential limitations, and have sought to modify and strengthen it
in various ways (Yates, 1948; Cochran, 1954; and Armitage, 1955). Its deficiencies
for significance testing are entirely unrelated to the adequacy of any distributional
approximation. The discussion in this section focuses on its use as a dispersion
index; it is not intended as an endorsement of its widespread use in applications as
a test for independence or lack of association.
The neologism refer[s] to an old joke about Calvin Coolidge when he was President.
You will remember that he was an exceedingly laconic individual and was nicknamed
Silent Cal. . . The President and Mrs. Coolidge were being shown [separately] around an
experimental government farm. When [Mrs. Coolidge] came to the chicken yard she noticed
that a rooster was mating very frequently. She asked the attendant how often that happened
and was told, ‘Dozens of times each day.’ Mrs. Coolidge said, ‘Tell that to the President
when he comes by.’ Upon being told, the President asked, ‘Same hen every time?’ The
reply was, ‘Oh, no, Mr. President, a different hen every time.’ President: ‘Tell that to Mrs.
Coolidge.’
As it turns out, the statistical analysis in the original paper is seriously deficient
in a number of ways. In a 2014 correction note, the authors remark
. . . the statistical models we used for analysing male courtship behaviour did not take into
account temporal correlations in courtship events within males. Consequently, the variance
in courtship events was higher than predicted by the model, and the excess dispersion could
potentially result in errors in conclusions. This highlights the general potential for high-
frequency sampling of behaviours to give rise to high temporal correlations of event counts
within a dataset, and the importance of correcting dispersion factors when analysing this
type of data.
In other words, the courtship activity for one male was recorded on multiple
occasions over a short period, and the sequence of records was analyzed as if the
activities on successive occasions were independent events measured on unrelated
flies. To say that high-frequency sampling has the ‘potential’ to give rise to high
temporal correlations is a gross understatement.
40 3 Fruit Flies
Despite the authors’ remarks about the potential for and the effect of temporal
correlations, it is worth remarking that the revised analysis accounts for over-
dispersion, but it makes no attempt to account for serial correlation. If the quoted
remark suggests that an excess dispersion factor is adequate to accommodate serial
correlation, it is certainly misleading. Indeed, the serial order of events was not
reported, so the data provided make it impossible to accommodate such effects in
anything like a principled way. A simple over-dispersion factor is better than none,
but it does not adequately address statistical issues arising from high-frequency
sampling of behaviours.
There is nothing intrinsically wrong with high-frequency sampling provided
that the statistical analysis accommodates the inevitable serial correlation in a
satisfactory way. If the activity of the focal male were recorded at 24 frames per
second, we may mark each frame in which courtship is directed at the novel female
by the label ‘N’ and those in which it is directed at the familiar female by the
label ’F’. While it may be reasonable to treat a single marked frame as a Bernoulli
random variable, it is obviously unreasonable to treat the sequence of frames as
a Bernoulli sequence with independent components. For the same reason, it is
unreasonable to treat the number of ‘N’ frames as a Poisson or binomial variable.
This statement may be obvious at a sampling rate of 24 frames per second, but
it applies equally at a sampling rate of one per minute or one per hour. Doubling
the frame rate doubles the computational burden, but has a negligible effect on
information pertaining to sexual preferences.
One possibility for analysis is to reduce the frame sequence to the fraction of
time spent in each activity, and to regard these temporal fractions as a compositional
response in the sense of Aitchison (1986).
The data for three of these experiments are available in the files
eyedat <- read.table("CoolEyeColorArchive.dat", header=TRUE)
paintdat <- read.table("CoolPaintArchive.dat", header=TRUE)
decapdat <- read.table("PhenoMaleDecapArchive.dat", header=TRUE)
Additional information is available in the file Coolidge.R. Other data files are
available online.
3.6 Exercises
3.1 Use the normal approximation to the binomial to compute the probability that
the horizontal line in Fig. 3.1 intersects all 18 whiskers at ±1 standard deviations.
Devise a better approximation by simulation that takes account of the fact that the
SII index has been computed from the same data.
3.2 Is the total number of matings in Table 3.1 related to the number of mating
wells? Is the pattern of variation different for the experiments reported in the last
three rows? Explain how you address such questions.
3.6 Exercises 41
3.3 For the experiment giving rise to the data in Table 3.2, an algebraically natural
assumption is that the allowable double matings occur as a Poisson process at a rate
proportional to the product of the single-mating rates. It is also natural—physically
if not mathematically—to allow separate factors for single and double wells, and a
reduced rate for wells in which one male does double duty. Formulate this statement
as a Poisson log-linear model or four-class multinomial model, and check whether
the data are in compliance with the product assumption. For this exercise, the
multinomial assumption in section 3.3.4 may be used. (The computation for this
question may involve the entire table, but parameter estimates and other conclusions
must be a function of the column totals only. Why so?)
3.5 The file ...birth-death.R contains the data compiled by Phillips and
Feldman (1973) on the month of birth and the month of death of 348 ‘famous
Americans’. Investigate whether the month of death is or is not independent of the
month of birth. The data are given as a 12 × 12 table of event counts. (This is not a
generic contingency table because the row labels and the column labels are not only
the same, but also cyclically ordered. Both aspects of the structure are relevant to
the question posed, and both should be exploited in your analysis.)
3.6 The advice sometimes given for the validity of the χ 2 approximation to the
null distribution of Pearson’s statistic is that the minimum expected value should
exceed a suitable threshold, usually in the range 3–5. However, the mean count for
the birth-death table is 2.42, so the expected count in every cell falls below the
threshold. Compute the null distribution of Pearson’s statistic by hypergeometric
simulation. Plot the density histogram of simulated values, and superimpose on it
2 density function. (This is intended as a computational exercise only. It is
the χ121
not a suggestion for data analysis aimed to address the question posed by Phillips
and Feldman.)
3.7 Check the calculations reported in the penultimate paragraph of Sect. 3.4.2 for
Bortkewitsch’s horsekick data. Compute the row and column totals, and simulate
the null distribution of X2 by random matching. Superimpose the χ247
2 density on a
histogram of the simulated values. Find two positive numbers a, b such that the first
two moments of aχb2 coincide with the Haldane-Dawson moments. Superimpose
this scaled chi-squared density on your histogram. (The intent of this exercise is
42 3 Fruit Flies
3.8 What was the matter that Lord Denning refused to accept in his 1980 appeals-
court judgement when he referred so melodramatically to the ‘appalling vista that
every sensible person would reject’? Why was this phenomenon so abhorrent to
him?
3.9 Explain where the factor 1 − r comes from in the penultimate paragraph of
Sect. 3.3.5.
Chapter 4
Growth Curves
1.0
40
cis
108 h(t) = t^2/(13^2+t^2)
0.8
height/max height
30
height[plant == 1]
0.6
20
0.4
10
0.2
0.0
date age
0
30 40 50 60 70 0 10 20 30 40
Plant height versus age fitted mean and predicted value curves
40
35
height
108
30
30
c(0, yw[plantw == 1])
25
cis
20
20
15
10
10
Fig. 4.1 Heights in mm of 70 Arabidopsis plants of two strains, plotted against calendar time in
panel 1, and against age in panel 3 (lower left). Lower right panel shows the fitted mean functions
(dashed lines) together with the best linear predictor (solid lines) of plant height for each strain
the ultimate height of the ‘108’ strain is about 40% greater than the ‘cis’ strain. The
age-specific ratio of sample mean heights ‘108’/‘cis’ for plants aged 4–32 days is
Age in days 4 8 12 16 20 24 28 32
108/cis ratio 1.06 1.39 1.37 1.36 1.42 1.44 1.43 1.42.
The fact that these ratios are remarkably constant from day 8 onwards suggests that
a simple multiplicative factor suffices for strain effects.
The growth curve for plant i is modelled as a random function ηi (t) whose value
at age zero is, in principle at least, exactly zero, and whose temporal trajectory
is continuous. In the analyses that follow, s(i) is the strain of plant i, the mean
trajectory is βs(i) h(t) with h(0) = 0 and h(∞) = 1, so that the plateau levels β0 , β1 ,
or βcis , β108 , depend on the strain, and the ratio of means is constant over time. The
observation process Y (t) = η(t) + (t) is also a function of time, but it is not
continuous in t; the additive measurement-error (·) is assumed to have mean zero
with constant variance σ02 , and to be independent for all times t > 0 and all plants.
Brownian motion (BM) starting from zero at time t = 0 is a continuous random
function with covariance function cov(B(t), B(t )) = min(t, t ) for t, t ≥ 0. We
are thus led initially to consider the additive Gaussian model with moments
has three parameters to be estimated, the two asymptote heights β0 , β1 and the semi-
max temporal parameter τ such that μτ = μ∞ /2. Two options for the estimation of
parameters are available as follows.
48 4 Growth Curves
Although all ages used in the computation are strictly positive, the model formula is
such that the mean height at age zero is exactly zero. This constraint is enforced by
exclusion of the intercept in the model formula h:strain-1. We find that the log
likelihood is maximized at τ̂ 12.782. A plot of the profile log likelihood values
against τ can be used to generate an approximate confidence interval if needed: the
95% limits are approximately (11.7, 14.2) days.
A follow-up step is needed in order for the standard errors of the β-coefficients
to be computed correctly from the Fisher information. To compensate for the
estimation of τ , the derivative of the mean vector with respect to τ at τ̂ must be
included as an additional covariate, as described by Box and Tidwell (1962)
deriv <- -2*tau * fit$fitted * h / age^2
fit0a <- regress(y~deriv+h:strain-1, ~BM+BMP,
start=fit0$sigma, kernel=0)
The start option takes the initial variance components from the previous fit.
It is a property of maximum likelihood estimators for exponential-family models
that the residual vector y − μ̂ is orthogonal to the tangent space of the mean model
(with respect to the natural inner product ˆ −1 ). Consequently, the coefficient of
deriv is exactly zero by construction, and all other coefficients β, σ 2 are unaf-
fected. The ordinary maximum-likelihood estimates of the variance components are
(1.0467, 0.0496, 0.4283), the plateau coefficients are (28.293, 40.042) mm, and the
standard error of the difference is 0.975. In this instance, the unadjusted standard
error is 0.941, so the effect of the adjustment is not great.
In more complicated settings where partially linear models are employed, the
mean-value space for fixed τ is a linear subspace Xτ ⊂ Rn . The intersection of
these is another subspace K = ∩τ Xτ . In the growth-curve example, K = 0 is the
zero subspace, but, in general, K may be non-zero. In that setting, it is better to use
K as the kernel subspace for model comparisons.
The mean functions for the two strains are β0 h(t) and β1 h(t), and the fitted curves
with βs replaced by β̂s are shown as dashed lines in the lower right panel of Fig. 4.1.
4.3 Technical Points 49
The fitted mean is not to be confused with the predicted growth curve for an extra-
sample plant i ∗ of strain s, which is deemed to have a response
are not all zero. The conditional distribution given the data is Gaussian with
conditional mean
strain `cis'
25
20
15
10
0 5 10 15 20 25 30 35
10 15 20 25 30 35
strain `108'
Fig. 4.2 Fitted mean growth curves (dashed lines) and best linear predictors (solid lines) of plant
height for two strains, using an inverse linear or inverse quadratic model for the mean and Brownian
motions for the deviations. Sample average heights at each age are indicated by dots
improves the inverse-linear fit, but even so, it is less satisfactory than the inverse
quadratic.
In certain areas of application such as animal breeding, the chief goal is to
make predictions about the meat or milk production of the future progeny of a
specific individual bull. This bull is not an extra-sample individual, but one of those
experimental animals whose previous progeny have been observed and measured.
Such predictions are seldom called for in plant studies, but may be required in
animal growth studies. From a probabilistic viewpoint, the procedure for in-sample
units is no different. If i ∗ is one of the sampled plants and t is an arbitrary time
point, the covariance of Yi ∗ (t) and Yi (t ) is
which involves two of the three variance components. The conditional expected
value (4.3) yields a continuous temporal curve specific to each plant. The observed
values for plant i do not lie on this curve, so (4.3) is not an interpolation.
The third variance component does not occur in the preceding calculation
because the phrase ‘t is an arbitrary time point’ is interpreted to mean that plant i ∗
was not measured at time t. Thus the contribution i ∗ t is independent of all
observations, and the conditional expectation is
E Yi ∗ (t) | data = E Yi ∗ (t) + E σ1 η0 (t) + σ2 ηi ∗ (t) | data .
1. Choice of temporal origin. The distinction between calendar time and plant age is
fundamental. The decision to measure plant age relative to the time of brearding
is crucial, and has a greater effect on conclusions than any subsequent choice.
2. Selection of a characteristic mean curve. The mean curve must pass through the
origin at age zero, so a logistic function et /(1 +et ) cannot be used. The graphs in
Fig. 4.1 suggest an inverse quadratic curve, which may or may not be appropriate
for other plants.
3. Use of a non-stationary covariance model. Plant growth curves are intrinsically
non-stationary because they are tied to the origin at age zero. For obvious
biological reasons, animal growth curves using weight in place of height are not
similarly constrained.
4. Brownian motion. It seems reasonable that every growth curve should be
continuous in time. It seems reasonable also to model the response process as
a sum of the actual height plus independent measurement error, thereby making
a distinction between plant height and the measurements made at a finite set of
selected times. The particular choice (BM) is not crucial, and can be improved
using fractional Brownian motion. It is also possible to mix these by using FBM
for the plant-specific deviation, and BM for the common deviation, or vice-versa.
5. Positivity. Plant heights are necessarily positive—or at least non-negative—at
all positive ages, whereas every Gaussian model puts positive probability on
negative heights. This is one of those compromises, some major, some minor, that
are frequently needed in applied work. Provided that the probability of a negative
value is sufficiently small, this compromise is a good bargain: see point 9 below
and Sect. 12.2.
6. Response transformation, usually y → log(y), is an option that must always
be considered: see Chap. 19. The log transformation might be reasonable for
52 4 Growth Curves
animal growth curves, but it was rejected here because of the role of zero height
in determining the age origin.
7. Limiting behaviour. Plants do not grow indefinitely or live for ever, so the
capacity of the growth model for prediction is limited to the life span of a typical
plant.
8. Other issues. The emphasis on growth curves overlooks the possibility that the
two strains may differ in other ways. In fact, the average brearding time for
strain ‘108’ is two days less than the time for strain ‘cis’, with a standard
deviation of 0.43 days. No single summary tells the whole story.
4.6 Exercises
4.1 In the inverse quadratic model, the height of plant i at age t is Gaussian with
mean βs(i) h(t) whose limit as t → ∞ is βs(i). What is the variance of the ultimate
height of plant i?
4.2 For the inverse linear model in which brearding is deemed to have occurred two
days prior to the first positive measurement, estimate τ together with the plateau
coefficients. Obtain the standard error for the estimated limiting difference of mean
heights for the two strains.
4.6 Exercises 53
4.3 The Brownian motion component of the model can be replaced with fractional
Brownian motion with parameter 0 < ν < 1, whose covariance function is
where s, t ≥ 0. The index ν is called the Hurst coefficient, and ν = 1/2 is ordinary
Brownian motion. Show that the fit of the plant growth model can be appreciably
improved by taking ν 1/4.
4.4 Bearing in mind that the heights are measured to the nearest millimetre,
comment briefly on the magnitude of the estimated variance components for the
FBM model.
4.5 In the fractional Brownian model with ν < 1/2, the temporal increments for
non-overlapping intervals are negatively correlated. Suggest a plausible mechanism
that could lead to negative correlation.
4.6 For 1000 equally spaced t-values in (0, 10] compute the FBM covariance
matrix K and its Choleski factorization K = L L. (If t = 0 is included, K is
rank deficient, and the factorization may fail.) Thence compute Y = L Z, where the
components of Z are independent and identically distributed standard Gaussian, and
plot the FBM sample path, Yt against t. Repeat this exercise for various values of ν
in (0, 1) and comment on the nature of FBM as a function of the Hurst coefficient.
4.7 Several plants reach their plateau well before the end of the observation period.
How is the analysis affected if repeated values are removed from the end of each
series?
4.8 Explain the purpose and the implementation of the Box-Tidwell method. Why
must the unmodified REML criterion be avoided?
4.9 Investigate the relation between brearding date and ultimate plant height. Is it
the case that early-sprouting plants tend to be taller than late-sprouting plants?
Chapter 5
Louse Evolution
5.1.1 Background
The following synopsis of experimental procedure is taken directly from Villa et al.
(2019). Before the start of the experiment, resident lice on all experimental pigeons
were eradicated by housing the birds in low-humidity conditions for at least ten
weeks. According to the authors, this procedure kills both lice and eggs, while
avoiding residues from insecticides. To begin the experiment, 800 lice taken from
wild-caught feral pigeons were transferred to 32 lice-free experimental pigeons, 25
lice per host. Pigeons were housed in eight aviaries, each aviary containing four
birds of the same breed. Every six months, a random sample of lice from each
bird was removed, photographed, and returned to the host. The sex, body length,
metathorax width, and head width of each louse was recorded.
One aspect of this design is different from that in Chap. 3. After measurements
were made, the lice were returned to their host. This was done in order to minimize
the effect of measurement on the host-parasite system. Otherwise, the act of
measurement would reduce the resident population, and introduce instability in
the lineage, which is not desirable. In the design in Chap. 3, the flies removed
for experimental purposes were reared separately for one generation on a standard
diet, so it was not possible to return them to the main breeding line. However, the
Drosophila breeding lines were more easily controlled, so plans could be made in
advance to accommodate the numbers needed in any particular generation.
As always in situations of this sort, the phrase ‘random sample of lice from each
bird’ must be treated with caution, particularly with regard to size measurements.
Larger lice are more visible than smaller specimens, so it would be naive to expect
the random sample to behave like a simple random sample of the resident lice on a
given bird. Nonetheless, size-biased sampling need not be a serious concern for this
experiment provided that it affects all birds equally.
Since each measurement is made on one louse, it is evident that each observational
unit is either one louse or one louse on one occasion, while the response Yu is a
point in the state space, which is {M, F } × R3 for three size measurements. Since
one louse generation is approximately 24–25 days, and measurement occasions are
six months apart, we can be sure that no louse was measured on more than one
5.1 Evolution of Lice on Captive Pigeons 57
Apart from the founders, louse sex is a post-baseline variable, and thus one of
four components in the response. Genetic theory leads one to expect the sex ratio
should remain steady at 50:50 for most species, and post-baseline counts in Table 5.3
confirm this. But the same table also shows that the baseline F:M ratio is 464:336,
which is significantly in excess of 50:50.
Each lineage was associated with a particular pigeon at baseline, which means
that lineage and pigeon are equivalent as block factors. A subsequent remark in the
paper shows that this statement is not quite correct. When a bird died during the
experiment, all lice from the dead bird were transferred to a new parasite-free bird
of the same type. Thus, one lineage could span two or more birds. Unfortunately the
data file does not indicate when deaths might have occurred, so we have no way to
check the effect on lineages of host transfers.
Table 5.1 Average log body length (in µm) of lice on two pigeon hosts
Time in months
Sex Host 0 6 12 18 24 30 36 42 48
F Feral 7.883 7.883 7.883 7.874 7.866 7.886 7.880 7.872 7.864
F G.R. 7.885 7.894 7.882 7.882 7.882 7.895 7.894 7.899 7.886
M Feral 7.720 7.716 7.705 7.700 7.702 7.712 7.709 7.713 7.700
M G.R. 7.720 7.718 7.717 7.716 7.713 7.723 7.726 7.731 7.720
Differences ×100: Giant runt − Feral
F G−F 0.2 0.1 −0.1 0.8 1.7 0.9 1.4 2.6 2.2
M G−F 0.0 0.2 1.2 1.7 1.1 1.1 1.7 1.8 2.0
The sex difference 7.88 − 7.72 = 0.16 on the log scale means that female lice
are about 17% longer than males: (e0.16 1.17). The last two rows show that the
mean difference for hosts tends to increase over time, reaching around 2% for both
sexes after 48 months. It is remarkable that such a small size difference could have
a detectable effect on sexual coupling.
The first panel of Fig. 5.1 shows a plot of the same data with sexes combined.
Automatic centering and re-scaling of the y-axis has the effect of exaggerating the
variation and the magnitude of the divergence between the two groups. In other
words, that which is emphasized by the table of averages is eliminated by the plot.
The remaining panels show similar plots for the head width, the metathorax
width, and the first principal component, which is a roughly equally-weighted
positive linear combination of the three standardized size variables. For all size
variables, the temporal trajectory for louse size on giant runts is surprisingly similar
to that for feral pigeons, and lice on giant runts are larger on average than those
on feral pigeons. Apart from the uniform decrease in all size measurements in the
initial and final intervals, no clear temporal trend is visible.
Ideally, it would be good to show error bars for every point. But size mea-
surements for different lice on one pigeon are not independent, so honest error
assessment is not straightforward. On balance, it is better to show no error bars
than to show the naive default based on independence, which is misleading in this
setting: contrast Fig. 5.2 with Table 5.4 in Sect.5.5.2.
Table 5.1 and Fig. 5.1 illustrate temporal trends in average body size. To get a
comparable impression of trends in variance, it is helpful to compute mean-squares
associated with louse sex, host size, aviary, lineage and residuals at each of the nine
time points.
The dominant mean square is that for louse sex which starts off at 5.25 at
baseline, drops to half that value at six months and decreases slowly to 1.64 at
60 5 Louse Evolution
Log body length versus time Log head width versus time
7.82
5.65
7.81
5.64
7.80
5.63
7.79
7.78
5.62
Feral pigeon Feral pigeon
Giant Runt Giant Runt
0 10 20 30 40 0 10 20 30 40
0.5
5.67
0.0
5.65
-0.5
5.63
0 10 20 30 40 0 10 20 30 40
Fig. 5.1 Average body sizes of lice for two hosts over time
48 months. For the other factors, the mean squares are shown in the top half of
Table 5.2, together with the REML variance components for aviary, lineage and
residual in the second half. For this fit, host and sex were eliminated as fixed effects,
so the mean-squared residual does not coincide exactly with σ̂02 .
Some of the following points are accommodated in subsequent analyses, but
others are merely noted.
1. The residual variability at baseline is twice that on all subsequent occasions. One
plausible explanation is that founder lice collected from wild pigeons are more
variable in size than those resident on captive pigeons.
2. The lineage mean square is remarkably constant from baseline onwards. Relative
to the residual mean square, it is below expectation at baseline, but not signifi-
cantly so. After baseline, it is uniformly larger than the residual mean square, but
not by a large factor.
3. The host mean square at baseline seems artificially low. There is strong evidence
in the data, for example in the sex ratios, that the randomization scheme was
5.2 Data Analysis 61
more complicated that that depicted in the preceding section, so this may be a
consequence of an effort to balance the randomization.
4. The between-aviary mean square at baseline is a little larger than expected from
uniform random assignment: the F -ratio is 2.6, which is at the upper 1.6% point
of the reference null distribution.
5. Variance-component estimates on few degrees of freedom, such as those for
aviary and lineage, have notoriously high variances.
The main issue to be addressed at this point is the size of the aviary mean square at
baseline, and whether the mean square provides sufficient probative evidence to cast
doubt on the randomization or to declare it inadequate or biased. The question is not
whether the initial lice were labelled 1–800 and lots drawn to determine which lice
would be assigned to which birds, but whether the laboratory procedures actually
employed are a reasonable facsimile of objective randomization. The only evidence
before the court is shown in Table 5.2.
One traditional view is that the aviary mean square is selected for attention as the
largest of three or four, so the p-value, or measure of extremity, is closer to 5%. That
calculation tells us something, but it does not answer directly the question of interest
to the court: ‘Given the data, what is the probability that the allocation to aviaries
was biased?’ From another viewpoint in which sparseness prevails at odds level ρ,
the odds against aviary bias given the mean-square ratio F = 2.6 on 6, 367 degrees
of freedom are approximately ρζ6 (2.6), where ζ6 (2.6) = 3.81. This calculation
uses a modification for F -ratios of the sparsity argument in McCullagh and Polson
(2018). The strength of the evidence is such that the initial presumption of innocence
with probability 1/(1 + ρ) is changed to 1/(1 + ρζ6 (2.6)). For ρ = 0.1, which is
not a strong prior presumption for this setting, the probability of a no aviary bias is
changed by the evidence from 0.91 to 1/1.38 = 0.72. So we take note and proceed
with caution, giving the randomization a provisional pass. This point is revisited in
Sect. 5.4.2.
62 5 Louse Evolution
in which h(u) is a code for the host size, and s(u) is the louse sex. At baseline, the
additive model implies
The following linear models address directly the question that is of principal interest
to an evolutionary biologist. Without straying from linearity in time, the null and
alternative may be formulated as linear subspaces.
The model formulae time+sex and host:time+sex generate basis vectors for
the two subspaces whose dimensions are three and four respectively. The alternative
model has two linear trends in time, one for captive feral hosts h(u) = 0, and one
for giant runts h(u) = 1.
For covariances, we start out following the authors’ suggestion with three
variance components
where l, l and a, a are the lineages and aviaries respectively. This is a linear com-
bination of three identity matrices, one on the lice, one on the lineages or pigeons
64 5 Louse Evolution
with 32 blocks, and one on the aviaries with eight blocks. It is usually justified either
by appeal to exchangeability based on recorded similarities of observational units,
or, if that argument fails to convince, by appeal to randomization. Although neither
argument carries weight in this instance, computation is cheap so we proceed.
For the log body length, the REML variance components in (5.3) paired
with (5.2) are
Both the lineage and aviary variance components are small relative to the between-
lice variance. Despite that, there is no compelling reason to declare them null simply
because they are small. The fitted slope coefficients (×104 ) for the two pigeon
breeds are
This analysis appears to provide reasonably strong evidence that lice transferred
to captive feral pigeons decrease in size over time, and moderately strong evidence
that lice transferred to giant runts increase in size over time. However, the analysis is
based on linearity in time, which seems implausible given Fig. 5.1, and a covariance
structure (5.3) that is both inadequate for the data and in conflict with randomization.
cov(Yu , Yu ) = σ02 δu,u + σ12 K(t, t ) δl,l + σ22 K(t, t )δa,a + σ32 K(t, t ). (5.4)
5.2 Data Analysis 65
The conclusion from this analysis is the essence of simplicity: the data are entirely
consistent with neutral evolution of louse size on both hosts. Not only is there no
evidence of a differential drift in louse size for the two hosts, there is no evidence of
a drift for either host.
Apart from the Brownian contribution, Table 5.2 shows that the baseline variance
is substantially larger than the residual variance on subsequent occasions. This
observation suggests that (5.4) is not adequate on its own, and must be supplemented
by an additional diagonal matrix for baseline observations. This differential baseline
variance leads to a further 64.7-unit increase in the REML criterion. However its
effect on conclusions is almost negligible; for comparison, the fitted coefficients
(×104) are as follows:
Villa et al. base their conclusions on the first principal component as a combined
measure of overall louse size. Since the first principal component is essentially the
standardized sum or average of the three size variables, this much is fine. The sample
averages for each host are plotted in the fourth panel of Fig. 5.1, which shows that
the divergence between the two mean trajectories is not appreciably greater than the
temporal variability of any single trajectory. This is a disappointing conclusion for
a four-year experiment, and not appealing as a headline story.
However, Villa et al. choose to emphasize the divergence over the variability by
plotting the PC1 mean difference (giant runts minus controls) as a function of time
in their Fig. 1C. A version of their plot is shown in Fig. 5.2, and is to be contrasted
with the fourth panel of Fig. 5.1.
The plot symbol on the horizontal line in Fig. 1C or Fig. 5.2 is explicitly
associated with controls. Error bars attached to zero are not mentioned in captions or
in text. The visual impression of remarkable temporal stability of louse size on feral
pigeons contrasts starkly with the rapid increase for lineages on giant runts. The
plot title and the scale on the y-axis confirm those impressions, which are in line
with the authors’ conclusion Lineages of lice transferred to different sized pigeons
rapidly evolved differences in size. In my opinion, Fig. 1C or Fig. 5.2 gives a grossly
misleading impression of stability for feral pigeons contrasted with a substantial
trend for giant runts. In fact, Table 5.1 shows that louse body-size changes are no
more than 2% over the entire period.
Taking correlations into account, the error bars for the non-zero line in Fig. 1C
or Fig. 5.2 are too small by a factor increasing from about 1.0 to 7.0, and roughly
proportional to time.
Tables S2–S5 in the Appendix to their paper report regression coefficients
and their standard errors for the full factorial model with (5.3) as the covariance
structure. These tables are cited in the Results and Discussion section to support the
chief claim: Over the course of 4 y, lice on giant runts increased in size, relative to
lice on feral pigeon controls (Fig. 1C and SI Appendix, Tables S2–S5). It is unclear
which coefficients are meant to justify this claim, but the coefficient of host:time in
the PC1 analysis is reported with a t-ratio of 3.15. Overlooked in this computational
blizzard is the fact that both the fitted mean and the fitted covariance contradict
the randomization. In addition, the covariance assumption is non-standard for an
evolutionary process, and is demonstrably inadequate for the task.
5.3 Critique of Published Claims 67
0.2
0.0
-0.2
0 10 20 30 40
The formal analysis of the first principal component by linear Gaussian models
follows the lines of Sect. 5.2.5. Although the scale of the PC1-response is very
different from that of the body length, the need for the Brownian-motion component
is abundantly clear, as is the additional baseline variance. When these covariances
are accommodated, the slope estimates and their standard errors are
Nothing in this PC1 analysis points to a departure from neutral evolution of lice on
either host. In conclusion, the evolutionary divergence described by Villa et al. may
well exist on some time scale, but the evidence for it is not to be found in their data.
68 5 Louse Evolution
The variables host and lineage are treatment factors generated immediately post-
baseline by randomization, and having a known distribution. For the 800 lineage
founders, louse sex is a pre-baseline variable; for the remaining lice, sex is a
random variable not generated by randomization, and not recorded immediately
post-baseline. One can speculate on the joint distribution, but in principle, the
sex ratio for giant runts might not be the same as the sex ratio for controls.
Thus, (5.1) and (5.2) are models for the conditional mean while (5.3) and (5.4)
are models for the conditional covariance—given host and lineage plus the entire
sex -configuration for all sampled lice.
Regardless of covariance assumptions, the interpretation in (5.2) of βh as ‘the
effect of treatment’ must be considered in the light of the fact that any additive effect
possibly attributable to an effect of treatment on sex has been eliminated. Although
not intermediate in the temporal sense, sex is not dissimilar mathematically to an
intermediate response. It is possible that treatment could have an effect on the
intermediate response, in which case the coefficients βh in the conditional mean
describe only one part of the treatment effect.
In the context of this experiment, no effect of treatment on sex is anticipated.
Any effect that might be present is most likely to be a sampling artifact of little or no
evolutionary interest. Nonetheless, it is not difficult to examine the sex distribution
at baseline and post-baseline for both treatment groups. Table 5.3 shows the louse
counts by time, host and sex.
The post-baseline total count is quite constant at 200 for giant runts, but is much
more variable for captive feral pigeons. The first is presumably a design target.
We are left to wonder why the the control group does not have a similar target.
Nevertheless, this is not a serious criticism. In both treatment groups, females
account for 58% of lice at baseline, but close to 50% thereafter. As anticipated,
there is little evidence of a difference in sex ratio between groups. If anything, the
difference between the ratios is below expectation at nearly every time point.
The Poisson log-linear model time:(host+sex) is equivalent to the statement that
host and sex are independent at each time point, or equivalently, that the sex ratio
is the same for both pigeon breeds, but not necessarily 50:50. The residual deviance
of 2.8 on nine degrees of freedom falls at the lower third percentile (0.03) of the
null distribution, which shows that sample log odds ratios are uniformly closer to
constant than the Poisson model predicts. Certainly, there is no suggestion of a
treatment effect on sex ratios. Apart from the imbalance at baseline, the subsequent
ratios are close to 50:50, so we can regard the sex indicator post-baseline as a
Bernoulli process independent of treatment.
with a single temporal shift τ ≤ 0 to be estimated from the data using the REML
criterion. One boundary point τ = 0 coincides with (5.4), and the other limit
τ → −∞ implies a constant aviary effect as in (5.3). For τ < 0, this modification
implies positive correlations within aviaries at baseline, which is a size pattern
that contradicts our understanding of randomization. The interpretation is that, by
accident or by design, some aviaries start out with larger lice than others, and the
initial pattern leaves an imprint on the subsequent evolution.
For the PC1 variable, the profile REML log likelihood values for τ at zero, τ̂ =
−2.6 and −∞ are 0.0, 11.5 and −18.5, showing that the constant aviary effect is
decisively rejected by the data. It appears from this analysis that the initial aviary
pattern for PC1 is non-zero and that it persists in the subsequent evolution. The
70 5 Louse Evolution
particular temporal offset may be pure coincidence, but τ̂ = −2.6 months is a very
close approximation to the de-lousing quarantine period during which the pigeons
had to be housed somewhere.
Consider the statement near the beginning of Sect. 5.1.3: ‘since each measurement
is made on one louse, it is evident that each observational unit is one louse. . . ’. The
premiss—that each measurement is made on one louse—is indisputable. Neverthe-
less, a conclusion that is obvious literally, is not necessarily true mathematically in
the sense of the definition.
According to the definition, the observational units are the objects, or points
in the domain, on which the response is defined as a random function or a
stochastic process. Thus, each observational unit exists at baseline, not necessarily
as a physical object, but as a non-random mathematical entity. For the models in
Sect. 5.2, with louse-time pairs as observational units, there is no birth or death,
and no evolving finite population—only a fixed, arbitrarily large, set of lice in each
lineage. In this mathematical framework, the lice are in 1–1 correspondence with the
natural numbers, they live indefinitely in the product space, and their vital statistics
are random variables recorded in the state space. To each louse there corresponds a
stochastic process, so the value for each louse evolves over time, but the population
itself is fixed and arbitrarily large in every lineage.
It would be wrong to say that the Gaussian model is incorrect or that its flaws are
fatal, but its shortcomings for this application are clear enough. If the application
calls for a finite randomly-evolving lineage, a more complicated mathematical
structure is required. The remarkable thing is not that this Gaussian model is
exquisitely tailored to this evolutionary process, but that a generic model that is
missing the defining aspects of life, namely birth and death, should have anything
useful to contribute at all.
Certainly, the lice do not exist in the physical sense at baseline. But lineages
are established at baseline, and it is the lineages that evolve. They evolve randomly
in two senses—in their composition as a finite set of lice, and in their values or
features. If both aspects are important for a given application, a more complicated
model is needed in which the observational units are lineage-time pairs. The state
space for one measurement on one louse is S = {M, F } × R3 ; the state space for
one observational unit is the set of finite subsets of S. One finite subset of S is a
complete description of the population size and the vital statistics of the residents at
time t. The transitions from one finite subset to another are limited by birth, death
and continuity in time.
A general process of the type described in the preceding paragraph is a
complicated mathematical structure, and we make no effort to develop a general
theory here. But there are simple versions that are essentially equivalent to imposing
a pure birth-death process independently as a cohort restriction on the domain of a
5.5 Follow-Up 71
Gaussian process. The distribution of the values thus generated coincides with the
Gaussian model in Sect. 5.2, and none of the subsequent analyses are affected. For
that setting, birth and death are immaterial.
The possibility that individual louse values or body-sizes might be related to
the sample size or lineage size from which they come has not been considered up
to this point, in part because such a dependence is not possible under the models
in Sect. 5.2. The notion that a sample can be extended indefinitely from a sub-
sample such that the sub-sample values remain unchanged, is usually understood
in applied work as an obvious fact. Failure strikes at the heart of the most
cherished notion in probability and applied statistics, which is the ‘obvious fact’
of distributional consistency for sub-samples as formulated by Kolmogorov (1933).
If lice are the observational units for this process, consistency implies that the
distribution for individuals is unrelated to the size of the sample from which they
are taken. Fortunately, variability of sample sizes provides a weak check to test that
implication.
Each of the 32×9 lineage-time pairs provides one sample, of which 15 are empty.
The louse counts range from zero to 44, they are highly variable, and they tend to
decrease over time. One lineage appears to go extinct at 30 months. The safest and
the simplest way to test for a dependence on sample size is to include sample size as
an additional ‘covariate’ in (5.2), retaining (5.4) for covariances. For both log body
length and PC1, the fitted coefficient is negative and approximately one half of the
standard error. This analysis offers no evidence of a sample-size dependence, which
provides a little reassurance that the earlier analysis with louse as the observational
unit is reasonably sound.
If some birds preened more vigorously or more thoroughly than others, and
larger or older lice were preferentially removed by preening, the more assiduous
preeners would then host fewer and smaller lice. Differential preening could lead to
a dependence of mean louse size on lineage size or on sample size, in which case
the test in the preceding paragraph is a reasonable check.
5.5 Follow-Up
Given the severity of the discrepancy between the conclusions presented above
and those published by Villa et al. (2019), it seemed only appropriate to send a
copy of Sects. 5.1–5.4 to the authors for comment. I contacted the lead author
in early December 2020. Scott Villa, responded immediately, and later at the
beginning of February 2021 offering further details about the experimental design,
and challenging the conclusions on several points.
By Villa’s account, the randomization was carried out according to an elaborate
protocol, which involved dislodging the CO2 -anesthetized lice over a custom-made
72 5 Louse Evolution
10 × 14 glass grid, generating a random grid number as the starting point for
collection of specimens, and placing lice sequentially and cyclically in vials labelled
1–32 until each vial contained 25+ lice. It was designed to avoid unintentional
biases, and it appeared to be adequate for the task.
The following summary of key design points that had previously been partially
or totally misunderstood is taken from Villa’s reply.
1. At time zero, 1600+ lice were collected from wild feral pigeons. No size
measurements were made on the sub-sample of 800 founder lice that were
transferred to captive birds. A second sample of 800 lice was photographed,
measured, and frozen for subsequent genetic analysis.
2. The 800 founder lice were assigned to hosts at random, 25 per bird. Each
founding population consisted of 13–14 females and 11–12 males with a
deliberate female bias to ensure that a lineage would be established on every
host.
3. The 800 lice measured at time zero did not contribute to the breeding population;
their assignment to lineages was randomized, but purely virtual. The virtual
sample had the same sex-ratio as the founders.
4. After baseline, the lice that were measured at 6-month intervals were frozen
thereafter to use for genomic analyses of the populations over time. Throughout
the experiment (months 6–48), the adult and immature lice that were removed but
not photographed were immediately placed back on birds, thus ensuring stability
of the lineages over time.
In light of the revised information, certain statements in the ‘Materials and
Methods’ section of the published paper seem ambiguous or oddly phrased, for
example,
We transferred 800 lice from wild caught feral pigeons to 16 giant runt pigeons and 16 feral
pigeon controls (25 lice per bird). At this time (Time 0), we also randomly sampled 800 lice
from the source population on wild caught feral pigeons and measured their body size.
This remark suggests, correctly as it turns out, that the measured lice and the founder
lice might be disjoint subsets. But that thought was dispelled by an earlier remark
Once photographed, the live lice were returned to their respective host,
matter experts, not a matter on which statistical expertise carries weight. As always,
the over-riding concern is that the experiment be reported as it was conducted.
At this point we accept the new design information, and ask what effect it has on
the appropriateness of the analyses already performed, and what modifications are
required.
Consider first the information that the association of time-zero measurements
with lineages is virtual. This fact implies that the information content is unchanged
if time-zero values are permuted in any manner that preserves sexes, while non-
baseline values stay put. A baseline permutation that preserves sexes is one in which
males are permuted with other males, females with other females, and non-baseline
individuals are fixed. This set of permutations is a sub-group of size 464! × 336! in
the larger group of size 3105!.
Any credible analysis that accommodates the virtual randomization must be
invariant with respect to this group of permutations; similar remarks apply to
numerical conclusions regarding temporal trends, variance components or other
effects. The authors’ block-factor assumption (5.3) applies to baseline and non-
baseline values, so it contradicts baseline exchangeability, virtual or otherwise. The
numerical values reported in their supplementary tables S1–S5 are also not invariant.
Non-virtual baseline exchangeability as discussed in Sect. 5.2.5 implies that the
marginal distribution of the initial 800 measurements is invariant with respect to
sex-preserving permutation. Virtual exchangeability is a much stronger condition
because it implies also that the joint distribution of all 3105 measurements is
invariant. Neither condition implies independence of initial and subsequent values,
but virtual exchangeability implies that the dependence must be of a trivial type,
which is ignorable in practice. The Brownian model (5.4) implies cov(Yu , Yu ) = 0
for any pair u = u such that t (u) = 0 or t (u ) = 0. Together with (5.2), it also
satisfies the virtual exchangeability condition. By contrast, the standard random-
effects model (5.3) having independent and identically distributed lineage effects
that are constant in time, does not satisfy even the weaker exchangeability condition.
It is also incompatible with the discussion in Sect. 5.4.2.
The Brownian-motion model is in line with the standard genetic theory for trait
evolution, and is compatible with virtual randomization as described above. Thus the
conclusions as stated at the end of Sect. 5.3 are confirmed. Average size differences
between the two hosts shown in Table 5.1 are less than 2% and are compatible with
neutral evolution in both hosts. The sex-adjusted PC1 mean differences GR − F at
each non-zero time point are very similar to the unadjusted differences displayed
in Fig. 5.2, but the correctly-computed standard errors in Table 5.4 tell a different
story.
Both the differences and the standard errors in this table are computed from a
fitted Gaussian model, in which the temporal trend, previously modelled as a zero-
74 5 Louse Evolution
mean random effect with covariance σ32 (t ∧ t ) in (5.4), is replaced with a non-
random term in the mean. The moments are
The mean subspace includes an additive constant for sex, and a host-dependent
temporal trend γh (t). The factorial model formula
According to the reply by Scott Villa, the sex ratio of lice at baseline was
intentionally biased towards females, with 13–14 females and 11–12 males as
founders for each lineage. Following the initial seeding, male and female lice were
sampled in approximately equal numbers, so information on the evolution of the sex
ratio over time is not available. In light of this information, much of the speculation
in Sect. 5.4.1 is no longer relevant.
Villa also takes issue with a remark in Sect. 5.2.1 that the overall change in body
size is surprisingly small, which suggests that changes of this magnitude (< 2%)
cannot be biologically significant. His counter-claim is that
. . . body size changes on this scale are biologically relevant for this species, as the effect on
mating behavior shows (Villa et al., 2019, Figs. 2–5).
The coefficient of variation of body length for female lice within aviaries is very
stable at 2.4–2.6% from six months onwards; the value for males is equally stable at
2.2–2.4%. These numbers represent natural variability of body length within freely
breeding populations, which is approximately 2.4% (or σ̂resid 2 60 × 10−5 in
Table 5.2). The mean differences between hosts are shown in Table 5.1; they are
almost uniformly less than 2%.
What are the implications for mating? The root mean square √ size discrepancy
between a random pair from the same aviary is approximately 2 × 2.42 , or 3.4%,
so the distribution of F − M-size differences is approximately N(411, 832). A 2%
increase in mean size for females implies that the distribution of size differences
for mixed hosts is N(411 + 50, 832). If size discrepancy is the chief determinant
of sexual compatibility, and incompatibility is rare in each population, a mean
difference of 0.6 standard deviations is not sufficient to make the incompatible
fraction large in the mixed population.
The two movies provided by Villa et al. (2019) illustrate size discrepancies
of 1.8 and −2.6 standard deviations, so their relevance at the 0.6σ -scale is not
immediately apparent. In the absence of a detailed morphological explanation, it
is difficult to accept the authors’ claim that body size changes on this scale (∼0.6σ )
are biologically important for any species.
5.6 Exercises
5.1 According to the standard definition in Sect. 11.4.2, two observational units
u, u belong to the same experimental unit if the treatment assignment probabilities
given the baseline configuration satisfy P (Tu = Tu ) = 1. Section 5.1.3 makes
the argument that each louse is one observational unit, and that each lineage is one
experimental unit. But the author subsequently pivots to aviary as the experimental
unit, hedging his bets by stating that ‘both seem to be relevant’. Discuss the
arguments pro and con of louse-lineage versus louse-aviary versus lineage-aviary
76 5 Louse Evolution
5.3 Download the data, compute the averages at each time point for the two pigeon
breeds, and reconstruct the plots in Figs. 5.1 and 5.2.
5.5 Use anova(...) to re-compute the mean squares in Table 5.2. Use Bartlett’s
statistic (Exercise 18.9) to test the hypothesis that the residual mean squares have
the same expected value at all time points. What assumptions are needed to justify
the null distribution?
5.6 For the model (5.3), what is the expected value of the within-lineage mean
square at time t? For the Brownian-motion model (5.4), show that the variance of
Yu increases linearly with time. What is the expected value of the within-lineage
mean square?
5.7 Use lmer(...) to fit the variance-components model (5.3) to the log body
length with (5.2) as the mean-value subspace. Report the two slopes, the slope
difference, and the three standard errors.
5.9 Compute the four covariance matrices V0 , . . . , V3 that occur in (5.4). Let Q be
the ordinary least-squares projection with kernel (5.2). Compute the four quadratic
forms Y Q Vr QY and their expected values as a linear function of the four variance
components. Hence or otherwise, obtain initial estimates.
5.6 Exercises 77
5.10 Use regress(...) to compute the REML estimate of the variance compo-
nents in (5.4). Hence obtain the estimated slopes, their difference, and the standard
errors for all three.
5.11 For n = 100 points t1 , . . . , tn equally spaced in the interval (0, 48), compute
the matrix
5.12 Regress the 32 × 9 lineage-time averages (for PC1) against sample size using
sample size as weights. You should find a statistically significant positive coefficient
a little larger than 0.01. Explain why the conclusions from this exercise are so
different from those at the end of Sect. 5.4.2.
5.13 In Table S2 of their Appendix, Villa et al. fit the eight-dimensional factorial
model host:sex:time to the first principal component values on 3096 lice. Show that
this is equivalent to fitting four separate linear regressions E(Yu ) = α + βtu , with
one intercept and one slope for each of the disjoint subgroups, Fer.F, Fer.M, Gr.F,
Gr.M. Feral and female are the reference levels, so sexu = 1 is the indicator vector
for males. Deduce that the host:time coefficient is equal to the slope difference
βGr.F − βF er.F restricted to female lice. The fitted value is 0.009. What is the fitted
slope difference for male lice?
5.14 The sex coefficient in Table S2 is −2.437. Which combination of the four
α-values in the previous exercise does this correspond to?
5.15 The host coefficient in Table S2 is 0.449 with standard error 0.159. What does
this imply about the average or expected baseline values for the four subgroups?
5.16 For the model with persistent aviary patterns described at the end of
Sect. 5.4.2, compute and plot the REML profile log likelihood for τ in the range
0.5 ≤ τ ≤ 24. Use PC1 as the response, and (5.2) for the mean-value subspace.
The covariance should be a linear combination of five matrices, one each for the
identity matrix and the identity restricted to baseline, two Brownian-motion product
matrices as in (5.4), and one τ -shifted B-M product matrix. Ten to twelve points
equally spaced on the log scale should suffice for plotting.
5.17 Use the profile log likelihood plot in the previous exercise to obtain a nominal
95% confidence interval for τ .
78 5 Louse Evolution
A baseline permutation is a 1–1 mapping u → τ (u) such that t (u) > 0 implies
τ (u) = u. Distributional invariance means that the permuted vector Y τ with
components Yuτ = Yτ (u) has the same distribution as Y . Show that the joint
distribution is invariant with respect to baseline permutations. Note that h(τ (u))
is not necessarily equal to h(u).
5.20 Consider the following statement taken from Sect. 5.5. Any credible analysis
that accommodates the virtual randomization must be invariant with respect to
the same group, and similar remarks apply to numerical conclusions regarding
temporal trends, variance components or other effects. Invariance in this setting
means that each distribution in the model is exchangeable, or invariant with respect
to sex-preserving baseline permutations. This is a demanding standard, and it is
possible that subsequent statements in that same section may not live up to it. Show
that the model-formula Host:as.factor(Time), which is related to Table 5.4,
corresponds to a set of vectors, some of which are not group-invariant. Investigate
the implications, particularly for time zero.
5.21 According to the text in Sect. 5.5, Virtual randomization requires the time-
zero average for feral hosts to be the same as that for giant runts, but the temporal
trends are otherwise unconstrained. It appears that the model matrix spanning
this subspace is not constructible using factorial model formulae. Explain how to
construct the desired matrix including a constant additive sex effect. What is its
rank? Fit the model as described in the text following Table 5.4. Include independent
Brownian motions for aviaries and lineages, plus an additional baseline error term
with independent and identically distributed components.
5.6 Exercises 79
5.22 Use the fitted model from the previous exercise to compute the linear trend
coefficient
and its standard error. You should find both numbers in the range 0.013–0.015 per
month, similar to, but not exactly the same as those reported in the text.
5.23 The model in the previous two exercises has a baseline variance that is larger
than the non-baseline residual variance. What is the ratio of fitted variances?
5.24 The fact that measured lice were not returned to their hosts is an interference in
the system that may reduce or eliminate temporal correlations that would otherwise
be expected. One mathematically viable covariance model that is in line with virtual
randomization, replaces each occurrence of t ∧ t in (5.6) with the rank-one Boolean
product matrix (t > 0)(t > 0), so that the only non-zero temporal correlations
are those associated with lineage and aviary as strictly post-baseline block factors.
Fit this modified block-factor model to the PC1 response with (5.5) for the mean
subspace. Which model fits better? Is the log likelihood difference small or large?
An informal comparison suffices at this point.
5.25 Construct two versions of Table 5.4, one based on the modified block-factor
model, and one based on the combined variance model that includes both. Comment
on any major discrepancy or difference in conclusions based on the various models.
Chapter 6
Time Series I
We examine here the Central England daily temperature series, from January 1,
1772 to Dec 31, 2019. The series length is 90 580 days over 248 years.
The data in tenths of a degree Celsius can be downloaded from the address
https://www.metoffice.gov.uk/hadobs/hadcet/cetdl1772on.dat
For each year the values are arranged in a 31 × 12 array, one column for each
month and one row for each day in standard Gregorian format. Non-existent days are
padded with the placeholder ‘value’ −999. For computational purposes, we assume
that the data have been rearranged in standard data-frame format with one row for
each of n = 90 580 days. Each column is one variable. Apart from temp and day,
it may be convenient to include the first and second-order annual harmonics
where t is time measured in days counted from Jan 1, 1772, and τ = 365.2425 is
the mean number of days in one Gregorian year.
As is often the case with very extensive data, much can be learned from simple
graphs and other summaries without resorting to formal stochastic models. We first
examine the nature of the annual seasonal cycle.
The average temperature for each date in the year is computed by associating with
each day a calendar date, either the Gregorian calendar date or some version thereof.
In the conventional Gregorian system, each date is an integer in the range 0–365,
beginning with Jan. 1 coded as zero. February 28 and March 1 are coded as 58 and
60 respectively, whether these are consecutive days or not. For present purposes, it
suffices to code day as sequential integers 0:(n − 1), where n = 90 580, and to use
the mathematical calendar date
tau <- 365.2425; date <- trunc(day %% tau)
which is an integer in the range 0–365. Whatever version of the calendar date is
used, the average for each date is computed as follows:
dailymeantemp <- tapply(temp, date, "mean")
Our mathematical dates do not correspond exactly with the Gregorian calendar date,
mostly because the leap day is intercalated at the end of December rather than at the
end of February. Thus, each calendar date 0–364 occurs 248 times, and these dates
are always consecutive days, whereas the leap date occurs only 60 times, so date 0
follows 365 in leap years and 364 in non-leap years. Similar remarks apply to date
number 59 (Feb. 29) in the Gregorian system. Wherever the leap date is intercalated,
a minor discontinuity may be introduced, as can be seen in the volatility series in
Fig. 6.1. If the Gregorian date is used, the discontinuity at Dec. 31/Jan 1 disappears,
but does not reappear at Feb. 29.
Neither the mean series nor the volatility series is adequately described by a first-
order harmonic function, which is a linear combination of the three basis vectors
1, cos(t), sin(t), but both are reasonably well described by second-order harmonic
functions with two further basis elements cos(2t), sin(2t). The fitted harmonics
shown in Fig. 6.1 were computed by ordinary least squares,
olsfit <- lm(dailymeantemp~c1+s1+c2+s2)
Jan | Feb | Mar | Apr | May | Jun | Jul | Aug | Sept | Oct | Nov | Dec |
Jan | Feb | Mar | Apr | May | Jun | Jul | Aug | Sept | Oct | Nov | Dec |
Figure 6.2 is the same as Fig. 6.1 except that the period has been split into four
non-overlapping blocks of 62 years in order that long-term trends and variations
in the annual cycle might be revealed. To simplify cross-block comparisons, the
plotting scales are fixed for each block, and the second-order harmonic is also fixed
to serve as a historical reference.
It is evident that there has been no major shift in the seasonal cycle over this
period. However, winter temperatures, particularly in January, have risen by several
degrees throughout this period, and that increase began even in the nineteenth
century. The low summer and autumn temperatures in the late nineteenth century
are well known and are often attributed to volcanic effects such as the Krakatowa
eruption in 1883. However, the lowest annual mean in this series occurs in 1879,
four years before the eruption, and the expected volcanic effects are not readily
apparent in the annual averages for the decade that follows: see Fig. 6.4. Other than
84 6 Time Series I
4.0
fitted harmonic (248 yrs) 1772 - 1833 fitted second-order harmonic (248 yrs)
15
3.5
3.0
10
2.5
5
2.0
| | | | | | | | | | | | | | | | | | | | | | | |
4.0
1834 - 1895
15
3.5
3.0
10
2.5
5
2.0
| | | | | | | | | | | | | | | | | | | | | | | |
1896 - 1957
15
3.5
3.0
10
2.5
5
2.0
| | | | | | | | | | | | | | | | | | | | | | | |
1958 - 2019
15
3.5
3.0
10
2.5
5
2.0
Jan | Feb | Mar | Apr | May | Jun | Jul | Aug | Sept | Oct | Nov | Dec | Jan | Feb | Mar | Apr | May | Jun | Jul | Aug | Sept | Oct | Nov | Dec |
Fig. 6.2 Mean temperature and volatility by day of the year in consecutive 62-year blocks
the winter increase, the annual pattern in the early 20th century is remarkably close
to that in the early 19th century. The phenomenon that stands out in Fig. 6.2 is the
uniformly high temperature throughout the year in the most recent period. Only on
35 dates do the daily averages for 1958–2019 fall below the historical reference
curve.
The first k-statistic is the sample mean. Subsequent k-statistics of order 2–4 are
(n − 1) k2,n(x) = (xi − x̄n )2 ,
(n − 1)↓2 k3,n (x) = n (xi − x̄n )3 ,
(n − 1)↓3 k4,n (x) = n(n + 1) (xi − x̄n )4 − 3(n − 1)3 k2,n
2
(x),
b > 0. Thus, the fact that the temperature is recorded in tenths of a degree ◦ C rather
than ◦ F has no effect on the standardized values. These statistics are frequently used
to gauge departures from normality. Here we are looking at cumulant variations as
a periodic annual time series.
The standardized values are plotted by calendar date in Fig. 6.3, so each skewness
and kurtosis coefficient is computed using 248 replicate temperature values for
every non-leap date, or 60 for the leap date. The average skewness is close to
zero, but there is a distinct sinusoidal cycle with a summer maximum, which is in
phase with the mean temperature cycle. Winter temperatures are skewed negatively,
summer values positively. The kurtosis values are more widely scattered with no
clear pattern, but summer values are slightly larger on average than those in other
months. Two thirds of the k4 -values are negative, indicating that tails are shorter
than Gaussian. The sinusoidal trend in the skewness plot is clear evidence of non-
normality, but that is not an adequate reason to abandon methods of analysis based
on linear decompositions.
It is worthwhile recalling the inheritance property of sample statistics kr,n , and
more general U -statistics, computed for sub-samples of various sizes. Let [N] be
the population and S ⊂ [N] a sample of size n ≤ N; let Y [S] be the sample
temperatures and kr,n (Y [S]) the sample statistic. Given the population statistic
kr,N ≡ kr,N (Y [N]), the average over samples of size n satisfies
Thus, given that the variance for April 7 is low relative to April 1 or April 12 in the
population of 248 years, we should expect the same to hold on average for simple
random samples or simple random partitions. Although a sequential block of 62
years is not a simple random sample, it may behave as such in the absence of serial
correlation, in which case the depression seen for April 5–9 variances in successive
62-year blocks in Fig. 6.2 is expected and not a surprise.
86 6 Time Series I
Jan | Feb | Mar | Apr | May | Jun | Jul | Aug | Sept | Oct | Nov | Dec |
Jan | Feb | Mar | Apr | May | Jun | Jul | Aug | Sept | Oct | Nov | Dec |
The top panel of Fig. 6.4 shows the annual average temperature for each year over
the 248-year period. Post-1790 record lows and record highs are indicated: year t is
a record high if Yt = max{Y1 , . . . , Yt }, and a record low if Yt = min{Y1 , . . . , Yt }.
6.3 Annual Statistics 87
9.5
8.5
7.5
3.0
2.5
Fig. 6.4 Mean temperature and volatility over 248 consecutive years, with record highs and lows
indicated
The record lows occur in 1814 and 1879, the record highs in 1834, 1921, 1949,
1990, 1999, 2006 and 2014. By visual inspection, the mean trend is constant up
to about 1900, increasing slowly to about 1950, and more rapidly thereafter. The
maximum-likelihood cubic-spline fit has been superimposed as a summary of the
mean trend. Computational details are given in the following section.
The second panel of Fig. 6.4 is similar to the first, except that it shows the within-
year standard deviation measured as the deviation from the second-order harmonic
fit. The harmonic term is removed so that the effect of seasonal variation is kept to a
minimum. Post 1790 record lows and highs are highlighted; the lows occur in 1790,
1832 and 1951, the highs in 1795 and 1947. The trend in volatility is downwards as
indicated by the cubic spline fit, but it is not significantly non-linear over this period.
Changes in meteorological technology over the centuries must have an effect on
variability of measurements, but this effect seems unlikely to be large for temper-
88 6 Time Series I
ature measurements. Temperatures are well calibrated relative to the freezing and
boiling points of water, so the effects of technological innovation on measurements
of annual average temperatures are likely to be small, if not entirely negligible.
A block of length l is an interval of the form (t +1, t +l), beginning on day t +1 and
ending on day t + l. The focus in this section is on the behaviour of block averages
for contiguous blocks of fixed length. To eliminate seasonal variation, we restrict
attention to blocks whose length is an integer number of years. In the following
table, the sample averages for 5000 blocks of length b years or 365.25b days
were obtained, and the sample variance of these block averages was computed.
Blocks were sampled uniformly at random, not necessarily starting on Jan 1, so
the average fractional overlap of two blocks is b/248. Standard theory for simple
random samples tells us that, in the absence of correlation, the sample variance
of the averages is proportional to (1 − f )/b, where f = b/248 is the sampling
fraction and 1 − f is the finite-population correction factor. Accordingly, the second
line reports the corrected variance of the block averages, with Cb = 1/(1 − f ).
The standard theory for uncorrelated values also applies asymptotically to block
averages from a stationary processes provided that the correlations decay at a
sufficiently fast rate. For a short-range dependent process the product bCb var(Ȳb )
shown in the middle line should be approximately constant in b, at least for
large b. However, this product is clearly increasing as a function of the block
size. The third line suggests that b1/2Cb var(Ȳb ) is approximately constant, and
hence that the variance of block averages behaves inversely as the square root of
the block size rather than O(b−1 ). This phenomenon is a characteristic of long-
range dependence. For a stationary process, the behaviour observed here for block
averages is consistent with the assertion that the covariance function does not have
a finite integral. It is incompatible with short-range dependence such as e−|s| or
P (s)e−|s| for any polynomial P , or any of the finite-range Matérn models.
It is important to be aware that the variances shown above are finite-population
sample variances. Since the totals for a block B of size b and its complement B̄ of
size n − b satisfy
the sample variance of fixed-length block totals is exactly equal to the sample
variance of the totals on the block complements. In general, the complement of
a contiguous block is not a contiguous block. However, if the sampling were done
cyclically, i.e., modulo n = 248 years, the complement of a contiguous block is also
a contiguous block.
The variogram of a stationary process at lag h is the expected value of the squared
difference |Yt − Yt +h |2 , which is non-negative and symmetric in h. If the process
has a covariance function K(|t − t |), the variogram is
γh = E |Yt − Yt +h |2 = 2K(0) − 2K(h) = 2σ 2 1 − ρ(h) ,
where ρ(h) is the autocorrelation at lag h, and σ 2 is the variance. The semi-
variogram is one half the variogram.
For a sequence of length n, the empirical variogram is the average squared
difference of sample values
1
n−h
γ̃h = (Yt − Yt +h )2 .
n−h
t =1
If the process has a non-constant mean, but is otherwise stationary, the residuals are
used instead. The empirical variogram provides a decomposition of the total sum of
squares by lags:
1 1
n
(n − h)γ̃h = (Yt − Ys )2 = (Yi − Ȳ )2 .
n n s>t
h=1
-0.01
-0.5
-0.03
-1.0
-0.05
Variogram for lags up to 30 days Variogram for lags 30-365 days
-0.07
-1.5
Variogram for lags at integer number of years Variogram for lags one year and above
at integer number of months
0.02
0.02
0.00
0.00
-0.02
-0.02
-0.04
-0.04
Fig. 6.5 Empirical log variogram of temperatures split into short, medium and long lags. A least-
squares fitted curve for short and medium lags taken together is shown in the top two panels. For
the longer lags, the least-squares straight line with slope 0.20–0.25 per millennium is shown
about 12–18 months. The standardized variogram and the autocorrelations implied
by the fitted curve for lags up to one week are as follows:
h 1 2 3 4 5 6 7
γ̃ (h)/(2s 2 ) 0.220 0.425 0.559 0.648 0.711 0.755 0.788
γ̂ (h)/(2σ̂ 2 ) 0.224 0.441 0.572 0.658 0.718 0.761 0.794
ρ̂h 0.776 0.559 0.428 0.342 0.282 0.239 0.206
For lags 2 ≤ h ≤ 4, the SD1/2 autocorrelations satisfy ρ̂h < ρ̂1h , the inequality
being reversed for h > 4.
6.4 Stochastic Models for the Seasonal Cycle 91
The third and fourth panels show the variogram behaviour for very long lags
in the range 1–120 years. The third panel is restricted to lags that are an integer
multiple of one year, so that the sequence of values is not affected by the elimination
of seasonal cycles. This graph indicates that the log variogram increases at the rate
0.25 units per millennium
over the range 1 ≤ h ≤ 120 years. The fourth panel shows lags that are integer
multiples of one month over the same range. The least-squares fitted line in this case
is a little flatter with slope 0.20 units per millennium. Neither scatterplot suggests a
substantial deviation from linearity over the range 1 ≤ h ≤ 120 years. The absence
of an upper bound for the variogram is one of the hallmarks of non-stationarity.
It is striking that the SD1/2 variogram curve γ (h) = σ02 − σ12 K ∗ (h/λ), which
fits the empirical variogram reasonably well for lags up to and well beyond one
year, has a finite limit γ (∞) = σ02 , and thus fails completely to capture the non-
constant behaviour of the variogram at very long lags. Although the long-range
trend is difficult to deny, the implied annual increase is almost imperceptible and is
comparable to the width of one plotting symbol in the second panel.
The observational units in a time series are the time points at which measurements
are made. Usually, there are no replicate measurements at the same time. In this
instance, each day is one observational unit, the observational units are completely
ordered and are associated with the integers, i.e., equally spaced points on the
real line. In the absence of further structure, we have at our disposal only one
fundamental covariate, which is time measured in days beginning at an arbitrary
point, which is taken to be zero for Jan 1, 1772. There is, however, one crucial piece
of additional information, which is the length of the Gregorian year, τ = 365.2425
days. From this we arrive at the first-order harmonics,
whose period is one calendar year. The kth-order harmonics c(kt), s(kt) have a
period of 1/k years.
92 6 Time Series I
One crucial property of harmonics is that the subspace spanned by each pair
c(t), s(t) is closed with respect to temporal translation:
for each displacement h, and the same holds for the pair c(kt), s(kt). The space Hk
of harmonics of degree ≤ k is a vector space of dimension 2k + 1, in which H0 = 1
is the space of constant functions. These are the only plausible functions that are
available for use as covariates in the model for the mean temperature.
The Fourier basis vectors c(kt), s(kt) are exactly orthogonal in the continuous
setting as functions on (0, 2π), and they are exactly orthogonal in certain uniformly-
spaced discrete-time settings. In the present discrete setting, they are not quite
orthogonal because of the leap-year complication, but this effect is very small and
can be neglected.
τ π|t − t |
χ(t, t ) = sin ,
π τ
(t, t ) = min{t − t , t − t},
d(t, t ) = (t − t ) τ − (t − t ) /τ.
The first two are respectively the chordal distance and the arc length on the annual
circle whose perimeter is τ . In all three expressions, t, t are understood as points
in the space R (mod τ ), with addition modulo τ , so that 0 ≤ t − t < τ and
t − t = τ − (t − t ) are complementary arc lengths. With this understanding, it can
be verified that d is a metric. For each metric, the maximum values τ/π, τ/2 and
τ/4 occur at diametrically opposite points t − t = τ/2.
For most statistical work, d and χ are essentially equivalent: see Exercises 6.11–
6.14.
6.5 Estimation of Secular Trend 93
For each λ > 0, the function K(t, t ) = exp(−χ(t, t )/λ) is positive definite on
[0, τ ), and also stationary and positive semi-definite on the real line with period τ .
The Gaussian random function η ∼ GP(0, K) is periodic and continuous, and is
reasonably well suited as a statistical description of the temperature deviations from
the seasonal harmonic in Fig. 6.1. For fixed λ, the Gaussian model in which μ =
E(Y ) belongs to the space of harmonics of degree two, and cov(Y ) = σ02 In + σ12 K,
is linear in the parameters. The model can be fitted using the R command
fit <- regress(dailymeantemp~c1+s1+c2+s2, ~K)
In this discrete computational setting, τ = 366, or 365 if the leap day is dropped, and
K is a symmetric matrix of the same order. The identity matrix, or nugget effect,
is included by default, so there are two variance components and five regression
coefficients to be estimated. As it happens, the nugget variance estimate is zero,
or even slightly negative if not constrained. The maximized log likelihood plotted
against λ has a maximum at λ̂ 4.3 days, and the fitted variance coefficients are
σ̂02 = 0, σ̂12 = 3.62. The log likelihood is distinctly non-quadratic in λ, but it is
approximately quadratic in λ−1 with a finite long-range limit as λ → ∞.
The positivity constraint is enforced either through the optional argument
pos=c(1,1) or, in this instance, by nugget omission identity=FALSE. The
residual log likelihood for the covariance model σ02 In + σ12 K exceeds that for the
independent and identically distributed sub model with σ12 = 0 by 167 units, leaving
no doubt about strength of the residual serial correlation.
If the arc length is substituted for chordal distance, the resulting process is
essentially an autoregressive process of order one, but with a periodic constraint.
The dependence is local and confined to a few days, so there is little difference
between the chordal and arc-length models.
The quadratic metric is closely related to the Brownian-bridge process: see
Exercises 6.15–6.16.
Suppose that Y = (Y0 , Y1 ) is a pair of random vectors that are jointly Gaussian with
moments
μ0 00 01
E(Y ) = , cov(Y ) =
μ1 10 11
94 6 Time Series I
The series of annual averages is modelled as the sum Y (t) = η(t) + ε(t) of two
independent Gaussian processes in which the components of ε are independent
and identically distributed with mean zero. The secular trend is a smooth random
function whose covariance function exhibits long-range dependence. Intuitively, a
long-range secular trend conjures up an image of a smooth function in time, so the
choice for K ∝ cov(η) must force a specific degree of smoothness on the function η.
Typically, the mean of η is constant or linear μ(t) = β0 + β1 t, but more general
expressions are also possible if the circumstances require it. The statistical goal is to
estimate the secular trend by computing the conditional expectation E η(·) | data ,
which is called the Bayes estimator. When maximum-likelihood estimates of the
fitted parameters are inserted, this is known as the empirical Bayes estimator. In
other types of application, the conditional expected value is sometimes called the
best linear predictor or the Kriging estimate.
It is worth emphasizing that continuity requires η(·) to be defined at all
points in R, even though the process Y is observed or recorded only at a finite
set of points. The conditional expected value E(η(s) | data) is linear in the
observations; its behaviour as a function of s is a linear combination of covariances
K(s, ti ) = cov(η(s), Y (ti )) for observation times t1 , . . . , tn . In practice, the degree
of smoothness of K on the diagonal is crucial.
For the present illustration, we choose the Matérn covariance function with index ν
K(t, t ) = x ν Kν (x),
6.5 Estimation of Secular Trend 95
where Kν is the Bessel function of order ν > 0, and x = |t −t |/λ is the standardized
temporal difference. The exponential covariance function corresponds to ν = 1/2,
and in this case λ log 2 is the range at which the serial correlation in η is reduced
by half. The index range ν > 0 guarantees continuity of η(·) as a random function,
ν > 1 guarantees continuity of first derivatives, and ν > r guarantees continuity of
rth derivatives. For computational illustration, we set ν = 3/2 and λ = 1000 (in
units of years). The large value of λ means not only that serial correlation persists
well beyond the observation period but also that the behaviour of η is governed by
the behaviour of K near the diagonal.
nu <- 3/2; lambda <- 1000; x <- abs(outer(yr, yr, "-"))/lambda;
K <- x^nu * besselK(x, nu); diag(K) <- 2^(nu-1)*gamma(nu)
fit <- regress(annualmean~yr, ~K)
blp <- fit$fitted + fit$sigma[2]*K %*% fit$W %*% (annualmean-fit$fitted)
lines(yr, blp)
In the formula for the conditional expectation, fit$fitted is the fitted mean
vector β̂0 + β̂1 t with 248 components, fit$W is the fitted inverse covariance matrix
for the observations, fit$sigma is the vector of fitted variance components,
and fit$sigma[2]*K is the matrix of covariances cov(η(t), Y (t )) for t, t
among the observation points. If we wish to make predictions beyond the range
of observation times, say to 2020 or 2021, it is necessary to extend the vector of
fitted means and the matrix of covariances in the obvious way.
The fitted variance components are σ̂ 2 = (0.316, 151.6), and the log likelihood
ratio statistic for testing the linear sub-model ση2 = 0 against this alternative is
2(16.14 − 8.46) = 15.36 on one degree of freedom. Note that the null hypothesis
sub-model is not stationary, but the mean trend is constrained to be linear in time.
Using the standard asymptotic approximation for the distribution of the likelihood-
ratio statistic, the tail probability is less than 10−4 , so the evidence for non-linearity
and/or long-range correlation is fairly strong.
If we wish to test the hypothesis of no trend versus a continuous non-linear trend,
we could proceed computationally as follows:
nu <- 1/2; lambda <- 1000; x <- abs(outer(yr, yr, "-"))/lambda;
K <- x^nu * besselK(x, nu); diag(K) <- 2^(nu-1)*gamma(nu)
fit0 <- regress(annualmean~1)
fit1 <- regress(annualmean~1, ~K)
2*(fit1$llik - fit0$llik) # LLR=71.68
consecutive annual averages seems inevitable, and no trend is not the same as no
correlation.
The difficulty here is that it is unclear statistically what is implied by the null-
hypothesis phrase ‘no long-term trend’. After all, the Gaussian process with constant
mean and covariance σ02 In + σ12 K is temporally stationary. The last snippet of code
generates a test statistic that is sensitive to long-range correlation, which is arguably
indistinguishable from long-term trend.
To appreciate the effect of the Bessel index, it is worthwhile computing and plotting
the Bayes estimate for the fitted mode shown above with ν = 1/2. For ν = 3/2
and large λ, the conditional mean is piecewise cubic—a cubic spline which has at
least two continuous derivatives at all points. For ν = 1/2 the conditional mean is
piecewise linear—a linear spline with a knot at each observation. The linear spline
is constrained only by continuity; the cubic spline is constrained by continuity of
two derivatives, so it is less flexible and much smoother in appearance. Intermediate
values such as ν = 1 have Bayes estimates that are intermediate in appearance.
In the majority of applications of this sort, the likelihood function is close to
constant in ν, but also slowly decreasing. In other words, the data are relatively
uninformative about smoothness of η, but there is a slight preference for rougher
trajectories. For visual extrapolation, however, a smooth curve provides a more
compelling image and tells a more convincing story than a rough curve. Continuity
of second derivatives seems about right for visual displays, and ν = 3/2 is a
reasonable compromise for a graphical summary.
In principle, the range parameter λ can also be estimated by maximum likelihood,
but for most practical work the value is effectively infinite, in which case the limit
process may used directly. In both examples, we have used λ = 1000 for illustration,
but the maximum is achieved in the long-range limit. Details of the limit process,
also called the cubic spline model, are given in section 16.9. The first snippet of code
shown above is satisfactory for ν ≤ 3/2, but it is not recommended for ν > 3/2 if
λ is large. The second snippet with constant mean is satisfactory for ν ≤ 1/2, but is
not recommended for ν > 1/2 if λ is large.
The Matérn covariance function is convenient in many ways, but it is not essential
to the discussion regarding smoothness. An alternative strategy for accommodating
intermediate-range dependence is to use an inverse-polynomial covariance function
of the form 1/(1 + x 2 ), where x = |t − t |/λ. Correlations decay slowly with
distance. Each realization of this process is an infinitely differentiable function, so
6.6 Exercises 97
the conditional expected value E(η | data) is also a C ∞ -function. Unlike the Matérn
process, the long-range limit of the inverse-quadratic process is not well-behaved.
Consequently, it is necessary to fix a finite range or to estimate the range, and λ̂
99 years is the value suggested by the sequence of annual averages. With this choice,
the conditional expected-value curve is not appreciably different in appearance from
the cubic spline shown in Fig. 6.4.
The ‘Gaussian’ covariance function exp(−|t − t |2 /λ) also gives rise to C ∞
trajectories. Usually this choice is not recommended for applied work because the
ultra-smooth trajectories give rise to ultra-smooth, non-local, predictions whose
apparent accuracy may be misleading. In addition, the long-range limit is not well-
behaved as a process, so a finite range is needed for computation.
In general, the behaviour of the covariance function at the origin governs the
smoothness of trajectories of the process η ∼ GP(0, K). If K is continuous at the
origin, η is everywhere continuous with probability one, i.e., η ∈ C 0 . If K has 2r
continuous derivatives at the origin, η is r times differentiable everywhere, i.e., η ∈
C r . For r = ∞, the situation is a little more delicate.
A covariance function that is infinitely differentiable at the origin has a series
expansion whose Taylor coefficients are finite. If the radius of convergence of this
series is strictly positive, we say that K is analytic at the origin. Otherwise, if the
radius of convergence is zero, K is infinitely differentiable but not analytic. Both the
inverse quadratic and the Gaussian covariance function are analytic. The trajectories
are also analytic, i.e., ultra-smooth, so predictions are non-local.
The SDα process has a covariance function Kα proportional to the α-stable den-
sity on the real line. The characteristic function, or spectral density, is exp(−|ω|α ),
so α = 1 corresponds to the Cauchy covariance 1/(1 + x 2 ), and α = 2 corresponds
to the Gaussian covariance. Both are analytic. The variogram shown in the top two
panels of Fig. 6.5 correspond to α = 1/2. The SD-covariance function K1/2 is
infinitely differentiable at the origin, but it is not analytic. The trajectories are also
infinitely differentiable but nowhere analytic. For an illustration, see the fourth panel
of Fig. 7.5.
The point of this distinction is that a C ∞ trajectory is much more flexible than an
analytic trajectory. The distinction is important both for computation and prediction.
Analytic covariance functions give rise to near-singular covariance matrices. In
practice, it is best to avoid ultra-smooth processes entirely.
6.6 Exercises
6.1 Let 0 < n ≤ N be positive integers, let ϕ : [n] → [N] be a 1–1 mapping from
the set [n] into [N], and let N ↓n be the set of such functions. Show that #N ↓n =
N(N − 1) · · · (N − n + 1) so that #N ↓N = N!.
1
↓n
kr,n (xϕ) = kr,N (x)
#N ↓n
ϕ∈N
for each x ∈ RN . Interpret this identity as a reverse martingale with n playing the
role of time.
6.3 Given the variance components, the Bayes estimate of the secular trend is a
linear combination of the fitted mean vector and the fitted residual
μ̃ = P Y + L −1 QY,
6.4 The U.K. Met Office maintains a longer record of monthly average and annual
average temperatures for Central England from 1659 onwards in the file
https://www.metoffice.gov.uk/hadobs/hadcet/cetml1659on.dat
Check the format, download the data, and plot the annual average temperature as a
time series. For the annual mean data up to Dec. 31 of the past year, fit the Matérn
model with ν = 3/2 and range λ = 1000 as described in section 6.5.3, and plot
the Bayes estimate of the secular trend. Repeat the calculation for ν = 1 and range
λ = 1000, and superimpose the two Bayes estimates. Comment briefly on the shape
of the fitted curves prior to 1772.
6.5 For the cubic and quadratic models described in the preceding exercise,
compute the predicted temperature for next year, i.e., the conditional distribution
of the mean temperature for next year given the series of annual averages up
to December 31 of the past year. The two models should give slightly different
predictive distributions.
6.7 Compute the annual average temperatures for the years 1772–2019, and
duplicate the first plot in Fig. 2.4. Include the Bayes estimate of the long-term
secular trend up to 2025 using the Matérn covariance function with ν = 3/2 and
range λ = 1000 years. Compute the pointwise standard deviation of the Bayes
estimate, and include the 95% prediction interval on your plot.
6.6 Exercises 99
Check the format, download the data, and plot the monthly average rainfall as a
seasonal series. Take note of the units of measurement, and include this information
on the graph.
6.9 This exercise is concerned with two versions of the Bayes estimate of the
seasonal rainfall component, where it is required to compute E(ηm | data) for each
of 12 months. As usual, χ is the chordal distance as measured on the clock whose
perimeter is 12 units, and month is a factor having 12 levels. In computer notation,
the code for fitting the two models is as follows, where K = const − χ is positive-
definite of order n × n and rank 12:
fit0 <- regress(rain~1, ~month)
fit1 <- regress(rain~1, ~K)
6.10 For the Oxford rainfall data up to Dec 2019, the first Bayes estimate in the
preceding exercise is a flat 10% shrinkage of monthly averages towards the annual
average; the second Bayes estimate is different. For example, the average rainfall for
September is 55.6 mm, which is slightly above the overall average of 54.7, so the
first Bayes estimate is 55.5 mm. The second Bayes estimate is 57.5 mm. Explain this
phenomenon—why the September component, which is already above the annual
average, is shifted even further from the overall average.
6.11 Let (εk , εk )k≥0 be independent and identically distributed standard Gaussian
variables. For real coefficients σk , show that the random function
∞
η(t) = σk εk cos(kt) + σk εk sin(kt)
k=0
2π −4
sin(x/2) cos(kx) dx = .
0 4k 2 − 1
for 0 ≤ x < 2π, and show that they are all positive.
6.13 In this exercise, χ is the chordal metric on the unit circle. From the results
of the preceding exercise, show that 4/π − χ(t, t ) is positive definite on [0, 2π).
Use the Choleski decomposition to simulate and plot a random function having this
covariance function as follows:
n <- 1000; t <- (1:n)*2*pi/n; chi <- 2*abs(sin(outer(t, t, "-")/2))
eta <- t(chol(4/pi - chi)) %*% rnorm(n)
plot(t, eta, type="l")
2π 2
− x 2π − x
3
on [0, 2π) has Fourier cosine coefficients 4π/k 2 for k ≥ 1. Hence or otherwise,
investigate the function
2π 2
K(t, t ) = − |t − t | 2π − |t − t |
3
as a candidate covariance function for a process on [0, 2π), and by extension to
a stationary periodic process on the real line. Plot a simulation of the process on
[0, 4π), and verify continuity.
6.15 Suppose that η ∼ GP(0, K), with K as defined in the preceding exercise.
The tied-down process ζ(t) = η(t) − η(0) is periodic and zero at integer multiples
of 2π. Find its covariance function, and investigate its connection with the classical
Brownian bridge.
6.16 Simulate and plot a random function η(·) on (0, 2π) whose covariance is
π/2 − (t, t ), where (·) is the arc-length metric. This function is less well behaved
than the chordal function because half of its Fourier coefficients are zero, so the
simulation code must be modified to accommodate singularities.
6.6 Exercises 101
6.17 A real-valued process Y (·) is called a Gaussian random affine function if the
differences Y (t) − Y (t ) are Gaussian with covariances satisfying
cov Y (x) − Y (s ), Y (t) − Y (t ) = σ 2 (s − s )(t − t )
for some σ ≥ 0. Show that Z(t) = β0 + β1 t is a random affine function if and only
if β1 is Gaussian.
6.18 Let K be the covariance function of a stationary process on the real line such
that K is twice differentiable on the diagonal, i.e.,
K(t, t ) = 1 − (t − t )2 + o(|t − t |2 ).
6.19 Let K(t, t ) = σ 2 exp(−|t − t |/λ) be the scaled exponential covariance, and
let Y be a zero-mean Gaussian process with covariance K. Show that the long-
range limit with σ 2 ∝ λ is such that the deviation Y (t) − Y (0) is finite and equal in
distribution to Brownian motion.
Chapter 7
Time Series II
A time series is, in the first instance, a function t → Y (t) on time points, either
t ∈ R for a continuous-time process, or t ∈ Z for a discrete-time process. Most
meteorological processes exist in continuous time, but are recorded in discrete time,
either as noontime values, daily totals, daily averages or daily maxima. Similar
remarks apply to plant and animal growth curves, personal health as a time series,
and most business series and economic series. For statistical purposes, either in
modelling or analysis, it is helpful to proceed as if the process exists in continuous
time but is observed discretely at a finite collection of time points. Growth curves
and personal health series are typically recorded at a small collection of irregularly-
spaced time points. The methods of analysis described in this section are most
suitable for a long series that is observed at a large collection of equally-spaced
time points.
Let Y (t) be the value recorded at time t = 1, . . . , n, so that the recording period
is an interval of length n in suitable time units. Frequency is measured in cycles
per recording interval, and the focus is on Fourier frequencies, which correspond
to an integer number of cycles. The discrete Fourier transformation ω → Ŷ (ω) at
frequency ω is a complex number
n
Ŷ (ω) = e2πiωt /n Yt ,
t =1
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 103
P. McCullagh, Ten Projects in Applied Statistics, Springer Series in Statistics,
https://doi.org/10.1007/978-3-031-14275-8_7
104 7 Time Series II
so that Ŷ (0) = Ŷ (n) = Y. is the total, which is real in most applications. For integer
frequencies 0 ≤ ω ≤ n, the real and imaginary parts are the linear combinations
Ŷ (1) = cos(2πt/n)Yt + i sin(2πt/n)Yt ,
t t
Ŷ (ω) = cos(2πωt/n)Yt + i sin(2πωt/n)Yt .
t t
n
n
Ŷ (n − ω) = e 2πit (n−ω)/n
Yt = e−2πit ω/n Yt = Ŷ (ω).
t =1 t =1
If we set aside the zero-frequency component, and split the non-redundant compo-
nents into real and imaginary parts, the Fourier transformation Ŷ = F Y is a linear
transformation Rn → Rn−1 . The cosine and sine components of the Fourier matrix
for frequency ω are the real and imaginary parts of roots of unity:
(c) (s)
Fω,t = cos(2πωt/n); Fω,t = sin(2πωt/n).
The identity F 1 = 0 defines the kernel subspace, the rows are mutually orthogonal
n-vectors, F F = (n/2)In−1 is the identity of order n−1, and 2F F /n = In −Jn /n
is the orthogonal projection in Rn with kernel 1. The projection matrix 2F F /n can
be expressed as a sum of (n − 1)/2 rank-2 projection matrices, Pω , one for each
frequency, plus an additional rank-1 matrix for frequency n/2 if n is even. The net
result is that the total sum of squares has an analysis-of-variance decomposition by
frequencies
n/2
1
n−1
(Yt − Ȳ. )2 = Pω Y 2
= |Ŷω |2 .
t
n
ω=1 ω=1
The last expression includes both conjugates Ŷω , Ŷn−ω , so the sum of squares for
frequency 1 ≤ ω < n/2 is
Pω Y 2
= 2|Ŷω |2 /n
7.2 Temperature Spectrum 105
The Central England daily temperature series for 248 years has a length of 90580
days. The analysis and interpretation are easier if the observation period is an integer
number of years, so the partial year at the end is not included in the analysis.
Harmonics associated with the annual cycle are expected to have large amplitudes,
so it is helpful for plotting purposes to separate out the seasonal frequencies (integer
multiples of 248) from the rest.
The first panel of Fig. 7.1 is a scatterplot of log |Ŷω |2 against frequency, which
has been re-coded in units of cycles per year rather than cycles per observation
period of 248 years. Seasonal frequencies have been excluded, partly because the
annual and biannual coefficients are so large. The general trend is quite clear for
the mean, but the high variability and density of points tends to obscure matters.
Ordinarily, we should expect Pω Y 2 to be approximately exponentially distributed,
in which case log Pω Y 2 should have constant variance, π 2 /6 1.282, and the
distribution should be skewed to the left. The plot is reasonably consistent with
those expectations.
In the middle panel, the squared Fourier components Pω Y 2 have been averaged
in consecutive non-overlapping frequency blocks, and the log averages are plotted
against average frequency, again coded in cycles per year. In this manner, the
variability is much reduced, so the trend in mean becomes clearly delineated. Note
that the goal here is to estimate the spectrum, which is E(|Ŷω |2 ) as a function of ω,
so all averaging takes place on that scale, not on the log scale.
Finally, the log spectrum for seasonal frequencies is shown in the third panel
with the non-seasonal cubic spline superimposed for comparison. Apart from the
first and second harmonics, the variation or energy at other seasonal frequencies
decreases with frequency in conformity with the decrease observed in the second
panel for non-seasonal frequencies. Certainly the variation at seasonal frequencies
above three per year is not greater on average than the variation at neighbouring non-
seasonal frequencies. The distinction between seasonal and non-seasonal seems to
matter only for the first two annual harmonics.
These spectrum plots tends to emphasize the variation at higher frequencies, on
the order of 20–150 cycles per year. However, it is the behaviour of the spectrum
at low frequencies and the limiting behaviour as ω → 0 that is crucial for
understanding long-range behaviour of temperatures. To clarify the picture, and to
give greater emphasis to lower frequencies, Fig. 7.2 consists of the same points as
the middle panel in Fig. 7.1, but the values are plotted against the square root of the
106 7 Time Series II
15
10
5
0 50 100 150
p
log averaged non-seasonal spectrum
21
19
18
17
16
0 50 100 150
log spectrum at seasonal frequencies
q
30
20
15
10
0 50 100 150
Fig. 7.1 Log power spectrum for the Central England temperature series separated by seasonal
and non-seasonal frequencies
frequency. To a first order of approximation, the log spectrum in linear in ω1/2 over
the bulk of the frequency range.
According to the theory discussed in the next section, the transformed coefficients
|Ŷ (ω)|2 for non-seasonal frequencies are approximately independent exponential
7.2 Temperature Spectrum 107
18
17
16
0 2 4 6 8 10 12 14
Fig. 7.2 Log power spectrum plotted against ω1/2 . The solid line is the additive GLM spectral fit
E|Ŷ (ω)|2 ∝ 1 + 391 exp(−|0.219ω|1/2 )
where λ̂ = 0.0347 years, or 12.67 days, σ̂1 /σ̂0 = 19.8 for the volatility ratio,
and σ̂0 = 0.62 for the nugget standard deviation in degrees Celsius. Note that the
second component is formally the characteristic function of the α-stable distribution
for α = 1/2, so the associated covariance function is the density of that distribution.
The additive gamma model fits the non-seasonal power spectrum reasonably
well, but it is not perfect. Small systematic deviations are apparent in Fig. 7.2 at
low frequencies, and there is approximately 3% excess dispersion relative to the
exponential distribution. In other words, the variance of the standardized spectral
coefficients |Ŷω |2 /K̂ω is 1.032, while the exponential model predicts unit variance.
This is a very small deviation in absolute terms, but, with 45 108 non-seasonal
Fourier frequencies, a 3% deviation in variance is moderately unlikely.
In the residual plots shown in Fig. 7.3, the 3% deviation is too small to be noticed.
Overall, the residual distribution seems to match the extreme-value distribution very
closely. The mean of the log residuals is −0.584, and the variance is 1.661 versus
the theoretical values −γ = −0.577 (Euler’s constant), and σ 2 = π 2 /6 = 1.645.
On the negative side, the ratios of the squared Fourier coefficients to the fitted
values for the ten lowest frequencies are
19.67, 4.44, 9.19, 4.40, 2.15, 0.73, 4.28, 0.55, 0.36, 2.99,
and the next largest ratio is 11.3, which occurs at one of the highest frequencies.
Despite the apparent success of this parametric model for the bulk of the frequency
range, these low-frequency values are not consistent with the fitted model, which
predicts independent standard exponential values. The expected value of the largest
of n standard exponentials is√approximately log(n) 10.7, and the standard
deviation is approximately π/ 6 1.3. Given that it was so selected, the second-
largest ratio is entirely consistent with the fitted model, but the first 4–5 Fourier
coefficients are not.
The behaviour of the low-frequency Fourier coefficients is strongly tied to the
behaviour of the covariance function or variogram at the longest lags. Bearing in
mind the variogram phenomenon observed in the third and fourth panels of Fig. 6.5,
which is compatible with a slow random walk or an autoregressive process with
semi-range λ on the order of one millennium, it is natural to look for a corresponding
phenomenon in the Fourier domain. The corresponding phenomenon is an additive
spectral component proportional to 1/(1 + λ2 ω2 ), which is essentially a multiple
of ω−2 . Inclusion of the inverse-square frequency as a further covariate the spectral
7.3 Stationary Temporal Processes 109
-5
-10
0 50 100 150
Histogram of log(res)
density exp(z - exp(z))
0.3
Density
0.2
0.1
0.0
-8 -6 -4 -2 0 2
Fig. 7.3 (a) Log Fourier residuals log |Ŷω |2 |/fittedω plotted against frequency; (b) Histogram
of log residuals with theoretical extreme-value density superimposed
model reduces the deviance by 52+ units, which is a substantial improvement to the
fit, showing conclusively that the slow linear trend seen in the variogram plots is a
real phenomenon and not a statistical artifact.
110 7 Time Series II
7.3.1 Stationarity
This section is concerned with real-valued processes that are defined pointwise on
the domain, and specifically with stationary Gaussian processes on the real line. The
first part of the statement means that to each point t in the domain there corresponds
a value Y (t), which is a real number. First-order stationarity implies that for each
pair of points t, t + h in the domain the values Y (t) and Y (t + h) have the same
distribution. For a time series, the domain is either the integers or the real line, and
translation implies that the domain is a group acting on itself by addition. More
generally, stationarity implies that for each ordered n-configuration t = (t1 , . . . , tn )
and each h-translate t + h = (t1 + h, . . . , tn + h), the values
Y [t] = Y (t1 ), . . . , Y (tn ) and Y [t + h] = Y (t1 + h), . . . , Y (tn + h)
distribution as the value Y (A+h) taken on the h-translated object. Strict stationarity
is defined in the same way for joint distributions. According to this definition, it is
possible to make sense of the statement that −|t − t | is the covariance function for a
Gaussian process or time series, sometimes called a generalized process because the
index set does not coincide with the domain. Likewise for the functions −|t − t |1/2
and − log |t − t |. Moreover, these processes are strictly stationary. This definition
paves the way to consider other group actions such as rigid motions or Euclidean
congruences or similarity transformations, which are associated with isotropy and
self-similarity.
To understand what the SD1/2 -process with spectral density exp(−|ω|1/2) looks
like, i.e., how a typical trajectory behaves as a function, it is helpful to compute,
simulate and plot. The first step is to compute the covariance function by inversion
of the spectral density. In general, this is a non-trivial computational exercise.
Fortunately, this spectral density is a special case of the characteristic function of
the α-stable class. The series expansion for the density (Feller, 1971, vol II, p. 582)
can be simplified for α = 1/2; for general α, see Matsui and Takemura (2006).
We remark only that K is strictly positive, and monotone as a function of temporal
separation. It is infinitely differentiable at all points on the real line, but it is not
complex-analytic in any neighbourhood of the origin: the Taylor expansion does not
have a positive radius of convergence. Accuracy to two or three significant decimal
digits suffices for graphical representation of the covariance function, but at least
eight-digit accuracy is needed to simulate trajectories.
Four covariance functions are shown in Fig. 7.4. At first glance, the differences
among them appear to be slight: all four are continuous, symmetric and are equal at
the origin and at ±1. The behaviour in a neighbourhood of the origin is an important
characteristic, which is shown in 5x-magnified form on the right of each panel.
The Matérn functions have zero, one and two derivatives at the origin, whereas the
fourth has infinitely many. The first, 1 − |x| + o(x), is easy to see by inspection,
but the others are not, even in magnified form: for ν = 1, the behaviour near the
origin is 1 + x 2 log |x|/2 − 0.31x 2 + o(x 2 ), so the first derivative is zero and the
second does not exist. The behaviour in the tail is the characteristic that distinguishes
intermediate- and long-range dependent processes from short-range. Once again,
this is easier to see in hindsight than in foresight—especially if zero has not been
included for visual reference in the graph.
All four curves in Fig. 7.4 are non-negative and have have finite integrals, so each
is proportional to a symmetric probability distribution on the real line. The integrals
are 2.0, 1.89, 1.86 and 4.58 respectively, or more generally, 2λ, πλ, 4λ and πλ/2 for
the scaled versions. The first Matérn covariance is a multiple of the Laplace density;
the SD1/2 covariance is a multiple of the α-stable density for α = 1/2.
112 7 Time Series II
1.0
1.0
Matern, nu=0.5, lambda=1 Matern, nu=1, lambda=0.603
0.8
0.8
origin x 5
origin x 5
0.6
0.6
0.4
0.4
0.2
0.2
0.0
0.0
-2 0 2 4 -2 0 2 4
1.0
1.0
Matern, nu=1.5, lambda=0.466 Stable, alpha=0.5, lambda=2.91
0.8
0.6
0.4
0.4
0.2
0.2
0.0
0.0
-2 0 2 4 -2 0 2 4
Fig. 7.4 Four covariance functions standardized to have unit variance and lag-one autocorrelation
e−1 . The fourth is a multiple of the α-stable density function for α = 1/2
Given computer code for the covariance function, the covariance matrix for
the process at 1000 points may be computed, followed by simulation of a 1000-
component Gaussian variable Y ∼ N(0, ). These are the values of the process at
the selected points in the domain. Special cases can be simulated more efficiently,
but this straightforward recipe suffices for present purposes. Each of the curves in
Fig. 7.5 is plotted using the values (x, Y (x)) at 1000 equally-spaced points in the
interval (0, 10).
To establish a ‘normal range’ of patterns, Fig. 7.5 shows the trajectories of three
Matérn processes in the ‘typical’ index range, plus the SD1/2 process. Each family
has a variance parameter and a range parameter, both of which are strictly positive
real numbers. For purposes of comparison, each covariance function is scaled to
have unit variance, and the same lag-one autocorrelation e−1 = 0.368, which
matches the standard order-one autoregressive process shown in the top panel. Each
of the Matérn processes has a distinct character, with continuous derivatives of order
zero, one and two for ν = 0.5, ν = 1.0 and ν = 1.5 respectively. Visually speaking,
the differences among these three are concerned with the degree of smoothness,
which is a local property. The SD1/2 process has its own distinct character. It has
continuous derivatives of all orders so it is smoother than any Matérn process, even
7.3 Stationary Temporal Processes 113
0 2 4 6 8 10
Matern process : = 1 = 0.603
1.0
0.0
-1.0
-2.0
0 2 4 6 8 10
3
0 2 4 6 8 10
1 2
log spectral density : 1.706
1.0
0.0
-1.0
-2.0
Fig. 7.5 A comparison of trajectories of four stationary continuous-time processes, three in the
Matérn class, and one specified by its spectral density e−|ω| . The standard Matérn covariance
1/2
function is Kν (x) = x ν Kν ( x ); special cases include K1/2 (x) = e− x and K3/2 (x) = (1 +
x )e− x . The spectral density is 1/(1+ω2 )ν+1/2 . For visual comparison over the interval (0, 10),
all four processes are standardized to have unit variance and the same lag-one autocorrelation
those with ν > 3/2. However, its medium-range oscillations are distinctly more
pronounced, and somewhat similar to those of the AR1 process (ν = 1/2).
In addition, although it is hard to point to visual consequences in the trajectories,
the long-range autocorrelation of the SD1/2 process is algebraic of order |x−x |−3/2 ,
whereas the long-range Matérn correlations decay faster than any polynomial. As a
result, the lag 5–10 autocorrelations are sizable 0.073–0.031 for the SD1/2 process,
but negligible for all Matérn processes and decreasing as a function of ν. Long-
range dependence appears to be universal for processes in nature, both for time
series and for spatial processes. For such applications, we should bear in mind
that each finite-range Matérn covariance has exponentially-decreasing tails whereas
the SD1/2 covariance has regularly-varying tails of order |x − x |−3/2 . All four
covariance functions have finite integrals, so all four processes are short-range
dependent. On the other hand, each Matérn process has a well-behaved infinite-
range limit, whereas the SD1/2 process does not.
Algebraic, or inverse-polynomial, decay of autocorrelations is a characteristic of
intermediate-range and long-range dependence. One consequence is that the sample
114 7 Time Series II
t
t
Ȳ(0,t ) = t −1 Y (s) ds or Ȳ1:t = t −1 Ys ,
0 s=1
has a variance that tends to zero as t → ∞ at a slower rate than O(t −1 ). The
rate is O(t −1 ) for short-range dependent series, including every Matérn process,
but only O(t −1/2 ) for the SD1/2 process. Empirically, we find that the variance of
the temperature average over randomly-sampled blocks of h successive years is as
follows:
When the circumstances permit it, i,e., when a series is recorded over a large number
of equally-spaced points, the advantages of working in the frequency domain are
considerable. For a stationary series with covariance function K, Whittle (1953)
shows that the Fourier coefficients are approximately Gaussian and approximately
independent for large n, with moments
nK̂(2πω/n) + O(1) ω = ω ,
E Ŷ (ω)Ŷ (ω ) =
O(1) ω = ω ,
The Whittle approximation overlooks the fact that, in many applications, the
series is defined in continuous time, but observed in discrete time. For that reason,
the expected value of the Fourier coefficient Ŷ (ω) is not exactly equal to the Fourier
transformation of the covariance function. Sykulski et al. (2019) show how to
compute the exact expectation of Ŷ (ω) efficiently, and to use the bias-corrected
version in the Whittle likelihood.
7.4 Exercises
7.2 Use fft() to compute the Fourier coefficients for the temperature series on
a whole number of years, identify and remove the frequencies that are seasonal,
average the power-spectrum values in successive non-overlapping frequency blocks
of a suitable size, and plot the log averages against the square root of the frequency
in cycles per year.
7.3 For the non-seasonal frequencies, use glm() to fit the additive exponential
model
for various values of λ in the range 7–14 days or 0.02–0.04 years. In this setting, the
distributional family is Gamma, the link is identity, and the dispersion parameter
is one. Plot the residual deviance against λ to find the maximum-likelihood estimate.
Check that the fitted coefficients are non-negative. Superimpose the fitted curve
on the graph of log-block-averages in the previous exercise. Plot the standardized
residuals (log-ratio of observed to fitted) against frequency, and comment on any
departures that are evident.
as n → ∞, and also that the null distribution is χ12 for both. In this setting the sample
size is the number of non-seasonal frequencies. Comment on any discrepancy
between theory and practice in this instance, and provide an explanation.
7.5 Ordinarily, Wald’s likelihood ratio statistic is essentially the same as Wilks’s
statistic, which in one-parameter problems, is the squared ratio of the estimate to
its standard error. But there are exceptional cases where a substantial discrepancy
may occur, and variance-components models provide good examples. In order to
understand the source of the discrepancy, simulate data with simple structure as
follows:
set.seed(3142); n <- 1000; x <- 1:n
rx2 <- 1/x^2; beta <- c(1,20);
X <- cbind(1, rx2); mu <- as.vector(X%*%beta)
y <- -log(runif(n))*mu
The null hypothesis is that μ ∝ 1 is constant, and the alternative is that μ = Xβ for
some β with non-negative components. Test this hypothesis using Wilks’s likelihood
ratio statistic, and also using the Wald statistic. Recall the exponential assumption,
which implies that the dispersion parameter is one.
7.7 For 0 < α ≤ 2, the α-stable distribution on the real line is symmetric with
characteristic function e−|ω| . For the sub-range 0 < α < 1, Feller (1971, eqn. 6.5)
α
which is convergent for t > 0. The goal of this exercise is to simplify the density
for α = 1/2 by splitting the sum into four parts according to k (mod 4). Show that
one of the four parts is zero, that the odd parts may be combined into a multiple of
t −3/2 sin(1/(4t) + π/4), and that the remaining part is O(t −2 ) as t → ∞.
7.8 From the cosine integral cos(ωt)e−|ω| dω, deduce that the α-stable density
α
Find the general term in this expansion and deduce the radius of convergence.
Chapter 8
Out of Africa
This chapter is concerned with the linguistic hypothesis and the data analysis in
the paper Phonemic diversity supports a serial founder effect model of language
expansion from Africa, published by Q.D. Atkinson in Science (15 April 2011). It is
recommended that you read the paper and the supplementary material. The data and
supplementary files can be found at .../ch08/
Like the genetic thesis for human migration and evolution, the ‘Out-of-Africa’
thesis for linguistic diversity holds that language evolved somewhere in Africa, and
diffused from there to Asia, Europe and elsewhere as populations split and migrated.
Since the genetic and linguistic diversity of a population is intrinsically related to its
size, a small migrating subset carries less diversity than the population from which
it originated. Accordingly, a subpopulation that splits and migrates carries less
diversity than the descendants of the ancestral population that remains. Although
tones and sounds are continuously gained and lost in all languages, the loss is
supposedly higher for small migrating founder populations than for the ancestral
population. In this way, the diversity of sounds becomes progressively reduced as
the distance from the origin increases.
Atkinson’s paper is concerned with the hypothesis that human language devel-
oped in a single location and spread from there by migration. He aims to test that
hypothesis by examining the relationship between the diversity of sounds in 504
extant languages and their geographic distance from a putative origin in Africa or
elsewhere, taking account of speaker population size. It is not our business here to
discuss the plausibility of Atkinson’s thesis. As we might expect, a wide range of
mostly skeptical views on that point has already been expressed in the literature. It
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 117
P. McCullagh, Ten Projects in Applied Statistics, Springer Series in Statistics,
https://doi.org/10.1007/978-3-031-14275-8_8
118 8 Out of Africa
is our business solely to examine the data carefully to assess whether the thesis is
supported by the data or not.
The data on which Atkinson bases his analysis is a list of 504 languages from
various parts of the world. The list of languages is not exhaustive, nor is it close
to geographically uniform with respect to current population density. The diversity
measure is not a measure of variability of sounds in the ordinary statistical sense,
nor is it an inventory or list of sounds, but simply the number of distinct phonemes
that the language employs. There are three distinct values for vowel inventory, three
for tone inventory, five for consonant inventory, and 40 for total phoneme inventory.
In principle, inventories should be all be non-negative integers, but the values have
been standardized or normalized in an unspecified way. The sample means are close
to zero, and the sample variances for the three phoneme constituents are close to
one.
Data on phoneme inventory size were taken from the World Atlas of Language
Structures, (WALS). The file ch08/S1.dat contains the main part of Atkinson’s
data, which is the list of 504 languages together with the following twelve
variables
1. Lname Language name: text, e.g. Abkhaz, Aikan??, B?©t?©, . . ., Zuni
2. WALS three character code, e.g. abk, aik, bet,. . .
3. Fam Language Family: text, e.g. Arawakan, Indo-European, Sino-Tibetan,. . .
4. Lat Latitude as a decimal number, e.g. −12.67
5. Long Longitude as a decimal number, e.g. −60.67 (meaning 60◦ 40 W)
6. Nvd Normalized vowel diversity based on WALS feature No. 2
7. Ncd Normalized consonant diversity based on WALS feature No. 1
8. Ntd Normalized tone diversity based on WALS feature No. 13
9. Tnpd Total normalized phoneme diversity
10. Iso ISO codes (one or more three-character codes)
11. Popn Estimated speaker population: integer 1–873 014 298
12. Dbo Distance in km. from Atkinson’s best-fit origin
Regardless of its geographical range, each language is associated with a single
point on the sphere, which is not necessarily the geographic centroid of the speaker
domain. International languages such as English, Spanish and French are associated
with their ancestral capitals. For example, English is Indo-European and is located
at latitude 52.0, longitude 0.0; the speaker population 309M is dominated by parts of
the former empire. Spanish is located at 40.0N, 4.0W with a population size 322M
most of whom are in Latin America; Mandarin is located at 34.0N, 110.0E, with
a population of 873M. The guiding principle for inclusion is not evident. Among
European languages, Albanian, Basque, Catalan, Breton, Romansch and Saami are
8.3 Distances 119
included, but not Portuguese (178M), Italian (60M), Dutch (22M), Ukranian (37M),
Belarusian, Slovak, Slovene, Serbian or Croatian.
For whatever reason, phoneme values are rounded and normalized. In addition,
the number of distinct values for each variable is very limited. For example,
English, French, German and Korean have exactly the same diversity profile
(1.39, 0.12, −0.77), which is shared by Turkish and 21 other languages; The values
in the file are reported to seven or eight decimal digits. Donegal Irish shares its
consonant-dominant diversity profile (−0.48, 1.80, 0.18) with 13 other languages
including Kwakw’ala, Lezgian and Saami.
8.3 Distances
The languages were partitioned into six continental groups, Africa, Europe, Asia,
Oceania, N.Amer, and S.Amer. These coincide closely with the geographic conti-
nents, but not exactly so: Malagasy, the national language of Madagascar, belongs
to Oceania, not Africa.
For his analyses, Atkinson used great circle distances between points x, x
for pairs of languages belonging to the same continental group. Otherwise, for
languages in different continental groups, distances were measured for the shortest
path passing through certain choke points (supplementary material, Fig. S8). For
example, the shortest linguistic path from Europe to N. America consists of
three great-circle arcs passing through Istanbul and the Bering Strait. The great-
circle distance from Aghem = (10.0,6.67) in the Congo to Malagasy =
(47.0,-20.0) in Madagascar is 5014 km, but since the latter belongs to region 4,
the linguistic distance through Cairo and Phnom Penh is 18475 km.
The R-executable file ch08/OOA.R contains the following commands:
S1 <- read.csv(file="ch08/OOA.dat")
S4 <- read.csv(file="ch08/S4.dat", header=FALSE)
The same file also contains the coded list lregion of 504 linguistic groups, the
geocoordinates of a small number of major cities including the choke points
chokes, some code for geographic plotting, and functions for computing distances,
as follows.
1. gcdist(x1, x2) great circle distance in km:
gcdist(Aghem, Malagasy) = 5013.585
gcdist(Paris, Chicago) = 6651.991
The format used here for geocoordinates is x=(long, lat) in decimal
degrees.
2. chokedist(x1, x2, r1, r2) linguistic distance:
chokedist (Aghem, Malagasy, 1, 4) = 18474.65
chokedist(Paris, Chicago, 2, 5) = 15761.26
120 8 Out of Africa
The file ch08/OOA.R also contains standard R code for plotting world maps
and subsets thereof, using the ggplot2 and rgeos packages downloaded from
the CRAN website. For illustration, Fig. 8.1 shows the location of the African
and European languages used in the analysis. The African languages are heavily
concentrated in equatorial Africa, roughly 10◦ S to 15◦N. Among the eleven
African countries south of 10◦ S, only four—Angola, Botswana, Namibia and
South Africa—are represented, and Botswana has four or five out of the eight
languages. Madagascar is represented by Malagasy, whose roots are non-African.
Similar anomalies are evident in the European sample.
The putative linguistic origin is a geo-coordinate point x0 for which total
phoneme diversity for language i satisfies
E Tnpdi = β0 + β1 xi − x0 ,
where xi − x0 is the linguistic distance between the origin and the geo-coordinate
of language i. The least-squares criterion is thus
2
Tnpdi − β0 − β1 xi − x0 ,
i
which is to be minimized with respect to the four parameters β0 , β1 plus the two
components of x0 . Atkinson reports the best-fitting origin at 1.25◦S, 9.30◦E in West
Africa: see Fig. 8.4.
Support for Atkinson’s thesis, that phoneme diversity—or more correctly
phoneme inventory—decreases with distance from the origin is best illustrated
by the aggregate scatterplot in Fig. 8.2 of total phoneme diversity against distance
from the best-fitting origin. The least-squares fitted line has a definite negative
slope.
What the scatterplot Fig. 8.2 fails to show is that all of the distances up to
4.5 units are in Africa, many of those in the interval 5.5–8.5 are in Europe,
most of those in the interval 5.5–13.5 are in Asia, and so on. Consequently, it
is natural to plot each continental group separately, which is done in Fig. 8.3.
This exercise reveals that the relation between distance and phoneme inventory is
negative chiefly in Africa. The negative least-squares slope for Oceania is almost
entirely due to the single point (Malagasy), which has high leverage on account
of its remoteness, and a low phoneme inventory. For each of the other linguistic
8.4 Maps and Scatterplots 121
30 N
20 N
10 N S d
Somaliland
Latitude
10 S
Saint Helena
Madagascar
Madagasca
g
20 S
30 S
10 W 0 10 E 20 E 30 E 40 E 50 E
Longitude
65 N
Faeroe
erroe
ro Is.
60 N
55 N
Latitude
50 N
45 N
40 N
Turkey
Malta
al
35 N
10 W 0 10 E 20 E 30 E 40 E
Longitude
Fig. 8.1 Geographic distribution of African and European languages in Atkinson’s sample
122 8 Out of Africa
0.0
-1.0
0 5 10 15 20 25 30
Dbo in 1000km
Fig. 8.2 Total phoneme diversity plotted against the distance from the best-fitting origin as
reported by Atkinson at 9.5◦ E, 1.25◦ S
TNPD versus Dbo for Africa TNPD versus Dbo for Europe
1.5
0.6
0.4
1.0
0.2
0.5
0.0
0.0
-0.2
-0.5
-0.4
0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 6.0 6.5 7.0 7.5 8.0 8.5
1.0
TNPD versus Dbo for Asia TNPD versus Dbo for Oceania
1.5
1.0
0.5
0.5
0.0
0.0
-0.5
-1.0 -0.5
-1.0
6 8 10 12 14 15 20 25
1.5
1.5
TNPD versus Dbo for N.Amer TNPD versus Dbo for S.Amer
1.0
1.0
0.5
0.5
0.0
0.0
-0.5
-0.5
-1.0
-1.0
14 16 18 20 22 14 16 18 20 22
Fig. 8.3 Total phoneme diversity plotted against the distance from the best-fitting origin for each
of the six linguistic groups separately
8.5 Point Estimates and Confidence Regions 123
groups, the relation between inventory and distance is either negligible or positive.
While the aggregate scatterplot in Fig. 8.1 seems to support Atkinson’s thesis,
the disaggregated continent-by-continent plots paint a different picture. This is an
instance of Simpson’s paradox for continuous data, with positive slopes on most
continental subsets, but a negative slope overall.
where x0 is the origin and xi − x0 is the linguistic distance. Given the range
of speaker populations, linearity in log population size seems more reasonable
than linearity in population size, which is in agreement with Atkinson’s principal
analysis. For the moment, we assume also that the components are independent
with constant variance σ 2 .
For arbitrary fixed origin, the model is linear in the remaining three regression
parameters, so the least-squares estimates can be obtained in the standard way.
Denote by RSS(x) the residual sum of squares on n − 3 = 501 degrees of freedom
for fixed x. The point x̂ that minimizes the residual sum of squares is the non-linear
least-squares estimate. For these data, the minimum over the rectangular grid that
covers Africa occurs at the south west corner, near 17◦W, 35◦ S in the south Atlantic.
However, the RSS function varies little throughout the bight of Africa, and is almost
constant along the coast from Liberia to Cape Town. For this exercise, we restrict
the parameter space to terra firma. The minimum over continental Africa occurs on
the coast near the border between Angola and Namibia, roughly at 12◦ E, 17◦ S as
indicated in Fig. 8.4, with RSS = 143.02. By the narrowest of margins, Atkinson’s
fitted point on the Congo coast appears to be a local minimum with RSS = 143.07.
The standard recipe for the formation of a confidence set in non-linear least-
squares problems uses a selected contour of the restricted residual sum of squares
function function RSS(x) as the boundary. The residual mean square s 2 =
RSS(x̂)/(n − 5) serves as the variance estimate, which is distributed approximately
as σ 2 χn−5
2 under the stated assumptions. Moreover, s 2 is distributed approximately
124 8 Out of Africa
30 N
20 N
10 N
Latitude
10 S
20 S
30 S
10 W 0 10 E 20 E 30 E 40 E 50 E
Longitude
Fig. 8.4 Point estimate and confidence regions (80% and 95%) for the language origin, assuming
independent observations. The RSS values for the three coastal marked points are 143.4, 143.1 and
143.0 in west-to-east order
unrestricted maximum had been used, confidence regions at low confidence levels
would cover water only, which is a difficult case for a linguist to make. However, the
unrestricted 95% confidence region matches reasonably closely the restricted 80%
region.
Shortcomings
The preceding analysis overlooks the fact that one of the regularity conditions fails.
The restricted maximum occurs at a 1D-boundary point on the coast, so the local 2D
manifold argument fails. If the linguistic hypothesis also stated that the origin must
be a coastal point, the 1D argument would naturally prevail. But that is not a part
of the thesis, so it seems preferable to use the more conservative 2D allowance. An
alternative option is to resort to simulation—but that is neither an easy answer nor a
satisfactory answer. In any event, there are more consequential effects that have so
far been ignored.
The likelihood-based nominal 95% confidence region is the set of all candidate
source points whose log likelihood is sufficiently high compared with the maximum,
i.e.,
2
{x : 2l(x̂) − 2l(x) ≤ χ2,0.95 }.
Here l(x) denotes the profile log likelihood maximized over all other parameters.
For the model suggested here, the 95% confidence region includes all of Africa
except for a portion of lower Egypt; the 99% region also includes the Levant (Israel,
Syria, Turkey, Jordan, Iraq) and all of Europe except for Russia and the Caucasus.
Generally speaking, failure to accommodate correlations has the effect of making
the data seem more informative than they are, so the resulting confidence intervals
are unrealistically narrow. Thus, it is no surprise that Fig. 8.4 is misleading in its
portrayal of the strength of information in the data.
The code used for computing the log likelihood for one candidate point x0
belonging to linguistic region lregn has two parts:
ldist <- vdist(x0, lregn) # vector of distances from x0
fit <- regress(Tnpd~ldist+log(Popn), ~Fam+V, data=S1)
Here S1$Fam is the linguistic family coded as a factor, and V is a matrix with
components Vij = exp(− xi − xj /λ), which do not depend on x0. As it stands,
this code is both computationally inefficient and technically incorrect on two counts.
The efficiency can be improved substantially by including the optional argument
start=fit$sigma, which makes the previously-computed variance components
available as the starting point for iteration.
The first technical issue is that the default likelihood function that is maximized
by regress() is the REML likelihood for the observation Y in the space Rn /X
of residuals modulo the subspace X of mean values. Ultimately, our goal is to
compute a likelihood ratio for one candidate center versus another, and the problem
with the code as shown is that the mean-value subspace for one candidate point
is not the same as the mean-value subspace for another candidate. The REML
log likelihoods are not comparable as log likelihoods. In order for this to be
done correctly, it is necessary to use the optional argument kernel=K to over-
ride the default kernel. While K=0 and K=1 are valid zero and one-dimensional
options, the more natural choice is the two-dimensional intersection subspace K
<- model.matrix(~log(Popn), data=S1).
The second technical issue is that the log likelihood for a given x0 should also
be maximized over λ, which is a substantial computational overhead. For simplicity
in the analysis described above, λ = 820km has been treated as a known constant.
8.6 Matters for Further Consideration 127
Shortcomings
Suppose that it were possible to extract from the WALS database, the actual
phoneme inventory of each language rather than the phoneme count. This statement
implies a finite master list of m phonemes together with a Boolean variable
Y : [m] → {0, 1} for each language indicating the subset of the master list that
occurs in the given language. The phonemes may be labelled by type (vowel,
consonant or tone), but the problem is already difficult enough without this added
complication.
As it happens, the data are now available in phoneme-inventory form in the file
.../ch08/santoso.dat. Details are given in Sect. 8.7.
Without altering notation, we may regard the phoneme inventory Yi for lan-
guage i either as a Boolean vector or as a subset Yi ⊂ [m] of the master list. Thus
Yi is the inventory for language i, the usual component-wise product Yi Yj is the
inventory common to a pair of languages, and the k-fold product Yi1 · · · Yik is the
inventory common to a specific subset of k languages.
128 8 Out of Africa
and so on. The analyses presented in this chapter use only the n-vector of first-order
counts. But inventory data also provide symmetric n × n matrices of second-order
counts, symmetric tensors of third-order counts, and so on.
Geography and distance are essential components in the Out-of-Africa hypothe-
sis. What bearing does the hypothesis have on data collected in inventory format?
Each language i is associated with a geographical location xi ; each pair may be
associated with a pair of points {xi , xj }, with a line segment (xi , xj ) or with a
weighted centroid; each triple may be associated with a set of points, the convex hull
of those points, or with their centroid, and so on. How is distance to be measured
for singletons, pairs, triples and so on? How are the questions to be formulated
statistically? Given answers to these questions, how might the analysis proceed to
estimate relevant parameters and to check whether the data are consistent with the
hypothesis?
8.6.3 Granularity
In our analysis, we have ignored the fact that phoneme inventory variables are
discrete, with only a few distinct values. Atkinson used three equally-spaced
numbers to represent normalized tone inventory, three for vowels and five values
equally spaced for normalized consonant diversity. Total normalized phoneme
diversity is the sum of these three. What effect does granularity have on conclusions
derived from a Gaussian model?
8.7 Follow-Up Project 129
For her Masters project at the University of Chicago, Josephine Santoso compiled
from phoible.org and other sources a more extensive phoneme inventory of 1277
world languages, of which 1028 have speaker populations of at least 50. This is a
little more than twice as many languages as Atkinson used. The Santoso file contains
all of the variables listed in Sect. 8.2 plus several others. She has kindly agreed to
make her compilation available in the file .../ch08/santoso.dat.
Although Santoso’s file lists the actual set of phonemes of each type, her
analysis focuses solely on phoneme counts for languages having more than 50
speakers. Two phonemes that are distinct in one dialect are not necessarily distinct
in another, so phoneme counts for any language are subjective to a certain extent,
particularly those for vowels and consonants. Where Atkinson’s main analysis uses
total normalized phoneme inventory as described earlier, Santoso’s main analysis
uses the total phoneme count without normalization. If Y1 , Y2 , Y3 are the phoneme
counts for vowels, consonants and tones respectively Santoso’s response variable is
Y1 + Y2 + Y3 , whereas Atkinson’s normalized combination is closer to
Y1 Y2 Y3
+ + ,
s1 s2 s3
where s1 , s2 , s3 are the three standard deviations. For the new compilation, the
means and standard deviations are as follows:
taken into account, however, almost any point in Europe or Africa fits equally well
as an origin.
Given the genetic and paleoanthropological evidence of migrations out of Africa
as early as 200k years ago, with later migrations 70–50k years ago, the argument
that traces of those migrations ought to remain in extant languages has a certain
degree of plausibility. Arguably, the data contain some broad-scale evidence for it.
But the linguistic evidence on its own is not at a level that enables us to point to a
language origin specifically in Africa.
Let l be a language that employs c consonants, v vowels and t tones. From the
perspective of a mathematician who is familiar only with European languages, it
is natural to imagine that every syllable or every word is a sequence of vowels
and consonants that can be enunciated in one of t tones. Each tonal enunciation is
a different word and a different concept. This Cartesian-product framework does
not imply that the entire set of (v + c)t tonal variations occurs in the language.
However, the function table(...santoso$num_tone) applied to Santoso’s
spreadsheet produces the following counts for languages having more than 50
speakers:
# Tones 0 1 2 3 4 5 6 7 8 9 10
# Languages 870 1 48 79 57 25 8 5 1 2 2
In other words, the number of tones ranges from zero to ten. Most languages have
a zero tone count, and only one has t = 1. If the preceding interpretation were
accurate, the set of phonemic sounds available would be empty for nearly 80% of
all languages. Clearly, this cannot be a correct interpretation.
It is evident that non-tonal languages are coded as t = 0. The singular single-
tone language (Dutch) could be a coding error, but possibly not; some Franconian
dialects are described as pitch-accented, which is not the same as tonal. Where
a mathematician would naturally opt for the adjective ‘monotonal’, meaning one
tone, the more common adjective ‘atonal’ is interpreted literally and coded as zero
tones. As a mathematician, my instinctive preference is the code t = 1 for atonal
languages, and I confess that I had taken for granted that everyone else would see
it the same way. Regrettably, misconceptions of this sort are not uncommon in
statistical consulting! Ultimately, the choice of one code over the other is a matter to
be settled by subject-matter specialists. Since re-coding is not an additive constant,
it does have a bearing on the analysis for total phonemes, normalized or otherwise.
8.8 Exercises 131
8.8 Exercises
8.1 Use the table() function to extract the distinct values for vowel diversity,
consonant diversity, tone diversity and total phoneme diversity. What does this tell
you?
8.4 The function vdist(x0, 1) returns a list of linguistic distances from the
designated point x0 in linguistic region 1 to each of the 504 language locations.
Show that Atkinson’s distance variable S1$Dbo implies that his best-fitting origin
lies somewhere in the box 9–10◦E, 1–2◦ S. Find the point and locate it on a map.
You should not assume that the best-fitting origin lies in or near the box 9–10◦E,
1–2◦S.
8.6 Which language has the greatest vowel inventory in the Santoso compilation,
and which has the least? Which language has the greatest consonant inventory, and
which has the least?
Chapter 9
Environmental Projects
This project concerns an experiment conducted at two sites in Minnesota over the
period 2009–2011 to determine the effects of climate warming on photosynthesis in
juvenile trees of 11 different species. The following excerpt is taken from the data
archive:
To test how climate warming and variation in soil moisture supply will jointly influence
photosynthesis of southern boreal forest tree species we measured gas exchange rates of 11
species in an open-air warming experiment at two sites in northern Minnesota, USA. The
experiment ran for three years and used juveniles of 11 temperate and boreal tree species
under ambient and seasonally warmed (+3.4◦ C above- and below-ground) conditions. We
measured in situ light-saturated net photosynthesis (Anet) and leaf diffusive conductance
(gs) on numerous days across the three growing seasons. Soil and plant temperatures and
soil moisture were continuously measured from sensor arrays.
Details of the experimental design, the site preparation, the species selected, the
variables measured, and so on, are provided in the paper (Reich et al., 2018).
The two sites are roughly 100 miles apart, each site consisting of 12 plots
arranged in three blocks of four plots. Each plot is a circular area, roughly three
metres in diameter, which is adequate for 30–40 juvenile specimens. Plots in the
same block are sufficiently far apart to avoid treatment interference. In other words,
the separation is sufficient to ensure that the treatment applied to one plot has
negligible effect on neighbouring plots.
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 133
P. McCullagh, Ten Projects in Applied Statistics, Springer Series in Statistics,
https://doi.org/10.1007/978-3-031-14275-8_9
134 9 Environmental Projects
Several specimens of each species were planted in each plot. The heat treatment
was applied to plots during the growing season only. On 50 days, roughly 15–18
days per year from mid-June till late September, measurements were made on trees
in several plots. For administrative reasons, all measurements on one day occurred at
the same site. On these days, soil water content was recorded for each plot together
with several measurements (photosynthesis, conductance, vapour pressure gradient,
temperature,. . . ) on selected trees in each plot. Each photosynthesis and vapour-
pressure measurement was made on a single leaf.
The main thrust of the paper is that atmospheric warming has two principal
effects that are relevant to tree growth. One is the direct effect on leaf transpiration
and photosynthesis. The other is the effect on soil moisture content. Higher
temperatures at these latitudes generally means that soils are drier than they would
otherwise be. The moisture deficit has an important effect on how trees respond to
warmer temperatures.
The data reported in the paper are available from the Environmental Data Initiative
at
https://doi.org/10.6073/pasta/258239f68244c959de0f97c922ac313f
For present purposes, the local file
.../ch09/borealwarming.csv
consists of a 2049 × 18 spreadsheet containing the data to be used for the exercises
listed below. Each row consists of several measurements on one leaf on one day. A
few minor discrepancies were noted and corrected, so this file differs slightly from
the archived data.
Photosynthesis, conductance and vapour-pressure are measured on leaves from
selected trees. Each leaf or each leaf-plot-occasion is one observational unit for
all such variables, and the sample contains 2049 units of this type, one for each
row in the spreadsheet. A few values are missing for some units. Responses for
distinct units in the same plot may be correlated positively. Likewise for distinct
units in the same block or the same site. Responses may also be correlated spatially
and temporally, but spatial information is meagre, so this avenue of investigation is
limited.
Soil moisture content is not measured on leaves: it is measured on plots, with at
most one measurement for each plot on each visit to the site. Each plot-occasion pair
is one observational unit, and the sample contains approximately 503 observational
units for this variable. Two soil moisture values are missing. Responses for distinct
units may be correlated spatially or temporally or in blocks.
The questions that follow are intended as a guide for analysis in an examination
setting. You should first read the Nature paper and then answer the questions as
9.1 Effects of Atmospheric Warming 135
asked. But you are not restricted to the points mentioned below; you are free to
examine the data in any way you please, so your approach might not follow the path
suggested.
9.1.3 Exercises
1. Before any trees were planted, a grid of underground electrical cables was laid
down at a depth of 15 cm in each plot, with sufficiently small separation that the
heating effect could be deemed uniform. The cables in the treated plots were used
as heating elements. What was the purpose of the cables in the control plots?
2. Ten of the variables in the file borealwarming.csv are as follows:
site, block, warming, plot_id, species,
plant_id, year, date, Asat, soil_moisture,
9.2.1 Introduction
Bees serve an essential function as pollinators for a very wide range of flowering
plants, from fruit trees to soy beans to rapeseed plants. Over the past several decades,
numerous articles have appeared in the press pointing to an alarming decline in bee
survival and the dire consequences for agricultural production. In some of the more
alarmist articles, the risk to honey bees is attributed to African bees or killer bees;
in others, the risk is put down to habitat loss, climate change, pesticides, pathogens
and parasites. All articles paint a bleak outlook, not only for bees, but for humanity
as well.
Adler et al. (2020) designed and analyzed a number of experiments whose
overall goal was to study the effect of a particular pathogen Crithidia bombi on the
behaviour and survival of the bumblebee Bombus impatiens. For reasons unknown,
it appears that some plant species are more effective than others at pathogen
transmission. The following quote is taken from an introductory passage.
The role of plant species in shaping infection intensity could be influenced by bee behavior.
If infected bees increase visitation to antimicrobial plant species as a form of self-
medication, such plant species could play a larger role than predicted in disease dynamics.
Alternatively, antimicrobial plant species may be less effective than expected if pathogens
manipulate host behavior to avoid such plants. Sunflower has pollen that dramatically
reduces [the pathogen] C. bombi in B. impatiens and several plant species produce nectar
with secondary compounds that can reduce pathogens, although such effects are not always
consistent. Only a few studies have assessed whether infection alters bee preference. In the
field, infected B. impatiens and Bombus vagans had greater preference than uninfected bees
for inflorescences with high nectar iridoid glycosides that can reduce pathogen infection.
However, a laboratory study with Bombus terrestris found only weak evidence that infected
bees had increased preference for nectar nicotine compared to uninfected bees. Thus, there
are conflicting results across species and compounds, and very few data overall to assess
whether infection changes foraging preferences.
The first experiment was designed to study the risk of infection by the pathogen C.
bombi, and how the risk varies with flower type. Each experimental unit was one
microcolony or one tent, so the flower type was assigned to tents. The control level
consisted of canola (rapeseed) alone; for the other two levels, some of the canola
pots were replaced either with flowering plants that had previously been deemed to
9.2 The Plight of the Bumblebee 137
It is unclear whether the same nine tents were used in each round, but it is reasonable
to presume so, and this has subsequently been confirmed.
To estimate the treatment effect, the authors used a generalized linear mixed
model of binomial type as follows. Given the round and block effects, ηr,b , treated
138 9 Environmental Projects
implies that the non-binomial random component ηr,b is a sum of two independent
processes, one having independent and identically distributed components for each
block:round level, the other having independent and identically distributed
components for each round. Equally important, η is constant within each block in
each round.
Part of the glmer() output shows the two fitted variance components plus the
estimated treatment effects with canola as the reference level.
Random effects:
Groups Name Variance Std.Dev.
block:round (Intercept) 0.0288 0.1698
round (Intercept) 1.3883 1.1783
Number of obs: 44, groups: block:round, 15; round, 5
Fixed effects:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.7102 0.5771 1.231 0.22
treatmentHigh 0.5135 0.3144 1.634 0.10
treatmentLow 0.2518 0.3016 0.835 0.40
The generalized linear mixed model described above is a development of recent vin-
tage. The two-stage construction is mathematically straightforward, but maximum-
likelihood estimation is computationally non-trivial. General-purpose computa-
tional packages have become available only in the past 25 years. The package
glmer() uses the Laplace approximation for Gaussian integrals; a few others use
Monte Carlo methods.
This book does not concern itself with computational tactics, but with broader
strategic matters. In that respect, there is a small non-obvious error or misunder-
standing in the form of the model used here. To be clear, I do not mean that the
computations are in error or that the conclusion as stated is biologically incorrect. I
mean only that there is a non-trivial methodological error that could, under different
circumstances or with different data, give rise to misleading conclusions. It is worth
revisiting the analysis in order to understand the motivation for the model and the
source of the error.
The fact that the response is binary rather than a quantitative measurement is
immaterial as a matter of principle. The fact that the expected value is not linear in
treatment effects is also immaterial. How would the analysis have proceeded had
the response Y (u) on each of 389 bees been quantitative?
Standard practice recommended in every textbook on experimental design calls
for a distinction between observational units and experimental units. This design has
389 observational units (bees), and 44 experimental units (colonies). Typically, the
variation between observational units within the same experimental unit is less than
the variation between observational units in different experimental units. In order for
the variance of treatment contrasts to be estimated honestly from variation between
experimental units rather than variation within, the experimental-unit factor must
be included as a term in the covariance model: see chapter 1 for a simple instance.
As it happens, the original round:block factor is not needed, so it is best to
replace (1|round/block) with (1|round)+(1|colony). In effect, the 15-
level block-factor block:round is replaced by the 44-level factor colony.
The output from lmer() after this substitution is as follows:
Random effects:
Groups Name Variance Std.Dev.
colony (Intercept) 0.2119 0.4603
round (Intercept) 1.4924 1.2216
Number of obs: 44, groups: colony, 44; round, 5
Fixed effects:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.6925 0.6100 1.135 0.26
treatmentHigh 0.6006 0.3788 1.586 0.11
treatmentLow 0.2923 0.3629 0.806 0.42
140 9 Environmental Projects
9.2.4 Exchangeability
The argument in the preceding section for including the experimental-unit factor
colony is a consequence of exchangeability associated with a group that is
implicit in the design. Other factors such as round may be included as needed,
either as additive fixed effects contributing to expected values, or as block factors
contributing to variances and covariances. But the inclusion of colony as a block
factor is a matter of principle; it is not a matter to be decided based on of goodness
of fit or model adequacy.
To understand the reasoning behind this attitude, it is helpful to rearrange the
spreadsheet by observational units, i.e., with one row for each bee. The first 38 rows
of the expanded file ExpSS are as follows:
It is understood that the labelling of bees within a colony is entirely arbitrary, and
carries no substantive information. The labelling shown above is such that the lower-
numbered bees in each colony are disease-free. Thus, the record for bee 1 is repeated
for bees 1–6, the record for bee 7 is repeated for bees 7–15, and so on.
Despite the fact that the expanded spreadsheet has 389 rows while the original
has only 44, it is apparent that the spreadsheets are entirely equivalent in their
information content. Thus, the output of the lmer() expressions
lmer(Y~treat+(1|round/block), family=binomial, data=ExpSS)
lmer(Y~treat+(1|round)+(1|colony), family=binomial, data=ExpSS)
9.2 The Plight of the Bumblebee 141
is identical in all important respects to the outputs shown in Sects. 9.2.2 and 9.2.3.
In particular, all parameter estimates and standard errors are identical.
Consider now a pair of bees (u, u ) with response (Yu , Yu ). In both of the
models implied by the lmer() expressions shown above, the bivariate random
variable (Y1 , Y2 ) has the same distribution as (Yu , Yu ) for all pairs of distinct bees
in colony 27. Likewise, (Y16 , Y17 ) has the same distribution as (Yu , Yu ) for distinct
pairs in colony 21. The responses are exchangeable within colonies. Because of
different treatments applied to colonies, they are not exchangeable for different
colonies either in the same round or in different rounds.
In the null versions with treatment effect omitted, both lmer() expressions
imply that (Y1 , Y2 ) also has the same distribution as the pairs (Y16 , Y17 ) and
(Y23 , Y24 ) in colonies 21 and 25. Two pairs of units have the same joint distribution
if the first pair belongs to one colony and the second pair belongs to another colony.
However, the first lmer() expression without treatment also implies also that all
observations Y1 , . . . , Y38 in round 1 are exchangeable. Thus (Yu , Yu ) has the same
distribution as (Y1 , Y2 ) for all distinct pairs in the same round whether they belong
to the same colony or not.
Since we are dealing with an infectious disease, it is entirely possible that a part
of the observed or anticipated correlation is associated with bee-to-bee transmission.
In that case, there is every reason to expect that an infected bee is more likely to
transmit the disease to a sister bee in the same colony than to a sister bee in a
different colony with which it has no direct contact. Or, to put it the other way
round, there is little reason to expect intra-colony transmissions to occur at the same
pairwise rate as inter-colony transmissions in the same physical block, which is what
the first version of lmer() implies.
It is not always easy to understand the implications of a complicated stochastic
model. When it is spelled out this way, it is evident that forced block exchangeability
must have been unintentional on the authors’ part. Fortunately in this instance, it
does not substantively alter the conclusions.
In either case, the model for the mean must include round effects in addition to
treatment effects. The code used to produce the output shown below is
fit <- glm(Ytot~treat+round, family=binomial(),. . .)
summary(fit, dispersion=X2/resdf)
Project I: Phenology refers to the study of cyclic and seasonal natural phenomena,
especially in relation to climate and plant and animal life. The paper by Montgomery
et al. (2020) is a continuation of the project discussed in Sect. 9.1. It aims to
study plant phenology in response to atmospheric warming. The bud-burst dates
for ten tree species over five years are available online. Is this an experiment or an
observational study? If it is an experiment, what is the treatment? What do the data
have to say about the change in budburst dates for each species over the five years?
Project II: Li et al. (2019) studied the effect of atmospheric warming on the amount
and the blend of volatile organic compounds released by the shrub Betula nana in
response to herbivory by moth larvae at a latitude 68◦ N site in the Swedish tundra.
The data are available online.
What are the observational units? What are the experimental units? What role
does methyl jasmonate play in the design? How do your conclusions align with
those in the paper?
9.4 Exercises 143
9.4 Exercises
9.1 Check that the authors’ lmer() code for the bee infection experiment
produces the output shown in Sect. 9.2.2. Check that the revised lmer() code
produces the output shown in Sect. 9.2.3.
9.2 Expand the spreadsheet so that it is indexed by bees rather than by colonies.
Check that the two versions of the lmer() code produce the same output for the
expanded spreadsheet as they do for the compact format.
9.3 The analysis of bee infection rates in Sect. 9.3 is only one of many similar
analyses reported by Adler et al. (2020). A subsequent analysis examines how
the mean infection intensity per colony is related to treatment. The authors use a
Gaussian random-effects model with round/block as the additive random effect.
Since mean infection is a positive number with an appreciable dynamic range,
check first whether a transformation might be helpful to improve additivity. On the
transformed scale, check whether there is a treatment effect. Which treatment level
has the lowest mean infection, and which has the highest?
9.4 McCullagh and Nelder (1989, Sect. 14.5) describe an experiment by S. Arnold
and P. Verrell on the interbreeding of southern Appalachian mountain dusky
salamanders. Each male has six breeding opportunities and each female also has
six opportunities. The incomplete crossed design is shown in Table 14.3, and the
results for three experiments in Tables 14.4–14.6 of McCullagh and Nelder (1989).
Use lmer(...) to estimate the male and female variance components for the
first salamander-mating experiment. Is there any evidence of correlation for repeat
observations on the same male? Is there any evidence of correlation for repeat
observations on the same female?
9.5 All of the analyses of infection rates in Sect. 9.2 are for infections among
surviving bumblebees, with the implicit assumption that the survival distribution
does not depend on infection status. Suppose that the initial colony size were
available, so that the survival fraction could be obtained by subtraction. How might
you use the additional information to determine the effect of treatment on survival?
Chapter 10
Fulmar Fitness
10.1.1 Background
The northern fulmar Fulmarus glacialis is a seabird found mainly off the coasts of
Iceland, the British Isles and parts of Norway. Fulmars have a life expectancy of
30–40 years, and some may live up to 60 years. Adults are monogamous, they form
long-term pair bonds, and breeding pairs return to the same nest site year after year.
Breeding begins in May; a single egg is laid and incubated by both parents for about
50 days. The chick is brooded for about two weeks and fully fledged after about
three weeks.
A survey of the Eynhallow fulmar colony in the Orkney Islands was started in
1951 by Robert Carrick and George Dunnet; the data for this exercise concern the
breeding record of 428 adult female birds for the period from 1958 to 1996. They
were provided by Steven Orzack in 2006. A few of the birds were active breeders
when first observed, but most were observed annually from their first attempts at
breeding until 1996 or the bird’s presumed death. No single bird was alive for the
entire period of observation. Further details concerning the Eynhallow population
can be found in Orzack et al. (2011).
According to the Wikipedia article, fulmars have a long adolescence and
commence breeding at around 6–12 years of age. The records available for this
project include 62 females that were born at Eynhallow and subsequently nested
there at age two or later. These individuals were marked as fledglings, so their ages
were known. Their nesting commenced at a wide range of ages from two to ten,
with a mean of 4.65 and standard deviation 2.46.
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 145
P. McCullagh, Ten Projects in Applied Statistics, Springer Series in Statistics,
https://doi.org/10.1007/978-3-031-14275-8_10
146 10 Fulmar Fitness
Every pair of birds breeding at Eynhallow during the period 1958–1996 was
recorded. The record for each female in each year shows whether the bird laid
an egg, and if so, whether the egg hatched, and, if hatched, whether the chick
fledged, which is the event of primary interest. The data are presented in the
file ...Fulmar.dat in mark-recapture format in which the first column is the
bird identifier and the next 38 columns describe the reproductive event observed
for that bird during each of the 38 years of the study. Each record includes the
letter ‘A’, in the year that the bird was first observed, either directly if the bird
fledged at Eynhallow, or indirectly if it fledged elsewhere and subsequently nested
at Eynhallow. All birds born at Eynhallow are marked as fledglings. If the bird
fledged at Eynhallow, ‘A’ indicates the year of fledging. However, the great majority
of birds nesting at Eynhallow are blow-ins that were born elsewhere. If the bird was
first observed nesting at Eynhallow, ‘A’ is the year before the first nesting. For such
blow-ins, the year of birth is not known.
The code for reproductive events is:
The records for birds 116, 209, 280 and 284 are as follows:
116 A00444224004040000000000000000000000000
209 A20043344444000000000000000000000000000
280 00000A000000324000000000000000000000000
284 00000A400000024040434444444440440020002
Bird 116 was first recorded and marked in 1958 as a fledgling at Eynhallow. It was
next observed as a breeding adult in 1961, when it produced a fledgling (code 4).
The last sighting occurred at Eynhallow in 1971, when it produced another fledgling.
Bird 209 was first observed as an adult in 1959, when it laid an egg that did not
hatch. Although it was first marked in 1959, ‘A’ is inserted into the record for 1958;
the year of birth is pre-1958. Likewise, bird 280 fledged at Eynhallow in 1963 and
was next recorded nesting there in 1970, ’71 and ’72. Bird 284 was first recorded in
1964 as a nesting adult, so ‘A’ is inserted into the record for the previous year. This
bird was not present at Eynhallow for the following six years. Given that fulmars
are faithful to their nesting place, it is presumed by ornithologists who study these
matters that no breeding occurred in those years either at Eynhallow or elsewhere.
10.1 The Eynhallow Colony 147
A leading zero in the sequence for one bird means that no nesting event occurred
in that year; the bird is classified a juvenile. A trailing zeros mean that the bird
was not observed to be nesting, either because it had died or because it was no
longer sexually active. An internal zero means that the bird was alive and capable
of breeding, but did not nest in that year.
The breeding sequence for one bird is the subsequence beginning at the first non-
zero value and ending at the last non-zero value. The ornithological rationale for
dropping the leading zeros is that the bird is either not yet born or a juvenile. Since it
ultimately nests at Eynhallow, and it is presumed to be faithful to its nest, the leading
zeros imply that it is non-breeding in those years. In the case of trailing zeros, we
know only that the bird was not seen at Eynhallow. If it is still alive, it is presumed
not to be an active breeder in those years; at some point it must be presumed dead.
The breeding sequence is restricted by design to the years 1958–1996.
The breeding sequences for the four birds shown above are
Y (280) = (3, 2, 4)
Y (284) = (4, 0, 0, 0, 0, 0, 0, 2, 4, 0, 4, 0, 4, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 0, 4, 4, 0, 0, 2, 0, 0, 0, 2)
Each sequence is associated with a bird, and each sequence has a length in years.
Each component in a sequence is associated with a calendar year; 1961–1971 for
bird 116, 1959–1969 for bird 209, and so on. Each component in the sequence is
also associated with a breeding age beginning at one for the first component. For
present purposes, b_age is the age relative to the first egg-laying event; 1–11 for
birds 116 and 209; 1–3 for bird 280, and so on.
Thirteen of the birds in this sample were born at Eynhallow but did not return
as adults; their sequences are empty and do not contribute to any subsequent plots
or analyses. The first and last components of each non-empty sequence are strictly
positive; internal values may be zero. The non-empty sequences range in length
from one to 33 years. Their combined length is 3785 bird-years.
Unless one plans to use a purpose-built computer package that is capable of
digesting the file in raw mark-recapture format, it is helpful to rearrange the
concatenated sequences in spreadsheet format with 3785 rows. The first few lines
are as follows:
bird_id yr age b_age rev_age outcome
116 61 3 1 11 4
116 62 4 2 10 4
148 10 Fulmar Fitness
116 63 5 3 9 4
116 64 6 4 8 2
116 65 7 5 7 2
116 66 8 6 6 4
116 67 9 7 5 0
116 68 10 8 4 0
116 69 11 9 3 4
116 70 12 10 2 0
116 71 13 11 1 4
193 59 1 1 2 4
193 60 2 2 1 4
194 59 1 1 9 3
194 60 2 2 8 0
194 61 3 3 7 0
The variable age is measured from the first recorded sighting, whereas b_age is
breeding age measured from the first record of an egg being laid. These variables
are equal for all blow-in breeders. Reverse age rev_age is breeding age measured
backwards from the last non-zero, so that the series length is b_age + rev_age
- 1.
a 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
#Sa 415 335 313 292 266 246 223 212 197 171 152 132 120 108 86 80
0 5 10 15 20 25 30
Fig. 10.1 Average reproductive score versus age in the top panel, and against reverse age coded
backwards in time in the lower panel. Points are scaled to reflect cohort size
same as the original cohort, the averages are different because bird i with sequence
length ni ≥ a contributes Yi,a to the average at a in first panel, and Yi,ni −a+1 to the
average at −a in the second.
The plot of averages against age counted forward from ‘A’ is not shown, but the
pattern is essentially the same as the plot against breeding age.
Figure 10.1 shows clearly that the first and last values in each sequence are
substantially larger on average than intermediate values. This fact is fairly obvious
from the context because the value Yi,a for bird i at age 1 ≤ a ≤ ni is a
random variable taking values in {0, 2, 3, 4}. However, the event Yi,a = 0, which
has probability zero for a = 1 and a = ni , has strictly positive probability for
150 10 Fulmar Fitness
1 < a < ni . Figure 10.1 demonstrates clearly that the magnitude of these terminal
anomalies is a dominant feature in this process.
Apart from the initial value, subsequent averages in the first panel exhibit a slow
but definite increase in reproductive score as a function of breeding age, at least up
to age 25. This pattern should come as a surprise because this is a breeding cohort
whose average fertility, agility, ability and dexterity might be expected to decrease
as a function of age. It suggests either that fertility increases with age, which is
counterintuitive, or that frequent practice and parental experience are sufficient
to offset any decline in fertility or foraging ability. It would be heart-warming to
report this phenomenon as a triumph of avian wisdom and maternal experience over
senescence and declining fertility. But any conclusion along these lines would also
be a statistical misinterpretation of the facts.
In the second panel ǎ = ni −a+1 is the breeding age in years counted backwards,
while the averages are plotted against −ǎ = a − ni − 1, preserving left-to-right
temporal order for all sequences of a given length. Apart from the final point, the
averages in the second panel appear to decrease as a function of −ǎ, i.e., to decrease
as a function of the bird’s age over the last 10–15 years of observation. At first sight,
therefore, the two panels appear to tell contradictory stories about breeding success
of females as a function of age—a slight but steady increase with age in the first
panel, a moderately strong decrease with age in the second. This is a cohort paradox
which is resolved in Sect. 10.1.6.
To help understand the apparent anomaly in Fig. 10.1, the birds are first partitioned
into disjoint subsets according to series length, i.e., by breeding lifetime. For
example, 19 birds had reproductive lifetimes of 10 years; 20 birds had reproductive
lifetimes of 11 years, and so on. Figure 10.2 shows the pattern of average
reproductive scores for six subsets of birds whose series lengths were 7–12 years.
In all cases, the terminal values are the largest, and the first tends to be a little larger
than the last. For the intermediate values, no pronounced positive or negative trend
is evident. The intermediate values have the appearance of a stationary time series.
The average of the intermediate scores for series of lengths 3–15 are
1.24, 1.56, 1.57, 1.90, 1.55, 1.57, 1.38, 1.74, 1.48, 1.45, 1.47, 2.01, 1.72
Without claiming that these averages are independent or identically distributed, one
can make an informal check for a trend by computing the correlation with series
length. Whether we use the actual averages or their ranks, no evidence of a non-zero
correlation emerges. In this respect at least, the distribution of non-terminal values
appears to be unrelated to series length. Not only do the intermediate values have
the appearance of a stationary series, but the distributions for different lengths seem
10.1 The Eynhallow Colony 151
3.5
seq_len=7; nbirds=11 seq_len=8; nbirds=15
3.0
3.0
2.5
2.5
2.0
2.0
1.5
1.5
1.0
1.0
1 2 3 4 5 6 7 1 2 3 4 5 6 7 8
3.5
3.5
seq_len=9; nbirds=26 seq_len=10; nbirds=19
3.0
3.0
2.5
2.5
2.0
2.0
1.5
1.5
1.0
1.0
2 4 6 8 2 4 6 8 10
3.5
3.5
3.0
2.5
2.5
2.0
2.0
1.5
1.5
1.0
1.0
2 4 6 8 10 2 4 6 8 10 12
Fig. 10.2 Average reproductive score versus age for series of length 7–12 years
to have similar first moments. Visual inspection of the panels in Fig. 10.2 suggests
that the second moments are also similar.
The first panel of Fig. 10.1 shows that the mean reproductive score increases with
the bird’s reproductive age for ages a ≥ 2, suggesting that older birds achieve
greater success on average than younger birds. Success improves with experience!
However, the second panel exhibits exactly the same feature in reverse time. For the
same sequence of birds, the mean reproductive score increases with reverse breeding
age (≥ 2), indicating that average success also improves with lesser experience. This
conclusion seems paradoxical, if not self-contradictory. It calls for a resolution.
152 10 Fulmar Fitness
Table 10.1 shows the forward- and backward averages in cohorts Sa , together
with the cohort size for a ≤ 12. Each average Ȳa or Z̄a is a mean of certain
reproductive scores, one score for each bird in the subset Sa ⊂ S1 . Some of these
scores are initial, some are final, and the remainder intermediate; Ȳ1 is the average
of initial values only, while Z̄1 is the average of final values only. In a sequence of
length one, the value is both initial and final, and contributes to both Ȳ1 and Z̄1 . In
the average over Sa for a ≥ 2, the fraction of values that are terminal is
which is shown as a percentage in Table 10.1. The general trend in the terminal
fractions is increasing as a function of a ≥ 2. Since terminal values are appreciably
larger on average than intermediate values, we should expect both Ȳa and Z̄a to
increase with age for a > 2, precisely as observed in Fig. 10.1.
The exploratory analyses described in the preceding section suggest fitting a formal
linear model to the data in concatenated sequence form. Specifically, the response
for bird i is a sequence Yi,1 , . . . , Yi,ni of length ni ≥ 1. We regard breeding age as a
covariate and sequence length as a non-random constant, so that there is a fixed set
of ni observational units for bird i.
If the baseline is set at the initial egg-laying, then breeding age is a bona-fide
covariate according to the definition in Sect. 11.3.1. Calendar year is also a baseline
variable, which is treated here as a relationship. However, it is difficult to offer any
comparable justification for treating sequence length as a covariate. Nonetheless,
the analysis in Sect. 10.1.5 suggests that this choice is not unreasonable.
Apart from the dominant terminal effects, any formal model must aim to take
account of possible trends in the mean plus non-trivial correlations associated with
various baseline relationships. Some possibilities are as follows:
1. mean score as a linear function of age E(Yi,a ) = β0 + β1 a with constant
coefficient β1 independent of sequence length;
10.2 Formal Models 153
The two covariates in the mean model are the normalized breeding age and the
indicator function for initial or final values. It is possible to include two separate
indicator functions, but this has not been done because a sequence of length one
deserves only one terminal increment, not two. The covariance model includes
calendar year as a block factor accounting for annual variations associated with
weather, with no temporal correlation for successive calendar years. Apart from
the block factor for birds, this model has no non-trivial temporal correlation for
successive observations on the same bird.
The fitted variance components are
all of which are significantly positive. The between-bird variance is nearly four
times the between-year variance, showing that maternal wisdom and avian skills
have greater impact on breeding success than annual weather fluctuations. The fitted
standard deviations are in the ratio 5:1:2.
The fitted terminal contribution to the mean is β̂2 = 1.362 with standard error
0.064, which is in line with Fig. 10.2. The fitted temporal trend in normalized age
is negative β̂2 = −0.163 ± 0.107, but not significantly different from zero. This
analysis shows no evidence of a temporal trend in normalized age or in normalized
reverse age.
The block matrix for birds, δi.i in (10.1), implies that the covariance between
distinct observations on one bird is σ22 , which is constant as a function of temporal
separation |a−a |. As it happens, the fit can be improved appreciably by allowing the
154 10 Fulmar Fitness
correlations to decrease with time. The block matrix δi,i is replaced with a suitable
Hadamard-product matrix such as
δi,i × e−|a−a |/λ
with range λ to be estimated. The data suggest λ 4.2 years, meaning that
the temporal correlation is reduced by half every three years. This covariance
modification improves the fit by approximately 75.0 log likelihood units, which is
very large, but it does not appreciably alter the principal conclusions.
10.2.2 Prediction
Whether or not the application calls for prediction in the literal sense, it is highly
desirable that the fitted model have that capability. As it stands, the linear Gaussian
model described in the preceding section is a family of distributions for sequences
of arbitrary fixed length—padded on the right with zeros if necessary. Such a model
has no capability for prediction because prediction implies that the ultimate series
length is unknown. The remedy, however, is relatively straightforward.
The sequence length is regarded as a random variable N taking values in the
natural numbers with probabilities P (N = n) = pn , independently for each bird.
The value N = ∞ is seldom important for prediction, but it is best included with
probability p∞ ≥ 0. The marginal distribution of series lengths can be estimated
either using the Kaplan-Meier estimator which takes account of right censoring, or
it can be estimated subject to a monotone hazard constraint on the expected hazards
in the last line of Table 10.1. The standard Kaplan-Meier estimator ordinarily
permits immortality p̂∞ > 0, but smoothed versions guaranteeing finiteness are
available. Conceptually, the sequence length N is generated first, and the values
Y1 , . . . , YN are generated according to the Gaussian model (10.1) using estimated
parameters as needed. The sequence may then be padded with zeros on the right.
With probability 1 − p∞ , the sequence thus generated contains finitely many non-
zero values. Given the initial sequence Y [k] = (Y1 , . . . , Yk ), it is straightforward
in principle to compute the conditional distribution or predictive distribution for
subsequent values.
The model described above with either fixed or random sequence lengths is open to
criticism on many fronts. The following list is by no means comprehensive.
1. Mis-match of state spaces: Each observation is a score in {0, 2, 3, 4} while all
model distributions are continuous on the entire real line.
10.2 Formal Models 155
2. Left censoring: Some of the birds were active breeders when the study com-
menced in 1958. In such cases, we cannot be sure that the first non-zero value for
that bird is the one recorded in 1958 or later.
3. Right censoring: Some of the birds were active breeders at the end of the
recording period in 1996. From the available data, we cannot be sure that the
bird did not nest at Eynhallow in 1996 or later.
4. Baseline or breeding age zero for bird i is defined as the calendar year min{t −
1 : Yi,t > 0}, which is a deterministic function of the breeding process. It is
difficult to regard this version of breeding age as a covariate in the spirit of the
definition in Sect. 11.3.1.
5. Faith and faithfulness: The record for some birds shows remarkable faithfulness
to the nesting site. Ornithologists assure us that all fulmars are equally faithful,
which implies that a bird nesting once at Eynhallow never nests elsewhere. But an
ounce of skepticism cures a ton of faith, and the most credulous statistician must
notice that nearly 20% of birds in this study are one-time breeders. Is it plausible
that such a large fraction of breeding females retire after one year? Without
further evidence, it is entirely possible that some of these one-time Eynhallow
breeders may be occasional or regular breeders elsewhere.
The first criticism is the most obvious, but it is also the least fundamental: all
estimates and most inferences are based on first and second moments. The fact that
the values are limited to the first few integers eliminates the possibility of heavy-
tailed distributions, so the worst consequences of non-normality are avoided.
The second and third criticisms are intrinsic to the design of ornithological
studies, which have many of the characteristics of mark-recapture designs. See
Sect. 10.3.
The last point is different in a fundamental way. If it prevails, skepticism leads to
a re-interpretation of all zeros in the sequence for each bird.
For this discussion, an observational unit is a slot (i, t) corresponding to bird i in
calendar year t. At that time, the bird may be (i) unborn, (ii) a fledgling or juvenile,
(iii) a breeder, or (iv) retired or dead. For slots in class (iii), the breeding score is
a number Yi,t in {0, 2, 3, 4}; for all other slots, Yi,t = 0. In the recorded sequence,
each slot contains a value Xi,t , not necessarily equal to Yi,t . Each non-zero slot
(i, t) for which Xi,t = 0 is a breeder containing the recorded value Xi,t = Yi,t .
Each zero slot corresponds to a missing bird, either a non-breeder, or an unrecorded
breeder. If the zero slot corresponds to a non-breeder, the breeding score is also
zero Yi,t = Xi,t = 0; if it corresponds to a breeder, the breeding score Yi,t is not
necessarily zero. Although no record is available for this slot, faithfulness comes
to the rescue and allows us to infer that no egg was laid either at Eynhallow or
elsewhere: Yi,t = Xi,t = 0. In particular, each internal zero slot in the recorded
sequence is an unrecorded breeder.
In the absence of faithfulness, each breeding slot for which Xi,t = 0 corresponds
to a missing bird whose score Yi,t is not observed. Since this bird was recorded
in a previous or subsequent year, there are only two possibilities: either breeding
did not occur in which case Yi,t = 0, or breeding occurred elsewhere in which
156 10 Fulmar Fitness
case Yi,t > 0 is indeterminate. More importantly, however, the sampling scheme is
biased because observational units for which Yi,t = 0 tend not to occur in a breeding
colony. Without the faithfulness assumption, it is difficult to say much about the
fraction of zero values among adult birds.
Ei = (0, 0, 1, 1, 0, 0, 1, 0),
the measurement Yit is available only at slots for which Eit = 1. Generally
speaking, Ei,t = 0 implies that Yi,t is unmeasured and unknown. However, if the
goal is to determine the survival distribution, the values for slots t = 5, 6 can be
inferred by continuity from the encounter sequence: Yi,3 = · · · = Yi,6 = 1. If
age is available for one slot, it can also be inferred for missing slots. Ordinarily,
measurements cannot be deduced from the encounter sequence alone.
The statistical computer package MARK (Cooch & White, 2020) is designed
to analyze data from mark-recapture studies. The input data are coded in mark-
recapture format, one line per animal. Its goal is to accommodate statistical
complications such as variable capture and survival rates, right censoring, cohort
effects, and serial correlation of measurements.
The Eynhallow study has many of the features of a mark-recapture design. All
birds nesting at Eynhallow are captured, measured and released. Each nesting bird
generates a capture history and a measurement history. This is an observational
study with no treatment. Ideally, we would like to know each bird’s age, but this is
available only in relation to the first encounter. However, the scheme for capturing
birds favours breeding pairs, and seems to guarantee Yi,t > 0. Without heroic
10.5 Exercises 157
assumptions such as those in Sects. 10.1 and 10.2, it seems difficult to say much
about the frequency of slots for which Yi,t = 0.
Dunnett (1991) gives a concise historical account of the background, initiation and
development of the study of fulmars at Eynhallow.
10.5 Exercises
10.1 Plot the average reproductive score against calendar year. Is the range of
annual averages high or low in relation to the reproductive scale 0–4? Does this
plot suggest serial correlation?
10.2 Each bird in this study has a sequence length in the range 0–xx. Compute the
histogram of sequence lengths. How many sequences are empty? Report the average
and the maximum length? What fraction of the birds are one-time breeders?
10.3 Let a be the vector of bird ages, and n the vector of sequence lengths so that
ǎ = n+1−a is reverse age. Show that span{1, a/(n+1)} is equal to span{1, ǎ/(n+
1)}. Hence justify the claim made about age reversal in the third of the list of options
in Sect. 10.2.
10.4 Show that the model (10.1) implies exchangeability of initial values Yi,1 ∼
Yj,1 for every pair of birds, whether recorded in the same year or in different years.
10.5 Show that the model (10.1) implies exchangeability of terminal values Yi,ni ∼
Yj,nj for every pair of birds, regardless of whether ni = nj and regardless of the
years in which these occurred.
10.6 In the light of the preceding exercises, discuss the pros and cons of using
normalized versus unnormalized age in the mean model (10.1).
10.7 Show that the extended model suggested in the last paragraph of Sect. 10.2
also has exchangeable initial values and exchangeable terminal values. Under what
conditions on the parameter do initial values have the same distribution as terminal
values?
10.8 What evidence is there in the data suggesting serial correlation in the year
effects? Can the fit be improved using a model containing non-trivial serial
correlation? Extend the model and report a likelihood-ration statistic.
158 10 Fulmar Fitness
10.9 A breeding population is a set of birds Nt consisting of adults aged one year
or more. The annual mortality rate is a constant q = 1 − p from year one onwards.
Each surviving adult produces one offspring per year; the survival rate for offspring
is such that n individuals survive to age one. Show that the approximate size of the
breeding population is #Nt n/q.
What fraction of the breeding population Nt in year t are one-time breeders? The
set ∪1≤s≤t Nt consists of all birds that were adults at some year during the interval
1 ≤ s ≤ t. For large t, what fraction of this set are one-time breeders?
Chapter 11
Basic Concepts
11.1.1 Process
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 159
P. McCullagh, Ten Projects in Applied Statistics, Springer Series in Statistics,
https://doi.org/10.1007/978-3-031-14275-8_11
160 11 Basic Concepts
11.1.2 Probability
11.1.3 Self-consistency
The dismissive phrase nothing more than a probabilistic description of the func-
tion..., which occurs at the beginning of the previous section, grossly underrates the
difficulty of the assigned task. To understand the difficulty, consider a longitudinal
design in which a given subject may be observed at an arbitrary finite collection of
time points t ⊂ R with t1 < t2 < · · · < tk . With all covariates fixed, it is necessary
to specify for each k ≥ 1 and each t, the k-dimensional joint distribution Pt (·) on S k .
Since the event (Y1 , Y4 ) ∈ A×A is the same as the event (Y1 , Y2 , Y4 ) ∈ A×S ×A ,
these distributional specifications are subject to logical consistency conditions such
as
P1,4 (A × A ) = P1,2,4(A × S × A )
= P1,2,5,4(A × S × A ) = P1,2,3,4(A × S 2 × A ).
11.2 Samples
11.2.1 Baseline
Every experiment and every observational study has a temporal component. The
baseline is the temporal origin or reference point marking the commencement
of the study. Mathematically speaking, the baseline is a point at which the
observational units have been assembled, together with all of the information about
them that is needed to specify the probability of arbitrary outcomes. Protocols for
experimentation and treatment assignment are registered at baseline. All statistical
inferences are based on probabilities, and the probability model is also registered at
baseline.
Generally speaking, the units available for study are not homogeneous. The
baseline information records sex, age, and, in principle anything else that is
available at baseline that can reasonably be deemed to have a bearing on outcome
probabilities. In practice, a certain restraint or professional judgement is needed to
decide what is likely to be relevant and what is not. Generally speaking, there is
little point in recording distinctive information about specific units unless we have
a plan for how it is to be used in the analysis, either in the immediate future with
current technology or in the distant future with more advanced technology. In a
field experiment, the geometric layout of the plots is ordinarily part of the baseline
information, and is almost always relevant in that it affects outcome probabilities.
Information about crop, treatment and yield in the previous season is sometimes
available and might be judged relevant if the new plots were well-aligned with the
previous plots. In a clinical trial with human patients, ethnic background might be
relevant as a block factor, but the number of letters in the patient’s name or the
primality of his or her ID code is unlikely to be considered relevant for clinical
outcomes.
For a randomized study, randomization occurs at or immediately after baseline.
The randomization protocol is registered at baseline, but the randomization outcome
is not. Model specification begins with randomization probabilities p(t) = pr(T =
t) for each treatment assignment vector t = (ti )i∈U , also called the treatment factor.
Even if one assignment list is a permutation of the other, two assignment vectors
t, t may have, and usually do have, different probabilities depending on baseline
information such as covariate or block structure. Most commonly, the randomization
is balanced with each treatment level occurring with equal frequency in each block.
Since the probability model is registered at baseline, i.e., pre-randomization,
the model specifies the joint distribution for treatment T and response Y . The
joint distribution implies a marginal distribution for treatment assignments, and a
conditional distribution t → P (· | T ), which associates with each assignment
vector t a conditional distribution for the response. Randomization subsequently
produces a particular treatment configuration, and nearly every subsequent probabil-
ity computation uses that value. In general, the conditional probability P (A | T = t)
of the event Y ∈ A may depend on any and all registered baseline information.
164 11 Basic Concepts
The observational units are the objects u ∈ U on which variables are defined and
measurements may be made. Usually measurements are made only on a small subset
of observational units (the sample), so the phrase measurements may be made does
not imply that measurements have been made or that plans are afoot to make such
measurements.
The statistical universe almost always includes infinitely many extra-sample
units, notional or otherwise, for which probabilistic prediction may be required.
Sometimes each unit is a physical object such as a plot, a patient, a rat, a tree,
or a M-F pair of fruit flies. Sometimes the units are less tangible, such as time
points or time intervals for an economic series, or spatio-temporal points or intervals
for a meteorological variable such as temperature or rainfall. Very often, the set of
observational units is a Cartesian product set such as
which contains 12 observational units for each mouse. As an index set, time is
structured cyclically in a similar way:
The index set may be structured in other ways such as pupils within classrooms
within schools, which is a nested or hierarchical structure defined by one or more
relationships R(u, u ) on the units.
11.2.3 Population
The population U is the set of observational units; the sample is a finite subset.
Where necessary, the sample may be extended to include units for which obser-
vations are unavailable but response predictions are requested. In a meteorological
context, the observational units are all points in the plane or sphere, or points in
11.2 Samples 165
In a clinical trial for a Covid-19 vaccine, the units available for recruitment are
individuals who are alive and of a suitable age at the crucial time. It appears that
the Covid-relevant population is finite. However, there are at least two reasons to
reject the finiteness argument. The first is that the current population is very large.
It is difficult to put a precise figure on on it, say 7.5–8.0 billion, and it is even more
difficult to explain why this number is biologically or mathematically relevant for
the assessment of drug safety or efficacy. The second argument is that the Covid-
19 relevant population is not restricted to the present, but also includes at least one
future generation. Given that some units are inaccessible, it is sufficient to take U
to be infinite, so that the mathematical set is large enough to accommodate every
conceivable demand, even beyond what is epidemiologically plausible.
The sample U ⊂ U is the finite subset of observational units on which the response
and other variables is recorded. Technically, U is a finite ordered list of units,
ordinarily distinct, and the recorded response Y [U ] is the list of Y -values for u ∈ U
in the same order.
To be clear, the word ‘sample’ in these notes denotes a finite ordered subset of
units. It does not imply a random sample, let alone a simple random sample. In most
research settings, such as a field trial or a laboratory experiment, a random sample of
any stripe is out of the question. In the case of biological populations, the sample is
a subset of units that are accessible today, so the inclusion probability is necessarily
zero for a great many units. Two samples consisting of the same units listed in a
different order are different; their distributions are different but they are statistically
equivalent for all inferential purposes.
In settings where prediction or interpolation is involved, it is necessary to
consider an extended sample U , which includes U as a sub-sample. Each u ∈ U \U
is called an extra-sample unit. Only the restriction Y [U ] is actually observed.
Prediction refers to the conditional distribution of Y [U ] given the sub-sample
values Y [U ]; point prediction refers to the conditional expected value.
In a classical controlled design with a treatment factor having k ≥ 2 levels, each
unit consists of a subject i together with an assignment i → t(i), or u = (i, ti ), so
(i, 0) and (i, 1) are distinct units having the same subject. Each sample is a finite set
I = {i1 , . . . , in } together with one of k n assignments t : I → [k]. The sample
U = (i1 , t(i1 )), . . . , (in , t(in )) ,
is a finite list in which {i1 , . . . , in } are distinct. Ordinarily, I is a fixed set of subjects,
and t is determined by randomization.
The counterfactual setting differs from the classical setting in that each sam-
ple is an unrestricted finite collection of individual assignments, so i1 , . . . , in
need not be distinct. Every classical sample is also a counterfactual sample,
11.2 Samples 167
11.2.6 Illustrations
In the discussion of Example 1, it was asserted that each observational unit is a site
on a rat, i.e., a (rat, site) pair, and the response is a real number, i.e., the state space is
the real numbers. However, one could argue that each rat is one observational unit,
and the state space is R5 . At first glance, these appear to be equivalent.
What makes one choice more appropriate than the other is the nature of the
five measurements on each rat. If these were five otherwise unrelated variables
such as pulse rate, temperature, weight and blood pressure, each rat would be one
observational unit, and the state space would be R5 . However, the observation
consists of one biological variable measured at five sites. Although we do not
necessarily expect the five measurements on one rat to be exchangeable or even
to have the same expectation, the nature of the observation process—using the same
instrument for each site—confers additional symmetry that would not otherwise be
present.
For one rat, either choice leads to a response distribution on R5 . The difference is
that the second version with five units per rat has more natural symmetries than the
first. These symmetries arise from notionally permuting the units in various ways.
For example, the model used in Example 1 has equal variances for all sites, and
equal covariances for each pair of sites, which comes implicitly from assumptions
about permuting sites. If we choose the rat as the observational unit, there is no
possibility to permute sites, so these symmetries do not emerge as a consequence of
permutation of units.
In Example 3, each observational unit was taken initially to be a mating event.
But this was subsequently shown to be inappropriate for the design, and misleading
for the analysis. Instead, it was deemed preferable to take one mating well as the
observational unit.
For the daily temperature series, each observational unit for the analysis in
Chap. 6 is a point in calendar time, consisting of a year and a date within the year.
Date is a number in the range 1–365 having cyclic structure, i.e., a real number with
addition modulo 365.
For the frequency analysis in Chap. 7, each observational unit is a Fourier
frequency. These also come with harmonic structure such that frequency ω is
associated with its harmonics {ω, 2ω, . . .}.
In Examples 1–9, one observational unit is (i) a site on a rat; (ii) a (log, saw)
or (log, team) pair; (iii) a mating well; (iv) a (plant, date) pair; (v) a louse; (vi) a
point in calendar time; (vii) a frequency; (viii) a language; (ix) a plot or a leaf. The
population is some set of observational units, and there is no compelling reason
in any example to restrict the population to a finite set. Many of these choices are
168 11 Basic Concepts
relatively straightforward from the definition given, but it is clear in several instances
that other choices are possible. Example 10 is conceptually more complicated
because the observational units are bird-year pairs, and the sampling scheme is
restricted to pairs that occur in a breeding colony. For such pairs the breeding activity
response is predominantly non-zero, which means that the inclusion probability is
not independent of the response.
11.3 Variables
Quantitative Variable
Qualitative Variable
Response
The response, usually denoted by Y , is the variable of primary interest, the variable
that is measured or recorded on the sample units, e.g., yield in kg. per unit area,
or time to failure in a reliability study, or stage of disease, or severity of pain, or
death in a 5-year period following surgery. There may be secondary or intermediate
response variables such as compliance with protocol in a pharmaceutical trial, which
are also part of the response. Synonyms and euphemisms include yield, outcome and
end point.
In statistical work, the response is regarded as the realized value of a random
variable, or process u → Yu taking values in the state space Yu ∈ S. For an
170 11 Basic Concepts
Covariate
second part associated with treatment, the remainder being ‘unexplained’ or residual
variation. The part associated with covariates and block factors, the between-blocks
variation, is said to be ‘eliminated’, and the more variation eliminated the less
remains to contaminate the estimates of treatment contrasts. A covariate or block
factor is said to be effective for this purpose if the associated mean square is
substantially larger than the mean squared residual. This means that the response
variation within blocks, the intra-block mean square, should be appreciably smaller
than the response variation between blocks, the inter-block mean square.
In practice, it may be acceptable to fudge matters by using as a covariate,
a variable measured post-baseline before the effect of treatment has had time
to develop, or an external variable whose temporal evolution is known to be
independent of treatment assignment for the system under study. Louse sex in
Chap. 5 is a simple, uncontroversial, example of a post-baseline variable, which
is not statistically independent of the response (louse size), but whose evolution is
‘known to be’ independent of both treatment assignment and louse size.
At a minimum, it is necessary first to check that the variable in question is
indeed unrelated to treatment assignment; otherwise its use as a covariate could
be counterproductive. It is well to remember that while measurement pre-baseline
is strong positive evidence that no statistical dependence on treatment assignment
exists, the most that can be expected of a post-baseline measurement is absence of
evidence. For a variable of dubious status, absence of evidence is considerably better
than its complement, but it does not provide the same positive assurance as evidence
of absence. A concomitant variable of this sort is not counted as a covariate in these
notes. It is formally regarded as a component of the response whose dependence
on treatment assignment is to be specified as a part of the statistical model. The
dependence may be null, but that alone does not give it the status of a covariate.
As always, a probability model P allows us to compute whatever conditional
distribution might be needed for inferential purposes. That includes the conditional
distribution given any concomitant or intermediate outcome or the conditional
distribution of health values given that the patient is alive, or the conditional
distribution of the cholesterol level given that the patient has complied with the
protocol, or even the probability of compliance given the cholesterol level. Whether
these are the relevant distributions for the purpose at hand is an entirely different
matter to be determined by the user in the given setting.
Treatment
even for distinct experimental units, are usually identically distributed, but seldom
independent.
In computational work, the observed treatment configuration t = (Tu )u∈U is
called the treatment factor. Although T is defined only for sample units, we must
bear in mind that the sample can always be extended indefinitely, at least in
principle, so the restriction to U is not a major part of the distinction between a
classification factor and a treatment factor. The important distinction is that a pre-
baseline variable is a property of the units, whereas treatment is assigned to units at
baseline.
External Variable
11.3.2 Relationship
say whether this block factor is pre-baseline or post-baseline. Similar remarks might
be made about the colony factor in Sect. 9.3.
Block Factor
i.e., the subset of sample units having the same x-value as unit 1, has the label
x(1). Since the blocks of B are unlabelled, a block factor has no reference level or
reference block.
At the risk of over-simplification, covariates typically occur in the model for
the mean response; block factors and other relationships occur in the model for
covariances.
In principle, there may exist relationships among triples or k-tuples of units. For
example, the cross-ratio
(z1 − z2 )(z3 − z4 )
χ(z1 , z2 , z3 , z4 ) =
(z1 − z3 )(z2 − z4 )
⊥ Y (t −1) | Z (t −1)
Zt ⊥ (11.1)
⊥ Z (t −1) | Zt , Y (t −1)
Yt ⊥ (11.2)
T
p(z(T ) ) × p(yt | Zt , Y (t −1) ),
t =1
⊥ Z (t −1), Y (t −1) | Zt
Yt ⊥ (11.3)
severely limits the nature of the temporal dependence in Y . In this case the second
factor in the joint density simplifies further to
p(y | Z) = p(yt | Zt ).
t
The health of an asthmatic patient may depend on recent local weather, but the
evolution of weather patterns is, to an adequate approximation, independent of the
health of patients. It is obvious in this setting that only local weather patterns matter,
and recent is more important than not-so-recent, but it is less obvious that only
synchronous weather matters, so (11.2) is dubious. Certainly, one would not expect
(11.3) to hold for values measured at moderate to high frequency. Similar remarks
could be made regarding investors in the stock market.
In a statistical model with parameter θ ∈ , a distributional factorization such as
that following from (11.1) or (11.3) may hold for every θ . In that circumstance, it is
usually possible to express = × as a Cartesian product, and to express
the likelihood function as a product of two factors, only one of which involves
the parameter of interest. Such a likelihood factorization can lead to substantial
simplification for parameter estimation.
11.4.1 Randomization
Pθ (T = t; x) = Pθ (T = t; x) (11.4)
for all values t and all θ, θ in the parameter space. In this setting, the covariate
configuration includes both relationships and initial values: see Sect. 13.2.
Full specification allows for considerable latitude. However, some designs are
more efficient than others for treatment comparisons. For a wide range of reasons, it
is best to keep the randomization protocol as simple as possible subject to efficiency
considerations.
In agricultural field trials, including horticultural trials, the randomization proba-
bilities usually depend on the block structure and covariate configuration occurring
in the sample units. For a typical randomized-blocks design, the joint probability
176 11 Basic Concepts
that the pair (u, u ) is assigned treatment levels (t, t ) depends on whether the
units belong to the same block or different blocks. More generally, the probability
pr(u → t; U ) that treatment level t is assigned to unit u may depend not only on
xu but also on xu for all other units u ∈ U . Unless otherwise specified, we assume
in these notes that the assignment probabilities pr(u → t; U ) > 0 are strictly
positive for every unit and every treatment level. Although these probabilities may
be positive for every unit, they are not usually positive for pairs of units and pairs of
treatments: in a crossover trial, the probability of one individual being assigned the
same treatment on both occasions is usually zero.
In cases where the components of T are independent, the randomization distri-
bution may depend on baseline covariates or classification variables such as sex.
For example, a two-level treatment may be assigned in the ratio 1:2 for males and
2:1 for females. Ordinarily, a deliberately unbalanced design of this sort causes no
problems in the analysis, except perhaps for a reduction in efficiency.
Randomization conventions vary greatly from one area of application to another.
Ordinarily, the expectation in clinical trials with human subjects is that treatment
be assigned independently of covariates and independently of initial values: see
Sect. 13.2. Although the mathematics requires neither, uniformity and independence
of initial values can be justified on grounds of efficiency of estimation. But the
more compelling reasons for uniformity and independence are more psychological
than mathematical. The essence of the matter is the need for a clear and credible
account that is sufficiently compelling to convince a skeptical reader, and for that
purpose, simplicity is at least as important as efficiency. A treatment assignment that
is exactly or approximately 50:50 for both sexes needs no explanation; a treatment
assignment that is 60:40 for men and 40:60 for women will certainly invite scrutiny
and skeptical commentary from reviewers. Unless otherwise stated, randomization
probabilities in clinical work are invariably assumed to be independent of covariates
and initial values.
For a discussion of why and how to randomize, see Sect. 5.8 of Cox (1958),
Chap. 2 of Cox and Reid (2000), or Sect. 2.2 and Chap. 14 of Bailey (2008).
The experimental units are the objects to which treatment is assigned, i.e., two
distinct experimental units may be assigned different treatment levels. Or, to say the
same thing in a different way, two distinct experimental units are assigned different
treatment levels with strictly positive probability. Each experimental unit consists
of one or more observational units, e.g., one mouse consisting of four legs, or one
classroom consisting of 20–40 students in the preceding example.
Two observational units u, u belong to the same experimental unit if the
randomization scheme necessarily assigns them to the same treatment level. In
mathematical terms, R(u, u ) = 1 if and only if T (u) = T (u ) with probability one.
11.4 Comparative Studies 177
so the treatment effect is the same for every pair such that x(u) = x(u ), whether
u = u or not.
11.4.4 Additivity
For non-Gaussian responses, and even for Gaussian models, it may be necessary
first to apply a transformation to achieve additivity. For example, if Yu ∼ Ber(πu )
is a Bernoulli variable, the logistic model
exhibits additivity on the logistic scale. The coefficient vectors α, β are called
effects.
Additivity usually refers to the mean model, but it can can also refer to random-
effects models. For example if A is a treatment factor and B is a block factor, the
Gaussian model
11.4.5 Design
The word design refers to the arrangement of the sample units by blocks, by covari-
ates, and by restrictions on treatment assignment. Nelder (1965a,b) distinguishes
two aspects of the design, the structure of the units, meaning relationships among
them, and the treatment structure, which is imposed on them. In a crossover design,
where the same physical object occurs as a distinct experimental unit on several
successive occasions, the structure of the units includes not only the temporal
11.4 Comparative Studies 179
sequence, but also a block factor whose blocks are the distinct physical objects.
In a field experiment, the structure of the units includes the geometric shape of each
plot, their physical arrangement in space, and the width of access paths or guard
strips separating neighbouring plots.
11.4.6 Replication
11.4.7 Independence
11.4.8 Interference
If the response Yu for one unit is statistically independent of the treatment applied
to other units, we say there is no interference, or no pairwise interference. Lack of
180 11 Basic Concepts
Note that x is not a random variable, so we have not written this as a conditional
probability statement.
If the measurements were weights at birth rather than later at six weeks, the
baseline would necessarily have to be pre-natal, implying that family size X is
a part of the response, not a covariate recorded at baseline. In that setting the
response Y is a random variable taking values in S, and the response distribution
F determines the distribution of X by pr(X = k) = F (Rk ) (including k = 0). The
conditional distribution given X is a function that associates with each integer k ≥ 0
a probability distribution F (· | X = k) such that F (Rk | X = k) = 1.
In a study of survival times following surgery, each patient is one unit, and the
response is a survival time Yu > 0, which is, prima facie at least, a point in R+ ,
the positive real line. Only the most persnickety mathematician would bother to add
a point at infinity to cover the remote possibility of immortality, which cannot be
ruled out solely on mathematical grounds. However, the response Yu(t ) as it exists
today or at the time of analysis, say t = 1273 days post-recruitment, is either a
failure time in the interval t − = (0, t], or a not-yet-failure corresponding to the
‘point’ t + , which is required to exist as a point in the state space for today. In other
words, S (t ) = t − ∪{t + }, the union of a bounded interval and a topologically isolated
‘point’ exceeding each number in the interval. The limit space S (∞) = R+ ∪ {∞}
differs from R+ by one isolated point that exceeds every real number.
To say the same thing in another way, the state space is a filtration that evolves
as an increasing σ -field in calendar time.
To every non-negative measure on the positive real numbers there corresponds
a distribution F on S (∞) given by F (t + ) = exp − (t − ) , where t + is the
complement of t − in S (∞) . Usually, F is called the survival distribution, and is
the hazard measure. If the total hazard (R+ ) is finite, the atom of immortality
F ({∞}) = exp(− (R+ )) is strictly positive; otherwise the atom is zero. With
respect to the state of information at time t, the probability density at y ∈ S (t ) is
182 11 Basic Concepts
(dy) exp − (y − ) for 0 < y ≤ t, and exp − (y − ) for y = t + . In particular, if
is proportional to Lebesgue measure on R+ , the density is λe−λs ds for 0 < s ≤ t
with an atom e−λt at t + .
Being alive at the time of analysis is one unavoidable form of censoring. In
practice, some patients disappear off the radar screen at a certain point t > 0, and
their subsequent survival beyond that time cannot be ascertained. These also are
typically regarded as censored at the last time they were known to be alive.
In a longitudinal study, also called a panel study, each physical unit is measured at a
sequence of time points. Growth studies, of plants or of animals, are of this type, the
response Y (i, t) being height or weight of unit i at time t. Usually the design calls
for measurements to be made at regular intervals, but in practice the intervals tend
to be irregular to some degree, particularly for studies involving human subjects.
A typical longitudinal design has a large number of subjects measured on a
relatively small number of occasions. The first of these measurements is made at
or pre-baseline. If the experiment has a randomized treatment assignment, the first
measurement is ordinarily pre-randomization before the treatment is decided, and
certainly before it can have had an effect. In the modelling and analysis, it may be
necessary to include a null treatment level to denote pre-randomization status; this
level is in addition to the control and active post-baseline levels.
11.5.1 Examples
A function x : U → [k] taking values in the finite set [k] = {1, . . . , k} determines a
partition of the units into k disjoint subsets, U1 , . . . , Uk called strata or blocks:
Ur = {u : x(u) = r}.
In general, U may be finite or infinite; if U is not finite, at least one stratum is also
not finite. In practice, if U is infinite, all of the strata are also infinite.
For current-population sampling applications, U is finite; to a close approxima-
tion x is known from the preceding census, so the strata sizes are also known in the
same sense. Every classification variable such as sex determines a stratification;
every pair of variables such as (sex, location) determines a finer stratification,
and so on. For example, location might have levels rural, suburban, urban. The
classification variables that are available for survey-sampling are mostly restricted
to those recorded in the census.
11.5.3 Heterogeneity
Heterogeneity means that the distribution of response values in one stratum is not
the same as the distribution in another stratum, or at least similarity is not to be
assumed. The implication, ironically, is that the values within each stratum can
184 11 Basic Concepts
ϕ Y
[n] −→ [N] −→ R,
11.5.6 Accessibility
It is possible to select a finite random sample from an infinite population. But simple
random sampling and stratified sampling are possible only for finite populations. In
11.5 Non-comparative Studies 185
practice, any form of random sampling is feasible only for the sub-population that is
currently accessible. For example, a population consisting of a lineage of breeding
flies that evolves in time is only partly accessible in any bounded temporal window.
μπ = π 1 μ1 + · · · + π k μk
In a stratified population, the target of estimation is usually the stratum mean vector
μ = (μ1 , . . . , μk ). However, there are various applications, particularly related
to marketing, opinion polling and voting, where the democratic average plays an
outsize role. In the run-up to a crucial plebiscite such as the Brexit referendum,
the democratic average of voter preferences looms so large that between-stratum
variation is of little consequence.
Note that the stratum relative proportions 5m:4m:3m are close to the sample
fractions 4:3:3, but not exactly the same. The stratum averages for this sample
are ȳ = (0.275, 0.467, 0.833), and the population-weighted linear combination of
stratum averages is
where the weights are inversely proportional to the first-order sample inclusion
probabilities (Horvitz & Thompson, 1952). Each urban voter has a sample inclusion
probability 400/5m, so wi = 5/400; each suburban voter has inclusion probability
300/4m, so wi = 4/300; and each rural voter has inclusion probability 300/3m, so
wi = 3/300. The sum of these weights is 12, and the linear combination is displayed
in the preceding paragraph.
that Stanford study of coronavirus prevalence. For present purposes, we set that
matter aside and suppose optimistically that the false-positive rate is zero.
Suppose that a similar set of numbers—suitably scaled to represent plausible
prevalences—had arisen in the COVID-19 antibody prevalence study.
and report only the county-wide antibody prevalence at 2.4%? I should hope not!
The crucial difference is not the numbers but the setting. For the political poll, the
current-population average is the natural target mandated by democratic principles
and supported by the force of law. In the epidemiological setting, the democratic
average or prevalence is a natural summary, but it does not carry an equivalent
epidemiological or legal mandate. Nor is it necessarily the most interesting summary
or the most striking feature to emerge from such a study. In the table shown above,
the observed prevalence in the rural community is more than four times that in
the urban community. Admittedly, the case numbers are small, so the ratio in the
population might not be so extreme. But a risk ratio or prevalence ratio as large
as 3–4 calls out for an explanation, and that finding could be more interesting
epidemiologically than the particular value of the county-wide prevalence.
The main point is that the exclusive focus on county-wide prevalence is a
distraction that has the potential to divert attention away from features that are
epidemiologically more interesting. Any epidemiologist who reported only the
prevalence of 2.4% would be derelict in his or her duty to draw attention to the
extreme variation in rates for urban versus rural communities. To conclude, inverse-
probability weighting is satisfactory as a summary statistic for a stratified population
in two circumstances only: either the democratic average is mandated by law; or the
degree of heterogeneity is moderate. In the latter case, the choice of stratum weights
matters little.
188 11 Basic Concepts
The Ewens sampling formula (Ewens, 1972), is the static probabilistic description
of an exchangeable process, which can be viewed as a sequence Y1 , Y2 , . . . of
species or types. In its original genetic form, Pn,α is the probability distribution
of the number N of distinct alleles and the multiplicity of each type occurring in a
sample of n individuals (technically haplotypes) taken from an infinite population
that evolves neutrally with mutation rate α. This combinatorial stochastic process is
a thing of uncommon mathematical beauty; it occurs in a surprisingly wide range
of mathematical and scientific applications from linguistic studies to genetics to
ecology and probabilistic number theory (Crane, 2016; Pitman, 2006; Tavaré, 2021).
Only two distributional facts are relevant to the present story.
The first fact is that the number of distinct types in a sample of n objects is equal
in distribution to the sum of n independent Bernoulli variables
N ∼ X1 + X2 + · · · + Xn ,
Thirty years earlier, R.A. Fisher (1943), together with A.S. Corbet and
C.B. Williams, had considered the problem of determining the distribution of
the number of species to be found in a sample of n specimens. This very brief paper
is now considered to be a landmark contribution to mathematical ecology. It is not
exactly a joint paper in the modern sense, but rather a tripartite paper in which each
author makes a separate contribution. Fisher contributed the theory; Corbet and
Williams supplied the moths and butterflies.
Although Fisher’s derivation is totally different from Ewens (1972), and the
crucial concept of a process is absent, his gamma mixture model coincides in all
essential respects with the Ewens sampling model for fixed n. In particular, Fisher’s
formulae contain a species-diversity parameter α which coincides with the mutation
rate in genetic applications. His expression for the maximum-likelihood estimate
looks nothing like (11.5), but it is numerically equivalent, at least to the present order
of approximation. However, Fisher’s expression for the variance is very different
from (11.6). In essence, Fisher’s variance formulae are equivalent to the statements
N α log 2
var(N) = α log 2; α̂ = ; var(α̂) = , (11.7)
log n (log n)2
for i = j . Thus, the expected value of the sample variance of the species counts
from successive samples is
1
m
1
E (Nj − N̄ )2 = E (Ni − Nj )2
m−1 m(m − 1)
j =1 i<j
= 12 E (N1 − N2 )2
= α log(n) − α log(n/2) = α log 2.
1
m
E ¯ 2 = α log 2
(α̂j − α̂)
m−1 (log n)2
j =1
in agreement with (11.7). The covariances come from the species that are common
to pairs of samples, each of which is distributed as Po(α log(n/2)), explaining the
origin of the mysterious log 2-factor. It seems safe now to conclude that Fisher
did not make a technical error; it looks as if his variance formula was meant be
interpreted in this unorthodox way to ensure that it would be relevant to the specific
population in which the sample had been collected. Fisher gave no indication that he
had any formal concept of a stochastic process in mind, so the fact that he computed
and interpreted his variance in this way can only be regarded as remarkably far-
sighted.
11.6 Interpretations of Variability 191
The existence of a deterministic limit is known in the sense α̂∞ = α with probability
one, but the limit value is not revealed. Both forms of prediction, parametric
and non-parametric, refer specifically to extensions of the single trajectory that is
partially observed. In other words, both inferential tasks refer to the extension that
Fisher appears to have had in mind when he derived the formulae (11.7). Neither
task calls for independent replication for fixed α, which is the scenario that leads
to the conventional Anscombe formula (11.6), which is also the Fisher-information
formula or parametric bootstrap formula.
Fisher’s focus on the single-trajectory extension is entirely apposite. And his
implied statement that the mean squared difference between estimates from non-
overlapping samples of the same size satisfies
1
E (Nj − N̄)2 = α log 2
m−1
j
¯ 2 | N1 = α log 2 .
E (α̂1 − α̂)
(log n)2
This line of argument makes it appear that Fisher’s unorthodox variance formula is
the correct variance for purposes of parametric inference in applications where there
is a strong serial correlation between successive samples. However, there is a catch.
192 11 Basic Concepts
The combined statistic α̂¯ is not the maximum-likelihood estimate based on the
combined sequence of length mn. It is the average of m estimates based on non-
overlapping samples of size n for which the variances and covariances are
α α α log 2
var(α̂i ) = ; cov(α̂i , α̂j ) = − .
log n log n (log n)2
¯ = α (m − 1)α log 2
var(α̂) − ,
log n m (log n)2
which, for fixed n, does not tend to zero as m → ∞. As an estimator, the average
of α̂1 , . . . , α̂m is not appreciably better than α̂1 . Although α̂¯ may have a limit as
m → ∞, it is not a deterministic limit and it is not equal to α.
The species counts N1 , . . . , Nm for successive samples do not determine the total
number Tmn of distinct species for the combined sample of length mn. Many species
are expected to occur in several samples, but the overlaps cannot be determined
from the marginal counts alone. Thus (α̂1 , . . . , α̂m ) is not a sufficient statistic for
the combined sample. In fact, the information in the marginal counts jointly is
negligible compared with that in Tmn : the Fisher-information numbers are O(log n)
and O(log n + log m) respectively.
The relevant inferential calculation focuses on the difference α̂1 − α̂∞ , for which
the mean square is
α
E(α̂1 − α̂∞ )2 = .
log n
11.7 Exercises
eight pairings. A given pair of ferrets may occur in two or more pairings in the same
season or in different seasons.
The following variables are recorded at birth for each ferret:
ferret_id, sex, birth_year, zoo_id;
There are six participating zoos, and zoo_id is the zoo of birth. The following
variables are recorded for each of the 1700 M-F pairings:
male_id, female_id, zoo_id, year, kinship, whelped;
11.3 For integer n ≥ 1, a partition B of the set [n] = {1, . . . , n} is a set of disjoint
non-empty subsets called blocks whose union is [n]. A partition into k blocks, can
be written as B = {b1 , . . . , bk }, with the understanding that B is a set of subsets, not
a list of subsets. For example, 12|34, 13|24, 14|23 are distinct partitions of [4] into
two blocks of size two, and these are the only partitions of type 2 + 2. Equivalently,
B is an equivalence relation [n] × [n] → {0, 1}, reflexive, symmetric and transitive.
Let Pn be the set of partitions of [n]. List the elements in Pn for each n ≤ 5, and
show that #P1 = 1, #P2 = 2, #P3 = 5, #P4 = 15, #P5 = 52. These are called the
Bell numbers.
11.4 One of the simplest static versions of the Ewens sampling formula is stated as
a probability distribution on the set of partitions of the finite set [n] as follows:
11.5 By direct calculation, show that the Ewens distributions satisfy the following
conditions:
Show that P4,α is the marginal distribution of P5,α when the element 5 is removed
from the set [5]. Hence calculate the conditional distribution P5,α (x | B) for x ∈ P5
and B = 1|3|24 and B = 13|24.
11.6 The simplest sequential description of the Ewens sampling formula is called
the Chinese restaurant process. The first customer arrives and is seated at a table.
After n customers have been seated, the next customer is seated alone with
probability α/(n + α); otherwise, the newcomer selects one of the seated customers
uniformly at random and sits at that table. Show that the configuration after n
customers are seated is given by Pn,α . Hence deduce that customers seven and nine
are seated together with probability 1/(α(α + 1)).
11.7 To the order of approximation used in Sect. 11.6, show that the maximum-
likelihood estimate, α̂(Tn ), of the species-diversity parameter as a function of the
cumulative species count Tn , defines a martingale.
11.8 Let B ∼ Pn,α be the partition after n customers in the Chinese restaurant
process with parameter α, and let α̂(B) be the maximum-likelihood estimate.
One way to approximate the variance of α̂(B) is to generate bootstrap samples
B1∗ , . . . , Bm
∗ by simulation, and to report the sample variance of the bootstrap
The chief guiding principle in this book is the need for a mathematical framework
for thinking about measurements and how they are related to physical, chemical,
biological or environmental processes. Invariably, a measurement on a sample is
regarded as a random variable, so the mathematical framework is a stochastic
process or family of stochastic processes.
A Gaussian model expressed in the form
Consistency for covariances, implies that the diagonal values σ12 (xi ), σ12 (xj ), are
determined by component-wise extension of the scalar function σ12 (·). The off-
diagonal value n (xi , xj ) = 2 (xi , xj ) is a function only of covariates and
pairwise relationships; it is independent of the sample size and remaining sample
configuration.
These constraints are strong and restrictive, but they are also minimal and
mathematically natural. Without this form of consistency, the entire mathematical
edifice underpinning all of sampling theory crumbles, and statistical inference as
we have come to know it is not possible. Prediction in the sense of the conditional
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 197
P. McCullagh, Ten Projects in Applied Statistics, Springer Series in Statistics,
https://doi.org/10.1007/978-3-031-14275-8_12
198 12 Principles
distribution given the observation does not exist. Parameter estimation in the
sense of probabilistic statements about infinite averages or other tail events is also
impossible.
Kolmogorov consistency is not a matter of mathematical fact, nor is it a statement
of physical reality; it is an assumption that can be viewed as a statement of
mathematical sanity. It provides a framework for thinking about samples and
sub-samples, measurements on samples, and how they are related to physical,
chemical, biological or environmental processes. Without consistency, there is only
mathematical chaos.
Sampling Inconsistency
To all probabilists and statisticians, the need for consistent specification of probabil-
ities is self-evident and requires no emphasis and little explanation. However, this
sentiment is not universal in applied work. A few instances taken from the literature
on the spatial distribution of economic and business activities suffice as illustration.
Dong et al. (2015) discuss the problem of modelling the emerging property
market in Beijing, China. For real-estate purposes, each observational unit i is an
administrative region called a land parcel. There are n = 1117 such parcels, all
disjoint subsets, which appear to cover the entirety of greater Beijing. They are
partitioned into d = 111 districts of various sizes. For the most part, the districts
and the parcels are relatively compact and simply connected. Inter-parcel distances
dij are measured in km. from a representative point in each parcel; districts are
related by adjacency or contiguity.
Dong et al. begin with a simultaneous autoregressive spatial formulation, which
. . . has been extensively studied in the spatial economics literature and is widely
used in geographical research. A key characteristic . . . is that it allows the observed
value at a particular location to be directly dependent on the values observed at
surrounding locations (or lagged y). . . . Five references are cited in support of the
wide usage in geographical research. So far, so good.
The following components are simplified slightly in the interests of clarity. First,
W is a trace-free stochastic matrix, a non-negative function of distances whose row
sums are all one; roughly speaking, Wij = exp(−dij2 )/ci for i = j . Second, M is
a similar n × d contiguity matrix for districts. The simultaneous-equation rationale
begins with a literal interpretation of the quoted remark in the form of a stochastic
vector equation
Y = ρW Y + Xβ + Mη + ε, (12.2)
It follows that μi (x) depends not only on xi but also on xj for j = i, in violation of
the most elementary consistency condition in the preceding section.
A quick perusal shows that the use of simultaneous spatial autoregressive
formulations is not at all uncommon in parts of the economic and business literature.
An identical formulation was used by Cellmer et al. (2019) for land price evaluation
in the city of Olsztyn in northeastern Poland. Baltagi et al. (2014) use a more general
formulation of the same type with spatio-temporal weights to study the spatio-
temporal pattern of house prices in England.
Why is it that the sampling inconsistencies in (12.3), which are so repugnant
to many authors, are shrugged off so casually by others? It is not as if these
matters have gone unnoticed. Dong et al. remark that . . . changing the value of the
covariates at one location . . . will affect not only its own outcomes, but also the
outcomes at other locations. . . . They appear to regard this property as nothing out
of the ordinary—neither distasteful nor inappropriate. However, the later version
used by Fingleton et al. (2018) uses the residual Y − Xβ in place of Y in (12.2).
This amendment has the effect of removing the inconsistency from the mean,
which suggests a disapproval of (12.3). However, the inconsistency remains in the
covariances. The oddities of simultaneous autoregressive covariance matrices were
pointed out by Sen and Bera (2014). Those oddities are real enough, but they are
not to be confused with sampling inconsistencies.
One counter-argument to consistent sampling proceeds as follows. The formu-
lation (12.2) is not a sampling model like (12.1). It is not meant to be applied to
arbitrary samples, but only to the entire sample, i.e., to the population. In other
words, the population is finite, and the joint distribution is Gaussian with joint
moments as specified in (12.3) and (12.4). After all, there is only one Beijing. If
you want the joint distribution for any subset of units, you need only extract the
relevant means and covariances.
This interpretation is mathematically honest and logically unassailable. The
problem lies in the finiteness of the population. In effect, Beijing is declared to
be sui generis—a population of one. Admittedly, the Beijing real-estate valuation
space, R1117, leaves plenty of room for computation. Plenty of room for an
accountant, that is, but none for a probabilist or statistician.
It is natural to extend the set of units geographically to include parcels beyond
the city proper, but that is is not in the spirit of the authors’ formulation. It is also
possible to extend the domain to parcel-time pairs. The distribution (12.3), (12.4),
can then be interpreted as a real-valued process in one of two ways: either Yit
is constant in time, or it has independent and identically distributed values for
t = t . Neither of these interpretations is appealing for real-estate valuations. But
the second provides a mathematical framework within which parameters can be
understood, and predictions can be interpreted.
200 12 Principles
Every statistical model begins with a set of observational units on which are
defined covariates and relationships of various sorts. These mathematical objects
must be matched carefully and appropriately with objects and relationships on the
workbench or in the field. The correspondence need not be one-to-one. All parts of
this activity are invariably tentative, and always require reconsideration in the light
of new information. Careful compromise demands a good knowledge of what is
important in the mathematics and what is important in the process under study. It is
usually necessary to entertain a range of stochastic models that exhibit progressively
more complicated effects.
Criticism of linear Gaussian models or generalized linear models or other stock
families is a popular pastime. Glib sweeping slogans such as ‘All models are
wrong!’, or ‘They don’t work!’, or ‘Data are not normal!’, or ‘Relationships are
not linear!’, are certainly provocative. But they are also unhelpful and irritating—
particularly so in situations where the criticism is justified. Despite the rhetoric,
none of these remarks is meant as a criticism of the model per se as a mathematical
object. For example, few would venture to claim that Brownian motion is wrong on
the grounds that atoms have mass, moving atoms have momentum, and momentum
implies differentiability of paths. Such a statement would expose the obvious fact
that the error lies not in the behaviour of atoms nor in Brownian motion, but in the
nexus that associates one with the other.
To be fair, one oft-cited sloganeer has commented as follows: Since all models
are wrong, the scientist must be alert to what is importantly wrong. It is inap-
propriate to be concerned about mice when there are tigers abroad (Box, 1976).
Apart from the lead-in phrase, Box’s sentiment is very much in line with themes
in this book. In that sense, I agree with him, and I agree enthusiastically. But, of
his menagerie, we must be clear that tigers are to be found neither in the model
nor in the specific components of the application. Instead, both mice and tigers are
interstitial species occupying exclusively the gaps in the nexus between one and the
other. As a result, we cannot learn to recognize either mice or tigers by studying
exclusively one domain or the other. We need to understand both.
For reasons outlined above, it is helpful do defuse the rhetoric and focus the
discussion by decomposing the model and its application into four distinct aspects
as follows:
• Probabilistic matters:
(i) observational units as the domain for a process;
(ii) properties such as sampling consistency, exchangeability, independence,
stationarity; short, medium and long-range dependence;
(iii) asymptotic properties of sample means, variances,. . .
12.3 Likelihood Principle 201
• Statistical matters:
(i) protocol, baseline, randomization,. . .
(ii) parameterization: compatibility with linear transformation,. . .
(iii) accommodation of effects;
(iv) compatibility with randomization;
• Match with needs of the application:
(i) identification of observational and experimental units;
(ii) compatibility with existing physical/genetic/biological theories;
(iii) need for response transformation or covariate transformation;
(iv) are baseline relationships adequately accommodated?
• Procedural aspects:
(i) computation for parameter estimation;
(ii) computation for prediction, Kriging;
(iii) algorithms and software.
Now that the target of criticism is more clearly identified, we can all agree to
put more effort into finding a model that matches well with the demands of the
application. For example, much of the effort and discussion in Examples 1–10 is
directed at the critical junction, and hopefully in a constructive way. Mice are not
exactly welcome, but tigers must be squeezed out.
In this scheme, procedural and computational considerations play an important,
but entirely subsidiary, role. The expectation is that whatever computations are
needed for estimation or prediction can be done in reasonable time. If it turns out
that the required computations are more formidable than anticipated, we may need
to make further compromises. For most of examples 1–10, such compromises are
not needed. Thus, strategies for computational approximation do not feature in this
book.
without violating the LP. Neither of these tasks lies within the realm of parametric
inference.
A predictive target event such as Ȳ∞ ∈ A is covered by LP if and only if the
event has probability either zero or one for each θ . In that case, the sample-space
event Ȳ∞ ∈ A can be identified with the parameter subset
A = {θ ∈ : Pθ (Ȳ∞ ∈ A) = 1},
Some of the controversies about the application of the principle revolve around
the initial clause. If the circumstances surrounding the two experiments were
sufficiently different, the prior distributions might not be the same, and the
posteriors would certainly be different. That much is accepted. However, if the two
experiments were investigating the same phenomenon in different ways, there is
only one prior (per statistician), and the conclusions must be the same. Generally
speaking, Bayesian inferences obey the likelihood principle; procedures based on
sample-space computations such as p-values and confidence intervals, do not.
First Illustration
Let n ≥ 1 be a positive integer. A partition of the set [n] = {1, . . . , n} is a set of
disjoint non-empty subsets B whose union is [n]. The Ewens sampling formula is a
family of probability distributions on set partitions
θ #B b∈B (#b)
pθ (B) = , (12.5)
θ ↑n
where #B is the number of blocks, #b is the block size, θ > 0 is the parameter, and
θ ↑n = (θ (θ + 1) · · · (θ + n − 1) is the ascending factorial. The Radon-Nikodym
derivative with respect to p1 is
pθ (B) θ #B n!
= ↑n ,
p1 (B) θ
which is also the likelihood function. Although the Ewens density depends on both
the number of blocks and on their sizes, the number of blocks is a sufficient statistic.
In other words, the likelihood function ignores block sizes. Any inferential statement
about the parameter that depends either on the block sizes or on the size of the block
that contains element 1 is a violation of the likelihood principle.
Given that the observed configuration has 20 blocks, sufficiency implies that the
Ewens distribution on block sizes is fixed and independent of the parameter: see
Exercise 12.6. An observation consisting of 20 blocks of approximately equal size
is certainly possible but it may be judged so unlikely as to cast doubt on the entire
family. Questions of model adequacy and goodness-of-fit question are traditionally
formulated as significance tests; such matters are not concerned with the parameter
space and are not covered by the LP.
Second Illustration
The principle can be illustrated by two observations on a Bernoulli sequence Yi ∼
Ber(θ ). For a sample of size n = 20, the density function and the likelihood function
are
f (y; θ ) = θ s (1 − θ )20−s ,
204 12 Principles
where the number of successes s = y. is the sufficient statistic. Suppose that the two
sequences are
y (1) = (0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1)
y (2) = (0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1).
Both sequences have s = 10, so the likelihood functions are equal. According to the
weak version of the likelihood principle, both observations must lead to the same
conclusion about θ .
From the present viewpoint, it is immaterial whether we adopt a Bayesian-type
beta-binomial model or we attempt to construct a confidence interval for θ .
Third Illustration
The defining condition (11.4) for a randomization protocol means that the joint
distribution of the treatment vector and response vector is
Pθ (T ; x) × Pθ (Y | x, T ) = P (T ; x) × Pθ (Y | x, T ).
Thus, the randomization protocol does not feature in the likelihood. Or, to put it
more correctly, two experiments using different assignment protocols P , P , giving
rise to the same treatment assignment and the same response must lead to identical
inferences about the parameter.
12.4 Attitudes
Principles are important both in statistical theory and in its application. While
one might make a plausible argument against the likelihood principle, it is difficult
to make a comparable argument against mathematical consistency. For that reason,
this book puts stochastic models and mathematical consistency at the top of
the hierarchy. Without consistency, there can be no guarantees of any sort, so
consistency is the sine qua non of science in general.
It is difficult to give a complete definition that covers every aspect of consistency,
but it is usually easy to spot or to confirm inconsistent formulations once the flaw is
12.4 Attitudes 205
Relative Importance
The second attitude addresses the relative importance of parametric inference as a
component of the professional activity of the applied statistician. While one may
debate the precise fractions, I venture to say that questions of parameter estimation
and parametric inference represent about 20% of the effort for an applied statistician.
Yet this activity accounts for possibly 70–80% of the attention in the methodological
literature. Computation and parameter estimation are important, but they are not the
most important activities for the applied statistician.
Lessons
The premiss of the likelihood principle is that the statistician buys into the model
exactly as stated, with no probabilistic reserve in the form of an opt-out clause to
cover buyer’s remorse. The lesson of experience is simply to avoid being sand-
bagged. The likelihood principle is not rejected, but a cautious applied statistician
invariably adopts the stated model provisionally, with adequate reserves to cover
mistakes, misunderstandings or unanticipated events. To do otherwise would be a
serious error of professional judgement.
In effect, a consulting statistician using the Bernoulli model proceeds as follows.
With probability 0.65 the sequence is Bernoulli with constant parameter θ ; with
probability 0.10 the sequence is Bernoulli with non-constant parameter; with
probability 0.10, the sequence has some temporal dependence, possibly Markov;
with probability 0.10, the design has some other feature that might lead in a
different direction. The weights shown here may be varied to match the incidental
information relevant to the context, but their sum is strictly less than one. After
observing either sequence y (1) or y (2) the first weight component is drastically
reduced. The Drosophila mating experiment in Chap. 3 is an instance of this sort.
12.5 Exercises
12.1 According to the discussion of Dong et al. (2015) in Sect. 12.1, the real-estate
market in Beijing is divided up into 1117 land parcels, which are partitioned into 111
districts. These administrative regions remain fixed over the time period of interest.
Each land parcel has a geographic position, an area, and a population density. To
each district there corresponds a subset of one or more neighbouring districts. Every
boundary parcel has at least one ‘foreign’ neighbour, a parcel outside of Beijing;
12.5 Exercises 207
likewise for districts. Two Beijing districts are isolated and have no neighbours
within the city.
Discuss the nature of these variables in the setting of a study of Chinese quarterly
real-estate market valuations. What are the observational units? How many are
there? Which variables would you classify as covariates? Which variables would
you classify as relationships? At least one of these relationships is an equivalence
relation. Which one? Is a district a neighbour of itself? At least one relationship is
Boolean but not transitive. Which one? At least two relationships are metric. Which
two? What other types of relationships might be relevant?
12.2 Each land parcel belongs to the first or inner ring, the second ring, the third
ring, or beyond. To be clear, the rings are disjoint, so the phrase ‘second ring’
excludes the first. A district may straddle two or more rings. What sort of variable
is ring ?
12.3 For non-commercial sales, average sale price per square metre is recorded
quarterly for each parcel. Discuss briefly how you might go about constructing a
sampling-consistent Gaussian model that incorporates spatial and temporal correla-
tions.
logit pr(Yi,j ≤ r) = γr + αi − αj ,
12.5 A different version of the preceding model uses the complementary log-log
link function:
log − log(1 − pr(Yi,j ≤ r)) = γr + αi − αj ,
12.6 For the Ewens distribution (12.5), show that the conditional distribution given
#B = k is
b∈B (#b)
pθ (B | #B = k) =
sn,k
where sn,k is Stirling’s number of the first kind, i.e., the number of permutations
[n] → [n] that have exactly k cycles.
12.7 Let n = 12, and let B have the Ewens distribution with parameter θ > 0.
Suppose B has six blocks. Which is more likely: (a) that B has all blocks of size
two; (b) that B has five blocks of size one and one of size seven?
1
(1 + ψ cos y)
2π
on the interval −π < y < π. The parameter space is the interval −1 ≤ ψ ≤ 1.
Show that the log likelihood is concave. What does this imply about maximum-
likelihood estimation?
1
(1 + ψ cos y)(1 + sin y/2)
2π
on the interval −π < y < π.
Show that the vector statistic R = cos Y is sufficient and that S = sin Y
is ancillary, ie., that S is distributed independently of the parameter. Show that
the conditional distribution of R given S is discrete, a Bernoulli multiple with
independent components. Find the conditional likelihood, and compare it with the
unconditional likelihood in Exercise 12.8.
1
(1 + ψ cos y)(1 + λ sin y)
2π
on the interval −π < y < π. Show that the likelihood for (ψ, λ) factors
Show that the likelihood factorization is also a density factorization, i.e., for fixed ψ,
that L1 (ψ; y) is a probability density on (−π, π)n , and likewise for L2 . Does it
follow that R and S are independent?
12.12 Let X0 , X1 be given matrices of order 100 ×5 and 110 ×5 such that X0 X0 =
X1 X1 = F , and let Pβ be the Gaussian mixture model
indexed by β ∈ R5 . For any sample point, either y ∈ R100 or y ∈ R110, show that
the likelihood ratio is
pβ (y)
= exp(y Xβ − β Fβ/2).
p0 (y)
Deduce that the vector X y is minimal sufficient, and that the sample size n(y) is
not a component of the sufficient statistic.
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 211
P. McCullagh, Ten Projects in Applied Statistics, Springer Series in Statistics,
https://doi.org/10.1007/978-3-031-14275-8_13
212 13 Initial Values
randomization scheme of this sort may not be recommended for a clinical setting,
but it is entirely legitimate on mathematical grounds if the scheme is fully specified
in the protocol.
In many settings such as agricultural field trials, treatment assignment is delib-
erately balanced to take account of block factors and other baseline variables such
as the geographical arrangement of plots in the field. It might well be the case that
Ti ∼ Tj for every pair of plots, but it is rarely the case that (Ti , Tj ) ∼ (Ti , Tj )
for every pair of distinct pairs. For example, the bivariate assignment probabilities
for adjacent and non-adjacent pairs of plots are usually different. Thus, complicated
randomization schemes taking account of baseline covariates and relationships are
established practice in certain areas. Even though simpler randomization schemes
are strongly favoured for clinical trials, the possibility of dependence of T on either
x or on Y0 is left open in the discussion that follows.
This motivation for this section comes from a series of papers in the biostatistical
literature, (Samuels, 1986; Liang & Zeger, 2000; Senn, 2006) in which there appears
to be disagreement on the choice of statistical techniques that are appropriate for
designs that focus on the change from baseline. These are sometimes called pre-
post designs.
Four closely-related Gaussian models are described. For the most part, the formula-
tions are entirely standard. Although the emphasis is on the treatment effect, other
aspects of the joint distribution must be considered. In the third and fourth versions,
it is explicitly assumed that T and Y0 are independent. This assumption does not
occur in versions I or II.
The reader should be warned that not all statements should be taken at face
value. Some are debatable; others may be misleading, or inappropriate for clinical
applications, or simply incorrect. See the subsequent discussion in Sect. 13.5.3.
Version I
In the simplest version of the problem, there are no baseline covariates other than
the initial value. In the absence of covariates, the baseline values are exchangeable,
here interpreted as independent Gaussian
This is the treatment effect, which is a constant independent of the initial value.
For two units whose initial values are not equal, the conditional expected
response difference is a linear combination of the treatment effect plus the initial
difference:
which is, in general, not the same as the treatment effect. However, if treatment
is assigned independently of initial values, exchangeability implies that the second
term is zero.
The parameters in this model are the two variances plus four regression coeffi-
cients μ0 , μ1 , γ , τ . Regardless of the specific model employed, the joint density has
two Gaussian and one non-Gaussian factor:
In the Gaussian model, the first factor depends on (μ0 , σ0 ); the second factor is fully
specified by protocol, i.e., constant on the parameter space; the third factor depends
on (μ1 , γ , τ, σ1 ). Thus, provided that the parameters are variation independent,
the density factorization is also a likelihood factorization. So far as maximum-
likelihood estimation of the the treatment effect is concerned, the first two factors
can be ignored, and usually are ignored. Whether T is independent of Y0 or not, the
treatment effect is estimated by ordinary least squares using (13.2) with the initial
value as a covariate.
Version II
The second version admits baseline covariates x whose effect on the response
distribution is additive. In standard linear-model notation, μ0 in (13.1) is replaced
214 13 Initial Values
by xi β0 and μ1 in (13.2) by xi β1 . Provided that the treatment effect is a constant
independent of x, the conclusions are not appreciably different from those in
version I. Whether T ⊥⊥Y0 or not, it is evident that the maximum-likelihood estimate
of the treatment effect and its standard error are obtained by least squares based on
the extended version of (13.2) using both x and the initial value as covariates on an
equal footing. It appears, therefore, that the distinction between the initial value and
other baseline covariates is more cosmetic than substantive.
Version III
The third version is a minor variation on the first, but it sets the scene for the natural
extension to longitudinal designs in version IV. For individuals such that Ti = 0,
the pairs (Y0,i , Y1,i ) are independent and bivariate normal
Y0,i μ0
∼ N2 , .
Y1,i μ1
For individuals such that Ti = 1, the pairs are again independent and bivariate
normal
Y0,i μ0
∼ N2 , .
Y1,i μ1 + τ
Version IV
In a longitudinal study of health, the response on each physical unit is measured
at baseline and at multiple points thereafter as specified by the protocol. Apart
from baseline, the observation times need not be the same for every individual.
In the fourth version, the response is assumed to be a Gaussian process with
covariance function δi,j K(t, t ) for t, t ≥ 0. In other words, treatment assignment
is independent of initial values, and the processes for distinct individuals i = j are
independent with the same conditional covariance function
cov Yi (t), Yi (t ) | T = K(t, t ). (13.4)
13.2 Four Gaussian Models 215
Apart from the covariance function, the model is determined by the conditional
mean function
μ0 (t) (Ti = 0);
E Yi (t) | T = (13.5)
μ0 (t) + τ (t) (Ti = 1).
If T ⊥
⊥Y0 , each of the four versions is a variation on IV. The processes for distinct
individuals are independent Gaussian processes with the same covariance function
K(t, t ) for all individuals regardless of treatment status. Treatment can only have an
effect post-baseline; in (13.5), the effect is additive on the mean trajectory given T ,
but it is not constant in time.
A treatment-assignment protocol pn (T | Y0 = y) is called fixed if the
distribution for each y is degenerate at t(y) and constant, i.e., t(y) = t(0) for all y.
Every fixed assignment is a random assignment, automatically independent of Y0 ,
to which the remarks in the preceding paragraph apply. For each fixed assignment,
the second factor in (13.3) is degenerate; the product of the first and third factors is
the density of the Gaussian process whose mean is (13.5).
If T is not independent of Y0 , the product of the first two factors in (13.3) is
the joint density of (T , Y0 ), and the full product is the joint density of (T , Y (·)).
In general, the conditional distribution of Y0 given T is not Gaussian, nor is
the conditional distribution of Yi (t), so the conditional mean given T does not
satisfy (13.5). However, the second factor in (13.3) is parameter-free, and plays no
part in likelihood calculations. According to the likelihood principle, all inferential
statements concerning the parameters alone are to be made as if the treatment
assignment were fixed and independent of the initial values. In other words, if T is
not independent of Y , the joint distribution given T is not Gaussian. Nonetheless,
216 13 Initial Values
the likelihood function is the same as if T were fixed, so the Gaussian likelihood
computation is correct in that limited sense.
The likelihood principle is concerned solely with parametric inference. It has
little to say about inferences that are external to the parameter space. For example,
the task of predicting future values Yi (s) for patient i is a state-space task, not
covered by the likelihood principle.
y0 4.33 5.13 5.05 4.62 3.90 4.08 4.99 4.39 5.58 4.99 4.16 5.22
y1 5.13 3.86 4.81 4.88 4.57 3.13 6.97 5.86 5.62 5.67 4.53 5.56
t 0 0 0 0 0 0 1 1 1 1 1 1
Maximum likelihood for the three Gaussian models described above produces the
following estimates for the treatment effect. In each case, the standard REML
procedure was used for the estimation of variances and covariances, followed by
weighted least squares for regression coefficients.
τ̂ s.e.(τ̂ ) s2
(13.2): OLS 1.1596 0.4843 0.6097
III: σ0 = σ1 1.2147 0.3660 0.4019
III: σ0 = σ1 1.1596 0.4277 0.5487
Differences 0.9350 0.4644 0.6469
The final column is the estimate of the conditional variance var(Y1 | Y0 ), either
computed directly from the residual mean square in (13.2), or computed indirectly
ˆ in version III. The fitted matrices
as a function of the fitted 2×2 covariance matrix
for the two variations of III are
0.427 0.104 0.280 0.110
and .
0.104 0.427 0.110 0.592
The Y1 -values are more variable than baseline values, but not significantly so. The
expression fit2 displayed below shows why it is sometimes necessary to allow
variance components to be negative.
For computational purposes, the data were coded in the response vector Y =
(Y0 , Y1 ), the patient ID factor with 12 levels, and the treatment factor T = (T0 , T1 ),
13.2 Four Gaussian Models 217
which has one baseline level and two post-baseline levels. The ordinary least squares
code uses the sub-vectors Y1 , Y0 and the two-level factor T1 . In all cases, the
covariance model includes the identity I12 or I24 by default. The third part uses the
additional matrix I0 of order 24, which is the identity restricted to baseline values
only.
fit0 <- regress(Y1~Y0+T1)
fit1 <- regress(Y~T, ~patient_id)
fit2 <- regress(Y~T ~patient_id+I0)
fit3a <- regress(Y~T+patient_id); fit3b <- regress(diff~T1)
The code for the fourth version differs from the second only in one respect:
patient ID is used as a classification factor in the mean rather than as a block
factor in the covariance. This is equivalent to assuming that the between-patient
variance is infinitely large, so the same numerical value is obtained by working
with the individual differences Y1 − Y0 in the code lm(diff~T1). As it happens,
the between-patient variance is about one third the residual variance, the regression
coefficient of Y0 in the ordinary least-squares fit is only 0.393, so differencing is
not especially effective in this instance. Statistically speaking, the fourth method is
strictly inferior to the first three.
We have seen that no fundamental distinction can be drawn between initial values
and other baseline covariates. Nevertheless, the distinction is relevant and important
if only as a strategy for model construction.
If the initial value is regarded as non-random, and observations are to be made
at arbitrary post-baseline points, it is necessary to specify the joint distribution
of Y (t1 ), . . . , Y (tk ) given Y (0) for arbitrary k ≥ 1 and arbitrary configurations
t1 , . . . , tk . In other words, we need to associate with each initial value y0 = Y (t0 )
a stochastic process that is devoid of symmetries such as stationarity, in such a
way that the one-dimensional distributional specification for each time t ≥ 0
is consistent with the two-dimensional specifications for times t, t , and so on.
Direct construction of conditional distributions is a very difficult exercise, and the
dependence on the initial value only compounds the difficulty. Generally speaking, it
is much easier and more natural to begin with a single process, which is a consistent
specification of the joint distribution of Y (t0 ), . . . , Y (tk ) for arbitrary temporal
configurations. Depending on the setting, this process might well be stationary. If it
is needed—and usually it is not needed—the process itself defines the conditional
distribution given Y (t0 ). These derived conditional distributions are automatically
self-consistent. Each one is non-stationary, fixed at the temporal origin.
In conclusion, the distinction between covariates and initial values may not be
fundamental, but it is strategically important. It is a strategic mistake to insist that
the initial value cannot be regarded as random.
218 13 Initial Values
All of the preceding remarks concerning initial values are made in the context of
a randomized controlled experiment. However, initial values also occur naturally
in every longitudinal study, even if there are only two observation times. Suppose,
therefore, that Yi (0) is the initial value, xi is a baseline classification factor such
as age or sex, and that Yi (1) is the subsequent or terminal value. All patients are
closely monitored, and the recommendations may be individually tailored, but the
program has no randomized assignment or declaration of subsets for comparison
other than subsets determined by levels of x. What sorts of analyses are possible,
and how should they be conducted?
In many clinical situations, the goal is to improve symptoms, for example by
alleviating pain, reducing weight or reducing blood pressure, so it is natural to
focus on the difference, Zi = Yi (1) − Yi (0), and to examine its association with
the classification factor x. Since there is no treatment and no control level for
comparison, we can only look at the reduction and ask whether the program of
medication appears to be effective at alleviating symptoms. In the absence of a
control group, such a question can be addressed only under an assumption of
stationarity, namely that any systematic difference is more naturally associated with
the program than with the passage of time. Therein lies the essential weakness of an
observational study.
Stationarity is not an assumption be taken lightly, particularly in studies of
chronic diseases such as Lyme disease, where the persistence of symptoms is known
only anecdotally and primarily from sources exhibiting the greatest pain and longest
persistence. It is well to remember that volunteers are primarily or exclusively those
exhibiting extreme symptoms at baseline. Even if the process is stationary and the
program is entirely neutral in its effect, the regression phenomenon guarantees that
average symptoms after one unit of time will be less extreme than those at baseline.
The temptation to attribute such a reduction to the effectiveness of the program is
undoubtedly strong for all participants, but it is also potentially misleading. For an
insider’s view, see Douthat (2021) and the sympathetic but more balanced review
by Austin (NYT, Oct. 2021).
Given stationarity, we can investigate whether the program looks promising by
examining the magnitude of the mean symptom reduction and asking whether this
is appreciably better than what would be expected under a neutral program. In other
words, some estimate of a neutral effect is needed for a definitive assessment. In the
absence of data from a neutral group, we can ask whether the program is equally
effective or ineffective for patients of all ages and both sexes. On the assumption of
independence and exchangeability for distinct patients, the simplest Gaussian model
takes the form
Y0,i μ0 (xi ) = β00 + β01 xi
∼ N2 , . (13.6)
Y1,i μ1 (xi ) = β10 + β11 xi
13.3 Exercises 219
for any pair of units units i = j such that xi = 1 and xj = 0. One way to
estimate the sex effect is to code Y as a matrix of order n × 2 and to fit (13.6)
as a bivariate regression model, Y ∼ N(Xβ, ⊗ In ). Maximum likelihood gives
β̂ = (X X)−1 X Y as a 2 × 2 matrix, and ˆ is the 2 × 2 matrix of residual mean
squares and products. We can then report the difference β̂11 − β̂01 , together with its
standard error. Numerically, this is exactly equivalent to a simple linear regression
of the differences Zi = Yi (1) − Yi (0) on xi .
Given the target parameter β11 − β01 , it is crucial that the baseline value Y0 not
be included as a covariate in either regression because
where γ = 10 /00 . Thus, E(Zi − Zj | Y0 ) = β11 − β01 . Apart from notation and
interpretation, β̂11 − β̂01 is what is reported in fit3b in Sect. 13.2.2.
The topic of this section—the conflation of a treatment factor in a randomized
trial with a classification factor in an observational study—is the core of Lord’s
paradox: Lord (1967). See also Bock (1975, Sect. 7.3.1) for a more detailed
discussion along the lines of Exercise 13.2. As Senn has pointed out, Lord’s paradox
is a feast of red herrings, the first of which is an observational study presented and
analyzed as if it were a randomized experiment.
13.3 Exercises
n+2
var(T̄n ) =
6n(n + 1)
E(Yi | T ) = β + σ γn xi
cov(Yi , Yj | T ) = σ 2 (1 − γn2 )δij + σ 2 γn2 /n,
13.3 Use the Gaussian model with second moments given in the previous exercise
to compute a pseudo-log likelihood l0 (β, σ ) for the parameter (β, σ ). Show that the
pseudo log-likelihood differs from the correct log likelihood
l(β, σ ) = −n log σ − 1
2 (y − β)2 /σ 2
by terms that are relatively small for large n, so that β̂0 = β̂ and σ̂0 − σ̂ = Op (n−1 ).
What precisely does ‘relatively small for large n’ imply about the magnitude of the
difference l0 (β, σ ) − l(β, σ )?
13.4 For the pseudo log likelihood in the preceding exercise, show that the Fisher
information matrix is diagonal and that it coincides with the Fisher information from
the correct log likelihood.
13.3 Exercises 221
13.5 Suppose that terminal values are conditionally independent given (Y0 , T ) with
conditional distribution
Recall that a process with state space S associates with each sample U =
(u1 , . . . , un ) consisting of finitely many distinct observational units taken in a
specified order, a probability distribution PU on the observation space S U . Thus
PU (A) is the probability of the event
(Yu1 , . . . , Yun ) ∈ A.
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 223
P. McCullagh, Ten Projects in Applied Statistics, Springer Series in Statistics,
https://doi.org/10.1007/978-3-031-14275-8_14
224 14 Probability Distributions
14.1.4 Stationarity
Let U = R be the index set. No covariates are registered, and the temporal difference
R(t, t ) = t − t is the only registered relationship. The restriction of R to a sample
is a square matrix R[U ] of signed temporal differences; two samples are called
congruent or structurally equivalent if R[U ] = R[U ]. Congruence is an equivalence
relation on samples, denoted by U ∼ = U ; in this setting, it implies U = U + h for
some real number h.
A process with distributions PU is said to be stationary, or invariant with respect
to temporal translation, if U ∼ = U implies PU = PU . In particular, stationarity
implies that all singletons have the same distribution Yt ∼ Yt .
Any transformation of R, such as the absolute distance R + , is also a relationship
on the units; R + is said to be a coarser relationship than R because the partition
defined by R is a sub-partition, or a finer partition, of that defined by R + . In
particular, R + (t, t ) = R + (t , t) is symmetric whereas R is not. If R + [U ] =
R + [U ] implies PU = PU , the process is not only stationary but also reversible.
14.1.5 Exchangeability
equal x(ui ) = x(ui ); it implies that all pairwise relationships are equal R(ui , uj ) =
R(ui , uj ), and so on. In particular, ui = uj if and only if ui = uj .
Exchangeability is nothing more than the statement that congruent samples are
required to have the same response distribution, i.e., U ∼ = U implies
PU = PU .
For singletons, x(u) = x(u ) implies P = P ; for pairs x(u1 ), x(u2 ) =
u u
x(u1 ), x(u2 ) and R(u1 , u2 ) = R(u1 , u2 ) together imply Pu1 ,u2 = Pu1 ,u2 .
Exchangeability is not a statement of biological, medical or scientific fact. It is a
mathematical statement of equity or equality, corresponding roughly to fairness or
even-handedness, which implies that probabilistic statements are based only on facts
that are registered at baseline. By supposition, all relevant facts are encoded in x.
Without a symmetry condition of this sort, conveniently selected alternative facts
are no less compelling than recorded facts. Such a world view may be acceptable in
politics and in the theatre, but it is an impediment to science.
Let B be a given partition of the finite set [n] into blocks, and let G be the group of
permutations that preserves the partition. In other words, G is the set of permutations
σ : [n] → [n] such that Bσ (i),σ (j ) = Bi,j .
To any vector y = (y1 , . . . , yn ) in S n there corresponds a randomized vector
Y = yσ whose components Y = (yσ (1), . . . , yσ (n) ) are obtained by composing the
given vector with a random permutation σ uniformly distributed over the group.
Randomization defines a process with state space S and finite index set [n]. By
14.2 Families with Independent Components 227
definition, for each τ ∈ G, the group product σ τ is also uniformly distributed, so the
permuted random vector Y τ = yσ τ has the same distribution as Y . The distribution
is invariant with respect to the natural sub-group of permutations associated with
the block factor.
If all blocks of B are of equal size, the distribution of Y is block-exchangeable
in the sense of Sect. 14.1.3. Otherwise, if there are blocks of different sizes, Y is
not block-exchangeable. For example, if n = 7, and B = 1|2|34|567 is a partition
into four blocks, the group contains 2 × 2 × 6 = 24 elements. Block randomization
implies Y1 ∼ Y2 because the transposition 1↔2 is a group element; it also implies
Y3 ∼ Y4 and Y5 ∼ Y6 ∼ Y7 for similar reasons. But it does not imply Y1 ∼ Y3 or
Y3 ∼ Y5 because there is no group element such that σ (1) = 3 or σ (3) = 5.
The parameter space is the set of probability distributions defined on the state space,
say S = R with Borel subsets. In other words, the sequence Yu for u ∈ U has
independent and identically distributed components Yu ∼ θ .
Properties of a statistical model are often gauged by their behaviour under the
action of a suitable group or semi-group of measurable transformations g : S → S.
If Y1 , . . . are independent and identically distributed with distribution θ ∈ ,
then the transformed variables g(Y1 ), g(Y2 ), . . . are independent and identically
distributed with parameter gθ ∈ , where
gθ (A) = Pθ (gY ∈ A) = Pθ (Y ∈ g −1 A) = θ (g −1 A)
iid iid
Y1 , Y2 , . . . ∼ θ, and gY1 , gY2 , . . . ∼ gθ
228 14 Probability Distributions
n
θ̂ (A; y) = n−1 δyi (A) = n−1 #{i ∈ [n] : yi ∈ A}.
i=1
The function y → θ̂ (y) is equi-variant in the sense that θ̂ (gy) = g θ̂ (y). The
transformation y → gy acts component-wise S n → S n , while θ → gθ is
the induced transformation on distributions on S. For invertible transformations,
this means θ̂ (y) = g −1 θ̂ (gy). The empirical distribution is sometimes called the
nonparametric maximum-likelihood estimate, or the bootstrap estimate.
The parameter space is the set of Gaussian distributions on the real line. Since
the Gaussian distribution is determined by its mean and variance, this statement
typically means one of the following:
= R2 ; Pθ = N(θ1 , θ22 );
= R × (0, ∞); Pθ = N(θ1 , θ22 );
= R2 ; Pθ = N(θ1 , eθ2 ).
Given a parameter point θ , the components are independent and identically dis-
tributed Yu ∼ Pθ on the real line.
The three versions are not mathematically equivalent. In version one, the two
distinct points (θ1 , ±θ2 ) define the same distribution, so the parameter is not iden-
tifiable. In addition, the boundary subset of Dirac distributions N(θ1 , 0) is included
in the first version, but not in the other two. Versions two and three are equivalent in
the sense that they contain the same set of non-degenerate distributions. Differences
of this sort are sometimes important in theoretical work, for example in questions
concerning the existence of a parameter point that maximizes the likelihood. But,
for the most part, minor differences in parameterization are not of great importance
and are usually overlooked in applied work.
All three versions are affine equi-variant in the sense that Yu ∼ Pθ implies gYu ∼
Pgθ for affine transformations y → gy = g0 + g1 y with g1 > 0. The induced
transformation on the parameter space is group composition
(θ1 , θ2 ) → (g0 + g1 θ1 , g2 θ2 )
14.3 Non-i.d. Models 229
This estimator is affine equivariant in the sense that θ̂ (gy) = g θ̂ (y) for affine
transformations y → gy acting component-wise. For this purpose, the divisor n − 1
could be replaced by n. Similar remarks with minor modifications apply to version 3.
The Cauchy distribution C(θ ) with median θ1 and probable error |θ2 | has a
density
|θ2 | dy
,
π |y − θ |2
variance, while both parameters are sex-dependent in the fourth model. In (vii), the
distributions are arbitrary; Yu ∼ θ0 for males and Yu ∼ θ1 for females.
Most readers whose experience lies in applied work would blanch at the
penultimate suggestion in which male values are Gaussian while female values are
distributed as Cauchy. The reasons for this have nothing to do with Cauchy versus
Gauss as individuals, or with male variability versus female variability, or with the
suitability of this model for any specific application. Instead, they are anchored in
the well-established legal principle of ‘equality under the law’, a desire to avoid
overt bias related to visible factors such as race, sex and religion that are, by common
agreement, incidental under law.
One mathematical statement of those principles is equi-variance under label-
switching. In the present setting, the permutation σ that transposes M with F
also switches P with Q. Equi-variance means that to each transposition of factor
labels there corresponds a permutation of parameter components such that Pθ (A) =
Qσ θ (A) for every event A. All of the models listed above are equi-variant except
for (v) and (vi).
Equi-variance does not imply that the distribution for males is the same as
the distribution for females, but it does imply that the set of distributions under
consideration is the same for both. Each sex gets to pick one distribution from the
same set, so there is equality of opportunity in that sense. However, the Gaussian
model in the fifth row of Table 14.1 shows that equality of the sets {Pθ : θ ∈ } and
{Qθ : θ ∈ } is not sufficient for equi-variance.
Equi-variance is not a fundamental principle on a par with Kolmogorov con-
sistency for a stochastic process. It is not even on a par with the principle of
exchangeability for individuals having the same covariate value. Equi-variance is
reasonably compelling in many circumstances and is a natural default for any factor
whose levels are unordered or otherwise unstructured. For example, occupation is a
classification factor, but the set of levels is not devoid of structure. In a survey with
limited options, one level might be employed but none of the above. Equi-variance
is a mathematical solution to the problem of accommodating distinct classes of units
on an equal footing in the stochastic model.
14.3.2 Treatment
P (Yu ∈ A | T = 0; θ ) = Pθ (A).
Treatment has an effect, possibly null, so the second step focuses on the set of
possible treatment effects g ∈ G, and on how each reference-level distribution is
modulated by g. Each treatment modulation is an action on the parameter space
θ → gθ which sends Pθ to Pgθ . The interpretation of the action by g is as follows:
if the conditional distribution given T = 0 is Pθ , and g is the treatment effect,
the conditional distribution given T = 1 shall be Pgθ . To make sense of this, it is
necessary that the set G be a group acting on 0 ; the group identity corresponds to
the null treatment effect.
The overall parameter space is the product set 0 × G. By definition of group
action, each transformation g : 0 → 0 is invertible, so g0 = 0 . Thus,
whatever the treatment effect may be, the set of conditional distributions given
T = 1 is the same as the set of distributions given T = 0. Since the action is a
group homomorphism, it is immaterial which level of T is used as the reference
level. This condition immediately excludes the fifth and sixth models in Table 14.1
as possibilities for modelling a treatment effect.
For this setting, where there is a single treatment factor and no covariate, the
Bernoulli logistic and probit models in Table 14.2 are equivalent, and both are
equivalent to the Bernoulli model in Table 14.1. The distributions are in 1–1
correspondence, and the only differences are in the parameterizations.
In the first three Gaussian models, the group acts additively on the parameter,
sending (μ, log σ ) to (μ + g, log σ ) in example (iii), to (μ, g + log σ ) in
is the conditional log odds of success for unit u, and for every unit in class x(u). The
treatment effect is a group action on the space 0 = Rk , which sends θ to gθ . In
the absence of additional structure (such as an inner product) there are two principal
options for the group and its action; either G = R and gθ = (θ1 + g, . . . , θk + g) or
G = Rk and gθ = (θ1 + g1 , . . . , θk + gk ).
14.3 Non-i.d. Models 233
By this odds-ratio yardstick, the effect of treatment is the same number g for unit in
every class. No interaction between treatment and the class means that the treatment
effect on some specified scale is the same for every class, so this group action
implies no interaction on the logistic scale.
The second option means that g ∈ Rk acts additively
Consider the simple linear regression model with independent components and a
quantitative covariate x. The response distribution given the parameter (η, σ ) is Y ∼
Nn (η(x), σ 2 In ), where the mean function η(x) = η0 +η1 x is linear in x. In a random
coefficients model, some or all of the regression coefficients are regarded as random
variables. Suppose, therefore, that η(·) is a random linear function in which the
coefficient vector (η0 , η1 ) is bivariate normal with mean (β0 , β1 ), variances σ02 , σ12
and correlation ρ. Then the marginal distribution of Y is Gaussian with moments
E(Yu ) = β0 + β1 xu
cov(Yu , Yu ) = σ 2 δu,u + σ02 + σ12 xu xu + ρσ0 σ1 (xu + xu ).
The six-parameter model is identifiable in the standard sense that distinct parameter
points give rise to distinct distributions, i.e., Pθ = Pθ implies θ = θ , at least for
non-trivial designs and interior parameter points with σ0 σ1 > 0.
This version of the random-coefficients model trades a three-parameter Gaussian
model for a six-parameter model. This transaction might be favourable if the larger
model were capable of accommodating effects that the simpler model cannot handle.
Sadly, that is not the case. The triple ( Yu , xu Yu , Yu2 ) is minimal sufficient
14.3 Non-i.d. Models 235
for both models, and the likelihood is maximized at the boundary point σ0 = σ1 =
0 with ρ indeterminate. Regardless of the number of observations, the variance-
components σ02 , σ12 , ρσ0 σ1 are not estimable.
The transaction is more favourable if B is a block factor, and the conditional dis-
tribution is Y ∼ Nn (ηb,0 + ηb,1 x, σ 2 In ) with bivariate coefficients ηb independent
and identically distributed for each block. The marginal distribution is then Gaussian
with moments
E(Yu ) = β0 + β1 xu (14.1)
cov(Yu , Yu ) = σ δu,u
2
+ Bu,u σ02 + σ12 xu xu + ρσ0 σ1 (xu + xu ) .
E(Y ) = Xβ (14.2)
cov(Y ) = σ 2 In + σ02 B + σ12 K + σ22 B · K,
236 14 Probability Distributions
where the matrix Ku,u = K(xu , xu ) has full rank, and B · K is the Hadamard
component-wise product. Once again, the model has four variance components and
two regression coefficients, so the choice of one over the other is not made on the
basis of parameter counting.
One crucial difference between the two versions is that (14.1) implies indepen-
dence of observations in different blocks, while σ12 > 0 in (14.2) implies otherwise.
Note also that the sum of the last three terms in (14.1) is a Hadamard-product matrix
of the form B · K, where K has rank two with eigenvectors confined to the linear
subspace X = span(1, x).
indexed by β ∈ Rp , with two variance components σ02 , σ12 > 0. Given T = t, the
group action generates an orbit
treatment effect need not be additive on the mean; moreover, if it is additive it need
not be the same constant for every unit.
Loosely speaking, interaction means that the effect of treatment for one unit is not
the same as the effect for another unit. In order for this to be the case, we must have
x(u) = x(u ), so the treatment action depends on x. In the simplest setting, x is
binary, G = R2 , and the group action (14.3) becomes
g
Nn (μ, ) −→ Nn (μ + tg0 + t·xg1 , ). (14.4)
For units at the reference level such that xu = 0, the treatment effect is an additive
increase in the mean by g0 ; for units such that xu = 1, the treatment effect is additive
by g0 + g1 . The difference g1 is called the interaction.
In (5.2), the treatment effect is a differential drift, whose magnitude is directly
proportional to time-since-baseline. That means that the action of the group element
g ∈ R is an additive function of the product g × time . The effect on the mean is not
the same for every unit. Nonetheless, G = R, so it is not entirely clear whether this
should be counted as interaction.
A similar effect can be generated artificially by restriction of (14.4) to the one-
dimensional sub-group g0 = g1 .
One further example may help to illustrate the options available for group action
on distributions. Let 0 be the set of non-negative measures on R+ = (0, ∞) that
are locally finite near the origin, i.e., there exists t > 0 such that the interval (0, t)
has finite measure. To each θ ∈ 0 there corresponds a probability distribution on
S = R+ ∪ {∞} defined by the survivor function
Pθ (Y > t) = exp − θ ((0, t]) ,
+
which implies Pθ (Y > 0) = 1 and Pθ (Y = ∞) = e−θ(R ) . A distribution on S is
called a survival distribution; every probability distribution P on R+ is regarded as
a survival distribution such that P ({∞}) = 0. In this setting, θ is called the hazard
measure.
To each survival distribution P there corresponds a hazard measure θ such that
− log P (t, ∞] P (t, ∞] > 0;
θ (0, t] =
∞ P (t, ∞] = 0;
238 14 Probability Distributions
for 0 < t < ∞. Although not especially important in practice, it is worth noting
that two distinct hazard measures may give rise to the same survival distribution:
see Exercise 14.14.
Hazard Multiplication
Temporal Dilation
Consider now the group G = R+ of positive scalars acting on the space of hazard
measures by the usual rules for the transformation of distributions by temporal
dilation. For present purposes, dilation means that (gθ )(A) = θ (gA) for A ⊂ R+ .
Each group element is a treatment effect, which is an invertible transformation
g : 0 → 0 , or equivalently Pθ → Pgθ , by scalar dilation, either of the hazard
measure or the distribution itself.
The accelerated-failure model states that each individual has a conditional hazard
given treatment, one for T = 0 and one for T = 1; if g > 0 is the treatment
effect, the two hazard densities are θu (t) and gθu (gt). As always, these are subject
to exchangeability: x(u) = x(u ) implies θu = θu .
14.4 Examples of Treatment Effects 239
The group actions illustrated above are the ones most commonly encountered in
survival analysis. It is evident that there are many other possibilities for group action,
most of which have limited potential for applied work, either because they are
implausible in one way or another, or because they lead to intractable computations.
Nonetheless, it may be helpful to describe a few. In the first two examples, the group
is R2 with addition, and the action on hazards is multiplicative but not constant
g
θ (dt) −→ eg1 +g2 t θ (dt)
g
θ (dt) −→ eg1 +g2 log(t ) θ (dt).
where Ȳn , sn2 are the sample mean and variance of the first n values, and αn2 =
1 + 1/n. The Gosset process with parameter (μ, σ ) differs only in the initialization:
Y1 = μ; Y2 = μ + σ 1 .
As a process on Borel subsets, Gosset’s process is certainly not exchangeable or
Gaussian: Y1 does not have the same distribution as Y2 or Y3 . Nor is it Gaussian
because Y2 has a two-point distribution, and Y3 has infinite moments. As a process
restricted to K it is exchangeable: the restricted process states not only that each
standardized ratio
Yn+1 − Ȳn
n = √
sn (1 + 1/n)
is distributed as tn−1 for n ≥ 2, but also that the ratio is independent of (Y [n], Kn ).
In this form, it is not difficult to see that every exchangeable Gaussian process
restricted to translation-scale invariant events coincides with Gosset’s process on K.
242 14 Probability Distributions
The lesson from the Gosset example is that a process defined on translation-
scale invariant events has multiple extensions to a Borel process on Rn . Every
exchangeable Gaussian process is an extension of the restricted Gosset process.
The sequential description (14.5) is a non-exchangeable non-Gaussian extension,
an extension that is well-suited for fiduciary purposes that require parameter-free
prediction. The fiducial extension implies, for example, that Ȳ∞ = limn→∞ Ȳn
exists and that, its conditional distribution given Y [n] is Student’s tn−1 , centered
√
at ȳn with scale parameter sn / n:
√
Ȳ∞ ∼ tn−1 (ȳn , sn / n).
In studies of causality and treatment effects, each unit in the population has one of k
possibilities for treatment. A non-randomized design consists of a finite sample U ⊂
t
U together with a treatment assignment U −→ [k], and Pt is the associated finite-
dimensional response distribution. These distributional specifications—for different
samples and alternative assignments—are assumed to be mutually consistent in the
Kolmogorov sense, so they determine a stochastic process indexed by assignments t.
Mutual consistency does not imply independence for distinct units, but it does imply
lack of interference, or the stable unit-treatment distribution assumption in Dawid
(2021, Sect. 6.2). In most applications, the sample U ⊂ U is—or is regarded
as—a fixed subset of the population, and treatment assignment is generated by
randomization subject to design constraints. For example, all sites on one rat in
Example 1 necessarily receive the same treatment.
The mathematical set-up for a counterfactual process is less complicated but
more elaborate. First, the index set is extended to the Cartesian product U × [k],
consisting of all unit-treatment pairs, and the counterfactual process is envisaged as
a function on this index set. Thus, Y (u, r) is the response, or potential outcome, that
would be observed if treatment r were assigned to unit u or patient u. To specify the
counterfactual stochastic process, it is necessary to specify the joint distributions PS†
for each finite sample S ⊂ U × [k] in a consistent manner. It suffices to specify the
finite-dimensional distributions PS† for finite product sets S = U × [k].
To each counterfactual sample S there corresponds a finite subset of units U ⊂ U,
which is obtained from S by ignoring the treatment component and eliminating
duplicate units. For example,
S = {(u1 , 0), (u1 , 2), (u2 , 1), (u4 , 0), (u4 , 1)}
14.5 Incomplete Processes 243
P † (Y (u1 , 1) ∈ A | data)
based on data from any sample, physical or metaphysical. For example the data
might come from a physical design that includes (u1 , 0) or a metaphysical design
such as S that includes (u1 , 0) and (u1 , 2), in which case the conditioning event
includes the value Y (u1 , 0). In other words, although patient u was physically
assigned at baseline to the control level, the counterfactual process allows us to
compute the conditional distribution of the response for the same patient had he or
she been assigned to the active treatment level at baseline. Mathematical duplication
of patients enables us to evade the apparent contradiction.
To many authors, the flexibility afforded by the introduction of counterfactuals is
embraced as a liberating experience and a cause for celebration. For example, Pearl
and Mackenzie (2021) write on page 269–270,
It is impossible to overstate the importance of this development. It provided researchers
with a flexible language to express almost every causal question they might wish to ask. . .
Indeed it does! However, that attitude is not universally embraced, and it is not
regarded by this author as a conceptual advance or a liberating experience or a cause
for celebration.
The situation in a nutshell is as follows. Let P † be the distribution of a
counterfactual process with domain U × [k], and let P be its restriction to
assignments or physical designs. Equivalently, P † is a counterfactual extension
t
of P . For any real design S corresponding to an assignment U −→ [k], and
any event A ⊂ RS , the probabilities are equal: P † (A) = Pt (A). In other words,
the two processes are in agreement regarding the probability to be assigned to
every observable event. Given that there is no disagreement about observables, their
means, their variances, and so on, what is the explanation for the exuberance of
the quote in the preceding paragraph? The answer, as Dawid (2000) argues, is that
the counterfactual framework adds much to the vocabulary but brings nothing of
substance to the conversation regarding observables.
244 14 Probability Distributions
Yn+1 = Ȳn + n
u 1 2 3 k
u1 ? 3.1 ? ?
u2 2.7 ? ? ?
u3 ? ? 4.5 ?
.. .. .. .. ..
. . . . .
un ? ? ? 6.4
un+1 ? ? ? ?
that is initially empty. The statement that t is an extension of t means that t (u) =
t(u) for all in-sample units. For an extension such that t (un+1 ) = r, we can
compute the conditional distribution of Yn+1 using the conventional distribution Pt
associated with that extension. By considering various extensions, it is possible to
complete all k entries in the additional row, one at a time, either by imputation or
by conditional expectation. But, by definition, the so-called missing entries for in-
sample patients are invisible to the conventional process.
14.6 Exercises
14.1 Let t be the treatment assignment vector, and let Bt be the associated block
factor, i.e., Bt (i, j ) = 1 if ti = tj and zero otherwise. For g ∈ R, consider the
transformations
g
−→ + g 2 Bt
for in the space of positive definite matrices. Discuss whether these transforms
determine a group action or group homomorphism (preserving identity and compo-
sition). If not, is it a semi-group homomorphism in a suitable sense? Maybe after
changing g 2 to eg or |g| to maintain positivity?
14.2 This exercise is concerned with a possible action of the additive group of
real numbers on the space of positive definite matrices of order n. Let X ⊂ Rn
be a given subspace. To each and W = −1 there corresponds a W -orthogonal
projection PW whose image is X , and a complementary projection QW = I − PX .
In matrix notation, PW = X(X W X)−1 X W depends on . For g ∈ R, show that
the transformations
g
−→ QW + eg PW = + (eg − 1)X(X −1 X)−1 X
14.6 Exercises 247
14.3 Let t be the treatment assignment vector, and let PW be the W -orthogonal
projection onto the subspace span(1, t). Show that the transformation
g
Nn (μ, ) −→ Nn (μ + g0 t, QX + eg1 PW )
14.4 Under what conditions does the treatment model in the preceding exercise
satisfy the lack of interference condition?
14.5 Show that the log likelihood function for the simple linear regression model is
−n log σ − 12 (Yu − β0 − β1 xu )2 /σ 2 .
Deduce that the triple ( Yu , xu Yu , Yu2 ) is sufficient for the parameter. Under
what conditions is this triple also minimal sufficient?
14.6 Show that the same triple is sufficient for the six-parameter random coefficient
model (14.1) with one block. Deduce that the likelihood is maximized at the
boundary point σ0 = σ1 = 0. Discuss the situation for two or more blocks.
14.7 Let be the extended complex plane. For each θ = θ0 + iθ1 let Pθ be the
distribution on the extended real line with density
|θ1 | dy
Pθ (dy) =
π |y − θ |2
aθ + b
gθ = .
cθ + d
14.8 Let = R2 , and let Pθ be the von Mises-Fisher distribution on the unit circle
with density
eθ y dφ
Pθ (dφ) = ,
I0 (|θ |)
where y = (cos φ, sin φ) and dφ is arc length, and I0 (·) is the Bessel I function
of order zero. Discuss the following groups as possible treatment effects acting
on distributions: (i) the group of strictly positive numbers acting on by scalar
multiplication; (ii) the group of planar rotations; (iii) the group of similarity
transformations generated by (i) and (ii).
14.9 Show that the random-coefficient model (14.2) is equi-variant under affine
covariate transformation x → g0 + g1 x with g1 = 0. Show that the induced
transformation on (β0 , β1 ) is linear R2 → R2 . Show that the induced transformation
on variance components is also linear
⎛ ⎞ ⎛ ⎞⎛ ⎞
σ02 1 2g0 g02 σ02
⎝ ρσ0 σ1 ⎠ −→ ⎝ 0 g1 g0 g1 ⎠ ⎝ ρσ0 σ1 ⎠
σ12 0 0 g12 σ12
for units in the active treatment arm. The covariance in both cases is σ 2 I2 ; the
parameter μ ∈ R2 is unrestricted, while σ > 0 and the treatment effect lies in
0 ≤ τ < 2π. By expressing the sample averages for each treatment arm as complex
numbers, show that τ̂ = arg(Ȳ0 ) − arg(Ȳ1 ) is the maximum-likelihood estimate of
the treatment effect. Find the maximum-likelihood estimate of σ 2 .
14.11 A modification of the preceding model retains the mean vectors, while the
covariance matrix is unrestricted but constant over units. Find an expression for the
maximum-likelihood estimate of the treatment effect.
14.12 Show that the treatment effect in both preceding exercises is a group action
on Gaussian distributions. What is the group, and how does it act? In only one case
is the action on distributions induced by an action on the sample space. Explain.
14.6 Exercises 249
14.14 Find the survival distribution P associated with the hazard measure
dt/(1 − t) 0 < t < 1;
θ (dt) =
dt/t t ≥ 1.
Hence find a second hazard measure that has the same survival distribution.
14.15 Let 0 = (0, 1) and let Pθ be the iid Bernoulli model Ber(θ ). Let G be the
additive group of addition modulo one, so that Pgθ = Ber(θ + g) is the Bernoulli
model with parameter θ + g modulo one. Explain why Ber(θ ) → Ber(θ + g) is not
a group action on distributions in the sense of Sect. 14.3.2.
Chapter 15
Gaussian Distributions
1
(dy) = φ(y) dy = √ e−y /2 dy
2
2π
with respect to Lebesgue measure on the real line. It is symmetric with finite
moments of all orders. The moment generating function is
∞
2 /2
M0 (t) = ety φ(y) dy = et = μr t r /r!,
r=0
from which the odd moments are zero, and the even moments are
(2r)!
μ2r = = 1 · 3 · · · (2r − 1).
2r r!
The cumulant generating function is
K0 (t) = log M0 (t) = κr t r /r! = t 2 /2,
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 251
P. McCullagh, Ten Projects in Applied Statistics, Springer Series in Statistics,
https://doi.org/10.1007/978-3-031-14275-8_15
252 15 Gaussian Distributions
distribution with mean μ and variance σ 2 . The density function of the transformed
variable at y is
1 y−μ 1
= √ e−(y−μ) /(2σ ) .
2 2
φ
σ σ σ 2π
The mean is μ, the variance is κ2 = σ 2 , and all other cumulants are zero.
For x > 0, the ratio of the right tail probability 1 − (x) to the density φ(x) is
called Mills’s ratio. The asymptotic expansion is
1 − (x) 1 1
= − 3 + O(x −5 ).
φ(x) x x
This stands in sharp contrast with heavy-tailed distributions for which the corre-
sponding ratio is increasing in x; in the case of the Cauchy distribution the ratio is
asymptotically linear in x.
The approximate inverse relationship for the Gaussian quantile in terms of the
right rail probability p is
x −2 log p − log(2π) − 2 log x.
and the cumulant generating function is t 2 /2, which is quadratic and radially
symmetric as a function of t. All of the joint cumulants are zero except for the
variances, which are cov(Xi , Xj ) = δij , i.e., one for i = j and zero otherwise.
Let L be a linear transformation Rn → Rn , so that the matrix L is of order n × n.
The moment generating function of the transformed variable Y = LX is
L t
et Lx φn (x) dx = M0 (L t) = e
2 /2
E et Y = ,
Rn
which is symmetric and positive semi-definite. The random variable Y has the
normal distribution in Rn with mean zero and covariance , which is denoted by
Nn (0, ).
If L is invertible, the covariance matrix = LL is also invertible with inverse
W = L−1 L−1 . In that case, the Jacobian of the transformation is the absolute value
of the determinant of the transform matrix
e−|z| /σ
2 2
φ(z) =
πσ 2
with respect to two-dimensional Lebesgue measure. The real part and the imaginary
part of Z ∼ CN(0, 1) are independent zero-mean real Gaussian variables with
variance σ 2/2 each. The argument of Z is uniformly distributed on [0, 2π), and
independent of |Z|2 , which is exponentially distributed with mean σ 2 .
Rotational symmetry means that, for every real θ , the rotated variables Zeiθ
have the same distribution as Z. It follows that Z k ekiθ ∼ Z k for every integer k.
Provided that the moments are finite, μk ekiθ = μk implies that complex powers
satisfy E(Z k ) = 0 = E(Z̄ k ) for every integer k ≥ 1. The only non-zero integer
moments are E(|Z|2k ) = k!σ 2k in which Z and Z̄ occur an equal number of times
in the product. The kth order cumulant is cumk (|Z|2 ) = (k − 1)!σ 2k .
n
∗ε
−n
e−|εr | = π −n e−ε = π −n e
2 ε 2
π
r=1
where 1 = −1 . Any pair of identically distributed real Gaussian vectors X, Y
defines a complex Gaussian vector Z = X + iY if and only if the cross-covariances
are anti-symmetric, cov(X, Y ) = − cov(Y, X).
15.2.3 Moments
case is a little simpler than the real case, and the product moment is as follows. To
each permutation π : [k] → [k] there corresponds a 1–1 matching ir → jπ(r) of
conjugated with non-conjugated components. Each matching gives rise to a product
of k covariances
k
E(Zi1 · · · Zik Z̄j1 · · · Z̄jk ) = ir ,jπ(r) = per([i, j]),
π r=1
which is the permanent of the i × j sub-matrix. Note that rows or columns may be
repeated, so that E(|Z1 |2k ) = 11k
k!, which are the moments of the exponential
distribution. The permanent is the same as the determinant except that all k! terms
in the permutation expansion have coefficient +1.
Complex-valued random variables seldom occur in experimental research except
in the setting of Fourier transformation for time series, as in Chap. 7. They are not
used in the remainder of this chapter, but they do also arise in connection with
stationary Gaussian processes, particularly space-time processes in Chap. 16.
The main reasons for endowing the domain with Euclidean structure are as
follows:
1. Orthogonality of subspaces is associated with independence of random variables;
2. The orthogonal projection having a given image is associated with maximum-
likelihood and weighted least squares;
3. The orthogonal projection having a given kernel is associated with a number
of statistically distinct operations such as least-squares residual, prediction,
interpolation, smoothing and Kriging;
4. Cochran’s theorem and much of the distribution-theory associated with linear
regression and analysis of variance become more transparent.
From the vantage of linear algebra, it is natural to specify the inner product directly
through the inner-product matrix W , which is symmetric and strictly positive
definite. The inner product in the dual space of linear functionals is the matrix
inverse, = W −1 .
The order of operations in statistical work is ordinarily reversed. A Gaussian
process is defined by its covariance function, which is naturally subject to restric-
tions such as stationarity, isotropy, or exchangeability, depending on the structure of
its domain. Consequently, the matrix , which is the restriction of the covariance
function to the sample points, is specified first. The inverse matrix then determines
the inner product in the observation space H for the particular sample selected.
For a process sampled at points u1 , . . . , un in some domain U, the matrix
component ij = cov(Y (ui ), Y (uj )) depends on ui , uj only, and is independent
of the configuration of the remaining sample points. By contrast wij depends on the
entire configuration of sampled points. For example, if the process is stationary on
the plane, then ui − uj = ui − uj implies ij = i j . But equal separation in the
domain does not imply wij = wi j .
Despite the substantial advantages listed in the preceding section, it is good
to be aware of one additional limitation of associating a specific geometry with
the Gaussian distribution. In statistical work it is often necessary to compare two
candidate distributions on the same observation space, for example by computing
the likelihood ratio. For example, the candidate distributions might be Nn (0, 0 )
and Nn (0, 1 ) for two given matrices. To compute a likelihood ratio, it is essential to
compare candidate distributions on the same space, so it could be a serious mistake
to associate with each distribution its own geometry.
258 15 Gaussian Distributions
15.3.3 Projections
Specification by Image
Let X be any matrix of order n × p whose columns span the real subspace X ⊂ H
of dimension p. The transformation P : H → H whose matrix representation is
Specification by Kernel
Self-adjointness Identity
Mixed Products
P1 P0 = P0 ; P0 P1 = P1 ;
i.e., the first or rightmost projection prevails in the product. Mixed products having
nested kernels exhibit the opposite behaviour; ker(Q0 ) ⊆ ker(Q1 ) implies Q1 Q0 =
Q1 in which the last, or leftmost, projection prevails.
In statistical work related to linear models, two linear transformations T , T
having the same kernel are statistically equivalent in the sense that there exists linear
transformations L, L such that T = LT and T = L T . One can be obtained from
the other by a linear transformation; for projections, L = T and L = T .
Rank Degeneracy
Suppose that Y ∼ Nn (0, ), where has rank n − p. Let K : Rn → Rn−p be any
linear transformation whose kernel coincides with the kernel of , i.e,. ker(K) =
ker() = X . This means that K is a matrix of order n − p × n and rank n − p.
260 15 Gaussian Distributions
Despite the notation, Y (u) or Y (δu ) in isolation is not a Gaussian variable with finite
variance. Provided that the phrase is understood informally as a limit, it is seldom
misleading to regard Y (u) as Gaussian with ‘infinite’ variance. Floating Brownian
motion is not defined pointwise, but it is stationary on its domain of contrasts.
Standard Brownian motion
Y (α) = α1 Y1 + · · · + αn Yn .
Instead of indexing Y by the points i ∈ [n], the preceding notation suggests that
we use the space of linear combinations as an extended index set. Strictly speaking,
this extension is unnecessary and superfluous. As a linear functional, the extension
Y (3α + 4β) = 3Y (α) + 4Y (β) is linear and additive, so all values are determined
by the values on any basis.
The covariance of two linear combinations is bilinear:
cov Y (α), Y (β) = %α, β& = αi βj ij .
In this section, H is the Hilbert space associated with the distribution Nn (0, ). For
simplicity of exposition, W = −1 is invertible and dim(H) = n.
The squared norm of a vector x ∈ H is x 2 = x W x. For Y ∼ N(0, ), the
distribution of the scalar random variable Y 2 , can be obtained from its moment
generating function
= (2π)−n/2 |W |1/2
2
E et Y
ety Wy−y Wy/2 dy
H
provided that t < 1/2. The moment generating function of the χ12 -distribution is
(1 − 2t)−1/2 , so Y W Y is distributed as χn2 , which is the distribution of the sum
Z12 + · · · + Zn2 of squares of n independent standard Gaussian variables.
The χ 2 density function is available in closed form, but is not especially
important for either theory or applications. The cumulant generating function
−n log(1−2t)/2 implies that the rth cumulant is κr = n (r −1)!2r−1 . All cumulants
are proportional to n, the mean and variance are n and 2n, and the central limit
theorem implies χn2 N(n, 2n) for large n. For numerical work, the cumulative
distribution function is available in R using the syntax pchisq(x, df=n), and
simulated variables are available using rchisq(..., df=n).
15.4.2 Independence
Cochran’s Theorem
Y = P1 Y + · · · + Pk Y ;
Y 2
= P1 Y 2
+ · · · + Pk Y 2
.
Pr Y 2 /n
r
Ps Y 2 /n
s
0 = X0 ⊂ X1 ⊂ · · · ⊂ Xk ⊂ Xk+1 = H
of dimensions 0 < n1 < n2 < nk < n. Let Pr be the orthogonal projection onto
Xr so that Pr Ps = Pr∧s , and Qr Qs = Qr∨s for the complementary projections.
Then the increments ( P )r = Pr − Pr−1 = Qr−1 − Qr are mutually orthogonal
projections satisfying the conditions for Cochran’s theorem. In particular, if Y ∼
N(Xβ, σ 2 V ) satisfies the standard linear model assumption with non-zero mean
such that X1 = span(X), and X2 = span(X, Z) is any proper subspace of H
containing X as a proper subspace, then
Q1 Y 2
= (Q1 − Q2 )Y 2
+ Q2 Y 2
(Q1 − Q2 )Y 2 /(n2 − n1 )
F =
Q2 Y 2 /(n − n2 )
Thus, the conditional distribution of Y given Z is N(QY, P ). These equations are
dual to (15.6).
In standard probability terminology, prediction calls for the conditional distribu-
tion given the σ -field generated by the observation as a measurable transformation.
By definition, the σ -field generated by a linear transformation with kernel K ⊂ Rn
is the Borel σ -field in Rn /K, i.e., all Borel subsets A ⊂ Rn such that A + K = A.
In that probabilistic sense, all linear transformations having the same kernel are
equivalent.
where Jn (i, j ) = 1 is the n × n matrix whose components are all one. The inverse
matrix is
θ
n−1 = σ0−2 In − Jn ,
1 + nθ
are independent. To avoid confusion in statistical work where the covariance matrix
is not completely known, it is best to fix the inner product in H rather than having a
parameter-dependent inner product: see the cautionary remarks in Sect. 15.2.2. Most
266 15 Gaussian Distributions
2
Pn Y = nȲn2 ∼ (σ02 + nσ12 )χ12 ,
Qn Y 2
= (n − 1)sn2 ∼ σ02 χn−1
2
nθ Ȳn
E(Y [n + 1:m] | Y [n]) = 1m ,
1 + nθ
cov(Y [n + 1:m] | Y [n]) = σ02 Im + σ12 Jm /(1 + nθ ).
The same partitioned-matrix formulae also imply that the conditional distribution of
the average (Yn+1 + · · · + Yn+m )/m given Y [n] is Gaussian with moments
nθ Ȳn
E Ȳn+1:m | Y [n] = ,
1 + nθ
var Ȳn+1:m | Y [n] = σ02 /m + σ12 /(1 + nθ ).
This conditional distribution has a limit as m → ∞ for fixed n, implying that the
infinite average is a conditionally non-degenerate random variable such that
nθ Ȳn σ02 θ
Ȳ∞ ∼ N , .
1 + nθ 1 + nθ
The limit θ → ∞ gives Ȳ∞ − Ȳn ∼ N(0, σ02 /n). For n ≥ 2, the internally-
standardized ratio
√
n (Ȳ∞ − Ȳn )
∼ tn−1 ,
sn
2
m(y) = p(x)φ(y − x) dx = φ(y) p(x)exy−x /2 dx.
R
Thus the density ratio m(y)/φ(y) is the Laplace transform of the function
p(x)e−x /2 . In addition, p(x)e−x /2 φ(0)/m(0)
2 2
is a probability density whose
cumulant generating function is log m(y)/φ(y) . In the absence of other
considerations, the goal of signal estimation is to compute the conditional expected
value of the signal given the data.
E(et X | Y ) = et x p(x)φ(y − x) dx m(y)
φ(y) 2
= et x+xy−x /2 p(x) dx
m(y)
m(y + t) m(y)
= ;
φ(y + t) φ(y)
d m(y + t)
E(X | Y ) = log
dt φ(y + t) t =0
d m(y) m (y)
= log =y+ .
dy φ(y) m(y)
d2 m(y)
var(X | Y ) = log .
dy 2 φ(y)
268 15 Gaussian Distributions
where φ is the density of Nd (0, ), and m (y) is the gradient vector.
For the vector formula to be useful in practical work, it is usually necessary
to make further simplifying assumptions. Rotational symmetry for both the signal
and the noise is reasonably natural. In that case ε ∼ N(0, Id ), the signal density
satisfies p(σ x) = p(x) for each orthogonal transformation σ : Rd → Rd , and
m(σy) = m(y) is also rotationally symmetric. The conditional expectation H (y) =
E(X | Y = y) then satisfies the commutativity condition
H (σy) = σ H (y).
Since the sub-group ±1 commutes with every group element, the normalized vector
H (y)/ H (y) is necessarily equal to ±y/ y . Examples of such transformations
include scalar multiples y → −3y, and invariant multiples such as the Euclidean
normalization y → y/ y 2 or the inversion y/ y 22 . But L1 -normalization y →
y/ y 1 does not commute with G.
15.4 Statistical Interpretations 269
Suppose that the signal X is a random matrix of order n×p, and that the components
of ε are independent standard normal. The first spectral moment is the conditional
expected value of tr(X X), i.e., the sum of the eigenvalues, given Y = y. By
Eddington’s second-moment formula, the first spectral moment is the trace of the
second partial derivatives of the conditional moment generating function
m(y + t)φ(y)
M(t | Y ) =
φ(y + t)m(y)
with respect to t at the origin. The trace of second derivatives is the standard
Laplacian. The second spectral moment is the conditional expected
value of the
sum of squared eigenvalues, i.e., the expected value of tr (X X)2 given Y = y. By
Eddington’s fourth-moment formula, the second spectral moment is the cyclic scalar
contraction of the fourth partial derivatives of the moment generating function.
Similar remarks apply to the kth-order spectral moment, which is the cyclic scalar
contraction of the partial derivatives of order 2k.
The spectral-moment formulae can be simplified if the signal distribution is
rotationally symmetric with respect to left and right orthogonal transformation.
In other words, p(σ xτ ) = p(x) for all orthogonal matrices σ of order n and τ
of order p. Since ε is rotationally symmetric, the convolution is also rotationally
symmetric in the same sense, and the conditional expectation is equi-variant in the
sense H (σyτ ) = σ H (y)τ . Equi-variance implies that H acts only on the singular
values.
Although the conditional expectation y → H (y) is an action on singular values,
the transformation does not necessarily act component-wise, nor is it necessarily a
shrinkage. Under certain sparsity assumptions, it is possible to be more specific
about the nature of the transformation, which is a shrinkage towards the origin
applied component-wise to the singular values: see Sect. 17.5.3.
l(β, σ 2 ; y) = − 12 y − Xβ 2
/σ 2 − n log σ + const.
P Y ∼ N(Xβ, σ 2 P V ), QY ∼ N(0, σ 2 QV ).
s 2 = Qy 2 /(n − p),
Exercises 15.4–15.9 give an outline of the argument for combining linear regression
with prediction. The parametric family Y ∼ N(Xβ, σ 2 V ) on H = (Rn , W )
determines a parametric family on the observation space, which is the image of
15.4 Statistical Interpretations 271
Nn (QY + P μ, σ 2 P V ).
−1 −1
The least-squares estimate is β̂ = (X0 V00 X0 )−1 X0 V00 Y0 , giving the fitted mean
with components μ̂0 = X0 β̂ and μ̂1 = X1 β̂. Given Y0 , the predictive distribution
for Y1 has moments
−1 −1
μ̂1 + V10 V00 (Y0 − μ̂0 ) and σ 2 W11 ,
with σ 2 replaced by s 2 where needed. The predictive mean in this setting goes by
various names—best linear predictor, fiducial predictor, Kriging estimate, smooth-
ing spline—depending on the area of application.
272 15 Gaussian Distributions
Fiducial Prediction
y + K = Qy + K = {y ∈ Rn | Ky = z}, (15.9)
which is the subset of points that are consistent with the observed value.
Any distribution defined on Borel subsets of Rn can be restricted to a sub-σ -field
if the need arises. In the linear-model setting Nn (Xβ, σ 2 V ) with X = span(X),
an event A ⊂ Rn is said to be translation-invariant if A + X = A. For historical
reasons, the invariant events are also called fiducial events; the set of fiducial events
is the Borel σ -field B(Rn /X ).
According to the fiducial argument, the set of distributions Nn (Xβ, σ 2 V ) indexed
by β ∈ Rp for fixed σ is interpreted as a single distribution Nn (0, σ 2 V ) on fiducial
events. For this purpose, two Gaussian distributions Nn (μ0 , 0 ) and Nn (μ1 , 1 ) are
equivalent modulo X , if, for any linear transformation T whose kernel includes X ,
T μ0 = T μ1 and T 0 T = T 1 T . In particular, ker(QX ) = X implies that the
distributions Nn (Xβ, V ) and Nn (0, QX V ) are fiducially equivalent. Thus, a single
fiducial distribution has multiple covariance-matrix representations in Rn .
Fiducially speaking, the response distribution is Nn (0, σ 2 QX V ), the observation
is a linear transformation Q† with kernel X + K, and (15.7) implies that the
conditional distribution given the observation is
Nn Q† Y, σ 2 (QX − Q† )QX V = Nn Q† Y, σ 2 (QX − Q† )V
∼
= Nn Q† Y + μ̂, σ 2 (P † − PX )V
∼
= Nn Q† Y + μ̂, σ 2 PK V .
15.5 Additivity
QY 2 − Q1 Y 2
F = (15.10)
Q1 Y 2 /(n − p − 1)
T = γ̂ / s.e.(γ̂ ),
and the value compared with the null distribution tn−p−1 . As always, T 2 = F , so
the two approaches are effectively equivalent. If the F -ratio is large, the remedy
suggested is to transform the response Y → Y λ , and Tukey’s suggested power
transform is λ = 1 − 2γ̂ Ȳ .
and the conditional distribution given μ̂ is N(0, σ 2 (z W Qz)−1 ). Given μ̂, the
residual vector is split additively into two orthogonal parts
QY 2
= γ (z W Qz)γ + Q1 Y 2
= Y W Qz(z W Qz)−1 z W QY + Y W Q1 Y.
According to Sect. 15.3.2, these are conditionally independent given μ̂ with distri-
butions σ 2 χ12 and σ 2 χn−p−1
2 respectively. Thus, Tukey’s distributional assertions
are upheld conditionally on μ̂, and therefore unconditionally.
are also linearly independent. For that setting, the 1DOFNA is equivalent to the
one degree of freedom for non-linearity, and specifically quadratic deviations from
linearity.
Provided that the constructed variable is a function of μ̂, any component-wise
non-linear transformation such as zi = exp(μ̂i ), or any non-component-wise
transformation H → H, may be used in the algorithm. Subject to the condition
z ∈ X mentioned above, the distributional argument leading to the conclusion
that the 1DOFNA F -ratio is distributed as F1,n−p−1 is unaffected by the choice of
transformation. For X = 1, the neighbour average zi = avej ∈nb(i) μ̂j is an example
of a linear non-component-wise transformation H → H that might arise in a spatial
or graphical setting.
Note that the word transformation is used above in two distinct senses that
are algebraically distinct. First, every statistical vector is a function U → R on
the units, and component-wise transformation g : R → R refers to composition
y g
y → gy on the left as illustrated by the diagram U −→ R −→ R. Component-
wise transformation exploits the fact that RU is a commutative ring. Second, every
statistical vector is also a point y ∈ H, and a typical linear transformation H → H
such as y → μ̂ or y → Qy does not act component-wise.
15.6 Exercises
is closed under matrix addition and multiplication. Show also that the ‘linear’
mapping into the space of complex n × n matrices
AB
→ A + iB
−B A
15.2 Let A + iB be a full-rank Hermitian matrix of order n. Show that the inverse
matrix C + iD is also Hermitian and satisfies the pair of equations
AD + BC = 0; AC − BD = In .
are mutual inverses. What does this matrix isomorphism imply about the relation
between complex Gaussian vectors and real Gaussian vectors?
15.3 By writing the complex vector z and the Hermitian matrix as a linear
combination of real and imaginary parts, show that the Hermitian quadratic form
z∗ z reduces to the following linear combination of real quadratic forms:
Hence deduce that the real and imaginary parts of Z ∼ CN(0, ) are iden-
tically distributed Gaussian vectors N(0, 0 ) with covariances cov(X, Y ) =
− cov(Y, X) = 1 .
Gaussian Linear Prediction The next five exercises are concerned with estimation
and prediction in the Gaussian linear model Y ∼ Nn (μ = Xβ, σ 2 V ) in which the
observation is the linear transformation Z = KY . The matrices X of order n × p,
K of order n−k ×n, and V of order n×n are given, while β, σ 2 are parameters to be
estimated. All three matrices are of full rank, the product KX has rank p ≤ n − k,
while the Hilbert space H with inner-product matrix W = V −1 determines the
geometry.
15.8 Show that the least-squares estimate of the conditional distribution of Y given
Z is
Nn QY + P μ̂, s 2 P V
15.6 Exercises 277
for some scalar s 2 . Show that the least-squares estimate is singular and is supported
on the k-dimensional coset QY + K. Explain why self-consistency requires KQ =
K.
15.9 Show that the zero-mean exchangeable Gaussian process in Sect. 15.3.3 with
covariances
nθ Ȳn
Yn+1 = + σ0 1 + θ/(1 + nθ ) n+1
1 + nθ
for n ≥ 0. Here θ = σ12 /σ02 is the variance ratio, and 1 , . . . are independent
standard normal variables.
15.10 Suppose that X is uniformly distributed on the surface of the unit sphere in
Rd , and that Y ∼ N(X, σ 2 Id ) is observed. Show that Eddington’s formula reduces
to the projection E(X | Y ) = Y/ Y .
15.11 Suppose that X is uniformly distributed on the interior of the unit sphere
in Rd , and that Y ∼ N(X, σ 2 Id ) is observed. Show that Eddington’s formula is a
radial shrinkage so that E(X | Y ) has norm strictly less than one.
Chapter 16
Space-Time Processes
Let U be an arbitrary index set, here identified with the domain. A Gaussian process
associates with each u in the domain a random variable Zu in such a way that for
each sample U = (u1 , . . . , un ), the random variable Z[U ] = (Zu1 , . . . , Zun ) has
a Gaussian distribution. It should be noted that the sample points are taken in a
specific order, so U is an n-tuple of points from the domain, and the components of
Z are taken in the same order. If U contains repeats, say U = (u1 , u1 , u2 ), then the
first two components of Z[U ] are necessarily identical.
As a function on the index set, Z may be real-valued or complex-valued or Rk -
valued or Ck -valued. This chapter focuses chiefly on scalar processes, either real-
valued or complex-valued, so Z is a function U → R or a function U → C into the
space of scalars. Since the complex numbers are in 1–1 correspondence with ordered
pairs of reals, every complex-valued process Z = X+iY is also a R2 -valued process
(X, Y ). Any reader who has made it this far has every right to ask why, in a book that
professes to be concerned with scientific applications of statistical ideas, we should
concern ourselves with a complex-valued process when a R2 -valued process would
serve the same purpose. However, there is a legitimate reason, which is central to the
theme of this chapter. For reasons discussed below, an arbitrary R2 -valued Gaussian
process (X, Y ) is not a complex Gaussian process in the algebraic sense. The algebra
of the complex numbers is not irrelevant in the real world.
A Gaussian process is determined by its mean function μ(·) and its covariance
function K(·, ·). In the case of a real-valued process, μ is a function U → R, and
K is a symmetric function U × U → R that is also positive definite. In the case
of a complex-valued process, μ is a function U → C, and K is a positive-definite
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 279
P. McCullagh, Ten Projects in Applied Statistics, Springer Series in Statistics,
https://doi.org/10.1007/978-3-031-14275-8_16
280 16 Space-Time Processes
The first equation is the condition for a pair of real-valued Gaussian processes X, Y
to determine a complex Gaussian process in the algebraic sense. The zero real part
implies KX = KY , so the real and imaginary parts of Z are two processes having
the same distribution. The imaginary part of the first equation implies that the cross-
covariances satisfy the mysterious skew-symmetry condition
16.2.1 Definitions
Stationarity and isotropy are properties of a process that are associated with a
group action on the domain. Stationarity is a symmetry or distributional invariance
with respect to domain translation; isotropy is an invariance under rotation or
orthogonal transformation. For translation to make sense, the domain is necessarily
a commutative group such as a vector space or an affine space; for orthogonal
transformation to make sense, the domain is necessarily a Euclidean space in which
the inner product determines the geometry.
A stochastic process Z with domain U is said to be stationary if the following
properties hold:
1. The domain is a commutative group, usually a vector space such as Rd or Cd for
some d ≥ 0. Other possibilities include the integers and the integers modulo k.
2. Each g ∈ U acts on the domain by addition, sending u to u + g;
3. The action on the domain sends the original process to Z g (u) = Z(u + g) by
composition, which is a translation by −g;
4. Each Z g has the same distribution as Z.
Ordinarily, the domain is a group which acts on itself by addition. Addition acts
transitively, which implies that each value Zu = Z(u) has the same distribution
as Z0 , i.e., all one-dimensional marginal distributions are equal. Since differences
are invariant under translation, stationarity implies that (Zu , Zu ) has the same joint
distribution as (Zv , Zv ) whenever u − u = v − v . Stationarity does not imply that
the pair (Zu , Zu ) has the same distribution as the reverse pair (Zu , Zu ).
Isotropy has a similar meaning in relation to a different group—orthogonal,
special orthogonal or unitary—acting on the domain, which is necessarily a
Euclidean space with an inner product.
1. The domain is Euclidean space, either Rd or Cd , or some subset of Euclidean
space that is closed under the group;
2. The orthogonal group, or possibly the special orthogonal group with positive
determinant, acts on the domain, sending u to gu;
3. The action on the domain sends the original process to Z g (u) = Z(gu) by
composition;
4. The process is isotropic if each Z g has the same distribution as Z.
282 16 Space-Time Processes
g Z
U −→ U −→ R
which shows the process as a function U → R with the group acting on the domain.
Sometimes it is necessary to ask for clarification whether the full orthogonal
group, including reflections, is intended. Sometimes the domain may be a proper
subset of Euclidean space on which the group acts, for example, the unit circle or
the unit disk in the complex plane or the unit sphere in R3 .
Ordinarily in applied work, the domain has no natural origin, so it is better
described as an affine space. In such applications isotropy is not a natural require-
ment on its own, but the Euclidean group of proper rigid motions (translation plus
rotation) is very natural. Depending on the setting, reflections may or may not be
included.
A zero-mean complex Gaussian process is stationary if and only if K(u, u ) =
G(u − u ) for some function G such that G(−u) = Ḡ(u). In the case of a
real Gaussian process, G is real, and therefore symmetric. A zero-mean complex
Gaussian process is stationary and isotropic if and only if K(u, u ) = G( u − u )
for some real-valued function G. Each function G is necessarily positive definite.
It follows that every stationary isotropic complex Gaussian process Z = X + iY
is a pair of independent isotropic real Gaussian processes X ∼ Y having the same
distribution. Conversely, a pair (X, Y ) → X+iY of independent and identically dis-
tributed stationary isotropic real-valued Gaussian processes determines a complex
Gaussian process. The situation for stationary non-isotropic processes is different.
Finally, a process is said to be reversible if Z[−U ] ∼ Z[U ] for every sample.
In this setting, the group {±1} acts on the domain u → ±u, and the process is
invariant. Usually, the domain is U = R representing time, in which case the process
is said to be time-reversible. Every stationary real-valued Gaussian process is time-
reversible; a stationary complex-valued Gaussian process is time-reversible only if
the covariance function is real.
be the vector of increments. The process is stationary on increments if, for each
integer n and each h in the domain, Z[u + h] − Z[v + h] has the same distribution
as Z[u] − Z[v]. A stationary process is automatically stationary on increments.
On U = Rd , the zero-mean Gaussian process with covariance
K(u, v) = u + v − u − v
The orbits also come in conjugate pairs, {Or , O−r } with −r ≡ k − r such
that K0,r = K̄0,k−r . The zero orbit is self-conjugate, so values on the diagonal
are real (and positive). If k ≥ 2 is even, orbit k/2 is also self-conjugate, so K0,k/2 is
also real.
For k = 2, the covariance matrix of a stationary process
K0 K1 1ρ
K= = K0
K1 K0 ρ 1
has two distinct values; both orbits are self-conjugate, so both values are real.
Stationarity implies that the real and imaginary parts are independent and identically
distributed processes; the reverse-time process has the same distribution as the
original.
284 16 Space-Time Processes
For k = 3, the orbits are labelled −1, 0, 1 with conjugate values on conjugate
orbits: K−1 = K̄1 . The covariance matrix
⎛ ⎞ ⎛ ⎞
K0 K1 K̄1 1 ρ ρ̄
K = ⎝K̄1 K0 K1 ⎠ = K0 ⎝ρ̄ 1 ρ ⎠
K1 K̄1 K0 ρ ρ̄ 1
has three distinct values, one real and one complex-conjugate pair. The reverse-
time process is Gaussian with covariance K = K̄. Positive definiteness implies,
K0 ≥ 0, |ρ| ≤ 1 plus the determinantal condition 1 − 3|ρ|2 + 2(ρ 3 ) ≥ 0. These
are equivalent to the conditions K0 ≥ 0 and ρ in the equilateral triangle with vertices
at the primitive roots {1, e2πi/3, e−2πi/3 }. See Exercise 16.1 for an explanation. The
condition |ρ| ≤ 1/2 is sufficient but not necessary for positive-definiteness.
Any process whose domain is the set of real numbers may be regarded as a time
series. To underline the connection, points in the domain are denoted by t. If
time is continuous but values are recorded at constant temporal increments, the
recorded process is a linear transformation of a continuous-time series. The linear
transformation is either a transformation by restriction to the integers Z ⊂ R, or a
transformation by integration over unit intervals. In either case, the continuous series
determines the distribution of the linear transformation. All processes discussed here
are defined in continuous time.
Let ξ(t) = eiωt be the complex harmonic function with frequency ω and period,
or wavelength, 2π/ω. The Hermitian function
Kω (t, t ) = eiω(t −t ) = ξ(t)ξ̄ (t ),
which is the Hermitian outer product ξ ξ ∗ of the ξ with itself, is positive definite
of rank one. Accordingly, if μ is a non-negative finite measure on frequencies, the
convex combination
∞
Kμ (t, t ) = eiω(t −t ) dμ(ω)
−∞
In the algebra that follows it is helpful to split the spectral measure into
symmetric and skew-symmetric parts μ = μsym + μalt as follows:
The symmetric part is non-negative. The alternating part is a signed measure such
that −1 ≤ (dμalt /dμsym )(ω) ≤ 1 for every ω. If we write t for the temporal
difference, the result of this decomposition of the measure is
∞ ∞
Kμ (t) = cos(ωt) dμsym (ω) + i sin(ωt) dμalt (ω)
−∞ −∞
sym
= Kμ (t) + iKμalt (t),
= eiat Kμ (t).
The standard Matérn spectral measure with index ν is symmetric with density
dω
dμsym (ω) = .
(1 + ω2 )ν+1/2
286 16 Space-Time Processes
For ν > 0, the measure is finite and proportional to the Student t distribution on 2ν
degrees of freedom. The standard Matérn covariance function is the characteristic
function of this distribution, which is real and symmetric. The characteristic function
of the translated Student t distribution is proportional to
Kν,a (t) = |t|ν Kν (|t|) × cos(at) + i sin(at) ,
dω 2bω
dμalt (ω) = × .
(1 + ω )
2 ν+1/2 1 + ω2
The condition −1 ≤ b ≤ 1 implies that the skew factor 2bω/(1+ω2 ) lies in [−1, 1],
so |μalt | ≤ μsym . Other possibilities for the skew factor include bω/(1 + ω2 )1/2 .
For the spectral measures shown above, the covariance function is proportional
to
Kμ (t) = |t|ν Kν (|t|) 1 + ibt/(ν + 1/2) . (16.3)
For a derivation of the real part, see Stein (1999, section 2.10) or Exercises 16.2–
16.6. In applications, t is replaced with t/ρ for some temporal range ρ.
We assume in this section that the domain is Rd for some d ≥ 1. In most examples,
the domain is also assumed to be Euclidean with an inner product and a norm, so
that the orthogonal group may act on it. To distinguish space from time, particularly
in the case d = 1, points in the domain are called sites and are denoted by x.
The frequency vector ω = (ω1 , . . . , ωd ) acts as a linear functional on the spatial
domain, so that the scalar product ωx ≡ ω x is the value at x. The Hermitian
function
Kω (x, x ) = eiω (x−x ) ,
16.4 Stationary Spatial Process 287
which is the Hermitian outer product ξ ξ ∗ of the function ξ(x) = eiω x with itself, is
positive definite of rank one. For each finite non-negative measure μ on frequencies,
the convex combination
Kμ (x, x ) = eiω (x−x ) dμ(ω)
Rd
Kμ,a (x) = eiω x dμ(ω − a)
Rd
= eia x Kμ (x).
Note that the exponent in the complex harmonic modulation factor is the inner
product a x, or a (x − x ), of the frequency shift vector a with the spatial
displacement x − x of the two sites.
If the spectral measure is rotationally symmetric, the covariance function is also
rotationally symmetric, which means that Kμ (x) is a function of x . Both Kμ (x −
x ) and the real part of the frequency-shifted covariance are also real symmetric and
positive definite, while Kμ,a (x − x ) is positive-definite Hermitian.
The spectral measure may be decomposed into symmetric and alternating parts
as defined in (16.1), so that
dμalt dμalt
−1 ≤ (ω) = − (−ω) ≤ 1.
dμsym dμsym
Since μsym is even and μalt is odd, the associated covariance function is
∞ ∞
Kμ (x) = cos(ωx) dμsym(ω) + i sin(ωx) dμalt(ω)
−∞ −∞
sym
= Kμ (x) + iKμalt (x),
sym
where x is the spatial difference vector for two sites. By construction Kμ (−x) =
sym
Kμ (x) is even, whereas Kμalt (−x) = −Kμalt (x) is odd.
288 16 Space-Time Processes
For general ν > 0, the Matérn spectral measure on Rd is finite and radially
symmetric with density
(ν + d/2) dω
dμ(ω) = . (16.4)
π d/2(1 + ω 2 )ν+d/2
Mν (x − x ) = x − x ν Kν ( x − x ),
Spectral Convolution
Let μ!2 = μ ! μ be the two-fold convolution of the Matérn measure with itself.
As a consequence of the definition, the characteristic function of a convolution is
16.4 Stationary Spatial Process 289
the product of the characteristic functions. Thus, the covariance function associated
with μ!2 is
Mν2 (x − x ) = x − x 2ν
Kν2 ( x − x ).
The behaviour of Mν2 near the origin is the same as that of Mν , so the two
processes are equally smooth. Convolution extends to higher-order products such
as Mν Mν Mν , in which case the behaviour near the origin is governed by
min(ν, ν , ν ).
Frequency Translation
This defines a complex-valued process, Z(·), which is stationary but not isotropic
in Rd . The real and imaginary parts are identically distributed processes with
covariance Mν,a , but they are not independent unless a = 0. Part of the effect
of anisotropy can be understood from the covariances of sums and differences at
sites x, x :
cov Z(x) + Z(x ), Z̄(x) + Z̄(x ) = 2Mν (0) + 2Mν (x − x ) cos(a (x − x ) ;
cov Z(x) − Z(x ), Z̄(x) − Z̄(x ) = 2Mν (0) − 2Mν (x − x ) cos(a (x − x ) ;
cov Z(x) + Z(x ), Z̄(x) − Z̄(x ) = −2i Mν (x − x ) sin(a (x − x )).
This latter is zero for sites such that x −x perpendicular to a, and a damped sinusoid
for x − x parallel to a.
The real part of the frequency-shifted covariance
Mν,a (x − x ) = Mν (x − x ) cos a (x − x )
To a certain extent, the relation between μalt and μsym may be decided in arbitrary
fashion. Apart from frequency translation, there are a few other mathematically
natural choices such as
2a ω
dμalt (ω) = dμsym (ω) × ,
1+ ω 2
In addition to the index and the range parameter which is not shown, this covariance
also depends linearly on the polar vector. The effect of the polar asymmetry on the
covariance of sums and differences is similar to (16.5) except that the sinusoids are
replaced by linear functionals.
Domain Restriction
with index ν, range ρ, and frequency-shift vector θ . The real and imaginary parts of
the process have the same distribution, but they are not independent. The frequency
shift vector governs both the magnitude and the direction of the anisotropy, as is
easily seen by visual inspection of the simulations.
292 16 Space-Time Processes
Fig. 16.3 Two anisotropic Gaussian fields with ν = 1 and different anisotropy vectors
16.4 Stationary Spatial Process 295
Fig. 16.4 Two anisotropic Gaussian fields with ν = 1/2 and different anisotropy vectors
296 16 Space-Time Processes
so these are both positive definite Hermitian. In fact Mν2 (x) is the characteristic
function of μ!2 , the two-fold convolution of the Matérn spectral measure (16.4),
which is spherically symmetric. and (CK)(x) is the characteristic function of the
convolution of the frequency-translated measures. The real part
(CK)(x, x ) = Mν2 (x − x ) × cos (a + b )(x − x )
ia(x − x ) ib(x − x )
Mν2 ( x − x ) 1 + 1+ ,
ν + d/2 ν + d/2
16.5 Covariance Products 297
Thus, the eigenvalues of the product are the products of the eigenvalues. Hence the
covariance product is positive definite on the product space, and the rank of the
product is the product of the ranks.
One important special case occurs when the spaces U and V are equal. The
covariance product (16.8) restricted to the diagonal of U × U is nothing more than
the Hadamard product of two covariance functions on U. Positive definiteness of the
Hadamard product follows trivially by diagonal restriction.
A covariance function on the product space is said to be separable if it is
expressible as a single product, as in (16.8). A statistical covariance model is said to
be separable if each covariance function in the model is separable. For example, the
298 16 Space-Time Processes
For purposes of illustration, we use the temporal family (16.3) and the spatial
family (16.5) with the same index. Writing x and t in place of x − x and t − t , the
outer product is
Mν ( x ) 1 + iax/(ν + d/2) × Mν (t) 1 + ibt/(ν + 1/2) , (16.9)
where −1 ≤ b ≤ 1 is a scalar. This product splits into four sub-products, two real
and two imaginary:
Mν ( x ) Mν (t);
Mν ( x ) Mν (t) × iax/(ν + d/2);
Mν ( x ) Mν (t) × ibt/(ν + 1/2);
Mν ( x ) Mν (t) × −ax bt/((ν + 1/2)(ν + d/2)). (16.10)
For a real spatio-temporal process, the two imaginary terms can be discarded. We
are left with a linear combination of the two real products,
ax bt
Mν ( x ) Mν (t) 1 − . (16.11)
(ν + 1/2)(ν + d/2)
Only the first of these is positive definite. Nevertheless, the linear combination is
positive definite for all a ≤ 1.
The complex temporal process associated with (16.3) is stationary but not
reversible. The complex spatial process associated with (16.5) is stationary, but
the polar parameter implies a specific directional asymmetry. The real space-time
process with covariance (16.11) is both spatially and temporally stationary. If
ab = 0, the covariance is also space-time symmetric, or space-time reversible, in
the sense
cov Z(site0 , day0 ), Z(site1 , day1 ) = cov Z(site0 , day1 ), Z(site1 , day0 ) .
(16.12)
A process on a finite index set [k] is nothing more than a list of variables Z =
(Z1 , . . . , Zk ), one component Zr for each r ∈ [k]. The covariance function is a
matrix Kr,s , one value for each ordered pair r, s ∈ [k]. In the case of a complex-
valued process such that eiφ Z has the same distribution as Z for each complex
unit scalar multiple, the product Zr Zs has the same distribution as e2iφ Zr Zs , which
implies that the expected value is zero. It is sufficient, therefore to record only the
expected values of the conjugated products Krs = E(Zr Z̄s ); This matrix is positive
definite Hermitian.
When we refer to Z : [k] → C as a Gaussian process rather than a complex
Gaussian vector, we usually mean that the process has certain symmetries, which
are seen as patterns in the covariance matrix. Two rather simple examples suffice by
way of illustration.
For k = 2 and real θ , let χ2 (θ ) be the skew-symmetric matrix
χ2 (θ ) = 0 −θ
.
θ 0
K = I2 + i χ2 (θ ) (16.13)
Kθ = I3 + i χ3 (θ ) (16.15)
(1 − θ 2
)Kθ−1 = I3 − θ θ − i χ3 (θ ).
16.6 Real Spatio-Temporal Process 301
where det(g) = +1 for a rotation and −1 for a reflection. In other words, the
family of matrices (16.15) is closed under orthogonal conjugation, as is the family
of inverses.
Complex Moments
In the product space C × [2], each domain point is an ordered pair (u, r) with u ∈ C
and r ∈ [2]. If it is convenient, the process values Z(u, r) can be taken in pairs
302 16 Space-Time Processes
W (u) = (Z(u, 1), Z(u, 2)). In this form, W is a bivariate planar process, i.e., a
function C → C2 . If Z happens to be real-valued, W is a function C → R2 or a
function C → C, a planar process also taking values in the plane. By construction,
the product of (16.13) on [2] and (16.17) on C is positive definite Hermitian on
the product space. The real part of the product is positive-definite symmetric: it
associates with the ordered pair (u, r), (v, s) the real number
is not symmetric unless θ = 0. In that case, the second term disappears and
K0 (u, s; v, r) = K0 (u, r; v, s) exhibits full symmetry, i.e., cov(W (u), W (v)) =
cov(W (v), W (u)).
A 3D Real Process
which is the product of the real parts minus the product of the imaginary parts.
16.6 Real Spatio-Temporal Process 303
The diagonal blocks are equal and symmetric; Each off-diagonal block is propor-
tional to K1 [u, u], which is skew-symmetric.
On account of (16.16), the orthogonally transformed process gZ has covariance
function
cov gZ(u), gZ(v) = K0 (u, v)I3 ∓ K1 (u, v) χ3 (gθ ),
where the sign is − det(g). In other words, the 3D process is not rotationally
symmetric, but the θ -indexed family is closed under orthogonal transformation.
Assuming that K and C are both stationary, the spatial profile is a Gaussian process
with covariance K(x − x ) C(0), which is spatially stationary. Although the spatial
distribution is constant in time, the profile itself is not static unless the spatial factor
is constant C(t) = C(0).
As a specific example consider a complex-valued Matérn process in Rd with
polar vector a and covariance function (16.5). The covariance function for the
associated wave travelling at velocity v ∈ Rd is
ia(x − x − v(t − t )
Mν ( x − x − v(t − t ) ) 1 + . (16.22)
ν + d/2
Mν ( x − vt ) Mν (t); (16.23)
Mν ( x − vt ) Mν (t) × ia(x − vt)/(ν + d/2);
Mν ( x − vt ) Mν (t) × it/(ν + 1/2);
−axt + avt 2
Mν ( x − vt ) Mν (t) × . (16.24)
(ν + 1/2)(ν + d/2)
The group element has two components, θ ∈ [0, 2π) or R (mod 2π), and v ∈
C. The covariance function for the transformed process is obtained by substituting
eiθt x − vt for x − vt in (16.22) and (16.10). If the polar vector a is also taken as a
complex number, ax is the real part of the complex product a x̄, not the product of
complex numbers.
16.6 Real Spatio-Temporal Process 305
with v ∈ Rd as velocity vector and b = −av as the scalar product. The perturbation-
theory covariance function
Mν ( x − x ) Mν (t − t )eia(x−vt −(x −vt ))
has some of the characteristics of a travelling wave, but it is not the same as (16.22).
306 16 Space-Time Processes
The original process associates with each point x in the domain a value Z(x); we
Z
denote this diagrammatically by x −→ Z(x). The domain-transformed process ZR
associates with each point x a value (ZR)(x) = Z(Rx) by composition on the
domain:
R Z
x −→ Rx −→ Z(Rx).
The process is said to be isotropic if, for every rotation R, the domain-
transformed process ZR has the same distribution as Z. It is rotationally symmetric
if, for every rotation R, the state-space-transformed process RZ has the same
distribution as Z. For present purposes, neither of these symmetry conditions is
compelling because any modification of the frame-of-reference in the domain has an
equal impact in the state space. If RZR has the same distribution as Z, we say that
the process is hydrodynamically symmetric; in other words, the frame of reference
is immaterial. Hydrodynamic symmetry is equivalent to the condition that RZ have
the same distribution as ZR. It does not imply that ZR has the same distribution
as Z, so hydrodynamic symmetry does not imply isotropy or rotational symmetry.
The gradient of a real-valued process on Rd provides the simplest example of a
hydrodynamic process: see Exercise 16.17.
Our primary focus in this section is on hydrodynamic processes in R2 and R3 .
The situation for R2 is simpler and reasonably well understood, so the algebra is
specific to d = 3. It is also specific to the proper orthogonal group O3+ consisting of
3D rotations, excluding reflections. Typically, the domain-reflected process Z(−x)
has the same distribution as −Z(x), which need not be the same as that of Z(x).
This definition means that x → χ (x) is a linear transformation from R3 into the
space of 3 × 3 matrices. It is a full-rank transformation: the image is the three-
dimensional subspace of skew-symmetric matrices. By construction, Q(x)x =
χ (x)x = 0 and χ 2 (x) = − Q(x).
For fixed x = 0, the normalized matrices χ̂ = χ (x)/ x and Q̂ = Q(x)/ x 2
can be multiplied as follows:
2
Q̂ = Q̂; Q̂ χ̂ = χ̂ Q̂ = χ̂ ; χ̂ 2 = − Q̂ .
This means that the four matrices {± Q̂(x), ± χ̂(x)} form a finite group which is
isomorphic with that of the four complex numbers {±1, ±i}. For each x = 0, the
2D subspace of matrices α Q̂ +β χ̂ for (α, β) ∈ R2 is a field that is isomorphic with
the complex numbers.
The proper orthogonal group consists of 3 × 3 orthogonal matrices whose
determinant is +1. Each group element R ∈ O3+ acts as a linear transformation
308 16 Space-Time Processes
so reflections in R3 are not preserved by the group action in the space of matrices.
The full orthogonal group admits two three-dimensional representations, one by its
action on vectors and one on matrices; both are O3 -irreducible, but they are non-
isomorphic as O3 -representations.
The group property (16.26) is satisfied not only by every scalar multiple of χ (x),
but also by the matrix products χ 2 (x), χ 3 (x), and so on for each positive integer.
Thus, every polynomial function of χ (x) and every analytic matrix function such
as exp(χ (x)) also satisfies (16.26). For even powers, χ 2n (x) = (−1)n x 2n Q̂(x)
is symmetric, and χ 2n+1 (x) = (−1)n x 2n+1 χ̂(x) is skew-symmetric. Thus, if
x → P (x) is a polynomial or analytic function such that P (0) = 0, the matrix
polynomial P χ (x)) is a linear combination of the basis elements Q̂(x) and χ̂(x)
with coefficients depending on x . Every even function is symmetric or real,
16.7 Hydrodynamic Processes 309
where Q̂(0) = χ̂(0) = 0. It follows that the matrix sine function satisfies
for R ∈ O3+ .
No attempt is made here to give a proof of (16.26). However, it is easy to
check numerically by testing random values of R and x. It can also be checked
algebraically by finding a suitable parameterization for general 3D-rotations and
verifying that the two sides of (16.26) are identical for all x.
%u, v& = uv̄ = u v −(u2 v3 −u3 v2 )i−(u3 v1 −u1 v3 )j−(u1v2 −u2 v1 )k; (16.27)
310 16 Space-Time Processes
see Exercises 16.24–16.25. Thus, the real part of uv̄ is the standard scalar product
u v = v u in R3 , and the imaginary part is the negative vector product, or cross
product
u × v = (u2 v3 − u3 v2 , u3 v1 − u1 v3 , u1 v2 − u2 v1 ) = −v × u; (16.28)
u × v = 123 $(uv̄),
Q(X) = XX ⊗ I3 − χ (X × X)
such that K(u, v) is the transpose of K(v, u). The process is stationary if K(u +
h, v + h) = K(u, v) for all u, v, h ∈ R3 , which means that K is a function of the
vector difference u − v.
The goal here is to exhibit a non-trivial hydrodynamically-symmetric process,
i.e., a process that is hydrodynamically symmetric, but neither isotropic nor
rotationally symmetric. One such covariance function is derived in Exercise 16.17.
For another example, consider the following matrix-valued covariance functions:
K0 (u, v) = Mν ( u − v )V
K1 (u, v) = Mν ( u − v ) I3 u v − χ (u × v) ,
41.5
41.5
41.5
41.0
41.0
41.0
40.5
40.5
40.5
40.0
40.0
40.0
39.5
39.5
39.5
39.0
39.0
39.0
41.5 38.5
41.5 38.5
41.5 38.5
-89.5 -88.5 -87.5 -89.5 -88.5 -87.5 -89.5 -88.5 -87.5 -89.5 -88.5 -87.5
41.0
41.0
41.0
40.5
40.5
40.5
40.0
40.0
40.0
39.5
39.5
39.5
39.0
39.0
39.0
41.5 38.5
41.5 38.5
41.5 38.5
-89.5 -88.5 -87.5 -89.5 -88.5 -87.5 -89.5 -88.5 -87.5 -89.5 -88.5 -87.5
41.0
41.0
41.0
40.5
40.5
40.5
40.0
40.0
40.0
39.5
39.5
39.5
39.0
39.0
39.0
38.5
38.5
38.5
Fig. 16.5 Fractional cloud cover on a 15 × 15 spatial grid in central Illinois in half-hour intervals
from 6.00am to 11.30am on June 9, 1998
17.0 × 22.2 km cells. Successive panels show the cloud cover at 30-minute intervals
from 6.00am to noon on June 9, 1998.
Solar irradiance is measured by a geostationary satellite, and fractional cloud
cover is the complement of the ratio of solar irradiance relative to the clear-sky
maximum at that time and location. The value lies between zero and one. On this
morning, the average fractional cloud cover was 35%. Cloud cover is the primary
variable that limits the production of solar energy. Its evolution throughout the day
is of commercial interest for short-term prediction of solar electrical generating
capacity, so that alternative sources may be brought online if needed.
For this illustration, only the first 2.5 hours of data from 6.00am to 8.30am are
used for parameter estimation and model fitting. This corresponds to the first six
panels in Fig. 16.5. The process appears to be relatively smooth in space and in
time, so we use the Matérn model (16.11) with ν = 1 for both space and time. Two
16.8 Summer Cloud Cover in Illinois 313
range parameters, ρ0 for time in minutes, and ρ1 for distance in km. are also needed,
so t is replaced by t/ρ0 and x by x/ρ1 . The isotropic sub-model has a = 0. The
maximally anisotropic model takes b = 1 and advection a = (cos θ, sin θ ) as a unit
vector in the east and north directions respectively.
For both the isotropic and the anisotropic covariance models, the mean fractional
cloud cover is taken to vary linearly in both space and time. This may be adequate
for short-term prediction, but it is not recommended for long-term prediction or
extrapolation beyond the spatial domain. As always, a nugget term is included
in the covariance model. The fitted range parameters for the isotropic model are
32.5 minutes and 27.0 km, while the variance components are 0.0219 for the identity
matrix and 0.0283 for the Matérn product covariance. For the anisotropic model
(16.11) using the unit vector a = (cos θ, sin θ ) with θ̂ = 3.01, the fitted range
parameters are 32.0 minutes and 26.0 km, while the variance components are 0.0215
and 0.0276.
The REML log likelihood for the fitted anisotropic model is 11.23 units higher
than that for the isotropic model. Since the anisotropic model has two additional
parameters, both a and θ , the likelihood-ratio test statistic is nominally on two
degrees of freedom. In fact, â = 1 on the boundary, which slightly complicates
the null distribution theory. Nevertheless, the observed likelihood-ratio test statistic
of 22.46 leaves no doubt about the existence of space-time anisotropy for the cloud-
cover process. Whether the formulation (16.11) captures adequately the full extent
of anisotropy is another matter. Most likely, the polar vector could not be expected
to remain constant from one day to the next.
Table 16.1 shows the fitted parameters for four spatial models, all including a
nugget effect. Each of the anisotropic models is a substantial improvement over the
isotropic Matérn product. The simplest travelling wave model (16.23) is the most
effective; the additional polar anisotropy in (16.10) does not substantially improve
the fit.
In both travelling-wave models, the estimated wave velocity is approximately
0.7 km/min, or 42 km/hr, or 26 mph from the east. However, that particular June
morning was calm and humid, with light and variable winds averaging four mph.
In such circumstances, a wave travelling at 42 km/hr in any direction might be
attributed to changes in temperature or pressure, but it cannot be attributed to
atmospheric advection.
The large value of the nugget variance implies that even the best predictor has
substantial variance. The nugget standard error is a lower bound for the root mean
square prediction error, and the fitted values are 0.148 for the isotropic model, and
0.147 for the anisotropic model (16.11). The empirical one-step-ahead root mean
square prediction error averaged over 225 sites for 9.00am are 0.140 for the isotropic
model and 0.131 for the anisotropic model (16.11).
An alert reader may have noticed that the definition of a Gaussian process in
Sect. 16.1, and the definitions of stationarity and isotropy in Sects. 16.2–16.3, are
not sufficiently broad to include the simplest non-trivial Gaussian processes on the
real line or on the plane. On an arbitrary domain with measure Λ, white noise is
a zero-mean Gaussian process indexed by subsets such that, cov(W (A), W (B)) =
Λ(A ∩ B). The process takes independent values on disjoint sets, and variances are
determined by the intensity measure. If Λ is Lebesgue measure on the real line or
Rd , which is the standard choice for those domains, the process is both stationary
and isotropic. The notation here and subsequently in this section presumes that the
process is real-valued.
The earlier definition is inadequate because it assumes that the domain and the
index set are one and the same set. This is sufficient for processes defined pointwise,
but it is not sufficient to cover many of the generalized or intrinsic processes that
occur in applied work. For planar white noise, the domain is D = R2 or C, but the
index set U is the set of Borel subsets in the domain. More correctly, U is the proper
subset consisting of Borel sets of finite Λ-measure. The definition of stationarity
offered in Sect. 16.2.1 is not applicable to white noise because it presumes that
W (x) exists for x ∈ D. In the case of standard white noise W ({x}) exists and the
value is zero for all singletons.
Stationarity and isotropy refer to a group acting on the domain x → gx either by
translation or rotation. There is a natural induced action on the index set A → gA,
which is a rigid Euclidean motion of subsets. With an appropriate modification to
distinguish between the domain and the index set, white-noise with intensity Λ is
stationary or isotropic if the measure is invariant under this action.
16.9 More on Gaussian Processes 315
Every process Z that is defined pointwise and is continuous on the domain can
be extended by integration to an additive process W on domain subsets
cov W (A), W (B) = K(x, x ) dΛ(x) dΛ(x ).
A×B
White noise is not a continuous process and does not have a covariance density.
However, cov(W (A), W (B)) = Λ(A∩B) means that there is a covariance measure,
which is the Dirac-type singular measure Λ(·) concentrated on the diagonal in D2 .
The extension to subsets is a half-way house that suffices for a few purposes, but
it is not adequate for mathematical work, which requires all variances to be finite,
and it is not entirely adequate even for applied work. Consider, for example, the
additive planar process defined for regular planar subsets as follows:
Λ1 (∂A ∩ ∂B), Int(A) ⊂ Int(B) or Int(B) ⊂ Int(A);
cov(W (A), W (B) =
−Λ1 (∂A ∩ ∂B), Int(A) ∩ Int(B) = ∅.
Regular means that each planar subset A has a well-defined interior Int(A), and a
one-dimensional boundary ∂A of finite length Λ1 (∂A). Additivity implies that the
variance for more general regions is the boundary length, and the covariance for two
regions is the total signed length of the common boundary.
It is not clear from the preceding description how we are meant to deal with a
subset having an irregular boundary or an empty interior, so this specification is
not entirely satisfactory. It is also unclear whether a covariance measure exists. The
more satisfactory way to study such processes is to abandon subsets and to use a
suitable Hilbert space as the index set. Planar white noise is associated with the
space of square-integrable functions f : R2 → R; a subset is nothing more than its
indicator function. The process described above is associated with functions such
that the norm of the derivative vector is square-integrable f (x) 2 dΛ(x) < ∞.
This section deals with questions of two types, the first related to limits of Gaussian
processes, the second related to prediction and limits of conditional distributions.
316 16 Space-Time Processes
Two families of processes are used to illustrate the development. Both are
indexed by a single parameter θ > 0, and the limit refers to θ → ∞. The first
is an exchangeable Gaussian process in which the covariance function is
which means that the covariance matrx of Z[n] is In + θ Jn . The second is a Matérn-
1/2 process defined pointwise on Rd with covariance function
x−x /θ
cov Z(x), Z(x ) = θ e− . (16.30)
In the first example Z1 ∼ N(0, 1 + θ ), and in the second Z(x) ∼ N(0, θ ). Neither
sequence of distributions has a limit as θ → ∞, so the answer to the first question
is negative. On the other hand, Zi − Zj ∼ N(0, 2) for i = j , while Z(x) − Z(x )
has variance
x−x /θ
2θ − 2θ e− = 2 x − x + O(θ −1 ).
Both limits exist and the distributions are Gaussian. More generally, for n ≥ 1
and any coefficient vector α = (α1 , . . . , αn ) whose components add to zero, the
16.9 More on Gaussian Processes 317
The Hilbert space Hn∗ has dimension n − 1, but it is exhibited here as the subspace
10n of contrasts in a vector space of dimension n. Consequently, the inner-product
matrix is of order n, and is not unique. For the first example, we could use either the
identity matrix of order n or In − Jn /n.
Since every n-contrast is also a (n+1)-contrast whose last component is zero, the
Hilbert-space Hn∗ of n-contrasts is a subspace of Hn+1 ∗ . Kolmogorov consistency is
automatic, but is equivalent to the statement that the restriction or insertion Hn∗ $→
∗
Hn+1 is an isometry. In effect, H∞ ∗ includes all of the finite-dimensional spaces as
In all examples of the type under consideration, the covariance matrix of Z[n] is
where is independent of θ and is also positive definite on contrasts. For the Matérn
example, is the matrix with components − xi − xj .
318 16 Space-Time Processes
has a Gaussian limit distribution with covariance matrix Jn +QQ . For each n ≥ 1,
the transformation Z[n] → W is invertible, so the answer to part 2 is affirmative.
Note that W ∈ Rn is not the restriction of the corresponding transformation in
Rn+1 , so there is no W -process associated with these transformations. The existence
of a limit distribution for every n does not imply the existence of a limit process.
nθ Z̄n 1 + (n + 1)θ
E Zn+1 | Z[n] = , var Zn+1 | Z[n] = ,
1 + nθ 1 + nθ
For each n, it follows that θ −1/2 Z̄n has a standard normal limit as θ → ∞, and is
asymptotically independent of every contrast Zi − Z̄n , not only for i ≤ n, but also
for i = n + 1. Thus, the limiting conditional mean satisfies
n
E(Zn+1 − Z̄n | Z[n]) = βr Z r , (16.32)
r=1
n
E(Zn+1 | Z[n]) = Z̄n + βr Z r , (16.33)
r=1
The situation regarding conditional distributions for the limit process is different
in a fundamental but subtle way. The joint distribution for any set of contrasts
is determined by the Hilbert-space inner product. In particular, the conditional
distribution of the contrast Zn+1 − Z̄n given the σ -field generated by Z1 , . . . , Zn
is Gaussian with mean (16.32) and variance as described above. However, the limit
process is defined on contrasts only, so the σ -field generated by Z1 , . . . , Zn is the σ -
field generated by contrasts, which means that the coefficient vector β in (16.32) is
a contrast in Hn∗ . The limit process does not admit either Zn+1 or Z̄n as a Gaussian
variable, so the crucial statement that Zn+1 − Z̄n is independent of Z̄n is either
meaningless or mathematically trivial. In either case, the fiducial leap from (16.32)
to (16.33) requires a σ -field extension, which cannot follow from the limit process
alone.
The limit process with probabilities defined on the σ -field generated by contrasts has
a certain mathematical elegance—brutal and minimalist. But the σ -field restriction
is a price too steep for any applied statistician interested in probabilistic prediction.
Is there a way out, a way that retains the elegance of contrasts at a more affordable
price? The answer, we hope, is yes.
A Markov kernel is a function that associates with each μ ∈ R a Gaussian process
Z such that, for each contrast α ∈ 10n , the increment or linear functional αr Zr
has the same distribution as that in the limit process. There is no σ -field restriction.
320 16 Space-Time Processes
16.10 Exercises
has determinant 1 − 3|ρ|2 + 2(ρ 3 ). Hence or otherwise, deduce that the matrix is
positive definite if and only if ρ lies in the triangle with vertices at the cube roots of
unity {1, e2πi/3 , e−2πi/3 }.
16.3 The Matérn spectral measure on the real line is proportional to the symmetric
type IV distribution in the Pearson class, which is also equivalent to the Student t
family (Pearson type VII). For ν > −1/2, show that the standardized version
(ν + 1/2) dω
M1 (dω) =
π 1/2 (1 + ω2 )ν+1/2
has positive density, but the total mass is finite only for ν > 0.
dω ∞ x d−1 dx
= Ad−1 ,
Rd (1 + ω 2 )ν+d/2 0 (1 + x 2 )ν+d/2
where Ad−1 = 2π d/2/ (d/2) is the surface area of the unit sphere in Rd . For
ν > −d/2, deduce that the Matérn measure on Rd
(ν + d/2) dω
Md (dω) = .
π d/2 (1 + ω 2 )ν+d/2
has finite mass if and only if ν > 0. Show that the total mass is a constant
independent of the dimension of the space.
16.10 Exercises 321
16.5 For fixed ν > 0, show that the Matérn measures are mutually consistent in the
sense that Md+1 (A × R) = Md (A) for all d ≥ 0 and subsets A ⊂ Rd . In other
words, show that Md is the marginal distribution of Md+1 after integrating out the
last component. For ν > −1, show that the Matérn measures are mutually consistent
in the sense that Md+1 (A × R) = Md (A) for all d ≥ 2.
16.6 Consistency and finiteness together imply that the normalized Matérn mea-
sures define a real-valued process X1 , X2 , . . . in which Mn / (ν) is the joint
distribution of the finite sequence X[n] = (X1 , . . . , Xn ). This process—a special
case of the Gosset process—is not only exchangeable but also orthogonally invariant
for every n. Show that the conditional distribution of Xn+1 given X[n] is Student t,
with a certain location parameter, scale parameter and degrees of freedom. To what
extent is finiteness needed in the construction of the process?
16.7 For the Matérn process, show that the sequence of partial averages X̄n has
a limit X̄∞ = limn→∞ X̄n . For n ≥ 2, what can you say about the conditional
distribution of X̄∞ given X[n]? Consider separately the cases ν = 0 and ν > 0.
cos(ωx) dω cos(w x ) dw
= I (ν + 1/2, d − 1)
Rd (1 + ω 2 )ν+d/2 R (1 + w2 )ν+1/2
π d/2
= × x ν
Kν ( x ),
2ν−1 (ν + d/2)
ωv sin(ωx) dω π d/2 vx
= × x ν
Kν ( x ),
Rd (1 + ω ) 2 ν+d/2+1 2 (ν + d/2) 2ν + 1
ν−1
16.11 Let Z be a real Gaussian space-time process with zero mean and full-rank
separable covariance function:
cov Z(x, t), Z(x , t ) = K(x, x )V (t, t ).
Show also that if the prediction site belongs to x, say x0 = x1 , the conditional
expectation reduces to the linear combination
E Z(x0 , t0 ) | Z[x × t] = V (t0 , tr )V [t]−1
rs Z(x0 , ts )
rs
depending only on the values at (x0 , t). In other words, if the model is separable,
the spatial covariance is irrelevant for temporal prediction.
16.12 This exercise is concerned with stereographic projection from the unit sphere
in Rd+1 onto the equatorial plane Rd . Latitude on the sphere is measured by the
polar angle θ , starting from zero at the north pole, through θ = π/2 at the equator
up to θ = π at the south pole. Every point on the sphere is a pair z = (e sin θ, cos θ )
where e is a unit equatorial vector. The stereographic image of z is the point
so that the southern hemisphere is projected into the unit ball, and the northern
hemisphere to its complement in Rd . Deduce that the stereographic image of the
uniform spherical distribution is
(d) dω
π d/2 (d/2) (1 + ω 2 )d
16.13 Points near the north pole are transformed stereographically to high frequen-
cies, and points near the south pole to low frequencies. For ν > d/2, the weighted
distribution with density proportional to
| sin(θ/2)|2ν−d
reduces the mass on northern latitudes and increases that on southern latitudes,
maintaining radial symmetry. Show that the stereographic image of the weighted
distribution is inversely proportional to (1 + ω 2 )ν+d/2. Find the normalizing
constants for both distributions.
is transformed to
(d) dω 2(a ω̄k )
× 1+
π d/2 (d/2) (1 + ω 2 )d (1 + |ω|2 )k
16.15 Consider a fixed tessellation of the plane into a countable set of polygonal
cells A1 , . . ., and let 0 ≤ ij < ∞ be the length of the common boundary ∂Ai ∩∂Aj .
Associate with each ordered pair of regions (i, j ) a Gaussian random variable
with independent and identically distributed signs independent of |ε|. If all boundary
lengths i. are finite, the row sums W (Ai ) = εi. define a Gaussian process indexed
by cells. Find the covariances cov(W (Ai ), W̄ (Aj )) for i = i and i = j .
16.16 In the setting of the previous exercise, let W = Lε, where L is a Boolean
matrix. Show that W is a process defined on general planar regions and that it
coincides with the process described at the end of Sect. 16.9.1.
Show also that K(Rx, Rx ) = RK(x, x )R for R ∈ Od , and hence that the gradient
process is hydrodynamically symmetric.
324 16 Space-Time Processes
Show that the covariance matrix is positive definite if and only if ρ ≤ 1. For any
real 3 × 3 orthogonal matrix L with det(L) = 1, show that LZ belongs to the same
family with parameter Lρ, i.e., that χ (Lρ) = L χ (ρ)L .
16.21 For each ρ ∈ R3 such that ρ ≤ 1, deduce that the following symmetric
functions are positive definite on R × [3]:
Mν ( t − t )δrs ;
Mν ( t − t ) cos(ω(t − t ))δrs ;
Mν ( t − t ) cos(ω(t − t ))δrs − Mν ( t − t ) sin(ω(t − t ))χ(ρ)rs .
16.22 For each ν > 0 and ω ∈ R3 , the Matérn function Mν ( x − x ) eiω (x−x )
defines a stationary complex Gaussian process on R3 with frequency ω and wave
direction ω/ ω . For each ρ ∈ R3 such that ρ ≤ 1, deduce that the following
functions are positive definite symmetric on R3 × [3]:
Mν ( x − x )δrs ;
Mν ( x − x ) cos(ω (x − x ))δrs ;
Mν ( x − x ) cos(ω (x − x ))δrs − Mν ( x − x ) sin(ω (x − x ))χ(ρ)rs ;
16.23 For each (ω, ρ), deduce that the matrix-valued function
16.25 The parameters ω, ρ of the Gaussian process are two points in R3 , which
determine the frequency and direction of spatial anisotropies in the given frame of
reference. In the rotated frame of reference, the values are Rω, Rρ. Hence justify the
claim that the matrix-valued covariance in Exercise 16.20 is the covariance function
of a hydrodynamic process at a fixed time.
is equal to O3+ . Find the unit vector x such that R(x ) = R (x).
i2 = j2 = k2 = ijk = −1.
326 16 Space-Time Processes
16.30 Show that |pq| = |p| × |q|, i.e., that the modulus of a product is the product
of the moduli.
16.31 A quaternion of modulus one is called a unit quaternion. Show that the set of
unit quaternions is a group containing the finite sub-group {±1, ±i, ±j, ±k}. What
is the group inverse of q?
q3 −q2 q1 q0
is a linear representation satisfying χ4 (pq) = χ4 (p) χ4 (q) and χ4 (p̄) = χ4 (p).
16.34 Let p be an arbitrary quaternion and let q be a unit quaternion. Show that the
real part of the product qpq̄ is equal to (p). Show also that the imaginary part of
the product qpq̄ is a 3D rotation of $(p).
Chapter 17
Likelihood
17.1 Introduction
Pn+m,θ̂ (A | Y [x])
for events A ⊂ Rn+m . In either case, the inferential goal requires not only a point
estimate of the parameter, but also some measure of its uncertainty and the effect of
uncertainty on inferences.
The estimation step is sometimes called ‘learning’ in computer-science circles.
But the parameter value is never learned with the certainty that that word implies;
it is only estimated with error, which might be small or large. A large error
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 327
P. McCullagh, Ten Projects in Applied Statistics, Springer Series in Statistics,
https://doi.org/10.1007/978-3-031-14275-8_17
328 17 Likelihood
θ = θ *⇒ Pθ = Pθ .
A Bayesian model has all of the ingredients listed in the preceding section—plus
one other. The additional feature is a probability distribution π(·) on the parameter
space. The non-Bayesian model is generally portrayed as a stochastic formulation
whose appropriateness in a given application is widely agreed, whereas no broad
consensus is expected regarding the choice of π(·). One is said to be objective and
the other subjective. These adjectives are not only provocative and unhelpful, but
also devoid of mathematical content.
The net effect of the prior is that a Bayesian model is either (i) a single
θ,n (·) on the product space × R ; or (ii) a single mixture
distribution π(dθ )P n
simplification because the estimation step is by-passed, the model comprises a single
process, and the ambiguity about the choice of process for prediction is eliminated.
Even for problems of parametric inference, it is usually possible in principle to
by-pass the parameter space entirely by re-phrasing the target as a tail event
associated with a limit statistic, for example by computing Pπ (Ȳ∞ ∈ A | Y [x])
or Pπ (θ̂∞ ∈ A | Y [x]).
In practice, two difficulties must be overcome before we can confidently take
advantage of the Bayesian solution. The first is to select a suitable prior distribution
and, more importantly, to convince the reader that this prior is appropriate for
the problem. The Bayes resolution calls for a single prior distribution selected to
represent the information available a priori. In practice, it is often better to depart
from the paradigm by considering a sequence of distributions πν for ν > 0 such that
the available information corresponds either to the limit ν → 0 or to the asymptote
in which ν is small but strictly positive.
Absence of information may be represented by a sequence of distributions such
that πν (A) → 0 at rate ρν > 0 on bounded subsets in such a way that ρν−1 πν (dx)
has a finite non-zero limit. The limit is a measure, sometimes termed ‘improper’
because it is not a probability distribution. However, for sufficiently large samples,
the conditional distribution given the data may have a limit that is satisfactory for
inference. Usually it depends on the limit measure, but is otherwise independent of
the sequence. At the other end of the spectrum, strong information such as sparsity
corresponds to a sequence that tends to the Dirac measure in a suitably regular way
that permits limits for a certain class of integrals (McCullagh and Polson 2018).
These limit recipes are reasonably satisfactory for stylized problems in low-
dimensional parameter spaces. For high-dimensional spaces, assumptions of inde-
pendence for selected components are not to be taken lightly because their effect on
conclusions may be substantial.
The second problem, perhaps less of an obstacle today than in the recent past,
is to manage the computations. Posterior distributions can often be approximated
by simulation in various ways, for example, using Markov-chain Monte Carlo.
Bayesian computation is not a focus of this book, so we do not make sweeping
recommendations regarding the choice of prior or how to manage the computation.
17.2.1 Definition
In the simplest setting where the response distribution has a density Pθ (dy) =
pθ (y) dy, the density pθ (y) as a function of θ for fixed y is the likelihood function.
The function pθ (y) is the density of the probability relative to Lebesgue measure.
For present purposes, there is nothing special about Lebesgue measure, so the
likelihood function is the density ratio relative to an arbitrary fixed density. For
330 17 Likelihood
The likelihood is a function L(θ ; y) of the parameter θ and the data y, and the same
applies to the log likelihood l(θ ; y) = log L(θ ; y). Since the likelihood is defined
up to an arbitrary multiplicative factor that is constant in θ , the log likelihood is
defined up to an arbitrary additive term that is constant in θ .
The Bayesian goal is to compute the conditional probability of some specified
inferential event given the data, and in that calculation y is regarded as a fixed
constant. However, frequentist properties of estimators are connected with the
statistical behaviour of log likelihood derivatives and related procedures for fixed θ
as a function of the random variable whose distribution is Pθ . The Bartlett identities
are fundamental for deriving large-sample asymptotic distributions in regular
problems. To keep notation digestible, we pretend that θ is a scalar, so that each
log likelihood derivative is also a scalar. Results for vector-valued parameters are
obtained by replacing scalars with vectors or matrices as appropriate.
The first two log likelihood derivatives are
The Bartlett identities are connected with the moments of these and higher-order
derivatives. The first identity follows from the constancy of the integral pθ (y) dy
as a function of θ .
∂
0= pθ (y) dy
∂θ
∂pθ (y)
= dy
∂θ
∂ log pθ (y)
= pθ (y) dy
∂θ
= E U1 (θ ; Y ); θ .
The first step in this derivation is to switch the order of differentiation with respect
to θ and integration over the observation space. This step requires a regularity
condition, which fails if the support of Pθ is parameter-dependent. Regularity
conditions must be taught by faculty and learned by students, if only to demonstrate
mastery of Fubini’s theorem, but they almost never fail in practical work. In the last
expression θ occurs twice, the first to indicate the differentiation point, the second
to indicate that the parameter of the distribution Y ∼ Pθ is the same as the point at
which the derivative is computed. The random variable U1 (θ ; Y ) does not have zero
mean under the distribution Y ∼ Pθ ∗ .
The second identity, which follows from the second derivative of the probability
integral, establishes the role of the Fisher information matrix
∂2
0= pθ (y) dy,
∂θ 2
∂ ∂ log pθ (y)
= pθ (y) dy,
∂θ ∂θ
∂ 2 log pθ (y) ∂ log pθ (y) 2
= + pθ (y) dy,
∂θ 2 ∂θ
= E U2 (θ ; Y ); θ + E U12 (θ ; Y ); θ .
The final expression implies that the variance of the first derivative is the negative
expected value of the second derivative, which is called the Fisher information
matrix:
I (θ ) = −E U2 (θ ; Y ); θ = cov U1 (θ ; Y ); θ .
It follows that I (θ ) > 0, and, for vector parameters, that I (θ ) is positive definite.
332 17 Likelihood
Regularity conditions for statistical work are of two types, those that can be checked
or verified and those that cannot. Fubini-type conditions permitting the interchange
of sample-space integration with parameter-space differentiation are verifiable.
Conditions regarding the smoothness of functions or topological adequacy of the
parameter space are also verifiable. Statistical models that occur in applied work are
usually field-tested and are seldom in violation of verifiable conditions.
Asymptotic conditions holding in the large-sample limit are a different matter.
Much of the theory of statistical estimation uses asymptotic theory as a device for
distributional approximation. For simple processes having independent and identi-
cally distributed components, the only route to infinity is ‘more independent copies
of the same’. For more general spatial or temporal processes, or processes involving
covariates, the routes to infinity are more numerous. By their nature, asymptotic
conditions are not verifiable in any finite sample because any finite design can be
embedded into a sequence of larger designs in countless ways. The question to be
asked is not whether the given design is part of a particular sequence but whether
one conceptual design sequence provides a better distributional approximation than
another.
The motivation for large-sample theory is most straightforward for independent
and identically distributed sequences. Such sequences seldom occur naked in
applied work, so the independent and identically distributed theory is not directly
relevant. However, the crucial parts of the theory carry over with relatively minor
modification to models having independent observations. Additional conditions are
needed for asymptotic regularity in specialized models for genetics, time series and
spatial processes. The count of individual numbers or observations or rows in a data
file may be impressive, but that does not necessarily translate into an impressive
quantity of information.
In a setting where the components of the response are independent, or condi-
tionally independent given treatment, the log likelihood is a sum of n independent
contributions, and the same applies to the log likelihood derivatives. In particular,
the total Fisher information I. (θ ) = Ii (θ ) is the sum of positive contributions
coming from individual components. The first derivative at θ ∗ is the sum of indepen-
dent random variables, U1 , . . . , Un , having zero mean and finite variances Ii (θ ) <
∞. Provided that n is large and that no small subset of components dominates the
contribution to the total Fisher information, the central limit theorem implies that
the first derivative at θ ∗ is approximately normally distributed
U. (θ ∗ ) ∼ N 0, I. (θ ∗ )
under the distribution Pθ ∗ . Assuming that the maximum is a stationary point, Taylor
approximation in a neighbourhood of θ ∗ gives
For models having independent and identically distributed components, the first
derivative is Op (n1/2 ), while second and higher-order derivatives are Op (n). As a
result, both terms shown are formally Op (1) while the error term is Op (n−1/2 ).
Under suitable asymptotic conditions, these asymptotic orders also hold more
broadly for generalized linear models and many models having temporal or spatial
correlation.
Using the one-step approximation (17.1) for the parameter estimate, the
likelihood-ratio statistic satisfies
The first and second versions are positive definite quadratic forms in the vector of
first derivatives at the true or hypothesized parameter point; the third is a quadratic
form in the parameter space. The likelihood ratio statistic is invariant under smooth
reparameterization, and that property is inherited by the first quadratic form shown,
which is called the Rao statistic, or Fisher-Rao statistic. The third version, called the
Wilks statistic, is not invariant under reparameterization. Invariance is desirable in
applied work, but perhaps not absolutely essential.
The central limit approximation for the distribution of log likelihood derivatives
implies that all three versions of the likelihood-ratio statistic are first-order equiv-
alent, and that the limit distribution is χp2 in all cases. They are not second-order
equivalent, either in power or in distribution. A more refined analysis taking account
of higher-order terms shows that the expected value of the likelihood-ratio statistic
is p(1 + b(θ )/n), and that the asymptotic distribution is (1 + b(θ )/n)χp2 with error
O(n−2 ). Division by the Bartlett adjustment factor 1 + b(θ )/n greatly improves the
accuracy of the χp2 approximation. This adjustment holds fairly widely for regular
problems with continuous distributions.
Most parametric models that occur in applied work make a distinction between
parameters of interest and other parameters, loosely called nuisance parameters.
Despite the nomenclature, nuisance parameters are essential for satisfactory infer-
ences.
The parameter of interest is defined by a differentiable function T : →
from the parameter space of dimension p into a manifold of dimension q ≤ p. We
suppose without loss of generality that this mapping is onto, i.e., T = . To each
τ ∈ there corresponds a sub-manifold of dimension p − q
τ = {θ : T (θ ) = τ } ⊂ .
All points in τ are similar in the sense that they have the same value of the
parameter of interest; differences are associated with nuisance parameters. By
construction, the sub-manifolds are disjoint and exhaustive in ; they form a
partition or a foliation of the parameter space.
The profile likelihood for τ is the maximum achieved on τ :
To first order in the sample size, the profile likelihood behaves like an ordinary
likelihood function. For example, the first derivative has mean of order O(n−1 ),
which is not zero but is small enough to permit the standard asymptotic argument
to proceed. Likewise, the expected value of the second derivative is not exactly
17.2 Likelihood Function 335
the variance of the first, but the difference is small enough that it does not
affect first-order asymptotic approximations under standard regularity conditions.
Consequently, the subset consisting of parameter values achieving near-maximum
likelihood
The standard Gaussian model for a completely randomized design has three
parameters θ = (μ0 , μ1 , σ 2 ), two means and one variance σ 2 > 0, so has
dimension three. The log likelihood function is
Usually the maximum for fixed τ must be computed numerically, but it can be
evaluated explicitly in this instance:
μ̂0 = n0 ȳ0 + n1 ȳ1 − n1 τ /n;
μ̂1 = n0 ȳ0 + n0 τ + n1 ȳ1 /n = μ̂0 + τ ;
nσ̂ 2 = (n − 2)s 2 + n0 (ȳ0 − μ̂0 )2 + n1 (ȳ1 − μ̂1 )2 ;
l(θ̂τ ; y) = n
2 log(σ̂ 2 ) + const
= n
2 log (n − 2)s 2 + n0 (ȳ0 − μ̂0 )2 + n1 (ȳ1 − μ̂1 )2 + const
= n
2 log (n − 2)s 2 + n0 n1 (ȳ0 − ȳ1 + τ )2 /n + const .
336 17 Likelihood
The partially maximized likelihood function is called the profile likelihood for the
parameter of interest. By construction, the overall maximum occurs at the ordinary
maximum of the likelihood, τ̂ = ȳ1 − ȳ0 in this example.
Asymptotically, the profile log likelihood has all of the essential properties of
a log likelihood function. For example, an approximate level-α likelihood-based
confidence interval can be obtained in the standard manner
2
{τ : 2l(θ̂ ; y) − 2l(θ̂τ ; y) ≤ χ1,1−α }. (17.3)
which has the Student t distribution on n−2 degrees of freedom. The exact coverage
of the likelihood-based
interval can be inferred from the fact that the likelihood-ratio
statistic n log 1 + tτ2 /(n − 2) is monotone in tτ2 .
Suppose that the response of unit i to dose x is a Bernoulli variable with parameter
π(x) satisfying the linear logistic model
with independent responses for distinct units. The goal is to estimate the dose τ for
which π(τ ) = 0.9, the so-called lethal dose 90%. The LD90 is a non-linear function
of the parameters
logit(0.9) = log(9) = θ0 + θ1 τ ;
τ = log(9) − θ0 /θ1 ,
so we take T (θ ) = log(9) − θ0 /θ1 . To compute the profile likelihood for τ , it is
necessary to fit the logistic model (17.4) by maximizing over the parameter subset
for arbitrary but fixed τ . This is not a linear logistic model in the strict technical
sense, but most computer packages have the option to cater for an offset, which is
the constant log(9) in this setting. The likelihood-based confidence region for τ is
the set of values for which the likelihood is sufficiently large in the sense of (17.3).
If we replace the linear logistic model (17.4) with a linear Gaussian model and
ask for the x-value that makes the mean response zero, the goal is the abscissa
or parameter ratio τ = −θ0 /θ1 . Fieller’s method is tailored for problems of this
sort. However, the likelihood-ratio statistic is a function of the standardized ratio on
which Fieller’s method is based, so the two approaches are essentially identical.
One point to note is that a likelihood-based confidence set is not necessarily
an interval. Equivariance under reparameterization makes this unavoidable. For
instance, if the likelihood-based confidence set for τ = θ0 /θ1 is a bounded interval
containing zero, the confidence set for 1/τ is necessarily an ‘interval’ containing
±∞ but not zero. In both the linear logistic and Gaussian cases, the likelihood-based
confidence set is either an interval or the complement of an interval, or possibly the
whole space.
so M0 (0) = 1. It is assumed that the the generating function exists in the sense that
the subset
= {θ : M0 (θ ) < ∞},
has a non-empty interior that includes zero. This subset is a real interval, called the
canonical parameter space. In that case M0 has a Taylor series
θ r T r (y)
M0 (θ ) = 1 + θ T (y) + · · · + + · · · dP0 (y),
r!
= μr θ r /r!,
is also a probability distribution on the state space. The set of distributions {Pθ : θ ∈
} is called the exponential family associated with the pair (P0 , T ). For a random
variable Y ∼ Pθ , the moment generating function of T (Y ) is
= et T (y) eθT (y) P0 (dy) M0 (θ )
S
= M0 (θ + t)/M0 (θ ).
This function also has a Taylor expansion in which the rth cumulant is the rth
derivative of Kθ (t) at t = 0, which is the rth derivative of K0 (·) at θ . In particular,
the mean is μ = K (θ ), and the variance is K (θ ).
For all generalized linear models, the state space is the real line or a proper
subset, and T is the identity function. The simplest example is the unit exponential
distribution P0 (dy) = e−y dy for y > 0. In that case M0 (θ ) = 1/(1 − θ ) for θ < 1,
and Pθ is the exponential distribution with rate 1 − θ and mean μ = 1/(1 − θ ) > 0.
The cumulant function is K0 (θ ) = − log(1−θ ). Standard examples of the same type
are listed in Table 2.1 of McCullagh and Nelder (1989). They include the normal,
binomial, Poisson, multinomial, hypergeometric and gamma families.
Examples in which the state space is not the real line include the Ewens
distribution on set partitions (Exercises 10.4, 11.6) and the von-Mises-Fisher family
on the circle or sphere (Exercise 14.8).
In every generalized linear model, the mean vector μ = E(Y ) (or μ = E(T (Y )))
is a component-wise non-linear function of covariates Xβ, depending on regression
parameters β ∈ Rp . In a regular model, the derivative matrix with components
Dir = ∂μi /∂βr has full rank p ≤ n at every point. The gradient vector of the log
likelihood with respect to β is
D −1 (Y − μ)
β̂ − β = (D −1 D)−1 D −1 (Y − μ)
17.5 Mixture Models 339
in which the μ, D, are computed at the current value. Provided that all eigenval-
ues of D −1 D are large, the asymptotic covariance of β̂ is
cov β̂ (D −1 D)−1 .
Let V0 , . . . , Vk be given n×n matrices that are linearly independent and symmetric.
Usually V0 = In and the remaining matrices are symmetric positive semi-definite. A
variance-components model is one in which the observation vector Y is a zero-mean
Gaussian variable with covariance matrix satisfying the linear model
= γ0 V0 + · · · + γk Vk
∂l ∂l
Irs = cov , = 1
2 tr −1 Vr −1 Vs .
∂γr ∂γs
Given an initial estimate γ , the one-step update γ̂ − γ satisfies the linear equation
I (γ̂ − γ ) = ∂l/∂γ .
This formula works reasonably well in the vicinity of the solution, but some
tempering is often needed to keep the early steps from straying into territory where
is not positive definite. In practice, some or all of the coefficients may also be
subject to positivity conditions, so it is necessary to check for boundary components
and to adjust computations for the remaining components.
Let ψ0 and ψ1 be the density functions of two distributions on the real line.
Both densities are assumed to be strictly positive, so the density ratio ζ(y) =
340 17 Likelihood
ψ1 (y)/ψ0 (y) is finite, as is the inverse ratio. The mixture model refers to the family
of distributions
ζ(yi ) − 1
2
l (θ ; y) = − < 0.
1 − θ + θ ζ (yi )
i
If all of the observation points satisfy ψ0 (yi ) = ψ1 (yi ), then ζ(yi ) = 1 for
each i, and the log likelihood is constant in θ . Otherwise, the second derivative is
everywhere strictly negative, implying concavity. Every stationary point is a global
maximum, and there is at most one such point in (0, 1).
At the left end-point l (0; y) = ζ(yi ) − n; if the derivative at zero is negative,
i.e., if ζ (yi ) ≤ n, the maximum occurs at θ̂ = 0. At the right end-point l (1; y) =
n − 1/ζ (yi ). If the derivative is positive, i.e., if 1/ζ (yi ) ≤ n, the maximum
occurs at θ̂ = 1. The likelihood function has a maximum in the interior of the
interval if and only if ζ(yi ) > n and 1/ζ (yi ) > n. In that case, the maximum
can be computed by a straightforward Newton-Raphson iteration.
For 0 < θ̂ < 1, the condition l (θ̂ ; y) = 0 implies
ζ(yi ) 1
= = n,
1 − θ̂ + θ̂ ζ (yi ) 1 − θ̂ + θ̂ ζ (yi )
17.5 Mixture Models 341
θ̂ζ(yi )
θ̂ (yi ) = = pr(i → class I | Y ),
1 − θ̂ + θ̂ ζ (yi )
n
const
n−1 ζ(Yi ) = 1 − + ,
log log n log n log log n
i=1
where is a random variable in the Landau class (stable with α = β = 1). The
event ζ (Yi ) > n is equivalent in the limit to > const × log n. Since the Landau
density has an inverse-square right tail, the probability is O(1/ log n).
342 17 Likelihood
e−x
2 /2
1−ρ = p(x) dx.
moment. It is possible to make an elaborate argument for one over the other, but
17.6 Inferential Compromises 343
such arguments are futile because the marginal distributions are indistinguishable to
first order.
Eddington’s signal-estimation formula reduces to
ρζ (yi )
E(Xi | Y ) = + o(ρ)
1 − ρ + ρζ(yi )
ρζ (yi )
E(Xi2 | Y ) = + o(ρ).
1 − ρ + ρζ(yi )
ρζ(yi )
P (|Xi | > | Y ) = + o(1).
1 − ρ + ρζ (yi )
For a given suitably low but strictly positive threshold, the exceedance probability
is approximately independent of the threshold (McCullagh and Polson, 2018), and
ζ (y) is interpretable as the posterior-to-prior odds ratio, also called the Bayes factor.
For example, if ρ = 0.05 and y = 3.5, the inverse-square zeta value is ζ(y) = 55.3
and the exceedance probability is 0.74. Provided that below-threshold signals are
counted as null or false, there is a close formal connection with the concept of false
discovery rate, and particularly with the local false discovery rate (Benjamini &
Hochberg, 1995; Efron, 2010).
In the great majority of sparse signal identification and detection formulations
the second component of the mixture ψ2 (y) = φ(y)ζ (y) has heavy tails. The tails
are governed by the signal distribution, which may be either Laplace-type e−|y|
or Cauchy-like with regularly varying tails. In all such models, the asymptotic
null distribution of ρ̂ and of the likelihood-ratio statistic is is degenerate at zero.
However, a very small signal little larger than ρ log(n)/n is enough to change
the calculus for signal detection, at least in the Cauchy case.
For parametric inference, the event θ ∈ A is identified with the set of sequences
y ∈ R∞ such that θ̂ (y) = limn→∞ θ̂n (y[n]) exists and belongs to A. In that way,
the conditional probability of the event θ ∈ A given Y [n] is computable as a tail
event
Any consistent estimator can be used in place of θ̂n , so this description is not tied in
any way to maximum likelihood.
Pn,θ (dy) = Q1,θ (dy1)Q2,θ (dy2 ; y[1]) · · · Qn,θ (dyn ; y[n − 1]).
This makes sense only for n sufficiently large that θ̂n = θ̂n (y[n]) exists. Given an
initial sequence y[m], the distribution of successive values in the MLP is defined by
the kernel product
17.7 Exercises
n
ψr (yi )
≤ n,
m̂(yi )
i=1
with equality for every r such that θ̂r > 0. Discuss the ‘almost-true’ claim that m̂
exists and is unique for every n ≥ 1 and every y ∈ Rn , even if the model is not
identifiable.
√
17.2 Let ψ0 (y) = e−y /2 / 2π be the standard normal density. Assume that
2
17.3 If the claim made in the last paragraph of section 17.5.2 is to be believed,
the re-scaled limit distribution of X̄n does not have a mean. Discuss this apparent
contradiction.
17.5 Show that the random variables Xi = ψ1 (Yi )/ψ0 (Yi ) in the preceding
exercise have a density whose tail behaviour is 1/f (x) ∼ x 2 log(x)3/2 as x → ∞.
17.8 Sparse signal detection. Suppose that the observation Y = X + ε is the sum of
a signal X plus independent Gaussian noise ε ∼ N(0, 1). For any signal distribution
X ∼ Pν , the sparsity rate is defined by the integral
(1 − e−x /2 ) Pν (dx).
2
ρ=
17.9 For the setting of the previous exercise, show that Y is distributed according
to the mixture with density
m(y) = (1 − ρ)φ(y) + ρψ(y) + o(ρ) = φ(y) 1 − ρ + ρζ (y) + o(ρ)
where ψ(·) is a probability density, ζ(y) = ψ(y)/φ(y) is the density ratio, and
ζ (0) = 0. Fill in the details needed to express ζ(·) or ψ(·) as a function of the
family Pν .
17.12 What does the preceding equation imply about the fraction of non-negligible
signals among sites in the sample such that |Yi | ≥ 3?
for large y. Discuss the implications for mean shrinkage and variance inflation.
1
F̂1 (·) = δλτy (·);
kk
τ ∈ Mk
Show that F̂0 and F̂1 are both exchangeable with the same marginal distribution,
and that F̂1 also has independent components. For λ = 1, these are called the
permutation estimator and the bootstrap estimator respectively.
18.1 Background
which is linear for fixed λ. On the other hand, a Gaussian graphical model is an
additive specification for the inverse covariance matrix. The simplest version is
Σ −1 = τ0 In + τ1 G,
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 349
P. McCullagh, Ten Projects in Applied Statistics, Springer Series in Statistics,
https://doi.org/10.1007/978-3-031-14275-8_18
350 18 Residual Likelihood
where G ⊂ [n]2 is the graph incidence matrix, and the coefficients are subject to
positive-definiteness conditions. Usually, this means τ0 > 0 and τ1 ≤ 0.
Residual likelihood differs from ordinary likelihood in that it uses only the
residuals R = LY , where L is any linear transformation such that
ker(L) = X = {Xβ : β ∈ Rp }.
By focusing on the residuals, the regression parameters are eliminated from the
distribution
k
Σ̂ = σ̂r2 Vr ,
r=1
be well-behaved in the sense that n−1 I has a limit whose eigenvalues are strictly
positive. For the residual likelihood (18.5), Q = I − X(X W X)−1 X W is the
orthogonal projection whose kernel is X .
18.3 The REML Likelihood 351
In a simple linear regression model with a single variance component, the covariance
matrix is Σ = σ 2 V , where V is known and strictly positive definite. It is convenient
in this setting to take W = V −1 as the inner-product matrix, so that PX =
X(X W X)−1 X W and Q = I − P are complementary W -orthogonal projections.
Then W Q and QV are both known and symmetric. The model for the residual
QY ∼ N(0, σ 2 QV ) has only a single parameter. For this full exponential-family
setting, the quadratic form QY 2 = Y W QY is minimal sufficient, and the REML
estimate is obtained by equating the observed value Qy 2 to its expected value:
Qy 2
= E(Y W QY ; σ̂ 2 ) = σ̂ 2 tr(V W Q) = σ̂ 2 tr(Q).
σ̂ 2 = y W Qy/(n − p).
Note that the REML estimate is strictly larger than the ordinary maximum-
likelihood estimator which is y W Qy/n. The lesson here is that REML, not ML,
is the norm for variance estimation.
18.3.1 Projections
18.3.2 Determinants
Using the standard formula for the determinant of a partitioned matrix, we find
For REML applications where the kernel in specified by K, the determinantal term
in the marginal likelihood is the expression on the right.
For any linear transformation such as T having kernel K, the linear transformation
Y → T Y is called a residual modulo K. All transformations having the given kernel
determine the same likelihood function. The marginal log likelihood based on the
linear transformation T Y ∼ N(T μ, T ΣT ) is
l = − 12 (y − μ) T (T ΣT )−1 T (y − μ) − 1
2 log det(T ΣT ) + const.
18.3 The REML Likelihood 353
In this setting, l is a function on the parameter space, and the additive constant may
be any function that is constant on the parameter space. It is convenient here to take
a particular constant, namely 12 log det(T T ) plus any function of y. This choice
ensures that, for every invertible matrix L of order n − k, the linear transformations
T and LT produce identical versions of the log likelihood. With this choice, the
marginal log likelihood based on the residuals modulo K is one half of
where Q is the orthogonal projection with kernel K, and K is any matrix whose
columns span K.
In applications where X is the model subspace, the most common choice is
K = X , but expression (18.4) is valid for all subspaces, and K ⊂ X arises in the
computation of likelihood-ratio statistics. The ordinary log likelihood with kernel
K = 0 is obtained by setting K = 0. The standard REML likelihood has K = X
and K = X so that μ ∈ X implies Qμ = 0:
Formulae (18.4) and (18.5) may be used directly in computer software. The
constant term log det(K K) is included to ensure that the log likelihood depends
on the kernel subspace, not on the particular choice of basis vectors.
For general-purpose computer software, these formulae are not recommended
because the marginal likelihood requires only that T ΣT be positive definite, which
is a weaker condition than positive-definiteness for Σ. Marginal likelihood modulo
a suitable kernel may be used for fitting generalized Gaussian processes, sometimes
called intrinsic processes, that are defined by a generalized covariance function,
which is not positive definite in the normal sense, but for which T KT is positive
definite. For example, if i → zi is a quantitative covariate taking values in Rk , the
matrix Σij = − zi − zj is positive definite in the Euclidean space Rn /1, which is
the space of residuals modulo the one-dimensional subspace of constant functions.
In other words, for general-purpose computer software, it is best to use a version of
(18.5) that does not require Σ to be positive definite or invertible.
supθ∈Θ1 Pθ (E)
supθ∈Θ0 Pθ (E)
in which the numerator and denominator are maximized over the respective
parameter spaces. It is crucial that all probability measures be defined on the same σ -
field and that the event in the numerator be the same as the event in the denominator;
otherwise the ratio is not a fair comparison. In fact, E is always an observation or
singleton event, which is best regarded as an infinitesimal event, and commonly
denoted by E = dy. Operationally speaking, dy is the limiting -ball B(y, )
centered at the observation point y ∈ Rn , and the likelihood ratio is the density
ratio at y.
In the case of marginal likelihood, however, the event E ⊂ Rn is necessarily an
event in the σ -field generated by the linear transformation T into the Borel space
Rn−k . The induced σ -field in Rn is the class of residual events, which are the Borel
subsets E ⊂ Rn such that E + K = E. In other words, a residual is a point in the
quotient space Rn /K, and each residual event E is a union of translates of K, i.e., a
union of K-cosets. The residual event, E = B(y, ) + K, is the union of K-cosets
that intersect the ball. This is, of course a Borel subset in the space of residuals
modulo K. A residual likelihood ratio statistic modulo K is thus a ratio of the form
supθ∈Θ1 Pθ (dy + K)
supθ∈Θ0 Pθ (dy + K)
in which the limiting event B(y, ) + K is the observed residual. A ratio such as
supθ∈Θ1 Pθ (dy + K1 )
supθ∈Θ0 Pθ (dy + K0 )
18.4 Computation
18.4.2 Likelihood-Ratios
Here, mf0 and mf1 denote the model formulae for X0 and X1 respectively. The
space of covariance matrices is fixed but arbitrary, and block+V is used solely for
illustration.
Welham and Thompson (1997) discuss two possibilities for a likelihood-ratio
statistic in this setting. The statistic denoted by A in their equation (5) is equivalent
to setting the kernel equal to the null subspace, i.e., K = X0 ⊂ X1 , which is the
computation illustrated above. Provided that K is fixed, X0 ⊂ X1 and μ ∈ X0 , the
log likelihood ratio is distributed approximately as χ 2 on dim(X1 + K) − dim(X0 +
K) degrees of freedom, which simplifies for K ⊆ X0 to q = dim(X1 ) − dim(X0 )
356 18 Residual Likelihood
independent of the kernel. The numerical value of the likelihood-ratio statistic for
K ⊆ X0 depends on the kernel, but the first-order asymptotic approximation to the
null distribution is χq2 , which is independent of K. The choice K = X0 is thought to
be optimal in the sense of power, and in the sense of accuracy of the χ 2 distributional
approximation.
The option kernel=0 is permitted but not encouraged; it implies ordinary
maximum likelihood, and is equivalent to the REML=FALSE option in lmer().
The option kernel = X1 is also allowed; this option produces a valid likelihood
ratio statistic that is exactly zero. Why? Because the hypothesis concerns μ ∈ X1 ,
and the residuals modulo X1 contain no information about the parameter.
For the comparison of two nested models having the same mean-value subspace,
the REML default option is recommended:
fit0 <- regress(y~mf0, ~block);
fit1 <- regress(y~mf0, ~block+site);
2*(fit1$llik - fit0$llik);
summary(fit1)
Consider an experimental design consisting of three physical sites, one northern one
southern and one western, separated by a considerable distance that is sufficient to
affect the local climate. Each site consists of four blocks of six plots, all of which
are outdoors. Each plot is assigned by randomization to one of two treatment levels,
which are constant in time. On 127 days over a two-year period, measurements
are made on one plant in certain designated plots. By construction, there is one
18.4 Computation 357
The t reat×site interaction space has dimension two, so the null distribution of the
likelihood ratio statistic is χ22 . The parameter estimates reported by fit1a and
fit1b should be very similar, but not identical. If the number of observations is
large, it is helpful to supply an initial value for the iteration, as illustrated for fit1b
above. Note that fit1b uses the default kernel site*treat, so fit1b$llik
is not comparable with fit0$llik.
It may happen that the response has a temporal component that is continuous in
time, as opposed to the process implied by the inclusion of day as a block factor
above, in which the daily contributions are independent and identically distributed.
One simple option is to assume that the temporal component behaves like free
Brownian motion, with generalized covariance function proportional to −|t − t |.
Free Brownian motion has independent increments on non-overlapping intervals; it
is a stationary process in the sense that the distribution of increments are constant in
time.
dayv <- as.numeric(paste(day)); BM <- -abs(outer(dayv, dayv, "-"))
X0 <- model.matrix(~site+treat);
fit0 <- regress(y~site+treat, ~block+plot+BM);
fit1a <- regress(y~site*treat, ~block+plot+ BM, kernel=X0);
fit1b <- regress(y~site*treat, ~block+plot+BM, start=fit1a$sigma);
2*(fit1a$llik - fit0$llik); summary(fit1b)
It is essential in this script that the kernel subspace include the constant functions.
The option kernel=1 is acceptable; kernel=0 is not acceptable, and may
produce an error message.
358 18 Residual Likelihood
18.5 Exercises
18.1 Welham and Thompson (1997) discuss two possibilities for a Gaussian
likelihood-ratio statistic. For arbitrary mean vector μ, inner product matrix W =
Σ −1 , and W -orthogonal projection Q with kernel K = span(K), W&T define the
residual log likelihood as a function of the parameter θ = (μ, Σ) by
where c(K) = rank(Q) log(2π) − log |K K|. Note that Q depends on Σ, and Qμ
need not be zero. For nested subspaces X0 ⊂ X1 , show that version A in their
equation (5) is the difference
where RL attains its maximum at θ̂0 under the null, and at θ̂1 under the alternative.
Discuss whether version D in their equation (6) is or is not a likelihood ratio. If
it is a likelihood ratio, or a non-random multiple of a likelihood ratio, what is the
event for which the probability ratio is computed?
18.5 Exercises 359
μ ∈ 1n , Σ = σ 2 (In + θ B),
is parameterized by two scalars μ, σ > 0 and one additional parameter. For the
purposes of this exercise θ > −1/b is not necessarily positive, but Σ is positive
definite. In addition, the residual refers to any linear transformation, such as Yij − Ȳ.. ,
whose kernel is 1 ⊂ Rn .
Let Yij be the observation for unit j in block i. Show that the within- and
between- quadratic forms
SSW = (Yij − Ȳi. )2 , SSB = b (Ȳi. − Ȳ.. )2 ,
ij i
SSB /(m − 1)
F =
SSW /(n − m)
18.3 For the balanced block design in the preceding exercise, show that the implied
distribution for residuals is a two-parameter full exponential-family model with
canonical sufficient statistic SSW , SSB . Hence deduce that the residual maximum-
likelihood estimate satisfies
18.4 For the balanced block design, show that the log determinant is
Show that the ML estimate satisfies 1 + bθ̂ = (m − 1)F /m. Hence deduce that the
ordinary log likelihood ratio statistic for testing θ = 0 is
n − m + (m − 1)F (m − 1)F
log det Σ̂0 − log det Σ̂1 = n log − m log ,
n m
360 18 Residual Likelihood
What does this expression tell you about the null distribution of the REML
likelihood-ratio statistic?
18.5 Show that the REML estimate with positivity constraint satisfies 1 + bθ̂ =
max(F, 1). What is the REML estimate for the second component? Express the
constrained REML likelihood-ratio statistic as a function of F , and compute the
atom at the origin.
18.6 The following exercise is concerned with the distribution of the likelihood-
ratio statistic in a ‘fixed-effects’ model for a balanced design, where Σ = σ 2 In , and
either μ ∈ 1n under the null hypothesis or μ ∈ span(B) under the alternative. The
meaning of the term ‘residual’ is unchanged, and the F -statistic in Exercise 18.2 is
also unchanged.
Show that the residual log likelihood-ratio statistic for testing μ ∈ 1 versus μ ∈
span(B) is
(m − 1)F
(n − 1) log 1 + .
n−m
By simulation or otherwise, show also that the null expected value exceeds that of
2
χm−1 by the approximate multiplicative factor
1 + 12 (m + 1)/(n − m).
18.7 The null hypothesis being tested in Exercise 18.5 is the same as that in
Exercise 18.3, but the alternatives are different: one implies exchangeability of block
effects, the other does not. Discuss the implications of the fact that one statistic is
strictly increasing as a function of F , whereas the other is strictly decreasing for
F < 1 and strictly increasing for F > 1.
Kij = K(xi , xj )
Let x be a point in real Euclidean space Rd . For each λ > 0, the function
x−x
K(x, x ) = e−λ
2
where sr2 is the sample variance in block r, and spool is the pooled variance. Bartlett’s
(1937) test for equality of variances is the REML statistic divided by the Bartlett
correction factor
k
1 1 1
1+ − .
3(k − 1) nr − 1 n − k
r=1
−1 (gy−μ)/2
n
(2π)−n/2 ||−1/2 e−(gy−μ) |g (yi )|
i=1
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 363
P. McCullagh, Ten Projects in Applied Statistics, Springer Series in Statistics,
https://doi.org/10.1007/978-3-031-14275-8_19
364 19 Response Transformation
n
ˆ g) +
lp (g; y) = − 12 log det( log |g (yi )|. (19.1)
i=1
Finally, for all scalars a, b = 0, the cone condition and 1 ⊂ X imply lp (a+bg; y) =
lp (g; y), so that the profile likelihood is invariant with respect to affine composition.
In other words, the transformations y → g(y) and y → a + bg(y) are equivalent
for this comparison: gY ∼ N(μ, ) implies a + bgY ∼ N(a + bμ, b2), and
vice-versa.
The preceding analysis assumes that the maximum-likelihood estimate μ̂g , ˆg
exists. Existence and uniqueness cannot be guaranteed in general, but failure is rare
in practice provided that p < n and the residual space is adequate to estimate all
variance components.
One very natural option is to choose a simple parametric family such as the family
of power transformations (Box and Cox, 1964). Provided that 1 ⊂ X and all
observations are strictly positive, the transformation (0, ∞) → R may be taken
in the form y → (y λ − 1)/λ for some scalar λ, with the limit λ → 0 corresponding
to the log function. The derivative y λ−1 is strictly positive, so, by (19.1), the profile
log likelihood for λ is
n
lp (λ; y) = − 12 ˆ λ ) + (λ − 1)
log det( log yi (19.2)
i=1
defined origin, and effects are expected to be multiplicative; (ii) the identity if
treatment effects are expected to be additive on the given scale; (iii) occasionally
the reciprocal, square root or cube root if there is a reasonable justification based on
the physical units of measurement. For example, if the observation is a volume, an
argument might be made for the cube-root; if the observation is a time or duration,
conversion by reciprocals to the rate scale or frequency scale might make sense. But
additivity of effects on such scales is usually dubious, so the log transformation is
the preferred choice for most physical variables such as mass, volume, length, time,
or ratios such as speed, density, miles per gallon, and so on. Under no circumstances
should the reported analysis be done on the scale Y λ̂ , where λ̂ is the maximum-
likelihood estimate from (19.2).
Let τ > 0 be a fixed constant, and let y → τ (y λ − 1)/λ be the re-scaled power
transformation applied component-wise on the transformed scale. The Jacobian is
τ n yiλ−1 , so the log Jacobian is
Composition on the right sends g(·) to g(τ ·), which is an affine transformation of
g(·):
τλ − 1
g(τy) = τ λ g(y) + = τ λ g(y) + const.
λ
The assumption 1 ⊂ X , and the cone condition on covariance matrices, are
sufficient to ensure that likelihood-based conclusions are unaffected by scalar
composition on the right. Invariance is absolutely essential in applied work, where
the choice of physical units—inches versus centimetres or minutes versus seconds—
is quite arbitrary.
366 19 Response Transformation
The purpose of re-scaling on the left is not to modify the power transformation in
a substantive way, but to simplify the computation. Nonetheless, the argument in the
first paragraph could easily be misconstrued as a statement that the modified power
transformation
yiλ yiλ − 1
yi → or yi →
λẏ λ−1 λẏ λ−1
has Jacobian equal to one. As they are written above, these transformations do
not act component-wise. The first transformation satisfies g(τy) = τg(y), but the
Jacobian J = |λ|−1 is discontinuous at λ = 0. For the second transformation, the
Jacobian
1 λ − 1 −λ
J = + yi
λ nλ
is continuous at λ = 0, but there do not exist constants a, b such that g(τy) =
a + bg(y). A modified power transformation satisfying both conditions—g(τy) =
τg(y) and continuity in λ—is described in Exercise 19.7. None of these modifica-
tions has a parameter-independent Jacobian, so the Jacobian cannot be ignored in
likelihood calculations.
In the analysis of the woodcutting efficiencies of three brands of saws in Chap. 2, the
response was the time taken to complete a designated task. On the grounds that mul-
tiplicative effects were more plausible than additive effects, the log transformation
was used in all analyses. We now provide an analysis justifying that choice.
Bearing in mind that the chief purpose of transformation is not so much to
induce normality, but to achieve additivity of effects, two additive Gaussian models
were selected as targets. In the first version, the mean of the transformed variable
is additive in the four factors species+bark+team+saw.id, while the variances
are constant, and the covariances are zero. This is a rank-14 sub-model of the
standard Latin-square model, which has rank 16. The transformation model has
two additional parameters, σ 2 and λ, making 16 total. In the second version, the
mean is additive in the three factors species+bark+saw.brand, which is a subspace
of dimension 6, while there are two additional variance components team+saw.id,
making a total of ten parameters. Both profile log likelihoods for λ in Fig. 19.1
have their maxima near λ̂ = −0.34; both 95% confidence intervals include λ = 0,
but the identity is excluded. The conclusion is that the effects on the time scale
are approximately multiplicative, so taking logs is the natural remedy, as indicated
by Bliss (1970, pp. 440–441). Most experienced statisticians would transform
19.2 Box-Cox Transformation 367
1.92
3.37
-2
llik
-4
-6
-8
lambda
Fig. 19.1 Log likelihood for the transformation parameter λ for two linear models
instinctively to the log scale on the grounds that additive effects on the time scale
are less plausible than multiplicative effects.
As it happens, the variation between duplicate saws is small, but brand three
is about 15% more efficient than the others. Mean cutting times are in the ratios
1.28:1.00:0.80 for larch:spruce:pine, with an additional factor of 1.14 for bark.
There is substantial variation among the teams.
A crucial point in the computation of log likelihoods for Gaussian transformation
models is that REML, or residual log likelihood, must not be used under any
circumstances. REML calculations are based on the distribution of residuals in
Rn−p , whereas the Jacobian is the determinant of a transformation Rn → Rn . These
are not compatible because the power transformation does not act on residuals. As a
function of λ, the residual likelihood criterion is not a likelihood in the conventional
sense of a density ratio in Rn . If the REML criterion were used with the Jacobian as
in (19.2), the plots shown in Fig. 19.1 would look substantially different. Details are
discussed in the next section.
A 1 − α confidence interval for the transformation parameter may be obtained by
the likelihood-ratio formula
which is the exact threshold for (1 − α)-coverage in the setting of nested linear
models. The 95% F -threshold for n = 36 and p = 14 is 3.37, which is also shown
in Fig. 19.1 for comparison. The greater allowance produces a wider interval, but
does not materially alter the conclusion or subsequent analysis.
An analysis-of-variance decomposition on the log scale shows that the interac-
tions species.bark and bark.brand are negligible. Bark removal reduces the mean
log cutting time by an estimated 0.152 ± 0.031 units for each species and each
brand, so the cutting-time distribution is reduced multiplicatively by about 14%.
l † = l(μ, , λ) − 1
2 log det(X W X) + 1
2 log det(X X),
l † (, λ; y) = l(, λ; y) − 1
2 log det(X W X) + log det(X X).
Suppose that two statisticians are asked to examine the same data, which is
concerned with vehicle fuel economy. Statistician I analyzes the consumption rates
in miles per gallon, and statistician II in kilometres per litre, so the pairs of numbers
differ by a constant multiple: yi(1) = τyi(2) with τ 2.8. For each λ, the transformed
values differ by a parameter-dependent factor τ λ , the associated variance matrices
satisfy (1) = τ 2λ (2) , and the inverse matrices satisfy W (1) = τ −2λ W (2) . As we
should expect, the log likelihood function is scale-invariant in the sense that, the two
versions differ by an additive constant
Invariance with respect to scalar multiplication means that two statisticians analyz-
ing the same data on different scales must arrive at the same conclusion regarding
the transformation parameter. By contrast, the two versions of the modified criterion
differ linearly in λ:
Elsewhere in the domain, F̃n (t) is subject to differentiability and strict monotonicity,
but, apart from the sample points, the values are otherwise unspecified. The numbers
F̃ (y1 ), . . . , F̃ (yn ) are the uniform sample quantiles in (0, 1), and the target G-
quantiles are the transformed values
taken with multiplicity in ascending order. If there are no ties, the uniform quantiles
are the numbers (2i − 1)/2n for 1 ≤ i ≤ n.
The Jacobian of the transformation h : Rn → Rn is the product of the derivatives
at the domain points
n
n
F (yi )
h (yi ) = ,
g(h(yi ))
i=1 i=1
where g = G is the target density. The last term is a quadrature sum, which is an
approximation to the entropy integral
J˜(G) = n−1 log g(qi:n ) = log g(x) dG(x) + O(n−1 ).
From (19.1), the profile log likelihood function for the quantile-matching
transformation h̃ is
ˆh +
− 12 log det log F̃ (yi ) − log g(qi:n ), (19.3)
i
− 1
log det( ˆ −1 ) − n J˜(G1 ) + n J˜(G0 ),
ˆ 1 (19.4)
2 0
19.4 Exercises
19.1 Let Y be a non-negative random variable with cumulants κr such that κr /κ1r =
O(ρ r−1 ) as ρ → 0. In other words, the scale-free variable Z = Y/κ1 has
variance ρ = κ2 /κ12 , which is the squared coefficient of variation of Y , and the
higher-order scale-free cumulants are O(ρ r−1 ). Show that the cumulants of the
power-transformed variable are
(λ − 1)κ2
E(Z λ ) = 1 + + o(ρ);
2κ12
κ2
var(Z λ ) = + o(ρ);
κ12
κ3 κ22
cum3 (Z λ ) = + 3(λ − 1) + o(ρ 2 ).
κ13 κ12
372 19 Response Transformation
19.2 Wilson-Hilferty transformation: Show that the rth cumulant of the exponential
distribution is κr = κ1 (r − 1)!, and hence that Y 1/3 is approximately symmetrically
distributed.
19.3 Show that the rth cumulant of the Poisson distribution is κr = κ1 , and hence
that Y 2/3 is approximately symmetrically distributed.
is linear and invertible with Jacobian J = |λ|n−1 . Here, ū is the mean of the
components of the vector u ∈ Rn , and λ is a non-zero constant.
yiλ
yi →
λẏ λ−1
where ẏ is the geometric mean of the components of y ∈ Rn+ . Using the result of
the previous exercise, show that the modified transformation is invertible Rn+ → Rn+
with Jacobian J = |λ|−1 .
yiλ − 1
(gy)i =
λẏ λ−1
1 λ − 1 −λ
+ yi .
λ nλ
Find the limits for λ = 0, ±1. Discuss the implications regarding invertibility? For
τ > 0, show that g(τy) is not expressible in the form a + bg(y) for any constants
a, b depending on τ, λ.
yiλ − ẏ λ
(gy)i = ẏ +
λẏ λ−1
19.4 Exercises 373
is continuous at λ = 0 and satisfies g(τy) = τg(y). What are the implications for
statistical applications? Show that the Jacobian is
1 λ − 1 λ −λ
+ ẏ yi ,
λ nλ
which is positive, and that the limits for λ → 0 and λ → 1 are equal.
The first nine of these tips are concerned with technical aspects of statistical reports.
The last six are concerned with English usage, style and semantics.
1. Length: Reports should be no longer than necessary. A short report that makes
the salient points is preferable to a long rambling philosophical essay, even if
the longer essay makes the same points somewhere along the way. Above all,
have compassion for the reader (and grader).
2. Graphs and plots: A plot either of the raw data or of the residuals is almost
always essential at some point, if only to explain the motivation for the analysis.
Some plots provide insight, some do not; only the most useful need to be
shown. Although you should indicate what plots were made, it is generally not
necessary to include in the report a copy of all plots made and all analyses
performed. If necessary for examination purposes, extra plots and lengthy
analyses can be included in an appendix.
3. Executive summary: All major conclusions should be stated at the beginning in
a summary intended for a scientifically literate reader who is not a statistician.
Technical terms associated with the context of the problem are unavoidable, but
technical statistical terms should so far as possible be avoided. One page is the
upper limit. Remember, few readers progress beyond the summary. It is up to
the author to state the conclusions early in as persuasive a manner as possible
if the reader is to be convinced.
4. Statistical analyses: Following the summary, the report should describe the
models fitted, the tests performed, and how these support the conclusions. The
relevance of the models to the context under study is important. Technical
statistical terms are acceptable here only if they are essential to support the
conclusions.
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 375
P. McCullagh, Ten Projects in Applied Statistics, Springer Series in Statistics,
https://doi.org/10.1007/978-3-031-14275-8_20
376 20 Presentations and Reports
one, main verb. Poor logical organization is a signal of a confused mind, and
poor sentence structure points to a lack of attention to detail.
11. Clarity and word usage: It is good practice to cultivate an awareness of grammar
and word usage. Accurate word usage is important insofar as inaccurate or
careless usage sows confusion; good grammar is important insofar as poor
grammar betrays faulty logic.
For example, some native English speakers who are employed as commen-
tators at sports events seem not to understand the difference between the verbs
substitute and replace. These words are also important in mathematics. Viewers
are likely to be confused when a talking head recommends at the end of the first
quarter that the starting quarterback be substituted! For the correct usage in the
active voice, the coach may substitute a bench player for a starter or he may
replace the starter with a substitute from the bench. In the passive voice, an
active player may be replaced, in which case a bench player is substituted. To
declare that an active player has been substituted on account of injury is to put
the focus on the destination, implying that the coach’s job is to ensure that the
bench is well-supplied with injured players! Unintended, perhaps, but possibly
accurate.
In a similar vein with relevance to genetics, the upstream region of a gene
may be rich in certain motifs, meaning that those short sequences are abundant
in the upstream neighbourhood. The upstream region is enriched with motifs,
but the motifs themselves are neither rich nor enriched anywhere. One could
say that coal is abundant in Wyoming and fruit is plentiful in Florida, but coal
is not enriched in Wyoming nor is fruit richer in Florida than it is in Georgia.
use versus usage: The line What’s the use of crying? from the song Smile
by Nat King Cole is a rhetorical query about the utility or futility of the
act. Similarly, the phrase cocaine use refers to the act—its utility, its benefits
or its prevalence. By contrast, the title Modern English Usage of Fowler’s
celebrated book refers to the manner in which the language is spoken or written,
e.g., imaginatively, in long convoluted sentences, with flair, grammatically,
clichéed, and so on. In the same vein, the phrase cocaine usage refers to
the manner of ingestion. As a statistical factors, cocaine usage has levels
snorting, smoking, injection and other; cocaine use has levels
never, infrequent, occasional and regular.
Verbs for computational activities: Author A writes I created a proportional-
hazards model with covariates...; author B writes I ran a p-h model on the
data...; author C writes I fitted the p-h model...; author D writes I trained the
p-h model...; author E writes The p-h model was trained...; author F writes I
learned the p-h model....
The proportional-hazards model is a set of probability distributions for
survival times. Credit for its creation goes to Cox (1972), not to author A.
Generally speaking, one runs computer code for an algorithm that is designed to
pick the distribution that best fits the data. This activity is called model-fitting—
or learning in CS circles. In a sense, the computer or the algorithm learns the
best-fitting distribution, possibly using data from a training subsample, and
20.1 Coaching Tips I 379
shares that wisdom with the user. Grammatically speaking, if the p-h model
is trained on the data, and learns from it, it would be more accurate for author F
to write The data taught the proportional-hazards model..., or perhaps, I used
the data to teach the proportional-hazards model..., but the semantic anomaly
would then be too evident.
12. Appropriate adjectives: Some computational tasks are easy, while other are
hard; some algorithms are efficient for the task while other are inefficient.
Likewise for a software implementation of an algorithm. Simulation is easy for
some distributions, less so for others. Maximum-likelihood estimation for some
models admits a computationally efficient algorithm; not so for other models.
A task may be easy or it may be hard, but it is neither efficient nor inefficient.
A model as a set of probability distributions may be finite or infinite, finite-
dimensional or infinite-dimensional; it may be suited to the task or it may not;
but it is neither easy nor hard, efficient nor inefficient.
13. Verb tense: Reports are best written in the present tense. If you wish to refer
to a past event, by all means use the past tense; likewise for future events.
If you switch from one tense to another mid-paragraph readers will notice,
and if a good reason is not apparent, the result will be confusion. It is best
to keep the bulk of your report in the present tense, including references to later
sections: An open-air experiment was conducted during the period 2012–2015;
the data from that experiment were analyzed and conclusions are presented in
sections 4–5.
Present tense: Anthropogenic emissions lead to global climate warming.
Past tense: Anthropogenic emissions led to global climate warming. Past
perfect tense: Anthropogenic emissions have led to global climate warming.
Both versions of the past tense, but particularly the first, suggest (probably
incorrectly) that anthropogenic emissions no longer have the effect that they
had in the past. That incorrect implication may be deliberate if the writer is a
White-House hack seeking to justify the U.S. exodus from the Paris Accord,
but it is a distraction for the discerning reader. Future tense: Anthropogenic
emissions will lead to global climate warming. The future tense suggests that
emissions did not have this effect in the past.
14. Numbers in text: A sentence must not begin with a mathematical symbol or a
numeral. Small integers 0–10 or 0–12 are usually spelled out when they occur in
text. A 34−1 design has 27 observational units indexed by four factors with three
levels each. Zero is one of the dose levels. The zero subspace 0 = {0} ⊂ Rn ,
which contains one point and has dimension zero, is not to be confused with the
empty subset ∅ ⊂ Rn , which contains zero points and has no dimension.
15. Quantities; number, amount, volume: a great number of tired tourists, diving
dolphins, ornery kangaroos, football supporters,...; amount of cash in low
denominations, amount of food, alcohol,...; volume or tonnage of crude oil,
undelivered mail, mining sludge, ripe tomatoes,...; mass of water, mass of
humanity; less time, fewer people.
380 20 Presentations and Reports
A + B + C, A ∗ B + C, ..., A ∗ B + B ∗ C, ..., A ∗ B ∗ C.
20.3 Exercises
20.1 The verb to write has a subject, a direct object and an indirect object. Some of
the parts may be empty or missing. Identify the three parts in the following sentences
(i) Joe wrote a letter to Anne; (ii) Anne sent Joe a present; (iii) Joe wrote Anne a
long passionate letter.
20.3 Let A, B be two groups. What are the relations between x, x , x in A that are
preserved by a group homomorphism h : A → B?
20.4 Let A, B be two vector spaces, i.e., commutative groups with additional
structure. What is the additional structure? What are the additional relations between
x, x , x in A that are preserved by a vector-space homomorphism h : A → B?
384 20 Presentations and Reports
20.5 Discuss whether or not the mapping x → x 2 (or its inverse image) is a
homomorphism
x 2
Rd , B(Rd ), N(0, Id ) −→ R, B(R), χd2
of probability spaces.
Chapter 21
Q&A
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 385
P. McCullagh, Ten Projects in Applied Statistics, Springer Series in Statistics,
https://doi.org/10.1007/978-3-031-14275-8_21
386 21 Q & A
Q21. Didn’t you say that the population must exist at baseline? Does Jan 8, 2022
exist at baseline?
A21. Yes, I did say that. And yes, the ordered pair (Kew, Jan 8, 2022) exists
today just as it did in AD 1899. But I did not say that every unit must be
accessible or observable immediately after baseline. Even a mathematician
has little control over the flow of time.
Q22. The space-time product set is uncountable in both dimensions. Isn’t that
excessive and unnecessarily extensive for statistical work?
A22. Maybe so, but imperialism is inscribed in the DNA of mathematics. Besides,
if you restrict the population, you forego the opportunity to make inferences
about the disenfranchised parts.
Q23. Is that a problem?
A23. Only if you decide later that you want to say something about units whose
existence was not declared.
Q24. Couldn’t you declare them retroactively?
A24. So long as the extension fits comfortably into the original scheme, yes you
could and you must, and probably you won’t be penalized for the oversight.
But, if there are multiple extensions, a convincing argument for a particular
extension may be problematic.
Q8. If your sample of plots is a non-random subset, where does the probability
come from?
A8. Probability comes from the mathematical framework that is implicit or
explicit in the protocol. Exchangeability gives rise to probabilities. Ran-
domization also gives rise to probabilities.
Q9. What is the role of randomization analysis?
A9. Randomization is usually associated with the uniform distribution on a finite
group acting on the sample units. Re-randomization enables you to generate
new ‘pseudo-samples’ having the same distribution as the original. For any
non-invariant statistic, you can compute its randomization distribution. This
is a useful way to determine where the observed treatment effect occurs in
the spectrum of treatments effects anticipated under randomization.
Q10. So the set of units in the randomization analysis is the finite sample of plots?
A10. Certainly the sample is finite.
Q11. Isn’t the randomization population the same as the sample?
A11. In a purely arithmetical sense, yes!
Q12. Is there any other sense?
A12. There must always be a wider statistical sense.
Q13. To what end?
A13. Presumably you want to say something about the likely effect of treatment
on other plots of a similar type in the population.
Q14. Couldn’t you just take the finite-population estimate, patch it together with
the randomization distribution or bootstrap distribution, and apply that to
other plots.
A14. If you had no principles or concerns about mathematical integrity, you could
do whatever you liked.
Q15. Isn’t that what every statistician does? Are we all dishonest?
A15. It is true that many statisticians do exactly that—and very often it is the right
thing to do if not always for the reasons stated.
Q16. So what’s the problem?
A16. The problem is one of honesty in mathematics. If you refuse to acknowledge
extra-sample plots, the statement about treatment effect is meaningless. If
you acknowledge their existence you have to establish a connection between
yields on the in-sample plots and yields on extra-sample plots. That step
requires an assumption such as stationarity or exchangeability with respect
to extra-sample plots.
Q17. In that case, what is the role of randomization analysis?
A17. Randomization analyses and bootstrap analyses are logically sound and
useful statistical tools. On its own—restricted to the finite sample of plots—
randomization is a basis for arithmetic and distribution-theory. It is not
otherwise a basis for statistical inference in the sense of prediction for extra-
sample plots.
21.1 Scientific Investigations 393
21.1.4 Covariates
Q1. Apart from the observational units, what else exists before the baseline?
A1. Covariates are recorded pre-baseline.
Q2. What is a covariate?
A2. A variable recorded pre-baseline.
Q3. What is a variable?
A3. A variable is a function on the observational units.
Q4. What types of covariate are there?
A4. Qualitative variables or classification factors, and quantitative variables such
as age or calendar date or spatial position.
Q5. Are there any other types of covariate?
A5. Yes, relationships can also be recorded at baseline.
Q6. What is a relationship?
A6. A relationship is a function on pairs of observational units.
Q7. Can you give examples.
A7. A block factor is an equivalence relation; there are also genetic relationships,
familial relationships, temporal relationships, adjacency relationships, and
metric relationships.
Q8. What is a metric relationship?
A8. A metric is a symmetric non-negative function on pairs that satisfies the
triangle inequality.
Q9. Any other examples of relationships?
A9. On any space, the identity function is a relationship on pairs; it tells
you whether the two elements are the same or different. That’s a rather
fundamental component of elementary set theory. In a Euclidean space, the
inner product is a relationship between pairs of points.
Q10. Are any covariates recorded post-baseline?
A10. No. Every post-baseline variable is a random outcome subject to the rules
of probability.
Q11. What happens at baseline?
A11. The protocol is announced, units are assembled, treatment is assigned by
randomization, and nature or Tyche takes over.
Q12. Who is Tyche?
A12. Tyche is the Greek goddess of chance—Fortuna to the Romans.
Q13. Is treatment a covariate?
A13. No, it is not.
Q14. Why not?
A14. Treatment is the outcome of randomization as specified by protocol. It is a
random variable, albeit ancillary in all settings.
Q15. Is treatment assigned independently to units in the sample?
A15. No, not usually. A balanced design implies non-independent assignments.
394 21 Q & A
Q16. Is the treatment assignment distribution the same for every unit?
A16. Not necessarily. In principle, the treatment assignment probability may vary
from one covariate sub-group to another as specified by protocol. But this
practice is not common and is not encouraged.
Q17. What is the purpose of randomized treatment assignment?
A17. Randomization is a panacea. It has many purposes.
Q18. Tell me one specific purpose.
A18. Concealment of treatment assignment promotes integrity in human trials.
Q19. Can you elaborate?
A19. Where human subjects are involved, the integrity of the experiment is at risk
if the treatment assignment is revealed prematurely, either to the patient or
to the physician. Concealment helps to limit the possibilities for subverting
the design.
Q20. Any other purposes?
A20. To see if God is paying attention.
Q21. What has God got to do with it?
A21. Concealment means that treatment assignment is known only to the control-
ling statistician, who must pay attention to events as they unfold.
Q22. Any other purpose?
A22. To help convince skeptics by levelling the playing field for treatment
comparisons.
Q23. Tell me about the role of exchangeability?
A23. Exchangeability is the fundamental axiom of statistical modelling.
Q24. What does exchangeability imply?
A24. It implies that two units having the same covariate value must have the
same response distribution. Implicitly or explicitly, that’s usually part of
the protocol.
Q25. Is exchangeability a mathematical theorem?
A25. No, it is an axiom of applied statistics. You can think of it as a bill of
rights or a guarantee of equality under the law. If two units are to have
different response distributions, there needs to be a demonstrable reason for
that difference. That’s where covariates enter the story.
Q26. What is the purpose of a covariate in a randomized study?
A26. There are three inter-related purposes.
(i) to accommodate sub-group effects (sex, age,...);
(ii) to improve precision of the treatment estimate;
(iii) to check for interaction.
Q27. What does interaction mean?
A27. Interaction means that the effect of treatment on males is different from its
effect on females.
Q28. So that means that you have two numbers, one for males and one for
females.
A28. Yes. If the treatment effect is summarized in a single real number, you have
one number for males and one for females.
21.1 Scientific Investigations 395
Q1. I’ve read that each patient in a two-arm randomized trial has two potential
outcomes or counterfactual responses, only one of which can be recorded.
Is that a fair description of the way you see it?
A1. No, not really. My inclination is to focus only on what can be observed in
principle. If only one observation can be recorded per patient, there is only
one outcome per patient. Counterfactuals or potential outcomes are not used
in these notes.
Q2. Why not? What’s wrong with counterfactuals?
A2. The issue is not whether there’s anything right or wrong with the concept
of counterfactuals, but whether there is a need for it, either in the real world
or in the mathematics. Certainly, the concept occurs in everyday language,
so there’s a need for it in some sense. Such issues are encountered in thorny
legal matters. X died in a work-related accident at age 32. What would X’s
lifetime earnings have been had he not died? That sort of determination is
important, and we need a way to address it. The question is how we address
it formally in the mathematics. It is a matter of mathematical style.
Q3. If it is merely a matter of style, what is there to argue about?
A3. Style is more than sufficient for an argument. In academia, the lower the
stakes the more tenacious the fight.
Q4. What are the mathematical issues?
A4. The counterfactual world admits duplicate copies of each patient, one
copy for each treatment level, and one outcome or response for each
patient-treatment combination. A counterfactual stochastic model necessar-
ily begins with a joint distribution for all outcomes, potential or otherwise.
For n patients and five treatment levels, you’re talking about a probability
distribution on R5n . The sampling scheme is restricted to one treatment
level for each patient, and the counterfactual process determines the joint
distribution of those particular outcomes.
Q5. That seems straightforward. What other options are there?
A5. The approach taken in these notes is to construct a process indexed by
assignments. A sample consists of a subset of n patients together with a
treatment assignment t. The stochastic model associates with each finite
sample a probability distribution Pt on Rn . One can extend this to a
counterfactual distribution on R5n , but the extension is not unique and is
not needed for observables. Nor is there any possibility of using data to
check the extension.
Q6. So the two approaches are equivalent for all observables?
A6. For observables, yes they are exactly the same. But not for counterfactuals.
Q7. Can you give an example where the two approaches give different answers?
A7. Patient i was assigned medication A for hypertension, and her outcome was
Yi,A = 12.3 in suitable units. What would her outcome have been if she
had been assigned medication B? That’s a counterfactual question to which
398 21 Q & A
the answer is simply the conditional distribution of the B-value given the
A-value for this patient. So the answer depends on the joint distribution and
the counterfactual correlation.
Q8. Couldn’t the counterfactual correlation be zero?
A8. Yes, it could be any number between −1 and +1 inclusive.
Q9. How do you address the same question without counterfactuals?
A9. A genuinely counterfactual question admits an answer only in the counter-
factual realm. But every applied statistician knows that the initial question
is merely an opening gambit: the standard reply is to re-phrase the question.
If you can persuade the interrogator to accept the re-phrased version, you’re
in business.
Q10. How do you re-phrase the question to avoid counterfactuals?
A10. The population contains extra-sample individuals who have the same
covariates as the target patient, but who were assigned to medication B.
These patients all have the same response distribution. For any assignment
such that the target patient gets A and the other gets B, you can report the
conditional distribution for the extra-sample patient given the data.
Q11. How do you know for sure that the population contains another individual
having exactly the same covariate values as the target patient?
A11. You can be absolutely sure of that because the statistician has the luxury
of defining the population. For reasons mentioned earlier, the population
contains infinitely many units for each covariate value.
Q12. So the counterfactual question is a question about a specific individual, and
the re-phrased question insists on an extra-sample individual. Is that the
only difference?
A12. That’s all there is to it. But if the two patients are one and the same
individual, the answer depends on counterfactual correlations. So the
counterfactual and non-counterfactual answers are numerically different in
general.
Q13. How do you address an explicit counterfactual question such as X’s lifetime
earnings potential?
A13. You could think of injury or death at age 32 as a sort of treatment
assignment. You’d definitely get into trouble if you presented that to the
institutional review board, so you’d have to think it quietly.
Q14. And where does that take you?
A14. The counterfactual set-up has multiple versions of individual X, the real
one who died at age 32, and many who survived beyond that. The literal
answer to the question of future earnings is the conditional distribution of
earnings for X given the counterfactual history of health and earnings up to
age 32. Except for the event of death, the counterfactual history agrees with
the actual history.
Q15. That seems fair enough. So the counterfactual framework is needed to
address such matters?
A15. No, I don’t think it is needed. You have in the population infinitely many
individuals who have the same covariates as X, the same employment
21.1 Scientific Investigations 399
history up to age 32, and so on, but who did not die at that age. You
can compute the distribution of lifetime earnings for one individual in that
subset.
Q16. And you get the same answer?
A16. That’s complicated. The answers are the same, but the questions are
different. The first is an answer to a counterfactual question about a specific
individual X. The second is an answer to a non-counterfactual question
about individuals other than X.
Q17. Why are the two answers numerically equal in the second case but not in
the first?
A17. I’ll leave that up to you. But the presumption is that the post-mortem lifetime
earnings of X is non-random.
Q18. Let’s move back to treatment. What do you mean by the effect of treatment?
A18. The effect of treatment is to modify the response distribution by group
action. The effect is to change the control distribution for each patient to
the corresponding active distribution.
Q19. Why must the effect of treatment be a group action?
A19. Well, you want to be in a position of saying ‘Here are the distributions
under consideration for a control, and here are the distributions under
consideration for a treated patient’. And the two sets had better be the same
for obvious reasons; a null treatment effect means that whatever the control
distribution may be, the treatment distribution is the same. It’s not just a
pair of distributions; it’s an action that takes each control distribution to a
specific treatment distribution.
Q20. What do you mean by the treatment effect?
A20. The treatment effect is a specific group element, a parameter if you like.
Q21. Is the effect of treatment the same for everyone?
A21. Yes, in the sense that it is the same group action on distributions. But no,
the particular group element need not be the same for everyone. It may vary
from one covariate subset to another; in (5.2), the treatment effect increases
linearly as a function of time.
Q22. If the treatment effect is not the same for everyone, how should it be
reported? Must we report the average treatment effect?
A22. The question makes sense only if the treatment group is a vector space, and
only if the population is finite. The first is often true, but not necessarily:
see Exercises 14.7 and 14.8. An average also requires a distribution on the
set of sampling units, and that’s not normally part of the specification unless
the population is finite.
Q23. So, how do you report a variable effect?
A23. If the treatment effect for males is not the same as that for females, you must
report one group element for each sex, or perhaps one distribution for each
sex. Same for population strata determined by any covariate or classification
factor. It would be misleading to report a single value or a single distribution
if there is substantial stratum-to-stratum variation.
References
Adler, L., Barber, N. A., Biller, O. M., & Irwin, R. E. (2020). Flowering plant composition
shapes pathogeninfection intensity and reproduction in bumblebee colonies. Proceedings of
the National Academy of Sciences, 117, 11559–11565.
Aitchison, J. (1986). The statistical analysis of compositional data. London: Chapman and Hall.
ISBN: 0-412-28060-4.
Andrews, D. A., & Herzberg, A. (1985). Data. New York: Springer.
Armitage, P. (1955). Tests for linear trends in proportions and frequencies. Biometrics, 11, 375–
386.
Atkinson, Q. D. (2011). Phonemic diversity supports a serial founder effect model of language
expansion from Africa. Science, 32, 346–349.
Austin, S. (2021). Review of The Deep Places: A Memoir of Illness and Discovery by R. Douthat.
New York Times Oct. 26, 2021.
Bailey, R. A. (2008). Design of comparative experiments. Cambridge: Cambridge University Press.
ISBN: 978-0-521-86506-7.
Baltagi, B. H., Fingleton, B., & Pirotte, A. (2014). Spatial lag models with nested random effects:
An instrumental variable procedure with an application to English house prices. Journal of
Urban Economics, 80, 76–86. https://doi.org/10.1016/j.jue.2013.10.006
Barnard, G. A. (1949). Statistical inference (with discussion). Journal of the Royal Statistical
Society B, 11, 115–139.
Barnard, G. A., Jenkins, G. M., & Winsten, C. B. (1962). Likelihood inference for time series.
Journal of the Royal Statistical Society A, 125, 321–372.
Bartlett, M. S. (1937). Properties of sufficiency and statistical tests. Proceedings of the Royal
Society, Series A, 160, 268–282.
Bates, D., Mächler, M., Bolker, B., & Walker, S. (2015). Fitting linear mixed-effects models using
lme4. Journal of Statistical Software, 67, 1–48.
Benjamini, Y., & Hochberg, Y. (1995). Controlling the false discovery rate: A practical and
powerful approach to multiple testing. Journal of the Royal Statistical Society B, 57, 289–300.
Berger, J., & Wolpert, R. L. (1988). The likelihood principle. IMS Lecture Note Series (Vol 6, 2nd
ed.). Hayward, CA: Institute of Mathematical Statistics.
Birnbaum, A. (1962). On the foundations of statistical inference (with discussion). Journal of the
American Statistical Association, 57, 269–306.
Bliss, C. I. (1970). Statistics in biology, vol. II. New York: McGraw-Hill.
Bock, R. D. (1975). Multivariate statistical methods in behavioral research. New York: McGraw-
Hill. ISBN: 0-07-006305-2.
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 401
P. McCullagh, Ten Projects in Applied Statistics, Springer Series in Statistics,
https://doi.org/10.1007/978-3-031-14275-8
402 References
Box, G. E. P. (1976). Science and statistics. Journal of the American Statistical Association, 71,
791–799. https://doi.org/10.1080/01621459.1976.10480949
Box, G. E. P., & Cox, D. R. (1964). An analysis of transformations (with discussion). Journal of
the Royal Statistical Society B, 26, 211–252.
Box, G. E. P., & Tidwell, P. W. (1962). Transformation of the independent variables. Technomet-
rics, 4, 531–550.
Brien, C. J., Bailey, R. A., Tran, T. T., & Boland, J. (2012). Quasi-Latin designs. Electronic Journal
of Statistics, 6, 1900–1925. https://doi.org/10.1214/12-EJS732
Cavalli-Sforza, L. L., & Edwards, A. W. F. (1967). Phylogenetic analysis: Models and estimation
procedures. American Journal of Human Genetics, 19, 233–257.
Cellmer, R., Kobylinska, K., & Belej, M. (2019). Application of hierarchical spatial autoregressive
models to develop land value maps in urban areas. International Journal of Geo-Information,
8, 195–214. 10.3390/ijgi8040195www.mdpi.com/journal/ijgi
Clifford, D., & McCullagh, P. (2006). The regress function. R News, 6, 6–10.
Cochran, W. G. (1954). Some methods for strengthening the common χ 2 tests. Biometrics, 10,
417–451.
Cooch, E. G., & White, G. (2020). Program MARK: A gentle introduction. Fort Collins, CO:
Colorado State University.
Courant, R. (1965). Professor Richard Courant’s acceptance speech for the distinguished service
award. The American Mathematical Monthly, 72, 377–379. https://doi.org/10.2307/2313496
Cox, D. R. (1958). Planning of experiments. New York: J. Wiley & Sons. ISBN: 0-471-1813-8.
Cox, D. R. (1972). Regression models and life tables (with discussion). Journal of the Royal
Statistical Society B, 74, 187–220.
Cox, D. R. (2006). Principles of statistical inference. Cambridge: Cambridge University Press.
ISBN: 0-521-68567-2.
Cox, D. R., & Reid, N. (2000). The theory of the design of experiments. London: Chapman
and Hall. ISBN: 1-58488-195-X.
Cox, D. R., & Snell, E. J. (1981). Applied statistics: Principles and examples. London: Chapman
and Hall. ISBN: 0-412-16570-8.
Crane, H. (2016). The ubiquitous Ewens sampling formula. Statistical Science, 3, 11–19.
Da Silva, P. H., Jamshidpey, A., McCullagh, P., & Tavaré, S. (2022). Fisher’s measure of variability
in repeated samples. Bernoulli (to appear).
Davison, A. C. (2003). Statistical models. Cambridge: Cambridge University Press. ISBN:
05217773393.
Dawid, A. P. (2000). Causal inference without counterfactuals. Journal of the American Statistical
Association, 95, 407–424.
Dawid, A. P. (2021). Decision-theoretic fundations for statistical causality. Journal of Causal
Inference, 9, 39–77. https://doi.org/10.1515/JCI-2020-0008
Dawson, R. B. (1954). A simplified expression for the variance of the χ 2 function on a contingency
table. Biometrika, 41, 280.
Dickey, D. A. (2020). A warning about Wald tests. SAS Global Forum 2020, paper 5088.
Diggle, P., Heagerty, P., Liang, K.-Y., & Zeger, S. L. (2013). Analysis of longitudinal data. Oxford:
Oxford University Press. ISBN: 978-0-19-967675-0.
Dong, G., Harris, R., Jones, K., & Yu, J. (2015). Multilevel modelling with spatial interaction
effects with application to a emerging land market in Beijing, China. PLoS One, 10, 1–18.
https://doi.org/10.1371/journal.pone.0130761
Douthat, R. (2021). The deep places: A memoir of illness and discovery. New York: Penguin
Random House. ISBN: 9780593237366.
Dyson, F. W. (1926). A method for correcting series of parallax observations. Monthly Notices of
the Royal Astronomical Society, 86, 686–706. https://doi.org/10.1093/mnras/86.9.686
Efron, B. (2010). Large-scale inference. IMS monograoph series. Cambridge: Cambridge Univer-
sity Press.
Efron, B. (2011) Tweedie’s formula and selection bias. Journal of the American Statistical
Association, 106, 1602–1614.
References 403
Ewens, W. J. (1972). The sampling theory of selectively neutral alleles. Theoretical Population
Biology, 3, 87–112.
Feller, W. (1971). An introduction to probability theory and its applications, vol II. New York:
Wiley.
Felsenstein, J. (1973). Maximum-likelihood estimation of evolutionary trees from continuous
characters. American Journal of Human Genetics, 25, 471–492.
Felsenstein, J. (2004). Inferring phylogenies. Sunderland: Sinauer Associates.
Fingleton, B., Le Gallo, J., & Pirotte, A. (2018). Panel data models with spatially dependent nested
random effects. Journal of Regional Science, 58, 63–80. https://doi.org/10.1111/jors.12327
Fisher, R. A. (1925). Theory of statistical estimation. Proceedings of the Cambridge Philosophical
Society, 22, 700–725.
Fisher, R. A. (1929). Moments and product moments of sampling distributions. Proceedings of the
London Mathematical Society, 30, 199–238.
Fisher, R. A. (1943). A theoretical distribution for the apparent abundance of different species.
Journal of Animal Ecology, 12, 54–57.
Fisher, R. A., Corbet, A. S., & Williams, C. B. (1943). The relation between the number of species
and the number of individuals in a random sample from an animal population. Journal of
Animal Ecology, 12, 42–58.
Fowler, H., & Whalen, R. E. (1961). Variation in incentive stimulus and sexual behavior in the
male rat. Journal of Comparative and Physiological Psychology, 54, 68–71.
Gelman, A. (2020). Concerns with that Stanford study of coronavirus prevalence. https://
statmodeling.stat.columbia.edu/2020/04/19/fatal-flaws-in-stanford-study-of-coronavirus-
prevalence
Gurka, M., Edwards, L. J., Muller, K. E., & Kupper, L. L. (2006). Extending the Box-Cox
transformation to the linear model. Journal of the Royal Statistical Society A, 169, 273–288.
Haldane, J. B. S. (1939). The mean and variance of χ 2 when used as a test of homogeneity when
expectations are small. Biometrika, 31, 346–365.
Horvitz, D. G., & Thompson, D. J. (1952). A generalization of sampling without replacement from
a finite universe. Journal of the American Statistical Association, 47, 663–685.
Isserlis, L. (1918). On a formula for the product-moment coefficient of any order of a normal
frequency distribution in any number of variables. Biometrika, 12, 134–139.
Jeffreys, H., & Jeffreys, B. S. (1956). Methods of mathematical physics. Cambridge: Cambridge
University Press.
Jiang, J., & Nguyen, T. (2021). Linear and generalized linear mixed models and their applications
(2nd ed.). New York: Springer. ISBN: 978-1-0716-1281-1. https://doi.org/10.1007/978-1-
0716-1282-8
Johnson, R. N. (1972). Agression in man and animals. Philadelphia, London: Saunders.
Johnstone, I., & Silverman, B. W. (2004). Needles and straw in haystacks: Empirical-Bayes
estimates of possibly sparse sequences. The Annals of Statistics, 32, 1594–1649.
Kaiser, J. (2021). Key cancer results fail to be reproduced. Science, 374, 1311.
Kerrich, J. E. (1946). An experimental introduction to the theory of probability. Copenhagen: Einar
Munksgaard.
Kerrich, J. E. (1961). Random remarks. The American Statistician, 15, 16–20.
Kimble, G. A., Garmezy, N., & Zigler, E. (1980). Principles of general psychology. New York:
J. Wiley & Sons.
Kolmogorov, A. N. (1933). Grundbegriffe der Wahrscheinlichkeitsrechnung. Ergebnisse der
Mathematik (Vol. 2). New York: Springer.
Li, T., Holst, T., Michelsen, A., & Rinnan, R. (2019). Amplification of plant volatile defence
against insect herbivory in a warming Arctic tundra. Nature Plants, 5, 568–574.
Liang, K.-Y., & Zeger, S. L. (2000). Longitudinal data analysis of continuous and discrete
responses for pre-post designs. Sankhya the Indian Journal of Statistics B, 62, 134–148.
Lord, F. M. (1967). A paradox in the interpretation of group comparisons. Psychological Bulletin,
68, 305–305.
Matérn, B. (1986). Spatial variation. New York: Springer. ISBN: 0-387-96365-0.
404 References
Matsui, M., & Takemura, A. (2006). Some improvements in numerical evaluation of symmetric
stable density and its derivatives. Communications in Statistics - Theory and Methods, 35, 149–
172. https://doi.org/10.1080/03610920500439729
McCullagh, P. (2016). Two early contributions to the Ewens saga. Statistical Science, 31, 23–26.
McCullagh, P. (2018) Tensor methods in statistics (2nd ed.). New York: Dover Publications Inc..
ISBN: 0-486823784.
McCullagh, P., & Møller, J. (2006). The permanental process. Advances in Applied Probability,
38, 873–888.
McCullagh, P., & Nelder, J. A. (1989). Generalized linear models (2nd ed.) London: Chapman
and Hall. ISBN: 0-412-31760-5.
McCullagh, P., & Polson, N. (2018). Statistical sparsity. Biometrika, 105, 779–814.
McCullagh, P., & Tresoldi, M. F. (2021). A likelihood analysis of quantile-matching transforma-
tions. Biometrika, 108, 247–251.
Mead, R. (1988). The design of experiments. Cambridge: Cambridge University Press. ISBN: 0-
521-28762-6.
Montgomery, R. A., Rice, K. E., Stefanski, A., Rich, R. L., & Reich, P. B. (2020). Phenological
responses of temperate and boreal trees to warming depend on ambient spring temperatures,
leaf habit, and geographic range. Proceedings of the National Academy of Sciences, 117,
10397–10405. https://doi.org/10.1073/pnas.1917508117
Nelder, J. A. (1965a). The analysis of randomised experiments with orthogonal block structure I.
Block structure and the null analysis of variance. Proceedings of the Royal Society of London.
Series A, 283, 147–162.
Nelder, J. A. (1965b). The analysis of randomised experiments with orthogonal block structure II.
Treatment structure and the general analysis of variance. Proceedings of the Royal Society of
London. Series A, 283, 163–178.
Orzack, S. H., Steiner, U. K., Tuljapurkar, S., & Thompson, P. (2011). Static and dynamic
expression of life history traits in the northern fulmar Fulmarus glaciali. Oikos, 120, 369–380.
https://doi.org/10.1111/j.1600-0706.2010.17996.x
Patterson, H. D., & Thompson, R. (1971). Recovery of inter-block information when block sizes
are unequal. Biometrika, 58, 545–554.
Pearl, J., & Mackenzie, D. (2021). The Book of Why. Basic Books. ISBN: 978-1-5416-9896-3.
Phillips, D. P., & Feldman, K. A. (1973). A dip in deaths before ceremonial occasions: some new
relationships between social integration and mortality. American Sociological Review, 38, 678–
696. https://doi.org/10.2307/2094131
Pitman, J. W. (2006). Combinatorial stochastic processes. Lecture Notes in Mathematics
(Vol. 1875). Berlin: Springer.
R Core Team (2015). R: A Language and Environment for Statistical Computing. R Foundation for
Statistical Computing, Vienna, Austria. http://www.R-project.org/
Reich, P. B., Sendall, K. M., Stefanski, A., Rich, R. L., Hobie, S. E., & Montgomery, R. A. (2018).
Effects of climate warming on photosynthesis in boreal tree species depend on soil moisture.
Nature, 263, 263–267.
Robbins, H. (1956). An empirical Bayes approach to statistics. In Proceedings of the Third Berkeley
Symposium on Mathematical Statistics and Probability, Vol. I (pp. 157–163). California:
University of California Press.
Samuels, M. L. (1986). The use of analysis of covariance in clinical trials: A clarification.
Controlled Clinical Trials, 7, 325–329.
Sen, M., & Bera, A. K. (2014). The improbable nature of the implied correlation matrix from
spatial regression models. Regional Statistics, 4, 3–15. https://doi.org/10.15196/RS04101
Senn, S. J. (2006). Change from baseline and analysis of covariance revisited. Statistics in
Medicine, 25, 4334–4344. https://doi.org/10.1002/sim.2682
Senn, S. J. (2019). Red herrings and the art of cause fishing: Lord’s paradox revisited. Eror
Statistics Philosophy. https://errorstatistics.com/2019/08/02/s-senn-red-herrings-and-the-art-
of-cause-fishing-lords-paradox-revisited-guest-post/
References 405
Sharon, G., Segal, D., Ringo, J. M., Hefetz, A., Zilber-Rosenberg, I., & Rosenberg, E. (2010).
Commensal bacteria play a role in mating preference of Drosophila melanogaster. Proceed-
ings of the National Academy of Sciences, 107, 20051–20056. https://doi.org/10.1073/pnas.
1009906107. Correction: (2013) Proceedings of the National Academy of Sciences, 110, 4853.
Shi, P., & Tsai, C.-L. (2002). Regression model selection—A residual likelihood approach. Journal
of the Royal Statistical Society B, 64, 237–252. Correction: 70, 1067.
Stein, M. (1999). Interpolation of spatial data: Some theory for kriging. New York: Springer.
ISBN: 0-387-98629-4.
Student (1931). Agricultural field experiments. Nature, 127, 404–405.
Sykulski, A. M., Olhede S. C., Guillaumin, A. P., Lilly, J. M., & Early, J. J. (2019). The debiased
Whittle likelihood. Biometrika, 106, 251–266.
Tan, C. K. W., Løvlie, H., Greenway, E., Goodwin, S. F., Pizzari, T., & Wigby, S. (2013). Sex-
specific responses to sexual familiarity, and the role of olfaction in Drosophila. Proceedings of
the Royal Society B, 280. https://doi.org/10.1098/rspb.2013.1691
Tan, C. K. W., Løvlie, H., Greenway, E., Goodwin, S. F., Pizzari, T., & Wigby, S. (2013). Sex-
specific responses to sexual familiarity, and the role of olfaction in Drosophila: A new analysis
confirms original results. Proceedings of the Royal Society B, 281. https://doi.org/10.1098/rspb.
2014.0512
Tavaré, S. (2021). The magical Ewens sampling formula. Bulletin of the London Mathematical
Society, 53, 1563–1582.
Tukey, J. W. (1949). One degree of freedom for non-additivity. Biometrics, 5, 232–242.
Villa, S. M., Altuna, J. C., Ruff, J. R., Beach, A., Mulvey, L. I., Poole, E. J., Campbell, H. E.,
Johnson, K. P., Shapiro, M. D., Bush, S. E., & Clayton, D. H. (2019). Rapid experimental
evolution of reproductive isolation from a single natural population. Proceedings of the
National Academy of Sciences, 116, 13440–13445.
Wasserman, L. (2004). All of statistics. New York: Springer. ISBN: 0-387-40272-1.
Welham, S. J., & Thompson, R. (1997). Likelihood ratio tests for fixed model terms using residual
maximum likelihood. Journal of the Royal Statistical Society B, 59, 701–714.
Whittle, P. (1953). Estimation and information in stationary time series. Arkiv för Matematik, 2,
423–434.
Wilson, J. R., Kuehn, R. E., & Beach, F. A. (1963). Modification in the sexual behavior of male
rats produced by changing the stimulus female. Journal of Comparative and Physiological
Psychology, 56, 636–644.
Wick, G. C. (1950). The evaluation of the collision matrix. Physical Reviews, 80, 268–272. https://
doi.org/10.1103/PhysRev.80.268
Yates, F. (1948). The analysis of contingency tables with groupings based on quantitative
characters. Biometrika, 35, 176–181; corr. 35, 424.
Zehnder, P. J., Weber, A., & Linder, A. (1951). Étude du rendement des scies par les méthodes
statistiques. Annales de l’Institut Fédéral de Recherches Forestières, 27, 1–18.
Index
A Bates, D., ix
Accelerated-failure model, 238 Bayes estimator, 94, 267
Accessibility, 387 Beach, A., 55
Additivity, 3, 178, 272 Beach, F.A., 38
Adler, L., 136 Belej, M., 199
Advection, 311 Benjamini, Y., 343
Age cohort, 148 Bera, A.K., 199
Aitchison, J., 40 Berger, J., 202
Alpha-stable Biased sampling, 156
covariance, 97 BIC selection procedure, 209
distribution, 97, 111, 116 Biller, O.M., 136
Altuna, J.C., 55 Birnbaum, A., 201
Analysis of variance, 104, 263, 266 Bliss, C.I., 15, 366
Analytic sample path, 97 Block
Ancillary statistic, 208, 393 averages, 88
Andrews, D.A., 38 design, 137
Anscombe, F.J., 189 exchangeability, 140, 224, 226
Armitage, P., 38 factor, 173
Arnold, S., 143 randomization, 226
Assortative mating, 26, 30, 31 BLUP, 48, 49, 94, 271
Atkinson, Q.D., 117 Bock, R.D., 219
Austin, S., 218 Bolker, B., ix
Autocorrelation, 89 Bootstrap, 191, 194, 195, 392
Average treatment effect, 399 Boson process, 255
Boundary point, 249
Box, G.E.P., 47, 48, 53, 200, 205, 364
B Box-Tidwell method, 47
Bailey, R.A., ix, 16, 63, 176 Bradley-Terry model, 207
Baltagi, B.H., 199 Brownian
Barber, N.A., 136 bridge, 93, 100
Barnard, G.A., 201 covariance, 45, 51, 64, 73, 200
Bartlett correction factor, 334, 360, 361 evolution, 64, 73
Bartlett identities, 330 motion, 45, 51, 73, 101, 200, 260
Bartlett, M.S., 76, 361 Bush, S.E., 55
Baseline, 8, 10, 163, 385
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 407
P. McCullagh, Ten Projects in Applied Statistics, Springer Series in Statistics,
https://doi.org/10.1007/978-3-031-14275-8
408 Index
C courtship, 27
Campbell, H.E., 55 diet, 25
Carrick, R., 145 refractory period, 27, 28
Cauchy distribution, 229, 247 Dunnet, G.M., 145
Cavalli-Sforza, L.L., 64 Dynamic range, 3, 143
Cellmer, R., 199 Dyson, F.W., 267, 342
Cemetery state, 182
Censoring, 154, 181
Chinese restaurant process, 188, 194 E
Choleski factorization, 52, 53, 100 Early, J.J., 115
Chordal metric, 92 Eddington, A., 267, 342
Clayton, D.H., 55 Eddington’s formula, 51, 267, 277
Clifford, D., viii Edwards, A.W.F., 64
Cochran’s theorem, 257, 263 Edwards, L.J., 368
Cochran, W.G., 38 Effects
Coefficient of variation, 9 covariate, 177
Cohort plot, 148 treatment, 177
Commutative ring, 274 Efron, B., 268, 343
Competition model, 207 Eligibility, 387
Compliance, 171, 180 Entropy, 371
Compositional response, 40 Equi-variance, 228
Conformity with randomization, 62, 63 Ewens, W.J., 188
Confounding, 5 sampling formula, 188, 193, 203, 208
Congruent samples, 225 Exchangeability, 5, 10, 63, 73, 140, 141, 157,
Consistency, self-, 161 177, 184, 200, 218, 223–225, 238,
Contrast, 316 394
Cooch, E.G., 156 Exclusions, 7
Coolidge effect, 38 Exogenous variable, 172–174
Corbet, A.S., 189 Experimental design, 178
Counterfactual, 166, 242, 397 Experimental unit, 3, 10, 176
as missing value, 245 E.U. vs. O.U., 139, 380
Counterfactual sample, 166 Exponential family model, 188
Covariate, 8, 16, 170, 393 External variable, 172, 173
Cox, D.R., ix, 176, 364, 378 Eynhallow, 145
Crane, H., 188
Crossover design, 178
Cumulant, 37, 85, 188 F
-g.f., 194, 251, 262, 267 Factor
Cycle of a permutation, 208 block, 5, 10
classification, 5, 6, 10
coding, 381
D effects, 5
Da Silva, P.A., 190 subspace, 5
Davison, A.C., 160 treatment, 10
Dawid, A.P., 242–244 Factorial model, 62, 63, 66, 74, 272
Dawson, R.B., 37 Factorial subspace, 381
Degrees of freedom, 3, 4, 11, 20–22, 31, 61, 69, False discovery rate, 343
105, 123, 142, 273 Famous Americans, 41
Denning, T., 34, 41 Feature, 168
Density factorization, 209 Feldman, K.A., 41
1DOFNA, 272 Feller, W., 111, 116
Dong, G., 198, 206 Felsenstein, J., 64
Douthat, R., 218 Feynman diagram, 255
Drosophila Fiducial process, 242, 272
Index 409
Liang, K.-Y., 212 Observational unit, 10, 28, 70, 134, 164, 386
Likelihood Olhede, S.C., 115
factorization, 209 Orzack, S.H., 145
ratio, 353, 355 Out of Africa, 117
residual-, 351 Out of Ireland, 129
Lilly, J.M., 115 Over-dispersion, 31, 39, 108, 142
Linder, A., 15
Linear regression, 269
Li, T., 142 P
Locally finite population, 165 Patterson, H.D., 349
Log normal distribution, 9 Pearl, J., 243
Longitudinal design, 182 Pearson statistic, 31, 33, 37, 38
Long-range dependence, 88, 91 Permanent of a matrix, 256
Lord, F.M., 219 Phillips, D.P., 41
Lord’s paradox, 219 Pirotte, A., 199
Løvlie, H., 39 Pitman, J.W., 188
Pizzari, T., 39
Polson, N., 61, 268, 329, 342, 343, 347
M Poole, E.J., 55
Mächler, M., ix Population, 164
Mackenzie, D., 243 average, 185, 399
MARK, 156 biological, 165
Mark-recapture design, 155, 164 locally finite, 165
Matched pairs design, 395 Post-baseline variable, 8
Matérn, B., 94 Potential outcome, 242, 397
Matérn covariance, 94, 285, 288 Prediction, 48, 93, 166, 264
Matrix exponential, 308 Principal component, 66
Matsui, M., 111 Probability weighting, 185
McCullagh, P., viii, 37, 61, 63, 143, 190, 255, Process, 159, 162
268, 329, 338, 342, 343, 371 counterfactual, 242
Mead, R., ix Profile likelihood, 77
Michelson, A., 142 Projection, 14, 22, 23, 76, 246, 258, 259
Mills’s ratio, 252 Proportional-hazards model, 238
Missing values, 1, 8, 155, 245 Protocol, 10, 163, 169, 171, 175, 385
Mixture model, 339 Pseudo-replication, 179
Model p-value, 5, 7, 34, 203, 376, 377, 382
formula, 5, 11, 48, 63, 74, 138, 355, 377
selection, 381
specification, 376 Q
Møller, J., 255 Quantile-matching transformation, 370
Moment generating function, 9 Quaternion scalar product, 309, 325
Montgomery, R.A., 133, 142
Muller, E., 368
Multinomial model, 29 R
Mulvey, L.I., 55 Random coefficient model, 234, 248
Randomization, 10, 166, 171, 175, 392, 394
Random matching, 36
N Random sample, 184
Nelder, J.A., ix, 37, 63, 143, 178, 338 Rao-Fisher-information, 256
Neutral evolution, 64, 188 Reich, P.B., 133, 142
Reid, N., 176
Relationship, 10, 172, 393
O REML, 7, 12, 20, 349
Observational study, 156 Replace vs. substitute, 378
Index 411
S
Sample T
accessibility, 184 Table, purpose of, 58, 381
path, 96 Takemura, A., 111
simple random-, 88, 166, 184 Tan, C.K.W., 39
stratified-, 184, 185 Tavaré, S., 188, 190
Sampling consistency, 193, 194 Thompson, D.J., 186
Sampling fraction, 88 Thompson, P., 145
Samuels, M.L., 212 Thompson, R., 20, 22, 349, 355, 358
Santoso, J., 128, 131 Tidwell, P.W., 47, 48, 53
Schur product theorem, 292 Time reversibility, 282
Segal, D., 25 Transformation, 3, 18, 143, 363
Selective sampling, 160 Box-Cox, 364
Self-adjoint transformation, 258, 259 cube root, 372
Semi-max coefficient, 43, 47 idempotent, 258
Sendall, K.M., 133 image of, 258
Sen, M., 199 involution, 383
Senn, S.J., 212, 219 isometric, 256
Separable covariance, 297, 303, 321 Jacobian, 366, 370
Sexual asymmetry, 27 kernel of, 258
Sexual isolation index, 26 nilpotent, 271
Shapiro, M.D., 55 quantile-matching, 370
Sharon, D., 25 self-adjoint, 258
Shi, P., 368 Travelling wave, 303, 305
Significance test, 203 Treatment
Silverman, B.W., 342 assignment, 166, 171, 175, 383
Simpson’s paradox, 120 effect, 177, 231–237, 399
Singular distribution, 253 factor, 16, 166, 231
Size-biased sample, 184 interference, 133, 179
Snell, E.J., ix Trend estimation, 94
Space-time reversibility, 299 Tresoldi, M.F., 371
Spatial autoregressive model, 224 Tsai, C-L., 368
Speciation, 25, 55 Tukey, J.W., 272
Species diversity model, 189 Tuljapurkar, S., 145
Spline function, 46, 87, 96, 105, 271 Tweedie’s formula, 268
Stability over time, 58
Standard error
of contrasts, 18 U
Stationarity, 93, 225, 281 Under-dispersion, 31, 35
Statistical model, 162 Use vs. usage, 378
Stefanski, A., 133, 142
Steiner, U.K., 145
412 Index